LEADS Blog

Working with LCSH

One of my goals after working with this new tool is to obtain the entire LCSH dataset and try to do matching on the local machine. In part because the previous method, while effective may not scale well, so we wish to test out whether we can get better results from downloading and checking it ourselves.
Since the data formats in which they make the data available are not ideal for manipulation my next task will be to try to convert the data from nt format to csv using Apache Jena. The previous fellow had made some notes about trying this method so I will be reading his instructions and seeing if I can replicate this having never used Jena before.
Once I obtain the data in a format that I can use, I will add it to my current workflow and see what the results are looking like. Hopefully the results will be useful and something that can be scaled up.
— Julaine Clunis
LEADS Blog

New Data Science Tool

For the project, we are also interested in matching against AAT. We have written a SPARQL query to get subject terms from the Getty AAT which was downloaded in json format.
Having data in these various formats I needed to find a way to work with both and evaluate data in one type of file against the other. In the search for answers I came across a tool for data analytics. It can be used for data preparation, blending, advanced and predictive analytics and other data science tasks and so is able to take inputs from various file formats and work with them outright.
A unique feature of the tool is the ability to build a workflow which can be exported and shared and which other members of a team can use as is or can probably easily turn into code if need be.
I’ve managed to join the json and csv file and check for matches and was able to identify exact matches after performing some transformations. This tool has a fuzzy match function which I am still trying to figure out and get working in an effective workflow that can be reproduced. I suspect that will be taking up quite a bit of my time.

Julaine Clunis
LEADS Blog

Clustering

One of the things that we’ve noticed about the dataset is that beyond duplicate terms there are subject terms that refer to the same thing that are spelled or entered differently by the contributing institution but which refer to the same thing. We’ve been thinking about using clustering applications to look at the data to see what kinds of matches are returned.
It was necessary to first do some reading on what the different clustering methods would do and how that might work for the data we have. We did end up trying some clustering using various key collision methods (Fingerprint, n-gram fingerprint) and KNN and Levenshtein distance methods. They return different results and we are still looking at the results returned from this before performing any merge functions. It is possible that terms look the same or seem similar but are in fact different so it is not as simple as just merging everything that matches.
One important question to answer is how accurate are the clusters and whether we can trust the results enough to go ahead and automatically merge. My feeling is that a lot of human oversight is needed to evaluate the clusters.
Another thing we want to test is how much faster the reconciliation process would be if we accepted and merged the results from the cluster and whether it was worth the time to do it, i.e. if we cluster and then do string matching, is there an improvement in the results or are they basically the same.

Julaine Clunis
LEADS Blog

Julaine Clunis, Week 1: Getting Started

Hi everyone!

This is Julaine and my assignment is with the Digital Public Library of America (DPLA). The DPLA has more than 3 million unique subject headings, with only a portion of those being from controlled vocabularies which can lead to various issues arising when records use slight term variations or synonyms for the same concept.
The aim of my project is to continue working on the development and testing of an effective method for analyzing record content and matching content. This includes keywords with relevant controlled terms from a defined list, in an effort to create a consistent vocabulary to aid users and that can be reliably re-ingested as well as consistently support analytics.
I have spent the last couple of days reading through a ton of documentation about the work that has already been completed on this project. Familiarizing myself with the DPLA Metadata Application Profile and getting set up and familiar with the software and data that has been recommended for use. I have been exploring, for the first time, Apache Spark and I am slowly finding my way around it (downloading, installing and setting up the environment for its use on my machine and reviewing tutorials),so I haven’t really done much in terms of coming up with any solutions to this problem as I am just getting to know the tools and the data.
My mentors have been incredibly supportive and helpful and make themselves available to me in several ways. I expect I will learn a lot from working with them and am feeling really thankful for that. We use various tools such as Slack, Zoom and email to stay in touch so I am feeling positive about having access to direction or support if and when I need it.
Well, that is about all I have to report at this time.
I wish everyone the best of luck going forward with their projects.

Julaine Clunis