LEADS Blog

Bridget Disney, California Digital Library – YAMZ

California Digital Library – YAMZ
Bridget Disney
We have been duplicating our setup for the the local instance of YAMZ on the Amazon AWS server. The process is similar – kind of – and we’ve come across and worked through some major glitches in its setup.
One challenge that we have experienced is setting up the database. First we had to figure out where PostGreSQL was installed. The address is specified in the code but it had moved to a different location on the new server. There are different steps that the code goes through to determine which database to use (local or remote) and the rules have changed on the new system. Because of that, we have had to figure out our new environments and our permissions, documenting the process as we go along. We’ve set up a markdown file in GitHub which will be the final destination for our process documentation, but in the meantime, we made entries to a file in Google Docs as we worked through the process of the AWS installation.
Finally, we used pg_dump/pg_restore to move the data from the old to the new PostGreSQL database, so now we have over 2500 records and a functioning website on Amazon AWS! This has been a long time coming but it has helped me see the purpose of the whole project, which is to allow people to enter terms and then collaborate to determine which of those terms will become standard in different environments. In order for this to happen, this system will have to be used frequently and consistently over time.
I still have some concerns. Did we document the process correctly? It does not seem feasible to wipe everything out and reinstall it to make sure. Also, we still haven’t worked out the process that should be used for checking out code to make changes. 
It’s been a productive summer and we’ve learned a lot, but I feel we are running out of time before completing our mission. Starting and stopping, summer to summer, without continuous focus can be detrimental to projects. This is not the first time I’ve encountered this as it seems to be prevalent in academic life.
So, in summary, I see two challenges to library/data science projects:
  1. Bridging the gap between librarians and computer science knowledge
  2. Maintaining the continuity of on going projects
LEADS Blog

Jamillah Gabriel: Python Functions for Merging and Visualizing

This past week, I’ve been working on a function in Python that merges the two different datasets (WRA and FAR) so as to simplify the process of querying the data.

A screenshot of a social media post Description automatically generated

 

The reason for merging the data was to find a simpler alternative to the previous function for searching developed by Densho which involved if/else for loops to pull data from each dataset.

A screenshot of a social media post Description automatically generated

 

Now, one can search the data for a particular person and recover all of the available information about that person in a simple query. After the merge, the data output looks something like this when formulated as a list:

A screenshot of a social media post Description automatically generated           

 

In addition to this, I’ve also played with some basic visualizations using Python to display some of the data in pie charts. I’m hoping to wrap up the last week working on more visualizations and functions for querying data.

LEADS Blog

Working with LCSH

One of my goals after working with this new tool is to obtain the entire LCSH dataset and try to do matching on the local machine. In part because the previous method, while effective may not scale well, so we wish to test out whether we can get better results from downloading and checking it ourselves.
Since the data formats in which they make the data available are not ideal for manipulation my next task will be to try to convert the data from nt format to csv using Apache Jena. The previous fellow had made some notes about trying this method so I will be reading his instructions and seeing if I can replicate this having never used Jena before.
Once I obtain the data in a format that I can use, I will add it to my current workflow and see what the results are looking like. Hopefully the results will be useful and something that can be scaled up.
— Julaine Clunis
LEADS Blog

New Data Science Tool

For the project, we are also interested in matching against AAT. We have written a SPARQL query to get subject terms from the Getty AAT which was downloaded in json format.
Having data in these various formats I needed to find a way to work with both and evaluate data in one type of file against the other. In the search for answers I came across a tool for data analytics. It can be used for data preparation, blending, advanced and predictive analytics and other data science tasks and so is able to take inputs from various file formats and work with them outright.
A unique feature of the tool is the ability to build a workflow which can be exported and shared and which other members of a team can use as is or can probably easily turn into code if need be.
I’ve managed to join the json and csv file and check for matches and was able to identify exact matches after performing some transformations. This tool has a fuzzy match function which I am still trying to figure out and get working in an effective workflow that can be reproduced. I suspect that will be taking up quite a bit of my time.

Julaine Clunis
LEADS Blog

Clustering

One of the things that we’ve noticed about the dataset is that beyond duplicate terms there are subject terms that refer to the same thing that are spelled or entered differently by the contributing institution but which refer to the same thing. We’ve been thinking about using clustering applications to look at the data to see what kinds of matches are returned.
It was necessary to first do some reading on what the different clustering methods would do and how that might work for the data we have. We did end up trying some clustering using various key collision methods (Fingerprint, n-gram fingerprint) and KNN and Levenshtein distance methods. They return different results and we are still looking at the results returned from this before performing any merge functions. It is possible that terms look the same or seem similar but are in fact different so it is not as simple as just merging everything that matches.
One important question to answer is how accurate are the clusters and whether we can trust the results enough to go ahead and automatically merge. My feeling is that a lot of human oversight is needed to evaluate the clusters.
Another thing we want to test is how much faster the reconciliation process would be if we accepted and merged the results from the cluster and whether it was worth the time to do it, i.e. if we cluster and then do string matching, is there an improvement in the results or are they basically the same.

Julaine Clunis
LEADS Blog

Extracting Subjects

After my last post I spent some time, along with my mentors figuring out how to isolate the subject headings and ids from the dataset. We decided that since the dataset was so large and my machine did not have the power to handle it all. We would do all our test with a sample subset. Using some python code with Apache Spark we managed to isolate the subject terms from these records terms and output them as a csv file. The sample we yielded over 700,000 subject terms.
One of the goals of this project was to map these term against LCSH. At first my idea was to download the LCSH dataset in xml and see what kind of scripting I could do with it. However, I discovered that there is a Python script which extends OpenRefine and which will perform reconciliation against the LOC API which we decided to test. This allows you to load a csv file and run the reconciliation script against it. We found that this is an effective method to find close matches where the confidence level is over 85% for a match. The reconciliation process returns the term as listed in LCSH along with a URI which can be saved  with the original data. The biggest concern with this method is the time that it takes to run within OpenRefine, however my mentors feel that this process can be captured and run in a similar way outside the tool using other programming methods.
Later we manually checked the items that were returned to see if they in fact were matching and happily everything has checked out. There still remains a question as to whether or not there are subjects that are not close/exact matches but rather fuzzy matches and how to identify and get uri results for those. Also, the dataset seemed to have a number of duplicates and data that may need some kind of cleaning preparation, so that is another thing that may need to be examined.

Julaine Clunis
LEADS Blog

Rongqian Ma; Week 4-5: Visualizing Decorations Information

Decoration information of the manuscripts is one of the most complex categories of information in the dataset, and to visualize it needs much work of data pre-processing. There are two layers of information that is contained in the dataset: a) one is what decorations the manuscripts include; and b) the other is how those decorations are arranged across the manuscripts. Delivering such information in the dataset may potentially communicate the decorative characteristics of the book of hours. For the what part, I identified several major decorative elements of the manuscripts from the dataset and color-coded each element in the Excel sheet, such as the illuminated initial, miniature (large and small), foliate, border (border decorations), bookplate (usually indicating the ownership of the book), catalog, notation, and multiple pictorial themes and imageries (e.g., annunciation, crucifixion, Pentecost, betrayal, and lamentation, Mary, Christ). Figure 1 demonstrates my preliminary attempt to visualize the decorative information of the manuscripts. I coded the major decorative patterns of the visualizations for the left half of the coding graph and the major pictorial themes (e.g., Virgin, Christ, Annunciation) for the right half of the graph. From this preliminary coding graph, we could see that there appears two general decorative styles for the book of hours. One type of decoration focuses on making the manuscripts beautiful and the other type focuses on displaying stories and the meaning behind them using pictorial representations of the texts. I then went back to check the original digitized images of the manuscript collection and found that the patterns were mostly utilized to decorate texts (appear surrounding the texts) while the other style appears mostly as full-leaf miniatures supplementing the texts. A preliminary analysis of the two styles’ relationship with the geographic information also suggests that the majority of the first decoration style is associated with France while the other that’s more emphasized on the miniature storytelling is more associated with the production locations such as Bruges.

For the second step, I explored the transitions as well as relationships among different decorative elements using Tableau, Voyant, and Wordle. Figure 2 is a word cloud that demonstrates the frequency of the major decoration elements across the whole manuscript collection. The Voyant Tools, in comparison, provides a way to further demonstrate the strengths of relationships among different decorative elements across the dataset. Here is an example. Treating all the decoration information as texts, the “links” feature in Voyant demonstrates the relationships among different elements. For instance, we could see that the strength of the link between the “illuminated” and “initial” is the strongest and there are also associations between different elements of decoration, such as “decorated,” “line,” “miniature,” “border,” “bookplate,” and “vignette.” The dataset has also attested that patterns such as illuminated initials, miniature, and bookplates demonstrating the ownership of the book, are the most common elements. The links, however, do not present any of the relationships among different themes.

Figure 1.

Figure 2.
 
Figure 3. Voyant analysis of the decorating information. 
LEADS Blog

Week 6 – Sonia Pascua – Parser, Python & Mapping

 

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading

 

Finally I met my mentor, Peter Logan last Monday, and it was great to see him in person. In this meeting I presented the progress of the project and figured out that perhaps a TEI format  would be a good data format for me to move forward. As pending action item, TEI format will be generated and provided by Peter.

Here are some of the matters to ponder on in this project.
  • I was able to make a parser code in Python to extract the elements from SKOS RDF/XML format of the 1910 LCSH
  • Concerted assessment of Jane, Peter and I resulted to the following
The sample entry from LCSH
 



SKO RDF version


Assessment:
Concept : Abandoned children 
PrefLabel: first SEE instance 
USE: succeeding SEE instances – Foundlings & Orphans and orphan-asylums
    • There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
    • It is decided for now that USE : will be transferred to AltLabel; We will set a meeting with Joan, the developer of HIVE, how USE and Use for will be represented in HIVE.
    • I brought up about some alphanumeric words in 1910 LCSH that is a recognized Library of Congress Classification number. Should it still be needed to be represented? As per Jane, they can be kept as Notes.
    • I need also to investigate how BT and NT are going to be represented both in SKOS and in HIVE DB.
    • The current SKOS RDF/XML at hand, shows the different SKOS elements that some have no representation in HIVE. To address this, we will bring this concern to Joan and consult with her on how this can be added or mapped with the existing HIVE DB fields. 
    • Now that the text file is the input in the parser script I wrote, it is recommended to work on a text file of the 1910 LCSH. Peter to provide the TEI format.

Additionally, earlier today, LEADS-4-NDP 1-minute madness was held. I presented the progress of the project to co-fellow and the LEADS-4-NDP advisory board.

 

LEADS Blog

Jamillah Gabriel: Working Around the Unique Identifier

In recent weeks, my project has taken an unexpected turn from data storytelling and visualization towards one of data processing. As it turns out, our partner organization (Densho.org) has already done some data cleaning in Open Refine, created a database, and began preliminary data processing. I’ll be using Python and Jupyter Notebook to continue the work they’ve started, first by testing previous processes and then by creating new processes. I also found out that the data doesn’t have unique identifiers so I’ll be using the following workaround for attempting to isolate pockets of data.

 

A screenshot of a social media post Description automatically generated

 

In this partial example (there’s more to it than what’s seen in this screenshot), I’ll need to query the data using a for loop that searches for a combination of first name, last name, family number, and year of birth in order to precisely locate data in a way that potentially replicates the use of a unique identifier. I’m finding that not having a unique identifier makes it much more difficult to access data quickly and accurately, but hopefully this for loop will do the trick. I’m looking forward to playing with the code more and seeing what can be discovered.