LEADS Blog

Week – 9 Sonia Pascua – 1910 LCSH Database Schema

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
The next traction achieved in this project was when the 1910 LCSH concepts were loaded to a database. Below are the screenshots of the CONCEPT table with the records which are the concepts of 1910 LCSH. This created database named “lchs1910.db”, is added into the list of vocabulary databases in HIVE. Next steps are to formulate a test case which will be provided by Peter and execute a query to check the results. It is also considered the loading of the created RDF or db to the live HIVE and Joan Boone, the developer of HIVE is on the assist. Couldn’t wait the end output of the testing and the live 1910 LCSH.
Volume 1 – Database Schema Letters A-F
Volume 2 – Database Schema Letters G-P
Volume 3 – Database Schema Letters S-Z
LEADS Blog

Rongqian Ma; Week 8-10

Week 8-10: Re-organizing place and date information. Based on the problems that have appeared in the current version of visualizations, I performed another round of data cleaning and modification, especially for the date and geography information. With the goal of reducing the categories for each visualization, I merged some more data into others. For example, all the city information was merged into countries, single date information (e.g., 1470) was merged into the corresponding time period (e.g., in the case of the year 1470, it was merged into the 1450-1475 time period), and inconsistency of data across the time and geography categories was further manipulated. As demonstrated in the following example, the new version of visualizations gets more “clean” in terms of the number of categories and becomes more readable. For the last couple of weeks, I have also had discussions with my mentor about the visualizations, the problems I had, and have worked with my mentor for the data merge. I’m also working on a potential poster submission to iConference 2020. 

 

Example: 


 

LEADS Blog

Rongqian Ma; Week 6-7: Exploring Timeline JS for the Stories of Book of Hours

Week 6-7: Exploring Timeline JS for the story of Book of Hours. I spent the past two weeks designing and creating a timeline of book of hours evolution using Timeline JS visualization site, which incorporates as much of the available information of the dataset as possible, including the date information, locations, digital images, and some textual descriptions. Timeline JS tool is an effective storytelling platform that combines multimedia resources and information. I initially started exploring this tool during the process of visualizing the date information of the dataset; I wanted to find a form of visualization that can examine the relationships between different categories of the dataset, especially those among the temporal, geographical, and content information of the manuscripts. I was able to create the timeline that demonstrates the evolution of book of hours manuscripts from the 14th to the 16th centuries, and develop a multimedia narrative of the book of hours. The biggest challenge of creating the timeline is to generate reasonable and meaningful period intervals. Because all the date information in the original dataset is presented heavily in texts and descriptions (e.g., circa 1460, mid-15th century, 1450-1460), manipulating and reformatting the date information and changing it to an easily computed form is important. Following this task, the major work to do so as to decide on the intervals is to summarize the characteristics of each representative period and present them in the timeline. Creation of the timeline also entails reviewing relevant literature of book of hours, choosing the pictorial representations, and illustrating the characteristics of book of hours for each time period based on other categories of information (e.g., geolocations, decorations). Despite the advantages of Timeline JS as an effective tool, it appears more like a platform for “display of findings and results,” not an approach for “visual analysis.” [Based on discussions with my mentor, she is going to help with the textual descriptions of the book of hours in general and each section, which I really appreciate!]  

LEADS Blog

Week 7-8 – Sonia Pascua, The SKOS of 1910 LCSH in RDF/XML format

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
Technically the project output is accomplished this week, the SKOS of the 1910 LCHS in machine readable format, RDF/XML. However, to integrate the 1910 LCSH vocabulary which is now in RDF/XML to HIVE for the use of automatic indexing, is also one of the goals of this project.
The last two weeks of the project will be on the parsing of the SKOS elements to map to the database fields of HIVE. Moreover, vocabularies are added to the database to build the LCSH db. Once LCSH db is available, SQL scripts and queries of HIVE  should be able to retrieve the data and use the indexing capabilities of HIVE.
See screenshot below of the 1910 LCSH SKOS.
Furthermore, below are the challenges that this project encountered:
  • Digitization – The TEI version of the 1910 LCSH encountered incompleteness therefore we need to go back to the digitization of the print copies and re-do the OCR process.
  • Encoding – Parsing, which is one of the activities done in this project encountered not only syntactic and basic semantic structure error but also logic and syntax/semantics interaction.
  • Programming
    • Characterizing the states if possible and be able to enumerate all of them so that a conditional statement can be composed.
    • Data is unclean that pattern is hardly identified for logic formulation.
  • Digitalization – MultiTes or Python Program
    • MultiTes usage which is manual process but yields 98% accuracy in terms of reppresentation
    • Building of a program (Python) to automate the SKOS creation from TEI format to RDF/XML format encountered pattern recognition challenges due to regular expression brought by the OCR process. This yielded higher percentage of error which were identified from the 47 inconsistencies found in the evaluation conducted when the control structures of the program was constructed. Further investigation could verify the percent error yield once compared to MultiTes version of SKOS RDF/XML.
  • Metadata – SKOS elements are limited to Concept, PrefLabel, Related and Notes. AltLabel, USE, USE FOR, BT and NT are not represented because HIVE database has no provision for them. 

The SKOS-ification of the 1910 LCHS brought a lot of challenges that we documented to contribute to the case studies in digitization, encoding, programming, digitalization and metadata practices.

LEADS Blog

Bridget Disney, California Digital Library – YAMZ

California Digital Library – YAMZ
Bridget Disney
We have been duplicating our setup for the the local instance of YAMZ on the Amazon AWS server. The process is similar – kind of – and we’ve come across and worked through some major glitches in its setup.
One challenge that we have experienced is setting up the database. First we had to figure out where PostGreSQL was installed. The address is specified in the code but it had moved to a different location on the new server. There are different steps that the code goes through to determine which database to use (local or remote) and the rules have changed on the new system. Because of that, we have had to figure out our new environments and our permissions, documenting the process as we go along. We’ve set up a markdown file in GitHub which will be the final destination for our process documentation, but in the meantime, we made entries to a file in Google Docs as we worked through the process of the AWS installation.
Finally, we used pg_dump/pg_restore to move the data from the old to the new PostGreSQL database, so now we have over 2500 records and a functioning website on Amazon AWS! This has been a long time coming but it has helped me see the purpose of the whole project, which is to allow people to enter terms and then collaborate to determine which of those terms will become standard in different environments. In order for this to happen, this system will have to be used frequently and consistently over time.
I still have some concerns. Did we document the process correctly? It does not seem feasible to wipe everything out and reinstall it to make sure. Also, we still haven’t worked out the process that should be used for checking out code to make changes. 
It’s been a productive summer and we’ve learned a lot, but I feel we are running out of time before completing our mission. Starting and stopping, summer to summer, without continuous focus can be detrimental to projects. This is not the first time I’ve encountered this as it seems to be prevalent in academic life.
So, in summary, I see two challenges to library/data science projects:
  1. Bridging the gap between librarians and computer science knowledge
  2. Maintaining the continuity of on going projects
LEADS Blog

Jamillah Gabriel: Python Functions for Merging and Visualizing

This past week, I’ve been working on a function in Python that merges the two different datasets (WRA and FAR) so as to simplify the process of querying the data.

A screenshot of a social media post Description automatically generated

 

The reason for merging the data was to find a simpler alternative to the previous function for searching developed by Densho which involved if/else for loops to pull data from each dataset.

A screenshot of a social media post Description automatically generated

 

Now, one can search the data for a particular person and recover all of the available information about that person in a simple query. After the merge, the data output looks something like this when formulated as a list:

A screenshot of a social media post Description automatically generated           

 

In addition to this, I’ve also played with some basic visualizations using Python to display some of the data in pie charts. I’m hoping to wrap up the last week working on more visualizations and functions for querying data.

LEADS Blog

Working with LCSH

One of my goals after working with this new tool is to obtain the entire LCSH dataset and try to do matching on the local machine. In part because the previous method, while effective may not scale well, so we wish to test out whether we can get better results from downloading and checking it ourselves.
Since the data formats in which they make the data available are not ideal for manipulation my next task will be to try to convert the data from nt format to csv using Apache Jena. The previous fellow had made some notes about trying this method so I will be reading his instructions and seeing if I can replicate this having never used Jena before.
Once I obtain the data in a format that I can use, I will add it to my current workflow and see what the results are looking like. Hopefully the results will be useful and something that can be scaled up.
— Julaine Clunis
LEADS Blog

New Data Science Tool

For the project, we are also interested in matching against AAT. We have written a SPARQL query to get subject terms from the Getty AAT which was downloaded in json format.
Having data in these various formats I needed to find a way to work with both and evaluate data in one type of file against the other. In the search for answers I came across a tool for data analytics. It can be used for data preparation, blending, advanced and predictive analytics and other data science tasks and so is able to take inputs from various file formats and work with them outright.
A unique feature of the tool is the ability to build a workflow which can be exported and shared and which other members of a team can use as is or can probably easily turn into code if need be.
I’ve managed to join the json and csv file and check for matches and was able to identify exact matches after performing some transformations. This tool has a fuzzy match function which I am still trying to figure out and get working in an effective workflow that can be reproduced. I suspect that will be taking up quite a bit of my time.

Julaine Clunis
LEADS Blog

Clustering

One of the things that we’ve noticed about the dataset is that beyond duplicate terms there are subject terms that refer to the same thing that are spelled or entered differently by the contributing institution but which refer to the same thing. We’ve been thinking about using clustering applications to look at the data to see what kinds of matches are returned.
It was necessary to first do some reading on what the different clustering methods would do and how that might work for the data we have. We did end up trying some clustering using various key collision methods (Fingerprint, n-gram fingerprint) and KNN and Levenshtein distance methods. They return different results and we are still looking at the results returned from this before performing any merge functions. It is possible that terms look the same or seem similar but are in fact different so it is not as simple as just merging everything that matches.
One important question to answer is how accurate are the clusters and whether we can trust the results enough to go ahead and automatically merge. My feeling is that a lot of human oversight is needed to evaluate the clusters.
Another thing we want to test is how much faster the reconciliation process would be if we accepted and merged the results from the cluster and whether it was worth the time to do it, i.e. if we cluster and then do string matching, is there an improvement in the results or are they basically the same.

Julaine Clunis