Extracting Subjects

After my last post I spent some time, along with my mentors figuring out how to isolate the subject headings and ids from the dataset. We decided that since the dataset was so large and my machine did not have the power to handle it all. We would do all our test with a sample subset. Using some python code with Apache Spark we managed to isolate the subject terms from these records terms and output them as a csv file. The sample we yielded over 700,000 subject terms.
One of the goals of this project was to map these term against LCSH. At first my idea was to download the LCSH dataset in xml and see what kind of scripting I could do with it. However, I discovered that there is a Python script which extends OpenRefine and which will perform reconciliation against the LOC API which we decided to test. This allows you to load a csv file and run the reconciliation script against it. We found that this is an effective method to find close matches where the confidence level is over 85% for a match. The reconciliation process returns the term as listed in LCSH along with a URI which can be saved  with the original data. The biggest concern with this method is the time that it takes to run within OpenRefine, however my mentors feel that this process can be captured and run in a similar way outside the tool using other programming methods.
Later we manually checked the items that were returned to see if they in fact were matching and happily everything has checked out. There still remains a question as to whether or not there are subjects that are not close/exact matches but rather fuzzy matches and how to identify and get uri results for those. Also, the dataset seemed to have a number of duplicates and data that may need some kind of cleaning preparation, so that is another thing that may need to be examined.

Julaine Clunis

Rongqian Ma; Week 4-5: Visualizing Decorations Information

Decoration information of the manuscripts is one of the most complex categories of information in the dataset, and to visualize it needs much work of data pre-processing. There are two layers of information that is contained in the dataset: a) one is what decorations the manuscripts include; and b) the other is how those decorations are arranged across the manuscripts. Delivering such information in the dataset may potentially communicate the decorative characteristics of the book of hours. For the what part, I identified several major decorative elements of the manuscripts from the dataset and color-coded each element in the Excel sheet, such as the illuminated initial, miniature (large and small), foliate, border (border decorations), bookplate (usually indicating the ownership of the book), catalog, notation, and multiple pictorial themes and imageries (e.g., annunciation, crucifixion, Pentecost, betrayal, and lamentation, Mary, Christ). Figure 1 demonstrates my preliminary attempt to visualize the decorative information of the manuscripts. I coded the major decorative patterns of the visualizations for the left half of the coding graph and the major pictorial themes (e.g., Virgin, Christ, Annunciation) for the right half of the graph. From this preliminary coding graph, we could see that there appears two general decorative styles for the book of hours. One type of decoration focuses on making the manuscripts beautiful and the other type focuses on displaying stories and the meaning behind them using pictorial representations of the texts. I then went back to check the original digitized images of the manuscript collection and found that the patterns were mostly utilized to decorate texts (appear surrounding the texts) while the other style appears mostly as full-leaf miniatures supplementing the texts. A preliminary analysis of the two styles’ relationship with the geographic information also suggests that the majority of the first decoration style is associated with France while the other that’s more emphasized on the miniature storytelling is more associated with the production locations such as Bruges.

For the second step, I explored the transitions as well as relationships among different decorative elements using Tableau, Voyant, and Wordle. Figure 2 is a word cloud that demonstrates the frequency of the major decoration elements across the whole manuscript collection. The Voyant Tools, in comparison, provides a way to further demonstrate the strengths of relationships among different decorative elements across the dataset. Here is an example. Treating all the decoration information as texts, the “links” feature in Voyant demonstrates the relationships among different elements. For instance, we could see that the strength of the link between the “illuminated” and “initial” is the strongest and there are also associations between different elements of decoration, such as “decorated,” “line,” “miniature,” “border,” “bookplate,” and “vignette.” The dataset has also attested that patterns such as illuminated initials, miniature, and bookplates demonstrating the ownership of the book, are the most common elements. The links, however, do not present any of the relationships among different themes.

Figure 1.

Figure 2.
Figure 3. Voyant analysis of the decorating information. 

Week 6 – Sonia Pascua – Parser, Python & Mapping


LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading


Finally I met my mentor, Peter Logan last Monday, and it was great to see him in person. In this meeting I presented the progress of the project and figured out that perhaps a TEI format  would be a good data format for me to move forward. As pending action item, TEI format will be generated and provided by Peter.

Here are some of the matters to ponder on in this project.
  • I was able to make a parser code in Python to extract the elements from SKOS RDF/XML format of the 1910 LCSH
  • Concerted assessment of Jane, Peter and I resulted to the following
The sample entry from LCSH

SKO RDF version

Concept : Abandoned children 
PrefLabel: first SEE instance 
USE: succeeding SEE instances – Foundlings & Orphans and orphan-asylums
    • There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
    • It is decided for now that USE : will be transferred to AltLabel; We will set a meeting with Joan, the developer of HIVE, how USE and Use for will be represented in HIVE.
    • I brought up about some alphanumeric words in 1910 LCSH that is a recognized Library of Congress Classification number. Should it still be needed to be represented? As per Jane, they can be kept as Notes.
    • I need also to investigate how BT and NT are going to be represented both in SKOS and in HIVE DB.
    • The current SKOS RDF/XML at hand, shows the different SKOS elements that some have no representation in HIVE. To address this, we will bring this concern to Joan and consult with her on how this can be added or mapped with the existing HIVE DB fields. 
    • Now that the text file is the input in the parser script I wrote, it is recommended to work on a text file of the 1910 LCSH. Peter to provide the TEI format.

Additionally, earlier today, LEADS-4-NDP 1-minute madness was held. I presented the progress of the project to co-fellow and the LEADS-4-NDP advisory board.