News & Events

Metadata Research Center: Data Science Foundations Carpentry

The Metadata Research Center is hosting a Data Science Foundations Carpentry on January 23rd, 2020 at Drexel University’s College of Computing and Informatics.

Date: January 23rd, 2020
Time: 8am-4:30pm
Location: 3675 Market Street, Philadelphia, PA
Room: TBA
Registration Link: TBA
Event Page: View Here

The workshop will cover:

  • GitHub basics
  • OpenRefine
  • APIs
  • Best data practices
  • Jargon-busting

The full carpentry abstract is provided below.

Continue reading “Metadata Research Center: Data Science Foundations Carpentry”
News & Events

Jane Greenberg and Xia Lin Present LEADS-4-NDP Updates at ALISE ’19

On September 25th, Jane Greenberg and Xia Lin participated in a juried panel on “Curricula Models and Resources Along the Data Continuum: Lessons learned in the development and delivery of research data management and data science education” at the annual ALISE conference in Knoxville, Tennessee. Jane and Xia presented updates on the LEADS-4-NDP program. LEADS Team members include: Jane Greenberg, Xia Lin, Weimao Ke, Il-Yeol Song, Jake Williams, Erja Yan, Liz Costello, and Sam Grabus.

Presentation slides can be viewed here.

News & Events

Sonia Pascua presents at DCMI and NKOS in Seoul, South Korea

MRC doctoral student Sonia Pascua presented her research at the Dublin Core Metadata Initiative conference (DCMI 2019) in Seoul, South Korea, on September 23rd, 2019. Sonia presented a short paper co-authored with MRC’s Kai Li, titled “Toward A Metadata Activity Matrix: Conceptualizing and Grounding the Research Life-cycle and Metadata Connections.”

Sonia Pascua discussing the addition of the 1910 LCSH to the HIVE platform at NKOS 2019.

Sonia also presented at the Networked Knowledge Organization Systems (NKOS 2019) Workshop at the National Library of Korea, Seoul, South Korea on September 25th, 2019. Sonia presented a long paper co-authored with Jane Greenberg, Peter Logan, and Joan Boone, entitled “SKOS of the 1910 Library of Congress Subject Heading for the Transformation of the Keywords to Controlled Vocabulary of the Nineteenth-Century Encyclopedia Britannica.” Sonia’s NKOS presentation slides may be viewed here.

News & Events

Sonia Pascua Presents at the University of the Philippines

MRC Doctoral student Sonia Pascua presented at the Lecture and Workshop on Linked Data at the University of the Philippines School of Library and Information Studies, on September 18th, 2019. Sonia presented her research on SKOS and the 1910 Library of Congress Subject Headings, as part of her LEADS-4-NDP summer fellowship.

Sonia Pascua presenting at the University of the Philippines iSchool.

News & Events

LEADS Forum: January 24th, 2020

After two successful years of the LEADS-4-NDP program, the Metadata Research Center and Drexel CCI will host a LEADS forum on Friday, January 24th, here at Drexel University.

2018 LEADS cohort

LEADS-4-NDP 2019 cohort
2019 LEADS cohort

This event is an opportunity for LEADS advisory board members, mentors, and fellows from both cohorts of participants from LEADS program to get together. The forum will include a panel of project mentors, student presentations, breakout groups, and an opportunity to discuss different models for continuing the LEADS program.

What: LEADS-4-NDP Forum
Date: January 24th, 2020
Time: 10am – 3pm
Where: 3675 Market St, Quorum, floor 2
Drexel University
Philadelphia, PA
Forum agenda: TBA.

LEADS Blog

Wrap Up

This has been quite an amazing experience for me and I am really very grateful for the opportunity. 

As was noted in my previous posts my task was to find a method or approach for matching terms to similar terms in the primary vocabularies and making the terminology more consistent to support analytics.  
I explored two methods for term matching. 

Method 1

The first method utilized Open Refine and it's reconciliation services via the API of the focus vocabulary. This method utilized Python script that matched terms in the DPLA dataset with terms from LCSH, LCNAF, and AAT. This method is very time-consuming. Using only a small sample of the dataset consisting of about 796508 terms took about 5-6 hours and returned only about 16% matching terms. (These were exact matches). While this method can definitely be used to find exact matches. Testing should be done to ascertain if the slow speed has to do with the machine and connection specs of the testing machine. However, this method did not prove useful for fuzzy matches. Variant and compound terms were completely ignored unless they matched exactly. Below is an example of the results returned through the reconciliation process.
image.png
The scripts used for reconciliation are open source and freely available via GitHub and may be used and modified to suit the needs of the task at hand.
Method 2
The second method involved obtaining the data locally then constructing a workflow inside the Alteryx Data Analytics platform. To obtain the data, Apache Jena was used to convert the N-Triple files from the Library of Congress and the Getty into comma-separated values format for easy manipulation. These files could then be pulled into the workflow. 
image.png
The first thing that was done was some data preparation and cleaning. Removing leading and trailing spaces, converting all the labels to lowercase and removing extraneous characters. We then added unique identifiers and source labels to the data to be used later in the process. The data was then joined on the label field to obtain exact matches. This process returned more exact match results than the previous method with the same data, and even with the full (not sample) dataset, the entire process took a little under 5 minutes. The data that did not match was then processed through a fuzzy match tool where various algorithms such as key match, Levenshtein, Jaro, or various combinations of these may be used to process the data and find non-exact matches.  
Each algorithm returns differing results and more study needs to be given to which method may be best or which combination yields the best and most consistent results. 
What is true of all of the algorithms though is that a match score lower than 85% seems to results in matches that are not quite correct, with correct matches interspersed. Although even high match scores using the character Levenshtein algorithm displays this problem with LCSH compound terms in particular. For example, [finance–law and legislation] is being shown as a match with [finance–law and legislation–peru]. While these are similar, should they be considered any kind of match for the purposes of this exercise? If so, how should the match be described?
image.png
Character Levenshtein
image.png
Character Levenshtein

Still despite the problems, trying various algorithms and varying the match thresholds returns many more matches than the exact match method only. This method also seems useful for matching terms that were using the LCSH compound term style with close matches in AAT.  Below are some examples of results 
image.png
Character: Best of Levenshtein & Jaro
image.png
Word: Best of Levenshtein & Jaro
In the second image, we can look at the example with kerosene lamps. In the DPLA data, it seems to have been labeled using the LCSH format as [lamp–kerosene], but the algorithm is showing it is a close match with the term [lamp, kerosene] in AAT. 
The results from these algorithms need to be studied and refined more so that the best results can be obtained consistently. I hope to be able to look more in-depth at these results for a paper or conference at some point and come up with a recommended usable workflow.
This is where I was at the end of the ten weeks and I am hoping to find time to look deeper at this problem. I welcome any comments or thoughts and again want to say how grateful I am for the opportunity to work on this project.

Julaine Clunis