Week 2: Understanding the limitation of data – What we can’t do

LEADS site: Repository Analytics & Metrics Portal



After developing some visualization to understand the relationship between columns in the RAMP dataset, we had a follow-up meeting to discuss the visualization result.
The visualization I discussed on the meeting focuses on aggregation between categorical values in the ramp dataset including the number of visits for each index and each domain name (URL), number of visitors for citable and non-citable content, number of visits based on the user devices, and providing histogram for position, clicks, and clickThrough.
In the meeting, we also discussed the possibilities of incorporating external data such as metadata for each index. One of the mentors Jonathan have been trying to merge metadata to the older RAMP dataset period (2018), and we also can extract the metadata from the new dataset that we want to focus on analyzing.
What I will do next for this dataset is extracting metadata, make the data reacher so we can understand more about the behavior of the users through the metadata and form a research question that we want to focus on for the RAMP dataset.
Nikolaus Parulian



Jamillah Gabriel: From Relocation to Internment to Detention (and Everything in Between)

In the past couple of weeks, a flurry of articles have been published about concentration camps and their place in American society and history. My mentor shared them with me and I have found them useful in contextualizing my work with the Japanese American internment cards. I’m reminded of how my LEADS project and the data I’m working with are still relevant today, when concentrations camps can’t be relegated to the past and, in fact, are very much a reincarnated racist reality in the present. Three of the four articles sent to me (listed below) connect the history of Japanese American internment camps with current issues around the migrant detention camps that have been implemented to detain migrant children crossing the border from Mexico, and highlight the fact that this, unfortunately, is history repeating itself. For instance, Ft. Sill, which is now a migrant detention center, was founded in 1869 and was once “a relocation camp for Native Americans, a boarding school for Native children separated from their families, and an internment camp for 700 Japanese American men in 1942” (Hennessy-Fiske, 2019). Its unmitigated and irreconcilable history is a continued legacy of racial difference, segregation, and discrimination. All of the articles reinforce the importance of this project that I (and two other LEADS fellows before me) am working on, but the last piece written by the granddaughter of a survivor of the Japanese American incarcerations is truly the most motivating factor for this work: so that former internees and their family members can know their own histories.




Friedman, M. (2019, June 19). American concentration camps: A history lesson for Liz Cheney. The Typescript. Retrieved from http://thetypescript.com/american-concentration-camps-a-history-lesson-for-liz-cheney

 Hennessey-Fiske, M. (2019, June 22). Japanese internment camp survivors protest Ft. Sill migrant detention center. Los Angeles Times. Retrieved from https://www.latimes.com/nation/la-na-japanese-internment-fort-sill-2019-story.html

 Provost, L. (2019, June 22). Prepared for arrest: Japanese-Americans protest at Fort Sill over incoming migrant children. The Duncan Banner. Retrieved from https://www.duncanbanner.com/news/prepared-for-arrest-japanese-americans-protest-at-fort-still-over/article_789070aa-9542-11e9-8107-9fcd6387dce9.html

 Sakurai, C. (2019, June 25). More than a name in the census: Piecing together the story of my grandmother’s life. National Japanese American Historical Society. Retrieved from https://www.facebook.com/notes/national-japanese-american-historical-society/more-than-a-name-in-the-census-piecing-together-the-story-of-my-grandmothers-lif/2679119588783598


Jamillah R. Gabriel, PhD Student, MLIS, MA
School of Information Sciences
University of Illinois at Urbana-Champaign



Week 2: Elaborating on our multi-level alignment idea and an initial exploration on the BHL collection

This week I explored more into the multi-level alignment idea , and I was almost convinced that we can leverage this idea into a ‘dataset merging’ problem.
The dataset merging idea is not new. For example in this one paper from my PhD advisor, they have discussed briefly about how to merge taxonomic data: Towards Best-effort merge on taxonomically organized data
But for our group in UIUC (in collaboration with systematic experts from ASU), we have mainly been working on the actually taxonomic names alignment rather than ‘dataset merging’.
For the dataset merging idea, our proposal is pretty simple.
If we can align taxonomic names, we should also be able to align other things in the dataset such as spatial information (in our case, countries/areas).
Naturally, finding the intersection from my project site the Academy of Natural Sciences and my interest in taxonomy has become the priority for this week. The task I have set for myself was to find a certain species that is endemic or popular across Taiwan (my geographical point of interest), and that also happens to appear somewhere in the text of either the proceedings or the journals of the Academy of Natural Sciences.
The quest went on with me fascinated (and slightly sidetracked) by all the orchids population and its varieties Taiwan has. To my surprise, one of the news (in Chinese) mentioned that Taiwan has more than 0.9 billions of moth orchids!
(image source:britannica.com)
Then I went on to create our dataset merging idea first around the orchids:
Basically, the idea is that if we have two occurrence datasets on orchids, then we can do the dataset merging with the two datasets like the figure shown above, with each column being one ‘taxonomy alignment problem’.
Just as I was almost set on going for the beautiful orchid flowers, I finally turned back to BHL to search the keyword “Taiwan” and set the Titles on “Academy of Natural Sciences”. This is when I found a whole new world of Mollusca (snails)!
The entry that returned results of intersection of “Taiwan” and “ANS” is from the Proceedings of Academy of Natural Sciences, v.57, 1905, and the title of the page/chapter is :“Catalogue of the Land and Fresh-water Mollusca of Taiwan (Formosa) with descriptions of new species”. 
Like the above BHL search interface shows, the scientific names on this page were also extracted and shown on the bottom left corner. Having this breakthrough on the Mollusca (possibly endemic to Taiwan), I will begin to work with this species on the dataset merging idea next week!
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica



Week 2-3: Sonia Pascua, I am one of the “Mix” ‘s in the Metadata Mixer

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading


Last June 13, 2019, I presented our LEADS-4-NDP project at the Metadata Mixer.

 I started my lighting talk discussing the bigger picture of our project.

The Digital Scholarship Center has an ongoing project which is the Nineteenth-Century Knowledge Project that builds the most extensive, open, digital collections available today for studying the structure of the 19th Century knowledge and transformation using historic editions of the Encyclopedia Britannica. This project is progressing hugely towards establishing the controlled vocabulary terms for the purpose of metadata consistency and interoperability and is utilizing vocabularies in HIVE especially LCHS.

Our project works around the SKOS – ination of the 1910 LCHs.

The hypothesis that we would like to explore is that there may be gap or we call it “vocabulary divide” between the vocabularies of the past and the present. With the current version of LCHS (2016) in HIVE, we aim to include the 1910 version of LCHS to cater the researches using resources from the past especially the 19th century knowledge.

Above is our conceptual model. As shown, the 1910 LCHS would be digitized to text format for easy manipulation of words. Then from the text, be it in csv, xls, DocX format – the RDF/XML format is constructed for HIVE integration. Once the 1910 LCHS is into HIVE, it could now be used as a tool for automatic indexing.
In the 5-min talk, I was able to present the proof of concept
We formulated use cases based on the data sets – 1910 LCHS and 2016 LCHS. Four scenarios were devised for data analysis.The gap or “vocabulary divide” is verified and validated by these use cases. 
A simulation of a word – Absorption was conducted. The article about the sun was taken from the 1911 Encyclopedia Britannica. It was subjected to a text analysis using TagCrowd. Frequencies of the words in the article were extracted. For subject cataloging, which was done manually, the descriptors were selected to represent the ABOUTNESS of the article. 1910 LCHS was used for indexing and vocabulary was generated. The same process was executed but this time with the use of 2016 LCHS in HIVE for automatic indexing. The case study fell under scenario 2 which meant that the word “Absorption” intersected both data sets, thus the word existed from 1910 till 2016.

Rongqian Ma; Week 1-2: Getting familiar with the project and exploring the initial dataset

My LEADS fellowship placement is with the University of Pennsylvania Libraries, Digital Research Services. The project this year aims to visualize a digitized collection of book of hours manuscripts produced in middle ages Europe. The major idea behind the project is to better introduce and communicate this specific genre of book production to the audience, using visual forms and languages.

During the Drexel University boot camp between June 6-8, I took the best use of the time to visit the UPenn Library and had a meeting with my project mentor Ms. Dot Porter. We discussed the project goals and the major tasks to successfully deliver the project. We identified two possible ways to present and share our major project outcomes, one as a research paper and the other as an interactive website displaying and communicating the visualizations.

I spent the first week of LEADS project to get familiar with the “book of hours” as a genre and an artifact, reading secondary sources recommended by my mentor. By reading those materials I developed a better understanding of the book of hours in terms of its history, major characteristics, and uniqueness in the religious life of the middle ages, which has been helpful for me to think of ways to visualize the manuscript data. Week 2 was mostly utilized to browse the dataset and propose visualization strategies. The book of hours initial dataset contains information of 185 digitized manuscripts, including their dates of production, the provenance of production and circulation, the contents (i.e., passages of prayer), and the decorations. Thinking about the visualization strategies, my mentor and I had a Skype check-in and discussed issues regarding which types of visualizations and graphs to create and some potential problems involved in the visualization processes. I also reflected on the ideas and theories communicated in the information visualization session at the boot camp when trying to identify the most effective visualization strategies for the manuscript data. Following the discussion with my mentor, I started actually working with the initial dataset — the provenance data of manuscript productions in particular. As visualization goes on, I feel that each graph tends to be more complex than it appears and manuscripts data visualization is quite a craft.


Alyson Gamble: So it begins…

My name is Alyson Gamble and I’m a doctoral student from Simmons University. My placement in LEADS-4-NDP is with the Historical Society of Pennsylvania. Before the LEADS boot camp at Drexel, I was able to spent half a day at my host site. My mentor, Caroline Hayden, gave me a great tour of the HSP’s buildings and collections. I met other HSP employees who were part of the project last year. Being able to visit my host site in person was helpful to acquaint me with both the people and the collections. I’ll be focusing on historical public school records from Philadelphia.
Figure 1. Picture of the historical marker for the Historical Society of Pennsylvania
The boot camp itself was very informative. Since I was not familiar with all of the concepts we discussed, I made sure to remember that this was an educational opportunity and to recognize that I don’t (and won’t) know everything. I enjoyed the presenters’ lessons, especially ones with a strong data visualization component. From my past experience and research, I’ve learned how important visualizations are for making data understandable. With the right visualization, a person can gain insight into data that they wouldn’t otherwise notice. Right now, I’m very fond of The Pudding (https://pudding.cool/) for data journalism; one of my students from my previous life as a science librarian, Caitlyn Ralph, is one of the site’s stars and I adore all of the work that she and others do on the publication. My favorite data visualization tool is currently Tableau, which I learned a lot about from Jess Cohen-Tanugi at Harvard’s Lamont Library. It’s pretty easy to use and makes nice dataviz. I’m especially fond of the idea of using Tableau as a sandbox for determining what kind of visuals will work best for a data set before creating those visuals in another program like R.
Figure 2. A picture of the final day of bootcamp
My favorite part of the bootcamp, though, was the opportunity to meet other doctoral students and the LEADS-4-NDP staff. Since I don’t get to interact with PhD students in person very often, it was a treat to spend time with the other LEADS Fellows. I’m very excited for our time together in this program and for seeing what happens with our projects.
Figure 3. Benches in the Drexel courtyard

Jessica Cheng, Week 1: Explore the direction of our project — Multi-layer taxonomy alignment problems

Week 1 of the LEADS fellowship project starts with exploring the actual problems and directions we would like to work for the course of 10 weeks.
For identifying the problems, I reviewed some literature that may hopefully guide me towards the intersection of my research interests and the Academy of Natural Science’s (ANS) goals. These topics are, but not exclusive to, taxonomies, knowledge graphs, biodiversity, geo-politics, and knowledge organization. We also want to link this project towards the Biodiversity Heritage Library (BHL) and ANS collections. 
Given the conversations I had with my mentor Steve Dilliplane at both the LEADS boot camp and the NASKO 2019 conference, I came up with this interesting idea of a ‘multi-layer’ taxonomy alignment problems/framework, which may ultimately guide us to constructing a data-driven biodiversity knowledge graph/ontology of a specific species we wish to examine.
A lot of times species co-occurrences datasets contain records based on the Darwin Core metadata standard. Multi-layers in this case means different fields in a co-occurrence dataset, these can include (again, not exclusive to): species names, characters/phenotypes, habitat information, geolocations (country, cities, latitude, longitude), IUCN redlist/other endangered speices classifications, etc. Depending on what type of metadata they actually use in the dataset, my thoughts are that each of these field can itself be a taxonomy.
1st taxonomy: species names
2nd taxonomy: geographic regions (geopolitical realities may exist) 
3rd taxonomy: phenotypes
…and more
How do we proceed? BHL/ANS or co-occurrences datasets & TAP:
Say we have two different datasets from BHL/ANS about Grizzly Bears. 
Each of these field can itself have a taxonomy alignment problem.
One dataset may only locate the Grizzly bear (species name identified by author X) in the lower 48 states, and lists the bears as endangered.
The other dataset may be the occurence dataset of the Grizzly bears (species name identified by author Y) in Alaska, which the bears are more than abundant.
1st taxonomy alignment problem (TAP): align the species names given by Author X vs. Author Y
2nd TAP: geographic regions – lower 48 states vs. Alaska
3rd TAP: endangered species list – one classification vs. another classification
In this case we can align the multi-layers in different datasets and each layers will come up with multiple possible worlds (merged solutions). 
Ontologies/knowledge graph/linked data:
If the abovementioned approach is feasible, my guess is that for each of the possible world we came up with, we can then patch them up together to form our own ‘grizzly bear’ ontologies/knowledge graphs. This can enable us to visualize and query for future uses.  
– Work with a particular species Academy of Natural Sciences is most proud of?
– What does the actual dataset look like? 
– Are there any relations across different layers?
10-week rough timeline:
Week1-2: identify the problems and research questions & come up with a 3-4 page proposal draft
Week 3-8: execute, implement the proposal 
Week 9-10: wrap up and draft deliverables 


Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica



Minh Pham, Week 1- Exploring the data


Week 1: Exploring the data

My placement is with the Repository Analytics & Metrics Portal (RAMP) project at Montana State University. Nikolaus – another LEAdS fellow in the same project with me provided a nice overview of the project. Thanks, Nikolaus!


Before the bootcamp, Nikolaus and I had an online meeting with our mentor – Dr. Kenning Atlitsch and other members in the project. Dr. Atlitsch and the other members in the project helped us understand more about the project and familiarized us with the data collected from the RAMP service. Thanks to the bootcamp, I came home filled with new knowledge about library science in general and meta data in particular and new techniques in database management, visualization, and analysis with text mining and machine learning methods.


For week 1, I focused on exploring the data by doing descriptive analysis and creating crude visualizations from the data. RAMP data consists numbers from over 50 IRs and consists over 400 million rows. Due to the amount of data and memory constraints of my laptop, it takes R from a couple of minutes to hours to run a command or knit the document. I looked into the option of working with R Studio Cloud but the current version of R Studio Cloud does not enable us to upload and work with such big data like RAMP. For now, I have to use the old school way of handing generated results from R: copying and pasting one by one to a word doc rather than make use of knitting capabilities of all results in a single document using R notebook or markdown.


My plan for the 2nd week is to refine the visualization for aesthetics and readability and merge RAMP data with other data to explore research possibilities from the RAMP data.


Minh Pham


Hanlin Zhang, LEADS Blog #1 Yamz Kickoff


Yamz Kickoff

June 23rd, 2019


In this summer, I’m going to work with my mentor John Kunze from California Digital Library (CDL), and another LEADS-4-NDP fellow Bridget Disney (University of Missouri), to do some awesome metadata research! What Jane Greenberg, John Kunze and other researchers in the area of metadata standards found problematic is that when metadata standard is being discussed and created, people (mostly domain experts) spend a relatively large amount of time to discuss and set the standards, controlled vocabularies and etc., but have little or less time to test the actual performance of such a standard and then revision.


YAMZ (Yet Another Metadata Zoo) creates a unique experience that is similar to Wikipedia and Stack Overflow in a scene that the community can co-edit and vote for a standard. Our first kickoff meeting with the LEADS-4-NDP site supervisor John was on Friday. We’ve learned that yamz.net is currently deployed on the free version of Heroku, and is going to be transferred to the Amazon cloud services (AWS) in this summer, and Bridget and I are going to be part of it. I’m very excited about we are going to be involved in this process and expecting to learn a lot of cool stuff.


To read more about Yamz:



The goals for next week:

  • Rewrite the new readme and improve the readability

  • Figure out how to remotely connect to CDL, preferably through a Drexel University Network.


Hanlin Zhang

Jamillah Gabriel: Getting Acquainted with the Data

For my LEADS project, I’ll be working with the Digital Curation Innovation Center (DCIC) at the University of Maryland on a project that examines Japanese American internment camp archival records that were collected over a period of four years from 1942 to 1946. I’m really excited to work on this project because of the cultural importance and potential impact it could have on the Japanese American community, which up to this point, has not had access to these records. The records consist of 25,000 cards that include details such as incidents in the camp, births and deaths, entries and exits, as well as transfers between camps.


After talking with my internship mentor, Richard Marciano, I decided to work on data that might help us track the movement of the internees within and among the camps from entry to exit in hopes that it might provide some insight into their lives. Additionally, examining data about the births and deaths in the camps could provide additional context that can aid in telling a more complete story of the Japanese American citizens who were subjected to imprisonment in internment camps. While the entire scope of the project has not been fleshed out completely, the preliminary steps of the research project will include parsing through three data files, looking at the previous projects conducted by MLIS students, reading the grant application which will allow the release of key data to the public, and viewing the “Resistance at Tule Lake” documentary. After these initial steps, I’ll begin to conceptualize what this data project will look like in terms of data processing and visualization.


I’m looking forward to what this project will bring to light in the remaining weeks of the internship!


Jamillah R. Gabriel