On Friday, October 11th, Jane Greenberg presented with Peter Logan about HIVE, historical controlled vocabularies, and Temple University’s 19th-Century Knowledge Project for the Digital Hermeneutics conference at the German Historical Institute.
MRC Doctoral student Sam Grabus presented as a guest lecturer on October 8th for the INFO: 821 Foundations of Information Science doctoral seminar. Sam presented her data sharing research, “Toward a Metadata Framework for Sharing Sensitive and Closed Data: An Analysis of Data Sharing Agreement Attributes.”
Presentation slides can be viewed here.
On September 25th, Jane Greenberg and Xia Lin participated in a juried panel on “Curricula Models and Resources Along the Data Continuum: Lessons learned in the development and delivery of research data management and data science education” at the annual ALISE conference in Knoxville, Tennessee. Jane and Xia presented updates on the LEADS-4-NDP program. LEADS Team members include: Jane Greenberg, Xia Lin, Weimao Ke, Il-Yeol Song, Jake Williams, Erja Yan, Liz Costello, and Sam Grabus.
Presentation slides can be viewed here.
MRC doctoral student Sonia Pascua presented her research at the Dublin Core Metadata Initiative conference (DCMI 2019) in Seoul, South Korea, on September 23rd, 2019. Sonia presented a short paper co-authored with MRC’s Kai Li, titled “Toward A Metadata Activity Matrix: Conceptualizing and Grounding the Research Life-cycle and Metadata Connections.”
Sonia also presented at the Networked Knowledge Organization Systems (NKOS 2019) Workshop at the National Library of Korea, Seoul, South Korea on September 25th, 2019. Sonia presented a long paper co-authored with Jane Greenberg, Peter Logan, and Joan Boone, entitled “SKOS of the 1910 Library of Congress Subject Heading for the Transformation of the Keywords to Controlled Vocabulary of the Nineteenth-Century Encyclopedia Britannica.” Sonia’s NKOS presentation slides may be viewed here.
MRC Doctoral student Sonia Pascua presented at the Lecture and Workshop on Linked Data at the University of the Philippines School of Library and Information Studies, on September 18th, 2019. Sonia presented her research on SKOS and the 1910 Library of Congress Subject Headings, as part of her LEADS-4-NDP summer fellowship.
After two successful years of the LEADS-4-NDP program, the Metadata Research Center and Drexel CCI will host a LEADS forum on Friday, January 24th, here at Drexel University.
This event is an opportunity for LEADS advisory board members, mentors, and fellows from both cohorts of participants from LEADS program to get together. The forum will include a panel of project mentors, student presentations, breakout groups, and an opportunity to discuss different models for continuing the LEADS program.
What: LEADS-4-NDP Forum
Date: January 24th, 2020
Time: 10am – 3pm
Where: 3675 Market St, Quorum, floor 2
Forum agenda: TBA.
In this project, we tried to apply network analysis and community detection methods to identify meaningful publisher clusters based on the ISBN publisher code they use. From my perspective, this unsupervised learning approach was selected because of a lack of baseline test conducted from a large-scale perspective, so that supervised approach using any real-world data is not possible.
In the end, we get yearly publisher clusters that hopefully reflects the relationship between publishers in a given year. That is being said, community detection methods is difficult to be combined with temporal considerations. The year may not be a fully meaningful unit to analyze how publishers are connected to each other (the relationship between any two publishers may well change in the middle of a given year), but we still hope this approach to publisher clusters could generate more granular results than using data in all years. The next step, though turned out to be much more substantial that what was expected, is to use manual approach to evaluate the results. And hopefully this project will be published in a near future.
Despite its limitations, I really learnt a lot from this project. This is the first time I have to play with library metadata in a really large scale. As almost my first project too large to be dealt with by R, I gained extensive experiences using Python to deal with XML data. And during the process, I also read a lot about the publishing industry, whose relationship with our project was proven to be more than significant.
The last point above is also one that I wish I better realized in the beginning of this project. The most challenging part of this project is not any technical issue, but the complexity of the reality that we aim to understand through data analysis. Publishers and imprints could mean very different things in different social and data contexts. And there are different approaches to clustering them with their own meanings underlying the clusters. My lack of appreciation of the importance of the real publishing industry prevented me from foreseeing the difficulties of evaluating the results. I think in a way, this could mean that field knowledge is fundamental to any algorithmic understanding of this topic (or other topics data scientists have to work on), and to a lesser extent, any automatic method is only secondary to the final solution to this question.
Week 8-10: Re-organizing place and date information. Based on the problems that have appeared in the current version of visualizations, I performed another round of data cleaning and modification, especially for the date and geography information. With the goal of reducing the categories for each visualization, I merged some more data into others. For example, all the city information was merged into countries, single date information (e.g., 1470) was merged into the corresponding time period (e.g., in the case of the year 1470, it was merged into the 1450-1475 time period), and inconsistency of data across the time and geography categories was further manipulated. As demonstrated in the following example, the new version of visualizations gets more “clean” in terms of the number of categories and becomes more readable. For the last couple of weeks, I have also had discussions with my mentor about the visualizations, the problems I had, and have worked with my mentor for the data merge. I’m also working on a potential poster submission to iConference 2020.