This week I explored more into the multi-level alignment idea , and I was almost convinced that we can leverage this idea into a ‘dataset merging’ problem.
The dataset merging idea is not new. For example in this one paper from my PhD advisor, they have discussed briefly about how to merge taxonomic data: Towards Best-effort merge on taxonomically organized data
But for our group in UIUC (in collaboration with systematic experts from ASU), we have mainly been working on the actually taxonomic names alignment rather than ‘dataset merging’.
For the dataset merging idea, our proposal is pretty simple.
If we can align taxonomic names, we should also be able to align other things in the dataset such as spatial information (in our case, countries/areas).
Naturally, finding the intersection from my project site the Academy of Natural Sciences and my interest in taxonomy has become the priority for this week. The task I have set for myself was to find a certain species that is endemic or popular across Taiwan (my geographical point of interest), and that also happens to appear somewhere in the text of either the proceedings or the journals of the Academy of Natural Sciences.
The quest went on with me fascinated (and slightly sidetracked) by all the orchids population and its varieties Taiwan has. To my surprise, one of the news (in Chinese) mentioned that Taiwan has more than 0.9 billions of moth orchids!
Then I went on to create our dataset merging idea first around the orchids:
Basically, the idea is that if we have two occurrence datasets on orchids, then we can do the dataset merging with the two datasets like the figure shown above, with each column being one ‘taxonomy alignment problem’.
Just as I was almost set on going for the beautiful orchid flowers, I finally turned back to BHL to search the keyword “Taiwan” and set the Titles on “Academy of Natural Sciences”. This is when I found a whole new world of Mollusca (snails)!
The entry that returned results of intersection of “Taiwan” and “ANS” is from the Proceedings of Academy of Natural Sciences, v.57, 1905, and the title of the page/chapter is :“Catalogue of the Land and Fresh-water Mollusca of Taiwan (Formosa) with descriptions of new species”.
Like the above BHL search interface shows, the scientific names on this page were also extracted and shown on the bottom left corner. Having this breakthrough on the Mollusca (possibly endemic to Taiwan), I will begin to work with this species on the dataset merging idea next week!