|

|
|
Project
Title: CAREER: A Unified Architecture
for Data Mining Large Biomedical Literature
Databases
Sponsor: National Science Foundation (NSF), Award
No. IIS 0448023
PI: Xiaohua Hu
Amount: $415,000
Duration: March 15, 2005 Feb 28, 2010
|

Project Description
The large number of documents in biomedical literature databases and the
lack of formal structure in the natural-language narrative in those documents
make the search and processing very difficult to many scientists involved in
bioinformatics research. This CAREER project is investigating the efficiency
and effectiveness of various data mining techniques and method and developing a
unified framework for mining large biomedical literature databases. Currently
we are focusing on graph-based text mining techniques and methods, its
application in biomedical literature. This project is testing its application
in real-world bioinformatics domains such as chromatin interaction networks and
microarray data analysis. The software package called Dragon Toolkit developed from this
effort is free available for academia research use, related publications from
this project are listed below..
The
broad impact on society made by this project is the generation of a novel
unified architecture for biomedical literature data mining. This integrated and
complementary approach in a unified architecture has the potential to create a
very powerful novel tool for bioinformatics and for most text processing tasks.
This project has the potential to attract diverse collaborators who have an
interest in accessing complex biomedical or general scientific data and
information. Students are involved in this research through hands-on projects,
a Co-Op program and courses at both the graduate and undergraduate level.
Researchers involved in
the project
Xiaohua Hu (Faculty, PI)
Illhoi Yoo (former Ph.D. student,
graduated in June 2006, tenure-track faculty in Univ. of Missouri-Columbia)
Xiaohua Zhou (graduated in 2008,
Data Analyzing Director at LYZ Capital Advisors LLC)
Xiaodan Zhang (graduate in 2009, Research Scientist at Vertex
Pharmaceuticals)
Caimei Lu (Ph.D.
student)
Xin Chen (Ph.D.
student)
The Dragon
ToolKit
|
The Dragon Tooolkit is a cute Java-based development
package for academic research use in language modeling (LM) and information
retrieval (IR). Language modeling has recently emerged as an attractive new
framework for text information retrieval and text mining (TM). However, most
Java-based free search engines such as Lucene does not support LM very well.
The Lemur toolkit is designed for LM and IR, but written in C and C++, which
may be a hindrance to people who prefer Java programming. Basically, the
dragon toolkit is tailored for researchers who work on large-scale LM and IR
and prefer Java programming. Moreover, different from Lucene and Lemur, it
provides built-in supports for semantic-based IR and TM. The dragon tookit
seamlessly intergrates and implements a set of NLP tools, which enable the
toolkit to index text collections with various representation schemes
including words, phrases, ontology-based concepts and relationships. However,
to minimize the learning time, we intentionally keep the package small and
simple. The toolkit does not have some features including distributed IR and
cross-language IR which are part of Lemur toolkit.
|
|
How to Cite Dragon Toolkit
|
|
If you are
using the Dragon Toolkit for research work, please cite it in your published
papers:
Zhou, X.,
Zhang, X., and Hu, X., Dragon Toolkit: Incorporating Auto-learned Semantic Knowledge
into Large-Scale Text Retrieval and Mining, in the Proceeding of the 2007 IEEE
International Conference on Tools on AI, 197-201 http://www.dragontoolkit.org
|
|
Download Dragon Toolkit
|
|
Get the
Dragon Toolkit source code and binary libraries (including external
libraries) and necessary supporting data. Click here to
download.
|
Papers published related to this project
- Hu X., E.K. Park, Zhang X., Micraarray
Gene Cluster identification and Annotation through Cluster Ensemble and EM
based Informative Textual Summarization, in IEEE Transactions on Information Technology in Biomedicine, Sept., 2009, Vol. 13, No. 5, pp832-840
- Hu X., Ng M., Wu F., Sokhansanj B, Mining, Modeling and Evaluation of Sub-Network from Large
Biomolecular Networks and its Comparison Study,
IEEE
Transactions on Information Technology in Biomedicine, March 2009, Vol.
13, No. 2, pp 184-194
- Hu X., Zhang X., Lu C., Park.
E.K., Exploring Wikipedia as
External Knowledge for Document Clustering, in the 15th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pp 389-396
4. Wu D., Hu X.
He T., Exploratory Analysis of Protein
Translation Regulatory Networks Using Hierarchical Random Graphs, in
the 2009 IEEE International Conference on Bioinformatics and Biomedicine,
- Chen X., Hu X., Shen X., Spatial Weighting for Bag-of-Visual-Words
Representation and Its Application in Content-Based Image Retrieval, in the 13th Pacific-Asia Conference
on Knowledge Discovery and Data Mining (PAKDD 2009), pp
867-874
6. Zhang X., Hu X.,
Zhou X., A Comparative Evaluation of
Different Link Types on Enhancing Document Clustering, in the 31th Annual International ACM SIGIR Conference on
Research & Development on Information Retrieval (SIGIR 2008),
pp555-562,
- Achananuparp P., Hu X., Shen X., The Evaluation of Sentence
Similarity Measures, in the 2008 International Conference on Data
Warehousing and Knowledge Discovery
(DaWaK 2008), pp305-316
- Hu X., Sokhansanj B, Wu D., Tang Y., A
Novel Approach for Mining and Dynamic Fuzzy Simulation of Biomolecular
Network, IEEE Transactions on Fuzzy Systems, Vol
15, No. 6, Dec 2007, pp1219-1229
- Zhang X., Jing L., Hu X., Ng M., Xia J., Zhou X., Medical Document Clustering
Using Ontology Based Term Similarity Measures, in the
International Journal of Data Warehousing and Mining, 2008
- Zhou
X., Zhang X., Hu X., Semantic Smoothing for Bayesian
Text Classification with Small Training Data, SIAM SDM 08
- Zhou
X., Hu X., Zhang X., A Segment-based Hidden
Markov Model for Real-Setting Pinyin-to-Chinese Conversion,
in
the ACM CIKM 2007, 1027-1030, (acceptance
rate: 26%, 512 submission)
- Hu X., Wu D., Data Mining and Predictive
Modeling of Biomolecular Network from Biomedical Literature Databases,
IEEE/ACM Transactions on Computational Biology and Bioinformatics,
(March-April 2007)
- Zhou
X., Hu X., Zhang X., Topic
Signature Language Models for Ad-hoc Retrieval, in the IEEE
Transactions on Knowledge and Data Engineering (IEEE TKDE)
- Hu X., Wu F.X. Ng M., Sokhansanj B., Mining
and Dynamic Simulation of Sub-Networks from Large Biomolecular Networks,
in 2007 International Conference on Artificial Intelligence, June 25-28,
Las Vegas, USA (Best Paper Award, out of 500 submissions)
- Yoo I., Hu X.,
Song I-Y, A Coherent Document
Clustering and Text Summarization Approach through a Scale-free Ontology-enriched Graphical
Representation, BMC Bioinformatics
- Yoo
I., Hu X., Song I-Y, Biomedical
Ontology Improves Biomedical Literature Clustering Performance: A
Comparison Study in the International Journal of
Bioinformatics Research and Application
- Song
M, Song I-Y, Hu X., Allen B., Integration of Association
Rules and Ontology for Semantic Query Expansion, in the Journal
of Data and Knowledge Engineering
(DKE)
- Zhou
X., Zhang X., Hu X., Semantic Smoothing
of Document Models for Agglomerative Clustering, accepted in the
Twentieth International Joint Conference on Artificial Intelligence(IJCAI
07), Hyderabad, India, Jan 6-12, 2007 (
- Zhang X., Jing L., Hu X., Ng M., Zhou X., A Comprehensive Study of
Ontology Based Term Similarity Measures on Document Clustering, accepted in the
12th International conference on Database Systems for Advanced
Applications (DASFFA2007)
- Zhou
X., Hu X., Lin X., Zhang X., Relation-based Document
Retrieval for Biomedical IR,
in LNCS Computational Systems Biology. Vol. 4. pp. 112 128
- Huang
Z., Li Y., Hu X., Anti-parallel Coiled Coils
Structure Prediction by Support Vector Machine Classification, in LNCS Transactions on
Systems Biology, Vol. 4. pp. 1 - 8
- Hu X., Lin T.Y., Song
I-Y., Lin X., Yoo I., Song M., A
Semi-supervised Efficient Learning Approach to Extract Biological
Relationships from Web-based Biomedical Digital Library, in
International Journal of Web Intelligence and Agent System, Vol .4, No. 3,
2006
- Zhou
X., Hu X., Zhang X., Lin X.,
Song I-Y., Context-Sensitive Semantic
Smoothing for the Language Modeling Approach to Genomic IR, in the Proceedings of the 29th Annual International ACM SIGIR
Conference on Research & Development on Information Retrieval (SIGIR
2006), pp 170-177 , (acceptance
rate: 18.5%, 74/399)
- Hu X., Zhang X., Yoo I.,
Zhang Y-Q. A
Semantic Approach for Mining Hidden Links from Complementary and
Non-Interactive Biomedical Literature, Proceedings of the
6th SIAM International Conference on Data Mining (SIAM SDM 06),
April 20-22, 2006, Bethesda, MD, USA, pp 200-209, (acceptance rate: 16%, 40/244)
- Hu X., Zhang X., Zhou X., Integration of Cluster
Ensemble and EM based Text Mining for Microarray Gene Cluster
Identification and Annotation, in the Proceedings of ACM 15th
Conference on Information and Knowledge Management (ACM CIKM 2006), post
paper, (537 submissions, 15%
acceptance rate for full papers, 10% acceptance rate for post papers)
- Zhang
X., Zhou X., Hu X., Semantic Smoothing for
Model-based Document Clustering, accepted in the 2006 IEEE
International Conference on Data Mining (IEEE ICDM06), Dec. 18-22, 2006,
HongKong (800 submissions,
acceptance rate : 20%)
- Hu X., Constructing Ensembles of
Classifiers for Data Mining Applications based on Rough Set Theory and
Set-Oriented Database Operations, in the Proceedings of the 2006
IEEE International Conference on Granular Computing (IEEE GrC 2006),
Atlanta, GA, May 15-17, 2006, pp 67-73 (acceptance rate: 15%, 49/321)
- Yoo
I., Hu X., Song I-Y., Integration
of Semantic-based Bipartite Graph Representation and Mutual Refinement
Strategy for Biomedical Literature Clustering, in the Proceedings of the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (SIGKDD 2006), short paper , pp
791-796, (acceptance rate for full
paper: 11%, acceptance rate for short paper: 12%, 50 full papers, 55 short papers out of 457 submission)
- Yoo
I., Hu X., A Comprehensive
Comparison Study of Document Clustering for A Biomedical Digital Library
MEDLINE, in the Proceedings of the 2006 ACM/IEEE Joint Conference on
Digital Library (JCDL 2006), June 11-15, 2006, Chapel Hill, NC, USA, pp
220-229, (acceptance rate: 15%, 28/188)
- Yoo
I., Hu X., Clustering
Ontology-enriched Graph Representation for Biomedical Documents based on
Scale-Free Network Theory, accepted in the IEEE Conference on
Intelligent Systems (IEEE IS06), Sept 4-6, 2006 (acceptance rate: 16.7%, 100/600)
- Zhou
X., Zhang X., Hu X., Using Concept-based Indexing
to Improve Language Modeling Approach to Genomic, in the
proceedings of the 28th
European Conference on Information Retrieval (ECIR 2006)
, pp 444-455 , (acceptance rate:
20%, 37/178)
- Zhu
W., Xu X., Hu X., Song I-Y.,
Allen B., Using UMLS-based Re-Weighting
Terms as a Query Expansion Strategy, in the Proceedings of the
2006 IEEE International Conference on Granular Computing (IEEE GrC 2006),
Atlanta, GA, May 15-17, 2006, pp 217-222 (acceptance rate: 15%, 49/321)
- Zhou
X., Hu X., Lin X., Han H.,
Zhang X., Relational-based
Document Retrieval for Biomedical Literature Databases, in the
Proceedings of the 11th International Conference on Database
Systems for Advanced Applications (DASFAA 2006), pp 689-701 , (acceptance rate: 25%, 47/188 )
- Yoo
I., Hu X., Clustering Large
Collection of Biomedical Literature based on Ontology-enriched Bipartite
Graph Representation and Mutual Refinement Strategy, in the
Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and
Data Mining (PAKDD) 2006, pp 303-312, (acceptance rate: 20%, 100/500)
- Wu
C., Hu X., Yang X., Yang J., Expanding Tolerance RST Models
based on Cores of Maximal Compatible Blocks, accepted as full
paper in the 5th
International Conference on Rough Sets and Current Trends in Computing
(RSCTC 2006), (acceptance rate: 27.4%,
91/332)
- Zhang
X., Wu D., Zhou X., Hu X., A Language Modeling Text Mining
Approach to the Annotation of Protein Community, accepted in the
Proceedings of the 6th IEEE Symposium on Bioinformatics and
Bioengineering (BIBE 06) (acceptance
rate: 34%, 33/81)
- Yoo I. Hu X., Biomedical Ontology MeSH
Improves Document Clustering Quality on MEDLINE Articles: A Comparison
Study, accepted in the 19th IEEE International
Symposium on Computer-Based Medical Systems, Salt Lake City, Utah,
June 22-23, 2006
- Xu X., Zhu W., Hu X., Song I-Y., A Comparison of Local Analysis,
Global Analysis and Ontology-based Query Expansion Strategies for
Bio-medical Literature Search, accepted in 2006 IEEE International
Conference on Systems, Man and Cybernetics (IEEE SMC 2006), Taiwan, ROC,
Oct 18-21, 2006
- Wu C., Hu X., Wang X., Yang X., Pan Y. Knowledge Dependency Relationships
in Incomplete Information System Based on Tolerance Relations, accepted in 2006 IEEE
International Conference on Systems, Man and Cybernetics (IEEE SMC 2006), Taiwan,
ROC, Oct 18-21, 2006
- Zhou
X., Zhang X., Hu X., MaxMatcher: Biological
Concept Extraction Using Approximate Dictionary Lookup, in the
Proceedings of the 9th
Biennial Pacific Rim International Conference on Artificial
Intelligence (PRICAI 2006), short
paper, pp 1145-1149 ,
(acceptance rate: 16.8%, 100/596),
- Yoo
I., Hu X., Song I-Y., A Coherent Document Clustering and
Text Summarization Approach through a Scale-free Ontology-enriched
Graphical Representation, accepted in 8th
International Conference on Data Warehouse and Knowledge Discovery (DaWak
2006), Krakow, Poland, Sept. 4-8,
2006, (acceptance rate 35%, 52/145)
- Zhong
H., Hu X., Object Oriented
Modeling of Protein Translation Systems,
in the Proceedings of the 2006 IEEE International Confernece on Granular
Computing, (IEEE GrC 2006), Atlanta, GA, May 15-17, 2006 (short paper), pp
353-356, (acceptance rate: 31%,
101/321)
- Hu X., Zhang X., Wu D.,, Zhou X., Rumm P., Integration of
Instance-based learning & Text Mining for Identification of Potential
Virus / Bacterium as Bio-terrorism Weapons,
in the Proceedings of the 2006 IEEE Intelligence and Security Informatics
Conference (short paper), pp 548-553
- Wu
D., Hu X., Mining and Analyzing
the Topological Structure pf Protein-Protein Interaction Networks,
in the Proceedings of the 2006 ACM Symposium on Applied Computing
(Bioinformatics Track), April 23-27, Dijon, Bourgogne, France, pp185-189 ,
(acceptance rate: 32.4%, 300/927)