(367d) Entity Extraction for Ontology Based Intelligent Querying in the Pharmaceutical Domain | AIChE

(367d) Entity Extraction for Ontology Based Intelligent Querying in the Pharmaceutical Domain


Patil, S. V. - Presenter, Purdue University

A huge body of pharmaceutical information gets generated in terms of scientific documents associated with every pharmaceutical product developed. The documents include wide ranging information such as pre-formulation studies, product formulation, process development, and manufacturing [1]. Further, information is represented in a variety of ways: raw data or unstructured laboratory reports; tabular or graphical data describing study designs or experimental results; and mathematical models. Owing to the lack of an inherent structure, it is not possible to organize and make (re)use of information for new design decisions in an efficient manner. While relational databases address this problem partially, by facilitating efficient retrieval of information, they suffer from the following limitations: (1) schemata that define entity attributes and relationships need to be created a priori; also, the schemata are static (2) populating the databases is largely a manual process and hence not scalable, and (3) there is no way to extract implicit relationships that might exist across entities.

In this work, we address the problem of extracting entities (or concepts) and relations between entities to automatically build an ontology over a corpus of pharmaceutical documents. We use a classification model based on conditional random fields [2] to tag document text using predefined entity types such as TABLET, API, MANUFACTURING_PROCESS and OPERATING_CONDITION. We build an interface to the Purdue Ontology for Pharmaceutical Engineering (POPE) [3] such that the ontology engine is populated with entities and relations automatically. The ontology is then used to search for associations between entities and answer questions that help in making design decisions.

Fluck et al. [4] provide a general overview of information extraction in the life sciences industries with a special emphasis on biomedical entity extraction (for example, protein and gene names). They also describe the specific challenges in chemical entity recognition and highlight some of the recent work in that direction. Banville [5] reports the problems in mining chemical structural information from pharmaceutical literature, mainly due to the non-standard representation of chemical structures. While there is substantial effort in the biomedical and clinical domains in entity extraction and question answering [6-10], there is not much focused research in addressing this problem as applied to pharmaceutical drug design and discovery. Our work is an effort in this direction


1. P. Beringer, A. DerMarderosian and L. Felton. Remington: The science and practice of pharmacy, 21st Edition, Lippincott, Williams and Wilkins, University of the sciences, Philadelphia 2006.

2. J. Lafferty, A. McCallum, F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, 2001, pp. 282?289.

3. L. M. Hailemariam, A. Jain, P. Suresh, V. P. K. Akkisetty, G. Joglekar, S-H. Hsu, K. R. Morris, G. V. Reklaitis, P. K. Basu and V. Venkatasubramanian. The Pope Ontology for Pharmaceutical Product Development. AICHE Annual Meeting, Salt Lake City, 2007.

4. J. Fluck, M. Zimmermann, G. Kurapkat and M. Hofmann. Information extraction technologies for the life science industry. Drug Discovery Today, Vol. 2, No.3, 2005, Elsevier, DOI: 10.1016/j.ddtec.2005.08.013.

5. D. L.Banville. Mining chemical structural information from the drug literature. Drug Discovery Today, Vol. 11, No. 1/2, January 2006, Elsevier.

6. R. McDonald and F. Pereira. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics, 6(Suppl 1):S6, 2005, doi:10.1186/1471-2105-6-S1-S6.

7. L. Tanabe, N. Xie, L. H. Thom, W. Matten and W. j. Wilbur. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 6(Suppl 1):S3, 2005, doi:10.1186/1471-2105-6-S1-S3

8. B. Settles. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics, 21(14):3191-3192, 2005, doi:10.1093/bioinformatics/bti475

9. D. Demner-Fushman and J. Lin. Knowledge Extraction for Clinical Question Answering: Preliminary Results. In proceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, 2005

10. P. Zweigenbaum. Question Answering in Biomedicine. In proceedings of the Workshop on Natural Language Processing for Question Answering. EACL 2003.