(383h) Machine Learning and Natural Language Processing for Pharmaceutical Product Engineering

Remolona, M. F., Columbia University
Venkatasubramanian, V., Columbia University
Pharmaceutical product engineering is a â??Big Dataâ? discipline. It requires understanding of details of the drug chemistry during production and within the body, the manufacturing processes and conditions, and the pharmacokinetics of a disease â?? all data intensive. In fact, a typical New Drug Application (NDA) contains more than 100,000 pages of a variety of information. In this talk, we present a framework, called HOLMES, for the automatic extraction of knowledge from primary sources related to pharmaceutical product engineering. The information extracted is then stored in ontologies. These ontologies are a computer readable semantic knowledge representation used in artificial intelligence. We describe Machine Learning (ML) algorithms and Natural Language Processing (NLP) techniques that are used in HOLMES for Entity and Concept Recognition and Relation Extraction. We will discuss our progress on the creation of an entity-concept-and-relation databank (7968 entities and concepts, 1665 relations); the application of different ML algorithms for the purpose of joint Entity and Concept detection; and the development of a relation clustering algorithm using common feature sets.