Deep learning applied to Genomics: Deep semantic DNA and protein representation | AIChE

Deep learning applied to Genomics: Deep semantic DNA and protein representation

Authors 

We are witnessing an explosion in biological sequence information stored in databases such as GenBank and UniProt, largely driven by massive improvements in sequencing technology and resulting in the availability of tens of thousands of genomes and metagenomes. The utility of each genome depends on high quality annotation. However, assignment of function to proteins with no known homologs is still an unsolved problem. I will discuss how Denovium developed a state-of-the-art artificial intelligence (AI) platform, the Denovium Engine, which is capable of finding and assigning function to genes directly from assembled DNA. Our DL models have been developed to be capable of interpreting proteins from primary amino acid sequence and learning multiple protein features implicitly in a single step. Denovium’s DL protein model encodes proteins in high-dimensional representations (embeddings) allowing the accurate assignment of over 700,000 labels for 25 distinct tasks. This model can be utilized to rapidly search over sequence databases to find proteins that have <25% amino acid identity to known enzymes. I will also discuss how these models can be leveraged to evolve proteins and looking into the future how we may be able to generate proteins de novo. Denovium was recently acquired by Absci and is currently being applied to the design of therapeutic proteins and associated producing cell lines.