Break | AIChE


In up to 67% of cases in patients with intellectual and developmental disorders (IDD’s), clinical genomic sequencing identifies sequence variations whose biological significance is uncertain, dubbed “variants of uncertain significance” (VUS’s). VUS’s are missense sequence variations where a genomic nucleotide change results in a change in the amino acid sequence of the resulting protein. Experimental research could be utilized to help determine a VUS’s significance. However, reclassification of VUS’s using current lab-based techniques could take years. To speed up the reclassification task, we employ recent deep learning advances such as ESM-1v, a pre-trained model, to identify functional effects of sequence variations. We apply ESM-1v to study the top 10th percentile of pathogenic variants and the bottom 10th percentile of benign variants in the complete group of 33,700 variants in the 1,513 genes known to be involved in IDD’s. Since ESM-1v outputs embeddings (a vector representation) for each mutated sequence, we are implementing ESM-1v in our model to accelerate training to featurize sequences instead of explicitly computing protein structure for all variants. Once completed, our model will be evaluated on test data to make sure its calibrated, accurate, and interpretable prior to use in a clinical setting where it will be tested against a data set of real VUS’s.