(314e) Combining Protein Sequence and Structure Pretraining | AIChE

(314e) Combining Protein Sequence and Structure Pretraining

Authors 

Yeh, H., University of Chicago
Proteins efficiently and precisely perform complex tasks under a wide variety of conditions. This combination of versatility and selectivity makes them not only critical to life, but also to a myriad of human-designed applications. Engineered proteins play increasingly essential roles in industries and applications spanning pharmaceuticals, agriculture, specialty chemicals, and fuel. The ability of a protein to perform a desired function is determined by its amino acid sequence, often mediated through folding to a three-dimensional structure. Machine-learning methods that predict fitness will enable the engineering of new protein functions.

Large pretrained protein language models have advanced the ability of machine-learning methods to predict protein structure and function from sequence, especially when labeled training data is sparse. Most modern self-supervised protein sequence pretraining combines a uses a neural network model trained with either an autoregressive likelihood or with the masked language modeling (MLM) task introduced for natural language by BERT (bidirectional encoder representations from transformers). For example, ESM-1b is a 650M-parameter transformer model, and CARP-640M is a 640M-parameter convolutional neural network (CNN), both trained on the MLM task using sequences from UniRef. While large pretrained models have had remarkable success in predicting the effects of mutations on protein fitness, predicting protein structure from sequence, and assigning functional annotations, pretraining often provides very little benefit on the types of datasets tasks often encountered in protein engineering. Furthermore, pretraining on only sequences ignores other sources of information about proteins, including structure.

Previous work shows that a graph neural network (GNN) that encodes structural information in addition to sequence context can also be used for masked language modeling of proteins, achieving much better performance on the reconstruction task than sequence-only models with a fraction of the data and model parameters (Figure 1). In this work, we show that a GNN MLM pretrained on 19 thousand sequences and structures outperforms sequence-only pretraining on protein engineering tasks where a structure is available, including zero-shot mutant fitness prediction and tasks in the FLIP (Fitness Landscape Inference for Proteins) benchmark. We then show that using the output logits from a fixed CARP-640M, pretrained on 42M sequences, as input to the GNN further improves performance on both the pretraining MLM task and on downstream tasks.

Historically, most pretraining methods for proteins have treated the amino-acid sequence as text and borrowed methods from natural language processing. However, proteins are not sentences, and protein sequence databases contain additional data that can be useful for pretraining, including structure, annotations, ligand, substrate, or cofactor information, and free text. Integrating this information into pretrained models will be essential to leveraging all available information for protein engineering.