(451g) A Transferable Diffusion Model for Coarse-Grained Backmapping | AIChE

(451g) A Transferable Diffusion Model for Coarse-Grained Backmapping


Shmilovich, K., University of Chicago
Ferguson, A., University of Chicago
Coarse-grained molecular models of proteins permit access to length and time scales unattainable by all-atom models and enable the simulation of important processes that occur on long time scales such as aggregation and folding. The reduced resolution of the coarse-grained models enables realization of computational accelerations, but sacrifices the atomistic resolution that can be vital for a complete understanding of the mechanistic details. Backmapping is the process of restoring the all-atom details to coarse-grained molecular representations in order to recover atomistic-level insight. Conventional backmapping approaches generate initial all-atom structures based on geometric rules and then apply energy relaxation to eliminate aphysical high-energy overlaps and produce stable all-atom configurations. The need for energy minimization makes these procedures typically quite expensive and slow. Recently, data-driven approaches have demonstrated great promise in furnishing trainable models to efficiently perform backmapping of small molecules and proteins. In this work, we report a novel backmapping approach based on autoregressive denoising diffusion probability models to restore all-atom details to coarse-grained simulations represented only by C-alpha coordinates. The generation process is conditioned on the coarse-grained protein configuration and any previously backmapped side chains in an autoregressive fashion in order to avoid steric clashes. As an inherently transferable and local model, it is scalable to proteins of arbitrary size with linear scaling. We train the model on over 100K proteins in the SidechainNet training data set and demonstrate state-of-the-art performance on systems including DE Shaw training trajectories of fast-folding mini-proteins, ensembles of intrinsically-disordered proteins, and randomly sampled selections from the Protein Data Bank. Furthermore, we demonstrate that fine-tuning the transferable model on a given system can further improve performance in recapitulating protein-specific sidechain distributions. We make the backmapping tool available as a free, open source Python package.