(203e) Improving Data Sub-Selection for Supervised Tasks with Principal Covariates Regression | AIChE

(203e) Improving Data Sub-Selection for Supervised Tasks with Principal Covariates Regression

Authors 

Cersonsky, R. - Presenter, EPFL STI IMX COSMO
Helfrecht, B., École Polytechnique Fédérale de Lausanne
Kliavinek, S., École Polytechnique Fédérale de Lausanne
Engel, E. A., TCM Group, Cavendish Laboratory, University of Cambridge
Ceriotti, M., École Polytechnique Fédérale de Lausanne
Data analyses based on linear methods constitute the simplest, most robust, and transparent approaches to the automatic processing of large amounts of data for building supervised or unsupervised machine learning models. Principal covariates regression (PCovR) is an underappreciated method that interpolates between principal component analysis and linear regression and can be used to conveniently reveal structure-property relations in terms of simple-to-interpret, low-dimensional maps. We have recently introduced methods that incorporate PCovR into two popular data selection approaches, CUR and Farthest Point Sampling, which iteratively identify the most diverse samples and discriminating features. While our approach is completely general, here we focus on systems relevant to atomistic simulations, chemistry, and materials science -- fields where feature and sample selection are an increasingly common practice. Our results show that these selection methods identify data subsets that out-perform their unsupervised counterparts--which we demonstrate with models of increasing complexity, from ridge regression to kernel ridge regression and finally feed-forward neural networks.

This work pulls from:

Structure-Property Maps with Kernel Principal Covariates Regression

BA Helfrecht, RK Cersonsky, G Fraux, M Ceriotti

Machine Learning: Science and Technology 1 (4)

Improving Sample and Feature Selection with Principal Covariates Regression

RK Cersonsky, BA Helfrecht, EA Engel, M Ceriotti

arXiv preprint arXiv:2012.12253