(714d) Development of An Efficient Variable Selection Methodology for Calibration Model Design | AIChE

(714d) Development of An Efficient Variable Selection Methodology for Calibration Model Design

Authors 

Fujiwara, K. - Presenter, Kyoto University
Kano, M., Kyoto University



In the pharmaceutical industry, the documents on quality by design (QbD) and process analytical technology (PAT) were issued by the Food and Drug Administration (FDA) and International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH), and online process monitoring and control technologies have attracted much attention.  Near infrared spectroscopy (NIRS) is a powerful online monitoring tool because of its noninvasiveness and short measuring time.  Many kinds of material attributes such as water content, blend uniformity, content uniformity and coating thickness have been estimated by using calibration models with NIR spectra.   Partial least squares (PLS), in particular, has been used to build an accurate calibration model with a small number of latent variables.

In general, a calibration model can be fitted to model construction samples when the number of input variables is large. However, its estimation performance may deteriorate because of measurement noise when input variables that do not physically relate to the output are used.  Although appropriate input variables have to be selected when a calibration model is constructed, the computational load increases remarkably if all possible combinations of input variables are tested.  A systematic methodology for selecting appropriate input variables is required for improving the estimation performance as well as efficiency of calibration model design. 

Although a genetic algorithm (GA) can be applied to variable selection, its computational load is still heavy.  On the other hand, PLS-based variable selection methods, such as PLS-Beta and variable influence on projection (VIP), have been proposed.  In addition, stepwise and least absolute shrinkage and selection operator (Lasso) have been used.  These methods evaluate each candidate variable independently as to whether or not it should be used as an input variable; however, such an evaluation is not appropriate because it has a correlation with other variables. 

This work proposes new methodology for selecting input variables using a correlation-based clustering method, referred to as nearest correlation spectral clustering (NCSC).  NCSC was originally proposed for sample clustering based on the correlation between variables by not assuming any distribution.  NCSC integrates the nearest correlation (NC) method that can detect samples whose correlation is similar to the query and spectral clustering (SC) that can partition a weighted graph.  The NC method constructs the weighted graph that expresses the correlation-based similarities between samples, and SC partitions the constructed graph. 

In the proposed method, the proposed method clusters variables into some variable groups on the correlation between variables by NCSC.  After variable clustering, each variable group is examined as to whether or not it should be used as input variables according to their contribution to the estimates.  This method is referred to as NCSC-based variable selection (NCSC-VS).

The variable selection result of NCSC-VS was compared with those of the conventional methods through an application to the pharmaceutical process data provided by Daiichi Sankyo Co., Ltd..  The target drug products consist of six components. Some blending experiments were conducted with different active pharmaceutical ingredient (API) content.  After each blending experiment, the granules for tableting were taken out, and NIR spectra (2203 points in 800 - 2500 nm) and the API content were measured.  The objective is to select appropriate input wavelengths of NIR spectra for constructing a precise calibration model that can estimate the API content. 

The calibration data and the validation data consist of 576 and 20 samples, respectively.  Before modeling, a Savitzky-Golay smoothing filter of the first-order derivation was applied to the spectra.

A PLS model, called PLS-All, employing all the wavelengths was constructed as a benchmark, and the number of its adopted latent variables was determined by cross validation.  The wavelengths were selected by using PLS-Beta, VIP, Lasso, Stepwise and proposed NCSC-VS.  The results show that the estimation performance of Stepwise was worse than PLS-All.  PLS-Beta, VIP, SR and Lasso improved the estimation performance.  On the other hand, the proposed NCSC-VS achieved higher performance than the conventional methods, and RMSE was improved by about 37% in comparison with PLS-All.  In addition, the wavelength selected by NCSC-VS contained almost only specific peaks, and this result was consistent with physicochemical knowledge that peaks in spectra contain much information about compounds.

Therefore it is concluded that the proposed NCSC-VS can select meaningful wavelengths for calibration model design.