F-Test in k-Fold Cross Validation and Its Application to the Discovery of Biological Networks
- Type: Conference Presentation
- Conference Type:
AIChE Annual Meeting
- Presentation Date:
November 10, 2010
- Skill Level:
There has been considerable emphasis in the recent years on applying systems approaches to decipher and reconstruct cellular networks using high-throughput data. To avoid over-fitting the data and to ensure that the resulting model has good predictive power, cross-validation is often used in data-driven input-response (or input/output (I/O)) modeling. F-test is commonly used to compare the fit-errors (usually, sum of squared prediction errors (SSE)) of the model on the training and test sets. If sufficiently large dataset is available then the data can be divided into non-overlapping training and test sets and F-test can be applied subject to the assumption of the normality of the experimental data points and hence that of the prediction errors. However, when large datasets are not available owing to the cost of conducting experiments, as is often the case for biological systems, k-fold cross validation (CV) is used. In k-fold CV, the entire dataset is randomly divided into k groups. The model is developed using (k - 1) groups as training set and remaining one set is used as the test set. This process is repeated until all k groups are used as a test set once. The mean of the SSE for the test set is compared with the mean of the SSE for the training set through F-test. In this case, 1/k fraction of the samples in any training set is exactly the same as 1/k fraction in other (k-1) training sets. Hence, the computation of the degree of freedom (DOF) for the average SSE for the training sets is not straight-forward. To the best of our knowledge, in most existing work on k-fold CV, the comparison between the average SSE for the training and test sets is carried out qualitatively in an ad-hoc fashion. In this work, we have developed a rigorous procedure to compute the DOFs for robust F-test in k-fold cross-validation. We have used this approach of k-fold CV to a partial-least squares (PLS)-based method for identifying the interactions between different signaling proteins using phosphoprotein data in mouse macrophage RAW 264.7 cells provided by the Alliance for Cellular Signaling (AfCS). A value of k = 10 was used. In the PLS-based modeling scheme used here, only one output is used at a time (1), which is different from the traditional way of applying PLS technique on I/O data. Once the I/O model is deemed robust based on the F-test, significant interactions are selected through t-test (1, 2) and are used to reconstruct the phosphoprotein signaling network. Important signaling events such as activation of glycogen synthase kinase 3 by protein kinase B (Akt) are captured by our reconstructed network. Novel links as well as testable hypotheses are also generated by our analysis approach. We will also show the application of the approach to least-square regression and principal component regression-based techniques for modeling I/O data. Reference 1. Gupta, S., M. R. Maurya, and S. Subramaniam. 2010. Identification of crosstalk between phosphoprotein signaling pathways in RAW 264.7 macrophage cells. PLoS Comput Biol. 6:e1000654. 2. Pradervand, S., M. R. Maurya, and S. Subramaniam. 2006. Identification of signaling components required for the prediction of cytokine release in RAW 264.7 macrophages. Genome Biol. 7:R11.