(341f) A Method for Learning a Sparse Classification Model in the Presence of Missing Data
AIChE Annual Meeting
2016
2016 AIChE Annual Meeting
Computing and Systems Technology Division
Computational Methods in Biological and Biomedical Systems
Tuesday, November 15, 2016 - 2:00pm to 2:18pm
To test the algorithm, a case study for the classification of two types of acute leukemia is presented. The dataset is gene expression data from a microarray. It is a public benchmark problem and has been widely studied [2]. Missing data is artificially added to be representative of missing data in microarrays [3]. The proposed approach is compared to the nearest shrunken centroids algorithm [4] and sparse linear discriminant analysis [5]. Missing data is handled with complete case analysis, mean imputation and k-nearest neighbor imputation, all common approaches in the field. The proposed approach outcompetes alternative methods.
[1] A. P. Dempster, N. M. Laird and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B Methodological, 39: 1-39, 1977.
[2] T. R. Golub, D. K. Slonin, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: 531-537, 1999.
[3] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17: 520-525, 2001.
[4] R. Tibshirani, T. Hastie, B. Narasimhan, G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99: 6567-6572, 2003.
[5] K. Sjöstrand, L. H. Clemmensen, R. Larsen, B. Ersbøll. SpaSM: A Matlab toolbox for sparse statistical modeling. Journal of Statistical Software, 2012.