(560g) Biclustering Via Optimal Ordering | AIChE

(560g) Biclustering Via Optimal Ordering

Authors 

DiMaggio, P. A. Jr. - Presenter, Princeton University
McAllister, S. R. - Presenter, Princeton University
Floudas, C. A. - Presenter, Princeton University
Feng, X. - Presenter, Princeton University
Rabinowitz, J. D. - Presenter, Princeton University
Rabitz, H. A. - Presenter, Princeton University


Problems of data organization and data clustering are prevalent across a number of different disciplines. The goal of data clustering, regardless of the application, is to organize data in such a way that objects which exhibit ?similar? attributes are grouped together. The definition of similarity depends on the application and may correspond to the direct comparison of values or the degree of correlation among trends or patterns of values. Several methods have been proposed for the clustering of large-scale data. The most frequently used techniques are hierarchical [1] and partitioning [2] clustering because of their availability and ease of implementation. However, these approaches result in suboptimal clusters because the comparisons between terms are evaluated locally.

Rearrangement clustering has emerged as an effective technique for optimally minimizing the sum of the pair-wise distances between rearranged rows and columns. The bond energy algorithm (BEA) was originally proposed as a method for finding ?good? solutions to this problem [3] and it was subsequently discovered that this problem could be formulated as a traveling salesman problem (TSP) which can be solved to optimality [4,5] using existing methods. The concept of biclustering has emerged in the context of microarray experiments where the rearrangement of the rows and columns is needed to identify a bicluster, which corresponds to a subset of genes related under a certain set of conditions. This is particularly useful when a gene is involved in more than one biological process or belongs to a group of genes that are co-expressed under limited conditions. An assortment of techniques and formulations has been proposed, many of which utilize heuristics to identify biclusters and often require normalization to elucidate patterns of interest. A good review on biclustering can be found elsewhere [6].

In this work, we present a method for the optimal rearrangement of both the rows and columns of a data matrix. Various objective functions can be used to quantify the degree of similarity among adjacent rows or columns in the final arrangement. We present both a network flow representation, which can be extended to address sparse data sets, and a traveling salesman problem (TSP) representation to model the actual permutations of the rows and columns. We compared the results of our method with hierarchical clustering for (a) metabolite concentration data [7], (b) colon cancer data [8], and (c) gene expression data [9]. For each of these data sets, our method provides a denser grouping of related metabolites and annotated genes (i.e., rows of the data matrix) than does hierarchical clustering, which suggests that optimal ordering has distinct advantages over local ordering. Our method also reconstructs underlying fundamental patterns in the data. For the metabolite concentration data, the nitrogen and carbon starvation conditions are perfectly separated and the in the colon cancer data the tissue types are partitioned into tumor-rich and normal-rich regions.

[1] M.S. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS, 95(25):14863-14868, 1998.

[2] J.A. Hartigan and M.A. Wong. Algorithm AS 136: a K-means clustering algorithm. Applied Statistics, 28:100-108, 1979.

[3] W.T. McCormick Jr, P.J. Schweitzer, and T.W. White. Problem decomposition and data reorganization by a clustering technique. Operations Research, 20(5):993-1009, 1972.

[4] J.K. Lenstra. Clustering a data array and the traveling-salesman problem. Operations Research, 22(2):413-414, 1974.

[5] J.K. Lenstra and A.H.G. Rinnooy Kan. Some simple applications of the traveling salesman problem. Operations Research Quarterly, 26(4):717-733, 1975.

[6] S.C. Madeira and A.L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEE-ACM Trans. Comp. Bio., 1(1):24-45, 2004.

[7] M. J. Brauer, J. Yuan, B. Bennett, W. Lu, E. Kimball, D. Bostein, and J.D. Rabinowitz. Conservation of the metabolomic response to starvation across two divergent microbes. Proc. Natl. Acad. Sci., 103: 19302-19307.

[8] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci., 96:6745-6750, 1999.

[9] R.B. Brem and L. Kruglyak. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci., 102(5):1572-1577, 2005.