(679a) A Novel Optimization-Based Clustering Approach and Prediction of Optimal Number of Clusters: Global Optimum Search with Enhanced Positioning (EP_GOS_Clust) | AIChE

(679a) A Novel Optimization-Based Clustering Approach and Prediction of Optimal Number of Clusters: Global Optimum Search with Enhanced Positioning (EP_GOS_Clust)

Authors 

Tan, M. P. - Presenter, Princeton University
Broach, J. R. - Presenter, Princeton University
Floudas, C. A. - Presenter, Princeton University


Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proven to be a useful tool for identifying biologically relevant groupings of genes, which can lead to interesting insights. Patterns seen in genome-wide expression experiments can give indications about unknown regulatory elements. Also, since genes with similar functions cluster together, grouping genes of known functions with poorly characterized genes may provide a simple means of gaining understanding into the functions of these uncharacterized genes. It is hence important to apply a rigorous yet intuitive clustering algorithm to uncover these genomic relationships. However, several of the popularly-used clustering algorithms demonstrate an uncomfortable level of sensitivity to the initialization point, as well as a broad level of latitude accorded to the user with regards to the optimal number of clusters. Furthermore, the quality of these clustering algorithms in finding groupings of data with the tightest possible clustering raises a number of issues.

In this presentation, a novel clustering algorithm framework is introduced [1]. It is based on a variant of the Generalized Benders Decomposition, denoted as the Global Optimum Search [2, 3], which includes a procedure to determine the optimal number of clusters to be used. As an investigative study, the proposed algorithm is applied to experimental DNA microarray data centered on the Ras signaling pathway in the yeast Saccharomyces Cerevisiae. The clustering results are compared to that obtained with existing popular clustering algorithms. The proposed approach outperforms these algorithms in both the areas of intra-cluster similarity and inter-cluster dissimilarity, often considered as the two key tenets of clustering. The proposed algorithm's implementation is also structured to expedite the solution for the determination of the optimal number of clusters.

In laying the groundwork for the development of the EP_GOS_Clust, we also studied the effects by differing normalization methods and pre-clustering techniques on clustering quality [4]. The aim of the latter is to use just an adequate amount of discriminatory characteristics to form rough information profiles so that data points with similar features can be pre-grouped together and outliers deemed not to be significant to the clustering process can be removed. With respect to the clustering of DNA microarray data, we compare the merits of normalizing expression data across genes as opposed to over each experiment. We also study the effects different pre-clustering approaches have on clustering quality. Specifically, we look at the pre-clustering of genes based on both actual expression data and symbolic representations such as {+, o, -}. In our assessment, we look again at the intra- and inter-cluster error sums. We also use publicly available Gene Ontology resources to determine the pre-clustering method that results in clusters with the highest level of biological coherence.

[1]-Tan, M. P.; Broach, J. R.; Floudas, C. A.; A Novel Clustering Approach and Prediction of Optimal Number of Clusters: Global Optimum Search with Enhanced Positioning (EP_GOS_Clust); 2006; In Preparation

[2]-Floudas, C. A.; Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications; Oxford University Press; 1995

[3]-Floudas, C. A.; Aggarwal, A.; Ciric, A. R.; Global Optimum Search for Non Convex NLP and MINLP Problems; Comp. & Chem. Eng.; 13(10); 1989; pp. 1117-1132

[4]-Tan, M. P.; Broach, J. R.; Floudas, C. A.; Evaluation of Normalization and Pre-Clustering Issues on a Novel Mixed-Integer Nonlinear Optimization-Based Clustering Approach; 2006; In Preparation