# (301p) Comparative Assessment of Clustering Techniques for the Analysis of Temporal Gene Expression Data

#### AIChE Annual Meeting

#### 2006

#### 2006 Annual Meeting

#### Computing and Systems Technology Division

#### Poster Session: Computers in Operations and Information Processing

#### Tuesday, November 14, 2006 - 3:15pm to 5:45pm

The advent microarrays has provided a boon to researchers in molecular biology and genetics. However, despite the ability to measure the expression levels of thousands of genes at once, the true potential of microarray experiments has not yet been realized, mainly due to the limitations in the data analysis aspect of the experiment. Currently an area of active research is in clustering algorithms that work under the primary assumption that co-expressed genes should exhibit relatively high degree of similarity in terms of function and regulation (6). However, one of the major problems in analyzing transcriptional information is the fact that no rigorous computational methods for the biological evaluation of clustering results exist and, furthermore, different algorithms produce different results with no clear justification as to why these discrepancies exist. Therefore, the main question we wish to address in this presentation is how to assess the actual information that is generated following the clustering of microarray data. Furthermore, we will demonstrate the advantages of the rigorous, and unbiased, integration of the clustering and selection of informative genes.

In this work, we will be evaluating four different clustering algorithms and their relative performance in terms of their ability to segregate genes by function utilizing expression profiles only, as well as assessing the major similarities and differences among the various solutions. The four algorithms which we will be assessing are: hierarchical clustering based on CLUTO a general purpose toolkit for clustering various datasets (8); Cluster Analysis of Gene Expression Dynamics - CAGED, a clustering algorithm based upon Bayesian learning of auto-regression models specially designed to analyzed temporal gene expression data (5), and finally two different clustering algorithms for temporal expression data that combine selection and clustering: STEM (3) and SLINGSHOTS (7).

Hierarchical clustering is a widely used general purpose clustering algorithm. It works by utilizing a bottom-up approach to clustering in which the two closest expression profiles based upon a symmetric distance metric are merged, then the next closest expression profiles and so froth. There are different variations in the technique revolving around different distance metrics such as Pearson's correlation or Euclidean Distance and linkage types which specify how sub-clusters are merged with each other, examples being single linkage in which the two closest elements in a cluster are used as the distance metric for merging sub clusters, complete linkage in which the two farthest elements are chosen to calculate the distance, and the centroid method where the mean expression profile is completed and then used for the distance metric. In this work we will be utilizing Euclidean distance and the Centroid Linkage method.

CAGED is a clustering algorithm which also takes a bottom up approach to clustering works by constructing an auto-regressive model which can be used to describe every expression profile in its cluster within a certain tolerance. The probability that a model can be used to describe the data is based around the Residual Sum of Squares error term, of which a Bayesian classifier can be built. The posterior probability of such the Bayesian classifier is then used to determine whether or not the merging should be accepted. If the posterior probability for iteration n is greater than that of iteration n-1, then the merging is accepted. What this method does is perform pair-wise clustering in a similar fashion as hierarchical clustering while utilizing a more sophisticated metric of temporal gene expression similarity.

The two clustering/selection approaches, STEM and SLINGSHOTS, operate under a different assumption. Instead of performing a bottom up approach to clustering, both methods define a set of bins, and assign each gene expression profile to a bin (hashvalue) based upon their profile. There are slight differences in the two hashing approaches, with STEM utilizing variable unit steps in order to determine whether time point N is 0, +1, +2, -1, -2 steps away from time point N-1, while SLINGSHOTS utilizes a quantization based off of Gaussian breakpoints in order to achieve a roughly equal distribution of symbols. The consequence of this hashing is that both algorithms have at this point performed a fine grained clustering of the data in linear time. This fine grained clustering of time series yields a very large number of clusters of order s^t where s is the number of symbols used to define the search space, and t is the number of time points being analyzed it. The primary difference between the two algorithms is that STEM selects bins based upon their population dynamics. This means that for every bin, STEM counts the number of genes that have hashed to that same bin and calculate the probability of that number of genes hashing to the same bin given a set of genes with random profiles. SLINGSHOTS also makes the assumption that bins containing more expression profiles are more informative than less populated motifs. However, it chooses the cutoff based upon an optimization criterion, namely that genes that are relevant to an experimental perturbation ought to show significant deviations from their baseline levels. Therefore the peaks are added to the set of informative motifs in the order of their population, until the maximum difference in their cumulative distribution function is reached.

The quality of each clustering technique will be evaluated through their ability to ?enrich? ontologies which is namely their ability to classify genes via their expression level in such a way which can be rationalized via their annotated functions. In order to make this determination, we first cluster with each of the four methods, utilizing the best practices in order to obtain a proper solution. We then evaluate the different algorithms upon their ability to enrich ontologies utilizing a significance value calculation that assumes that the number of ontologies per cluster follows a hyper geometric distribution. Finally, we will evaluate the ability of each clustering or selection algorithm to select for ontologies that are known to be related to inflammation and burn injury such as cholesterol biosynthesis, and interleukin related pathways. This will be used to assess the ability of the various hashing algorithms to provide information about the underlying mechanism which underlies the experimental response. The comparisons will be discussed in context of three inflammation-specific experiments and their corresponding data: a 17 time point orticosteroid data set (1) the second is a burn data set (4), and finally a bacterial endotoxin induced sepsis dataset containing 6 time points(2).

The number of significant clusters depends on the particular algorithm therefore the methods will be analyzed in a comparative manner in order to assess major similarities and discrepancies in terms of co-expressed genes as well as relative functional enrichment of the clusters produced by each class. Overall we determine that despite the different number of clusters, STEM, SLINGSHOTS, and hierarchical clustering yielded a good segregation of gene ontologies to different clusters whereas CAGED in general yields poor functional enrichment across clusters. However, a more in-depth analysis of the results for STEM, SLINGSHOTS will also demonstrate the importance of the combination of clustering and selection in terms of the possibility for upgrading the information content of the experimental data. While hierarchical clustering exhibits bias in terms of grouping ontologies in a similar manner as the expression profile, the majority of ontologies are not statistically over-represented, whereas a much larger percentage of inflammation-specific ontologies selected by STEM and SLINGSHOTS are statistically over-represented.

1. Almon RR, DuBois DC, Pearson KE, Stephan DA, Jusko WJ. 2003. Gene arrays and temporal patterns of drug response: corticosteroid effects on rat liver. Funct Integr Genomics 3: 171-9

2. Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, et al. 2005. A network-based analysis of systemic inflammation in humans. Nature 437: 1032-7

3. Ernst J, Bar-Joseph Z. 2006. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinformatics 7: 191

4. Jayaraman A, Yarmush ML, Roth CM. 2005. Evaluation of an in vitro model of hepatic inflammatory response by gene expression profiling. Tissue Eng 11: 50-63

5. Ramoni MF, Sebastiani P, Kohane IS. 2002. Cluster analysis of gene expression dynamics. Proc Natl Acad Sci U S A 99: 9121-6

6. Wolfe CJ, Kohane IS, Butte AJ. 2005. Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks. BMC Bioinformatics 6: 227

7. Yang E, Berthiaume F, Yarmush ML, Androulakis IP. 2006. An integrative systems biology approach for analyzing liver hypermetabolism. Presented at 9th Int. Symp. Process Systems Engineering and 16th European Symp. Computer Aided Process Engineering, Garmisch-Partenkirchen / Germany

8. Zhao Y, Karypis G. 2005. Data clustering in life sciences. Mol Biotechnol 31: 55-80