(374f) Identification of Gene Networks: Characterization of the Problem and Assessment of the Tools | AIChE

(374f) Identification of Gene Networks: Characterization of the Problem and Assessment of the Tools

Authors 

Guner, U. - Presenter, Georgia Institute of Technology
Lee, J. H. - Presenter, Korea Advanced Institute of Science and Technology (KAIST)
Realff, M. - Presenter, Georgia Institute of Technology

Two major concepts of molecular biology are that (i) genes, the fundamental units of heredity, are encoded as sequences of chemical bases in DNA and (ii) a gene is expressed when its DNA sequence is transcribed into an RNA intermediate and, is translated into proteins. Proteins, in turn, perform regulatory, catalytic, mechanical, and electrical functions [1].

Genes, proteins, and metabolites can regulate one another in various ways. Regulatory proteins bind to DNA to affect the transcription of genes. Proteins can also combine to form multi-protein complexes that can take part in various functions in regulation [1]. All these interactions form a complex network of regulatory control. Experimentally, it is quite hard to obtain the information on the levels of gene regulation. Major challenge in biology is to map out and model the topological and dynamical properties of these networks.

Recently, diverse types of genomic data have been obtained to shed light on transcription regulation, e.g., DNA sequence data, micro-array gene expression data, and protein-DNA binding data. The advent of such diverse data has motivated various researchers to develop computational methods to model transcription regulation [2]. DNA-protein binding data provides information to understand the regulators involved in transcription. Time-series micro-array expression experiments are the main source of data which provides dynamic information about the expressions of thousands of genes that are activated or repressed in response to external stimuli [3].

Extensive studies on gene regulatory network modeling, using time-series data, have focused on linear discrete time model equations.  In this model, the expression level of a gene is assumed to be the concentration of its transcript. The concentration of a particular transcript at time point , is given by the linear function of the concentrations of other RNA species at time point, ;

                                                 (1)

where N is the number of transcripts in the network and  is the regulatory strength between gene pairs  and  is the error term for the difference between observation and the model. The errors are assumed to have Gaussian distribution with zero mean and standard deviation of . The aim is to estimate parameter values, 's,  from micro-array observations, , thereby reconstructing the gene network. A negative  indicates an inhibition, and a positive value for  stands for activation between the gene pair. In general, only a small subset of all RNA species regulates a particular transcript, which means most of the 's are zero. In other words, the gene networks are sparse. [4].

Microarray data is usually limited and subject to high levels of additive and multiplicative errors [5]. Therefore, one can write concentration levels for genes as follows;

(2)

                          

In this equation,  is the unknown true value for concentration of gene at  time point and  is the measurement error. The terms  and  correspond to multiplicative and additive parts of the measurement error.

Using equation (1) and (2) , one can write the model for all genes,

                                                                                     (3)

where ,  , ,  and     

Equation (3) can be written for all time points, , as follows;

                                                                                            (4)

Where , , and .

One can see that the error terms in both sides of the equation (4),  and  are serially correlated as they have same columns except for the first and last columns.

A significant problem from the regression standpoint is that both independent and dependent variables have high level of noise. Moreover, these noise terms are serially correlated. Other challenging characteristics include limited number of available data and sparse but unknown structure of the parameter matrix. There is limited access to the topology information of the network through noisy protein-DNA binding data.

Many parameter estimation algorithms applied to this problem in gene network identification literature [1]. Here, we will benchmark different regression methods for this model. In the context of this problem, the most commonly used method is least squares estimation. In the classical least squares regression theory, the errors are assumed to be confined only to response variables. However, in this model, the predictor variables are also noisy, thus, least squares estimator is not appropriate for this model (See  in equation (4) ).  Total least squares is another method of fitting that is appropriate when there are errors in both independent and dependent variables [6]. Constrained total least squares is an additional improvement over total least squares which addresses the correlation in errors in both variable types. However, its formulation results in non-convex optimization problem [6,7]. One can also employ pseudo-linear regression under a similar objective.  Finally, partial least squares is a biased regression method in order to alleviate large variances.  Here, we will compare the performance of the least squares, total least squares, constrained total least squares  and partial least squares methods with respect to different level of noises, problem, and data size through  in-silico examples. The comparison of various methods under different conditions will give us valuable insights to address this difficult estimation problem.

REFERENCES

[1] Driscoll, M. E., Gardner, T.S, Identification and control of gene networks in living organisms via supervised and unsupervised learning, Journal of Process Control 16 (2006) 303-311.

[2] Sun, N., Carroll, R.J, Zhao, H., Bayesian Error Analysis model for Reconstructing transcriptional regulatory networks, PNAS 103 (21) (2006), 7988-7993.

[3] Ernst, J., Vainass, O., Harbison, C. T., Simon, I., Bar-Joseph, Z., Reconstructing dynamic Regulatory maps.  Molecular Systems Biology  3 (74) (2007), 1-13.

[4] Ideker, T., Thorsson, V., Siegel, A.F., and Hood, L.E. Testting for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray data.  Journal of Computational Biology  2  (2005), 65-88.

[5] Gardner, T.S., Faith, J. J., Reverse-engineering transcription control networks.  Physics of Life Reviews  2  (2005), 65-88.

[6] Bansal, M., Giusy, D.G., Bernado, D., Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22 (2006), 815-822

[7] Huffel, S. V. (1991). The total least squares problem: computational aspects and analysis, Society for Industrial and Applied Mathematics, Philadelphia.

[8] Kim, J., Bates, D. G., Postlethwaite, I., Harrison, P., and Cho, K. (2007). BMC Bionformatics, 8, 8.