International Journal of Biophysics
2012; 2(2): 18-25
doi: 10.5923/j.biophysics.20120202.02
Enrique Hernández-Lemus1, 2, 3, 4, 3, 5
1Computational Genomics Department, National Institute of Genomic Medicine, México City, 14610, México
2Complexity in Systems Biology, Center for Complexity Sciences, National Autonomous University of M&
3#233
4xico, Mexico City, 04510, M&
5xico
Correspondence to: Enrique Hernández-Lemus, Computational Genomics Department, National Institute of Genomic Medicine, México City, 14610, México.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
Whole genome transcriptional regulation involves an enormous number of physicochemical processes responsible for phenotypic variability and organismal function. The actual mechanisms of regulation are only partially understood. In this sense, an extremely important conundrum is related with the probabilistic inference of gene regulatory networks. A plethora of different methods and algorithms exists. Many of these algorithms are inspired in statistical mechanics and rely on information theoretical grounds. However, an important shortcoming of most of these methods, when it comes to deconvolute the actual, functional structure of gene regulatory networks lies in the presence of indirect interactions. We present a proposal to discover and assess for such indirect interactions within the framework of information theory by means of the data processing inequality. We also present some actual examples of the applicability of the method in several instances in the field of functional genomics.
Keywords: Gene Regulatory Networks, Inference And Assessment, Data Processing Inequality, Information Theory
could be written as follows[8]:![]() | (1) |
![]() | (2) |
's are interaction potentials. A set of variables (genes)
, interacts with each other if and only if the potential
between such set of variables is non-zero. The relative contribution of
is taken as proportional to the strength of the interaction between this set.Equation 2 does not define the potentials uniquely, thus, additional constraints should be provided in order to avoid ambiguity. A usual approach to do so is specify
's using maximum entropy (MaxEnt) approximations consistent with the available information on the system in the form of marginals. In the case of the gene network inference problem, the use of marginals is closely related with a class of methods, commonly termed hidden Markov models (HMMs)[1]. As in the case of HMMs the rationale behind marginals is in recognizing that, eventhough some priors are given, there remains a (probably quite large) set of unknown parameters that may affect the inference process and should be taken into account even if by an indirect treatment. Hidden Markov models and MaxEnt approaches differ in the marginalizing procedure, since in HMMs the hidden states take the place of the unknown variables, whereas in MaxEnt approximations these are marginalized instead. A common way to do so, is by considering that interaction potentials (already marginalized, or to use the language of statistical physics, coarse-grained) are in some sense equivalent to correlation measures. To be more precise; two highly correlated genes (say in their mRNA expression levels) are believed to be physically interacting (by means of some still undisclosed -but probably physically complex-mechanisms) in the transcriptional regulation network[9]. Hence, the interaction potentials
are approximated by correlation measures, say mutual information, i.e.
.
and
, whose mutual information is
. Now consider a third random variable,
, that is a (probabilistic) function of
only. It can be shown that
, which in turn implies that
, as follows from Bayes' theorem.The DPI simply states that
cannot have more information about
than
has about
; that is
. This inequality, which is a property of 's information, can be proved. The inequality follows because conditioning on an extra variable (in this case
as well as
) can only decrease entropy (in a similar way to what occurs in statistical physics when adding constraints to a thermal system thermodynamic entropy can only decrease, conversely when removing constraints, say by allowing an irreversible process to take place, thermodynamic entropy can only increase), and the second to last equality follows because
[8,12]. More formally,Definition 1 Three random variables
,
and
are said to form a Markov chain (in that order) denoted
, if the conditional distribution of
depends only on
and is independent of
. That is, if we know
, knowing
does not tell us any more about
than if we know only
.If
,
and
form a Markov chain, then the Joint Probability Distribution can be written:![]() | (3) |
,
and
form a Markov chain, then ![]() | (4) |
By the Markov property, since
and
are independent, given
,
, then, since
we have:
c.q.d.In reference[8] the application of DPI has shown that if genes
and
interact only through a third gene,
within a given GRN; we have that
.Hence, the least of the three MIs can come from indirect interactions only so that the proposed algorithm examines each gene triplet for which all three MIs are greater than some threshold value
and removes the edge with the smallest value. DPI is thus useful to quantify efficiently the dependencies among a large number of genes. The DPI algorithm eliminates those statistical dependencies that might be of an indirect nature, such as between two genes that are separated by intermediate steps in a transcriptional cascade. Such genes will very likely have non-linear correlated expression profiles which may result in in high MI, and otherwise would be selected as candidate interacting genes. In fields such as developmental biology and cancer genetics, there is a growing need to place the vast number of newly identified gene variants into well-ordered genetic and molecular pathways. This will require efficient methods to determine which genes interact directly and indirectly. In this sense a methodology such as DPI-characterization will result extremely useful indeed. For instance, the role of transcriptional cascades in development is becoming evident. Well-known examples may include, the hierarchical interactions underlying hematopoiesis and adipogenesis in vertebrates and the ecdysone and segmentation gene pathways in Drosophila[25]. In such cases, “...gene expression in such cascades is predominantly controlled at the level of transcript initiation, and is based on interactions between sequence-specific transcription factors and their cis-acting response elements.Two types of regulatory relationships, direct and indirect, can be defined. Direct interactions occur independently of intermediary gene regulation but need not involve direct molecular contact between the regulator and its target gene promoter. Indirect interactions require the activation or repression of intermediary genes, the products of which acton the target gene in question....¨[25]. This is precisely the scenario in which a methodology such as DPI-prunning becomes relevant to distinguish between these two different (but often indistinguishable) conditions with aims to discern the actual functional mechanisms behind them. For instance, intron-regulation of transcription has been elucidated. Introns are able to affect gene expression significantly, both in plants and also in many other eukaryotes in a variety of ways. Some introns may contain enhancer elements or other types of promoters, whereas others function by elevating mRNA accumulation by a process called intron-mediated enhancement (IME). The intron-regions causing IME must be inside transcribed sequences near the start of a gene and in their natural orientation in order to increase expression. Detection of IME activity by sequencing is not easy, however by observing DPI-curated networks, we may be able to infer some candidate genes, and perform deeper studies just in this reduced set.
for a random variable
distributed according to some empirical distribution
. A statistic
extracts some of the information in your observed sample
, by the DPI,
. In the cases in which equality holds, we call
a sufficient statistic for
. That is to say, a sufficient statistic for some distribution
extracts all of the information within your data (samples)
about the value of
.![]() | (5) |
and
non-negative functions,
. We call equation 5 a factorization theorem[13] and it is a necessary and sufficient condition for sufficient statistics. If no such factorization exists for
(in the support under consideration), then
is not a sufficient statistic (in that support). Factorization theorems are important in minimal network estimation since they provide a somehow independent way of sufficient statistics assessment to DPI inference.With this in mind, we can see that DPI (via the sufficient statistics argument) may be useful to infer Minimal networks, i.e. the smaller GRNs that are able to capture
-almost all information content of the correlation structure of the actual (larger) biological network.
, whereas the lower FDR-corrected p-value for network B is
. DPI thus improved p-value performance by almost two orders of magnitude. DPI assessment also prompted new significant biochemical pathways, some of the more important are: urokinase plasminogen activation and the related plasmin synthesis and activation; innate immune system, cell junction organization and HNP1-4/CD4/Defensin signaling.As we can see, global topological features pointing out to greater modularity –hence robustness-; clearer functional mechanisms related to inflammation and growth receptor signaling (two hallmark processes in Cancer); as well as stronger statistics were attained after careful DPI-prunning of the network. This means that, at least in this case DPI methodology presents itself as an efficient tool for the analysis (both functional and modular) of biological networks.
, where
is the number of genes. The problem of inference (usually consuming thousand of computation hours) at the whole genome network level by constructing a 15,222 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in 30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes on a 8-node Cell blade cluster[24].