American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2012; 2(5): 68-78
doi: 10.5923/j.bioinformatics.20120205.01
Hasin Afzal Ahmed 1, Priyakshi Mahanta 1, Dhruba Kumar Bhattacharyya 1, Jugal Kumar Kalita 2
1Department of Comp. Sc. and Engg. ,TezpurUniversity, Napaam, 784028, India
2Department of Comp. Sc., University of Colorado, Colorado Springs, USA
Correspondence to: Dhruba Kumar Bhattacharyya , Department of Comp. Sc. and Engg. ,TezpurUniversity, Napaam, 784028, India.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
DNA microarray technology has revolutionized biological and medical research by enabling biologists to measure expression levels of thousands of genes in a single experiment. Different computational techniques have been proposed to extract important biological information from the massive amount of gene expression data generated by DNA microarray technology. This paper presents a top down hierarchical clustering algorithm that produces a tree of genes called GERC tree (GERC stands for Gene Expression Recursive Clustering) along with the generated clusters. GERC tree is an ample resource of biological information about the genes in an expression dataset. Unlike dendrogram, a GERC tree is not a binary tree. Genes in a leaf node of GERC tree represent a cluster. The clustering method was used with real-life datasets and the proposed method has been found satisfactory in terms of homogeneity, p value and z-score.
Keywords: Hierarchical Clustering, Divisive Clustering, Mean Squared Residue, Gene Expression Data

Mean squared residue of a subspace cluster is computed as,
![]() | Figure 1. Visual interpretation of Mean Squared Residueo |
Where amean is the mean of all the elements of g1 and bmean is the mean of all the elements of g2. MRD of the gene pair g1 and g2 with respect to a subspace of conditions λ can be computes as,
Following definitions and theorems provide the theoretical basis and soundness of the proposed measure based on [12].Definition 1: Coherent genes: Two genes are called coherent if similarity between the two genes is more than a given threshold in terms of a particular proximity measure.Definition 2: Expression pattern: The expression pattern of a gene is defined as the discretized form of the gene expression values. Two genes are said to have similar expression pattern if their discretized values are same. Mathematically, two genes gi and gj have similar expression pattern if
Where n is the total number of conditions.Definition 3: Constant valued genes: For two genes gi=(a1, a2,...., an) and gj=(b1, b2,..., bn) if a1=a2=...=an=b1=b2=...=bn, then the genes are called constant valued genes.Definition 4: Constant row genes: For two genes gi=(a1, a2,...., an) and gj=(b1, b2,..., bn), if a1=a2=...=an and b1=b2=...=bn, then the genes are called constant row genes.Definition 5: Constant column genes: For two genes gi=(a1, a2,..., an) and gj=(b1, b2,..., bn), if a1=b1, a2=b2, ..., an=bn, then the genes are called constant column genes.Definition 6: Additive genes: For two genes gi=(a1, a2,...,an) and gj=(b1, b2,...,bn) if b1=a1+d, b2=a2+d, ..., bn=an+d, where d is an additive constant, then the genes are called additive genes.Properties of MRDThe MRD measure is capable of detecting four types of coherence (a) Coherence among constant valued genes (b) Coherence among constant row genes (c) Coherence among constant column genes (d) Coherence among additive genes. Next we present some of the properties of MRD.Theorem 1. MRD of two constant valued genes is always zero.Proof:Let the two genes be g1=(a1, a2,..., an) and g2=(b1,b2,..., bn). Since the two genes are constant valued soa1=a2=...=an=b1=b2=...=bn=x (say).Now mean of the two genes will be, amean=bmean=x.
Theorem 2. MRD of two constant row genes is always zero.Proof:Let the two genes be g1=(a1, a2,..., an) and g2=(b1,b2,..., bn). Since these are constant row genes, so a1=a2=a3=...=an=x (say) and b1=b2=b3=...=bn=y (say).
Theorem 3. MRD of two constant column genes is always zero.Proof:Let the two genes be g1=(a1, a2,..., an) and g2=(b1,b2,..., bn). Since these are constant column genes, so a1=b1, a2=b2,..., an=bn.Now mean of two genes will be amean=bmean=m(say).
Theorem 4. MRD of two additive genes is always zero.Proof:Let the two genes be g1=(a1, a2,..., an) and g2=(b1,b2,..., bn). Since the genes are additive, so bi=ai+d, for i=1,2,3,...,n. Here d is an additive constant.

![]() | Figure 2. Structure of GERC tree |
![]() | Figure 3. Dendrogram versus GERC tree |
|
|
In GERC, the reference gene is used as an input parameter. The algorithm tries to find the initial cluster to which this gene potentially belongs to. While finding this initial cluster it tries to locate all genes which have similar expression pattern with the reference gene in terms of at least (Nc/2)+1 number of conditions. If we want to explore the entire dataset we can use each of the available genes or a set of stochastically selected genes as reference genes. Once we discover the initial clusters we move to step 2 of the algorithm. In the second step, we create a single node first with all genes of the initial cluster in it and then iteratively cluster each node(having genes more than preferred volume umber of genes) of the tree until all the processing nodes in the tree have less than or equal to preferred volume number of genes. The input parameter step down ratio is actually used to reduce the value of MRD threshold as towards leaf nodes of the tree the similarity among genes increases and require a smaller value of MRD threshold. e.g. if the MRD threshold provided by user is 1 and step down ratio is .5, then the first node will use 1 as its MRD threshold while clustering. On successful division of the node to its children, the successive level nodes will use thresholds .5(1*.5),.25(.5*.5) …. and so on. Finally the sub-trees (one sub-tree in case of single reference gene) generated from the initial clusters are combined to form the GERC tree. The roots of these subtrees are made children of a single node that contains the set of entire genes and this node becomes the root of generated GERC tree.
|
|
|
|
|
![]() | Figure 4. Tuning the value of δ for Dataset 3 |
|
|
![]() | Figure 5. Tree generated from an initial cluster for yeast Dataset 2 |
![]() | Figure 6. Tree generated from an initial cluster for yeast Dataset 3 |
![]() | Figure 7. Tree generated from an initial cluster for yeast Dataset 4 |
| [1] | M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of Natl. Acad. Sci. U.S.A., vol. 95, pp. 14 863–14 868, 1998. |
| [2] | J. Dopazo and J. Carazo, “Phylogenetic reconstruction usingan unsupervised neural network that adopts the topology of aphylogenetic tree,” J Mol Eval, vol. 44, pp. 226–233, 2002. |
| [3] | A. Bhattacharya and R. K. De, “Divisive correlation clutering algorithm (dcca) for grouping of genes,” Bioinformatics, vol. 24, no. 11, pp. 1359–1366, June 2008. |
| [4] | D. Jiang, J. Pei, and A. Zhang, “Dhc:a density-based hierarchical clustering method for time series gene expression data,” Proceedings of IEEE International Symposium on Bioinformatic and Bioengineering, pp. 393-400, 2003. |
| [5] | F. Luo, L. Khan, F. Bastani, I. L. Yen and J. Zhou, “A dynamically growing self-organizing tree (dgsot) for hierarchical clustering gene expression profiles,” Bioinformatics, vol. 20, no. 16, pp. 2605–2617, 2004. |
| [6] | K. Rose, “Deterministic annealing for clustering, compression, classification, regression, and related optimization prob-lems,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2210–2239, 2002. |
| [7] | H. A. Ahmed, P. Mahanta, D. K. Bhattacharyya and J. Kalita, “GERC: Tree based clustering for gene expression data,” Proceedings of 2011 IEEE 11th International Conference on Bioinformatics and Bioengineering, pp. 299-302, 2011. |
| [8] | H. Kim, G. H. Golub and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, vol. 21, no. 2, pp. 187–198, 2005. |
| [9] | R. Das, J. Kalita and D. K. Bhattacharyya, “A new approachfor clustering gene expression time series data,” Int. J. Bioinformatics Res. Appl., vol. 5, pp. 310–328, 2009. |
| [10] | S. C. Madeira and A. L. Oliveira, “Biclustering algorithm for biological data analysis: A survey,” IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 1, pp. 24–45, 2004. |
| [11] | Y. Cheng and G. M. Church, “Biclustering of expression data,” Proceedings of the eighth International Conference on Intelligent Systems for Molecular Biology, vol. 1, pp. 93–103, 2000. |
| [12] | A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, “A novel coherence measure for discovering scaling biclusters from gene expression data,” J. Bioinformatics and Computational Biology, vol. 7, no. 5, pp. 853-868, 2009. |
| [13] | R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, and R. W. Davis, “A genome-wide transcriptional analysis of the mitotic cell cycle,” Molecular cell, vol. 2, no. 1, pp. 65–73, 1998. |
| [14] | J. L. DeRisi, V. R. Iyer and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, vol. 278, no. 5338, pp. 680-686, 1998. |
| [15] | P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein and B. Futcher, “Comprehensive identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Molecular biology of the cell, vol. 9, no. 12, pp. 3273–3297, 1998. |
| [16] | S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. O. Brown and I. Herskowitz, “The transcriptional program of sporulation in budding yeast,” Science, vol. 282, no. 5389, pp. 699–705, 1998. |
| [17] | I. Ulitsky, A. Maron-Katz, S. Shavit, D. Sagir, C. Linhart, R. Elkon, A. Tanay, R. Sharan, Y. Shiloh and R. Shamir, “Expander: from expression microarrays to networks and functions,” nature protocols, vol. 5, no. 2, pp. 303–322, 2010. |
| [18] | R. Sharan and R. Shamir, “Click: A clustering algorithm with applications to gene expression analysis,” Proceedings of the eighth International Conference on Intelligent Systems for Molecular Biology, vol. 8, pp. 307–316, 2000. |
| [19] | G. F. Berriz, O. D. King, B. Bryant, C. Sander, and F. P. Roth,“Characterizing gene sets with FuncAssociate.” Bioinformatics, vol. 19, no. 18, pp. 2502–2504, 2003. |
| [20] | F. Gibbons, F. Roth, “Judging the quality of gene expression-based clustering methods using gene annotation,” Genome Research, vol. 12 , pp. 1574-1581, 2002. |
| [21] | J. A. Hartigan and M. A. Wong, “A k-means clustering algorithm,” JSTOR: Applied Statistics, vol. 28, pp. 100–108, 1979. |
| [22] | T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464-1480, 1990. |
| [23] | S. Sarmah, R. D. Sarmah, D. K. Bhattacharyya, “An effective density-based hierarchical clustering technique to identify coherent patterns from gene expression data,” Proceedings of the 15th Pacific Asia conference on Advances in knowledge discovery and data mining,” pp. 225-236, 2011. |