American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2012; 2(4): 40-46
doi: 10.5923/j.bioinformatics.20120204.02
Suvendu Kanungo 1, Gadadhar Sahoo 2, Manoj Madhava Gore 3
1Department of Computer Sc., BIT Mesra (Allahabad Campus), Allahabad, 211010, India
2Department of IT, BIT Mesra, Ranchi, 835215, India
3Department of Computer Sc. & Engg., MNNIT, Allahabad, 211004, India
Correspondence to: Suvendu Kanungo , Department of Computer Sc., BIT Mesra (Allahabad Campus), Allahabad, 211010, India.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
Biclustering is a vital data mining tool which is commonly employed on microarray data sets for analysis task in bioinformatics research and medical applications. There has been extensive research on biclustering of gene expression data arising from microarray experiment. This technique is an important analysis tool in gene expression measurement, when some genes have multiple functions and experimental conditions are diverse. In this paper, we introduce a new framework for biclustering of gene expression data. The basis of this framework is the construction of a range bipartite graph for the representation of 2-dimensional gene expression data. We have constructed this range bipartite graph by partitioning the set of experimental conditions into two disjoint sets. The key benefit of this representation is that, it leads to a compact representation of all similar value ranges between experimental conditions. Based on this problem formulation, an efficient algorithm is proposed that searches for constrained maximal cliques in this range bipartite graph, in order to extract a set of biclusters. Our technique is scalable to practical gene expression data and can produce different types of biclusters amid noise. The experimental evaluation of this technique also reveals its accuracy and effectiveness with respect to noise handling and execution time in comparison to other similar techniques.
Keywords: Microarray, Biclustering, Gene Expression Data, Bipartite Graph
, where n and m are the number of rows and columns of the input data matrix and d is the average degree of overlap among biclusters, which is slower than our approach.Wang and Liu[8] proposed RMSBE, which can identify optimal square biclusters with the maximum similarity score. This method performs multiple scans of the data matrix in order to compute similarity score, reference gene identification and bicluster identification. The time complexity of this technique is
where n is number of rows and m is number of columns. Due to this cubic nature of complexity, it is not feasible for very high dimensional data. Prelic et al.[9] proposed BiMax, which can identify constant biclusters. This method discretize the input expression matrix into a binary matrix based on a threshold value. Therefore it is difficult to identify coherent biclusters.Bergmann et al.[10] proposed the iterative signature algorithm (ISA) that uses gene signatures and condition signatures in order to extract biclusters with both up and down-regulated expression values. They identify several transcription modules (biclusters) by executing the algorithm on reference gene sets. The reference gene sets needs to be carefully selected for extraction of good quality biclusters.Zhao and Zaki[11] proposed Tricluster, for mining coherent clusters in 3-dimensional gene expression data sets. They construct a range multigraph and then searches for constrained maximal cliques in this multigraph, in order to extract a set of biclusters. However, they do not address the issue of noisy data, where as our approach can effectively handle noisy data.
be a set of n genes and
be a set of m experimental conditions. Microarray data-set is a real valued
expression matrix
where
,
and each entry
corresponds to the logarithm of the relative abundance of mRNA of a gene under a specific experimental condition
. A bicluster corresponds to a sub matrix that exhibits some coherent tendency. Let B be a sub matrix of dataset D i.e Bicluster
where
and
, provided certain conditions of homogeneity are satisfied. We define the volume or size of a bicluster B as the number of elements
, such that
and
. Let S be the set of all biclusters that satisfy the given homogeneity conditions, then
is called a maximal bicluster iff there doesn’t exist another bicluster
such that
Let
be any arbitrary sub matrix of B.B will be a valid bicluster iff it is a maximal bicluster satisfying the following conditions:
between two specified column values for a given row and
be the weight of the row for this specified two column values. Let us consider
and
be the weighted difference of two column values for a given row i and j respectively. We need that
where
is the multiple of maximum weight in the corresponding gene-set i.e
We also need that
and
, where
and
denote minimum cardinality thresholds for each dimension.We consider an edge as valid, when the weighted difference range for a condition pair satisfies the threshold value so that it generates a gene-set. In order to produce large enough clusters, the minimum size constraints i.e
, and
are imposed. B is a scaling bicluster if
and
and
where
is a constant multiplicative factor. B is a shifting bicluster iff
and
and
where
is constant additive factor. B is a constant bicluster if
or
B is a constant row bicluster if
or
. Similarly B is a constant column bicluster if
or
, where
is a typical value within the bicluster;
and
are adjustment for row and column respectively. B is overlap bicluster if
is the sum or product of the contribution of different biclusters to which they belong. Bipartite Graph: A graph
is called Bipartite if its vertex set V can be decomposed into two disjoint subsets
and
(i.e.
) such that every edge in E joins a vertex in
with a vertex in
.We consider weighted bipartite graph
with
where
denotes the weight of the edge
between vertices iand j.
, where k is a positive integer and
where 35 is the minimum sample size for large sample procedures[16]. The values within the intervals are then smoothed by interval means. We found that this way of partitioning is very effective and can deal with outliers efficiently.
and
, and the maximum weighted difference threshold
, let
and
be any two condition columns of D and let
be the weighted difference of the expression values of gene
in columns
and
such that
, where
. In order to incorporate the idea of mutual importance between two columns, we have computed the weight of all rows for specified column pairs. A difference range is defined as an interval of difference values
, with
. Let
be the set of genes, whose difference w.r.t. columns
and
lie in the given weighted difference range. A difference range is called valid iff
where
is the multiple of maximum row weight in the corresponding gene set. Normally, for microarray experiment data, genes and conditions are represented by
and
vertex sets respectively, and the edge weight
represents the response of
gene to
condition. However, in order to have a very compact representation, in this paper, we construct the weighted undirected bipartite graph by partitioning the condition set into two disjoint sets called upper layer
and lower layer
. The conditions that do not have any data values are not considered in the formation of disjoint sets. Here, each edge in the range bipartite graph has associated with it the rank and gene-set corresponding to the weighted difference range on that edge. Different bipartite graphs emerged for different threshold value, which is the multiple of maximum weight value in the corresponding gene-set. Consequently, we will have different types of biclusters. We have given priority to all valid edges that have large number of genes in the gene-set, by assigning rank
to these edges. Ranks have been assigned on the basis of total number of genes in the gene-set i.e.
The gene-sets having highest cardinality have been assigned rank 1, the second highest rank 2, the third highest rank 3 and so on. The inclusion and deletion of edges depends upon the value of
and order in which we process these ranks and conditions. Let
and
be the set all valid edges and gene-sets respectively, in ascending order of rank values.
|
and
in Table 1. Let
, which is the maximum row weight in the gene-set, and considering
then there is only one valid weighted difference range
and the corresponding gene-set in sorted order is
. Similarly, we compute weighted difference range for other experimental conditions. In this case, the number of valid ranges depends on the value of
. For the sorted difference values, we find all valid weighted difference ranges for all pair of columns
. Further, for simplicity, we have not considered rows with weight 0.![]() | Figure 1. Weighted difference range for c0 and c1 |
and
, for construction of bipartite graph is given in Algorithm I. From Table 1, a maximal weighted range bipartite graph is constructed (Figure 2). Let
be the set of columns with missing value in each row. Let
be the set of conditions such that
. We have taken weight of the edge as rank for the corresponding gene-set. This algorithm gives priority to the edges having large rank in order to compensate the deletion of few valid edges, while partitioning the condition set into two disjoint sets. If there is a tie among rank values, then we randomly select an edge and start constructing the bipartite graph. As we deal with noisy data, additive and multiplicative methods of finding clusters may not always lead to good results. Therefore, instead of comparing two column values independently[11], we have computed weight of each row for any two specified column values. We build bipartite graph model of data, after properly conditioning the input data. ![]() | Figure 2. Weighted Range Bipartite Graph G′ |
Input:
Output: Creation of two disjoint sets
and
Initialization: 
1. while
2. for each valid edge
do3. if
and
then4. insert
and
in
5. endif6.if
and
then7. insert
and
in
8. endif9. endfor10. endwhile
into two disjoint sets
and
Each edge of this graph is associated with a rank value and the corresponding gene-set. The algorithm for extraction of biclusters from this graph by employing depth first search technique is given in Algorithm II. This algorithm requires the value of
and the weighted bipartite graph
as its input parameter. The output of this algorithm is a set of biclusters
and the number of such biclusters depends upon the dataset.For example, let us consider the value of input parameters
and
. The algorithm starts searching the graph
at a valid edge having highest rank value. In this case, it starts with the valid edge
and get the corresponding reference cluster
. Then, other adjacent valid edges are explored that leads to the formation of final bicluster
. If a new edge does not have the sufficient number of genes and conditions, the reference bicluster is declared as the final bicluster, since this is maximal and satisfies all required conditions. Further, other valid edges are searched that leads to the formation of final bicluster
. Here both the biclusters satisfies the minimum threshold value.Algorithm II: Bicluster ExtractionInput:
Output: SInitialization:
Call
1.
2. if
then3. if
such that
then4. if
then5. remove
6.
7. endif8. endif9. endif10. foreach
do11.
12.
13. forall edges
do14. if
then15.
16. endif17. if
then18.
19. endif20. endfor21. endfor22. Return S
. Here, we have considered the experimental conditions as vertices for the bipartite graph. These vertices are partitioned into two disjoint sets on the basis of rank of valid edges. This way of constructing bipartite graph leads to deletion of some valid edges and as a result, fewer edges need to be processed in comparison to other graph approach for solving the same problem. Therefore, the running time can be significantly reduced for this representation of gene expression data using a bipartite graph. The bicluster extraction step depends on the value of input parameters and datasets. This step is most expensive as there can be exponential number of clusters. Since the experimental conditions are only considered as vertices, and some uninteresting edges may be pruned, the depth of the search in weighted range bipartite graph to extract biclusters is likely to be small in comparison to multigraph representation[11].
where the size of matrix and bicluster vary from 100 × 100 to 110 × 110 and from 10 × 10 to 20 × 20, respectively. The steps for evaluation of coherent biclusters are same as that of constant bicluster, but rows and columns in a bicluster have a 0.02 increasing trend. The parameter setting for different algorithms is shown in Table 2. In order to validate the accuracies of different algorithms, we apply the gene match score proposed by Zimmermann et al.[9]. Let
and
be two sets of biclusters. The match score of
with respect to
is given by:![]() | (1) |
represent the set of implanted biclusters and M be the set of output biclusters of an algorithm. The score
represents the degree of similarity between extracted biclusters and the implanted biclusters, where as the score
represents how well each of the true biclusters extracted by the bicluster algorithm.Figure 3 illustrates the experimental results with respect to the accuracy evaluation of constant biclusters. As per the experimental results, in case of high noise level for extraction of constant biclusters; BiRange along with cHawk, ISA and RMSBE shows high accuracies; BiMax and SAMBA perform moderately, and CC perform poorly. Figure 4 illustrates the experimental results with respect to the accuracy evaluation of coherent biclusters. For coherent biclusters, BiRange has a comparable accuracy with RMSBE and cHawk. Figure 5 illustrates the experimental results with respect to the accuracy evaluation of overlapped biclusters. In case of overlapped biclusters, BiRange is marginally affected by the overlap degree of the implanted biclusters.
|
![]() | Figure 3. Accuracy of Constant Biclusters |
![]() | Figure 4. Accuracy of Coherent Biclusters |
![]() | Figure 5. Accuracy of Overlapped Biclusters |
![]() | Figure 6. Proportion of GO Based Enriched Biclusters |
![]() | Figure 7. Performance of BiRange, cHawk and RMSBE |