American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2016; 6(1): 19-25
doi:10.5923/j.bioinformatics.20160601.03

Md. Siraj-Ud-Doulah, Md. Bipul Hossen
Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh
Correspondence to: Md. Siraj-Ud-Doulah, Department of Statistics, Begum Rokeya University, Rangpur, Bangladesh.
| Email: | ![]() |
Copyright © 2016 Scientific & Academic Publishing. All Rights Reserved.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

DNA microarray experiments have emerged as one of the most popular tools for the large-scale analysis of gene expression. The challenge to the biologist is to apply appropriate statistical techniques to determine which changes are relevant. One of the tools is clustering. Clustering is a method to discern hidden patterns in data without the need for any supervision and in absence of any prior knowledge. Clustering is a popular method for analysis of microarray data. There are several challenges to clustering of microarray data. Unfortunately the results obtained from the common clustering algorithms are not consistent and even with multiple runs of different algorithms a further validation step is required. Due to absence of well-defined class labels, and unknown number of clusters, the unsupervised learning problem of finding optimal clustering is hard. Obtaining a consensus of judiciously obtained clustering not only provides stable results but also lends a high level of confidence in the quality of results. Several base algorithm runs are used to generate clustering and a co-association matrix of pairs of points is obtained using a configurable majority criterion. Synthetic as well as real world datasets are used in experiment and results obtained are compared using various internal and external validity measures. In this paper, results obtained from consensus clustering are consistent and more accurate than results from base algorithms. The consensus algorithm can identify the number of clusters and detect outliers.
Keywords: Consensus Clustering, Linkage, Microarray, Outliers, Validation Indexes
Cite this paper: Md. Siraj-Ud-Doulah, Md. Bipul Hossen, Performance Evaluation of Clustering Methods in Microarray Data, American Journal of Bioinformatics Research, Vol. 6 No. 1, 2016, pp. 19-25. doi: 10.5923/j.bioinformatics.20160601.03.
Pearson Correlation Coefficient (PCC)Pearson Correlation Coefficient is a value for the quality of finding best-fit by minimizing sum of squares from the best-fitting curve. For two variables it is defined as the ratio of covariance of the variables to product of their standard deviations [7].
Spearman Rank Correlation Coefficient (SRCC)Spearman Rank Correlation Coefficient is a nonparametric procedure of measuring dependence between variables. It is similar to Pearson correlation coefficient except that it works on rank-order of variables. It is less sensitive to outliers and independent of assumptions about distribution of data [3].
Kendall tau Rank Correlation Coefficient (KTRCC)Kendall tau Rank Correlation Coefficient is another nonparametric procedure for measuring dependence of variables using hypothesis test. It is more intuitive and easier to calculate than Spearman Rank Correlation Coefficient. A pair of data points is considered concordant if the values increase (or decrease) in all dimensions [9]. If the value of one point is higher in one dimension while that of other point is higher in another dimension, the pair is called discordant.

![]() | Linkage Rules |

= distance between clusters k and j
= intercluster distance of cluster l
= number of clustersSilhouette WidthFor any element the Silhouette value shows ratio of measures by which average between cluster distances exceeds within cluster distance [3].
Where
= average distance of element average distance of element i to other elements in same cluster
= average distance of element i to elements in its nearest neighboring clusterHubert Gamma StatisticHubert Γ is defined [14] as
Where
= distance between elements i and k
= distance between clusters to which elements i and k belong (represented by centroids)EntropyAssuming that a point has equal probability of belonging to any cluster, the entropy of a clustering is defined as [14]:
, Where
, K = number of clusters![]() | Figure 1. Linkage Rules (Distance measures) |
![]() | Figure 2. Clustering Algorithms (Validation Indexes) |
|
![]() | Figure 3. K-Means clustering (k=4 & k=5) |
![]() | Figure 4. Consensus Clustering |