American Journal of Biomedical Engineering
p-ISSN: 2163-1050 e-ISSN: 2163-1077
2012; 2(5): 206-211
doi: 10.5923/j.ajbe.20120205.03
Khalid Raza , Akhilesh Mishra
Department of Computer Science, Jamia Millia Islamia (Central University), New Delhi, 110025, India
Correspondence to: Khalid Raza , Department of Computer Science, Jamia Millia Islamia (Central University), New Delhi, 110025, India.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
The high-throughput data generated by microarray experiments provides complete set of genes being expressed in a given cell or in an organism under particular conditions. The analysis of these enormous data has opened a new dimension for the researchers. In this paper we describe a novel algorithm to microarray data analysis focusing on the identification of genes that are differentially expressed in particular internal or external conditions and which could be potential drug targets. The algorithm uses the time-series gene expression data as an input and recognizes genes which are expressed differentially. This algorithm implements standard statistics-based gene functional investigations, such as the log transformation, mean, log-sigmoid function, coefficient of variations, etc. It does not use clustering analysis. The proposed algorithm has been implemented in Perl. The time-series gene expression data on yeast Saccharomyces cerevisiae from the Stanford Microarray Database (SMD) consisting of 6154 genes have been taken for the validation of the algorithm. The developed method extracted 48 genes out of total 6154 genes. These genes are mostly responsible for the yeast’s resistants at a high temperature.
Keywords: Microarray Data Analysis, Gene Expression, Differentially Expressed Genes, Drug Target
Cite this paper: Khalid Raza , Akhilesh Mishra , "A Novel Anticlustering Filtering Algorithm for the Prediction of Genes as a Drug Target", American Journal of Biomedical Engineering, Vol. 2 No. 5, 2012, pp. 206-211. doi: 10.5923/j.ajbe.20120205.03.
One way we use to improve data discrimination is to transform cy5 à cy3 value by taking the logarithm of base 2. The transformation produces more uniform distribution of data and has advantage to display up-regulated and down-regulated gene more symmetrically and more com-parable. To further normalize the data we put the data point horizontally by plotting the log ratio of cy5/cy3 against the average log intensities. In the representation the data are roughly symmetrically distributed around the horizontal axis. The differentially expressed gene then be more easily visualized. This form of representation is called ‘intensity ratio plot’. The linear regression is used in all these instances. A non-linear regression may produce a better fitting and help to eliminate the bias for data which not confirm to linear relationship owing to systematic sampling error. The most frequently used regression type is known as LOWESS (locally weighted scatter plot smoother) regression[9].Step 2. Elimination of gene that fail to provide data in majority of experiment. In this step we remove that rows corresponding to gene that were not expressed or majority not expressed on any chip. In many of the cases due to some experimental problem some genes expression cannot be measured on the gene chip due to (i) wrong probing of gene on microarray chip, (ii) some specialized gene which are expressed in only a specific cell or specific condition are thus not expressed in that cell we are working on, (iii) scanner have some problem in that region to read the fluorescence value of gene, and (iv) due to defects in machine which make that microarray chip for Robotic probing. It is not hard and fast rule that if data is missing then we have to eliminate the row containing that gene. It can be a genuine problem that particular gene is actually not expressed in that particular condition. In this algorithm we have considered that if missing values for a particular row are less than or equal to 40 % then missing values will be filled up by a zero value, indicating that genes are not expressed. If missing values in a row are more than 40 % then that particular row will be removed from the main dataset and will not be used further for analysis.Step 3. Analysis of significance of data. In this step we check significance of data. The t-statistics is based on the assumption that the variability in these measurements follows a normal distribution, which means there is some pattern that is present in data which can be analyzed and may be interpreted as a result. Those data which are highly random and does not have any significance cannot be proceed for further analysis.Step 4. Replicate handling. In replicate handling we remove those genes whose expression level are taken or noted more than one time in gene expression data. Thus, each gene should have only one entry. This will remove the redundancy in dataset. The multiple entry may produce due to presence of more than one position of single probe or different gene coding for same protein having different position or due to manual or machine error in detecting and noting expression level of gene. These redundancies will increase the volume of data as well as analysis time. This step is optional if we are sure that our data is quite mature and it does not have redundancy.Step 5. Elimination of gene having less than two-fold change in expression level. We eliminated those genes that do not show considerable variation in expression level. In the dataset, positive value means up-regulation of expression in cy5 labeled gene and negative value means down-regulation of cy5 labeled gene. Those genes which neither show up-regulation nor down-regulation at least of half of its normal condition or which have less variation in expression level in control and diseased condition are not useful[10].Thus, we have filtered the data and taken only those genes which show variation in expression level more than half of its expression level in control condition. For this, we have taken mean of each row and extracted only those rows or gene which have 1≤mean<−1, that actually represent change in expression level of at least half of its normal condition.Step 6. Conversion of datasets using logsigmoid function. At this step we have used log-sigmoid function to transform values in the range[0, 1] to make data more convergent which help us for further analysis. This conversion function is also useful as it transform all the negative values in positive range which is very useful for statistical data analysis. The log-sigmoid transformation takes the input, which can have any value between[+∞ to −∞] and squashes the output into the range[0, 1]. The transfer function is given by,
Step 7. Elimination of genes that have high variation across the collection of sample. At this step we remove those genes which do not have consistent variation or variance in expression level in all different experimental condition in different time series. In this algorithm, we have eliminated those genes which have more than 36 % of variation because we are interested in those genes which show consistent differential expression level in disease case. Thus, we can use that gene or gene product for drug target to inhibit the symptom of that particular disease case.Suppose we have n number of genes at m different time points, expression level of gene n will be
. We have calculated the coefficient of variance (CV) for each row in the dataset. The CV is given by,
where SD and x are standard deviation and mean respectively. By using coefficient of variance those genes which show less than 36 % variation are selected as they show consistency and other that show more than 36 % variation are deleted out because they are not representing as marker gene of that state. This cut off can be changed according to our need, bigger the cut off bigger will be the output. The flow chart of the proposed algorithm is presented in Figure 1.![]() | Figure 1. Flow chart of the proposed algorithm |
![]() | Figure 2. Dot plot of C.V. values |
| [1] | Joshua W.K. Ho, Maurizio stetani, Cristobal G. dos Remedios and Michael A. Chorleston. Differential variability analysis of gene expression and its application to human diseases, Bioinformatics 2008, 24 (13):390-398. |
| [2] | Andersen CL, Jensen JL, Orntoft. TF: Normalization of Real-Time Quantitative Reverse Transcription-PCR Data: A Model- Based Variance Estimation Approach to Identify Genes Suited for Normalization, Applied to Bladder and Cancer Data Sets. Cancer Res 2004, 64:5245-5250. |
| [3] | de Brouwer AP, van Bokhoven H, Kremer H. Comparison of 12 reference genes for normalization of gene expression levels in Epstein-Barr virus-transformed lymphoblastoid cell lines and fibroblasts. Molecular Diagnosis and Therapeutics 2006, 10(3):197-204. |
| [4] | Saviozzi S, Cordero F, Lo M, Novello S, Giorgio VS, Calogero R. Selection of suitable reference genes for accurate normalization of gene expression profile studies in non-small cell lung cancer. BMC Cancer 2006, 6:200. |
| [5] | Szabo A, Perou CM, Karaca M, Perreard L, Quackenbush JF, Bernard PS. Statistical modeling for selecting housekeeper genes. Genome Biology 2004, 5:R59. |
| [6] | Lindsey J Maccoux, Dylan N Clements, Fiona Sal-way1 and Philip JR Day. Identification of new reference genes for the normalisation of canine osteoarthritic joint tissue transcripts from microarray data. BMC Molecular Biology 2007, 8:62 |
| [7] | J. Quackenbush. Computational Approaches to Analysis of DNA Microarray Data. Yearb Med Inform 2006, 91-103. |
| [8] | Alvis Brazma, Jaak Vilo. Gene expression data analysis. FEBS Lett 2000, 480:17-24. |
| [9] | M. Madan Babu. An Introduction to Microarray Data Analysis. Computational Genomics (Ed: R. Grant), Horizon |
| [10] | Xutao Denga,b, Jun Xub, James Huia, Charles Wanga. Probability fold change: A robust computational approach for identifying differentially expressed gene lists. Comput Methods Programs Biomed. 2009 Feb ; 93(2):124-39. |
| [11] | Debouck, C. and Goodfellow, P. N. DNA microarrays in drug discovery and development. Nat Genet. 21(1): 48–50[PMID: 9915501] |
| [12] | D. E. Jr Bassett, M. B. Eisen and M. S. Boguski. Gene expression informatics--it's all in your mine. Nature Genetics 1999, 21(1 Suppl):51-5, (1999). |
| [13] | A. Brazma, A. Robinson, G. Cameron and M. Ashburner. “One-stop shop for microarray data”. Nature. Feb 17;403(6771):699-700, 2000. |
| [14] | A. Brazma and J. Vilo. ”Gene expression data analysis”. FEBS Letters, Aug 25; 480(1):17-24 (2000). |
| [15] | William Shannon, Robert Culverhouse, Jill Duncan. Analyzing microarray data using cluster analysis. Pharmacogenomics 2003 Jan ;4(1):41-52. |
| [16] | http://genome-www.stanford.edu/yeast_stress/data.shtml |