American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2019; 9(1): 1-10
doi:10.5923/j.bioinformatics.20190901.01

Salah H. Abid, Jinan H. Farhood
Al-Mustansiriyah University, Iraq
Correspondence to: Salah H. Abid, Al-Mustansiriyah University, Iraq.
| Email: | ![]() |
Copyright © 2019 The Author(s). Published by Scientific & Academic Publishing.
This work is licensed under the Creative Commons Attribution International License (CC BY).
http://creativecommons.org/licenses/by/4.0/

Many studies discussed different numerical representations of DNA sequences, while far fewer studies deal with image analysis for aspects related with DNA. In this paper, we proposed new algorithm for image similarity to compare among variance covariance matrix eigenvalues images of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences of five organisms, Human, E. coli, Rat, Wheat and Grasshopper. This algorithm is based on randomized block design model. It should be noted that it is the first time that the variance covariance matrix eigenvalues of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences, is used in an analysis like this and related analyzes.
Keywords: FFT scaling, DNA, Randomized block design, Image similarity, Eigenvalues
Cite this paper: Salah H. Abid, Jinan H. Farhood, Image Analysis Based on the Eigenvalues of Variance Covariance Matrix of FFT Scaling of DNA Sequences: An Empirical Study for Some Organisms, American Journal of Bioinformatics Research, Vol. 9 No. 1, 2019, pp. 1-10. doi: 10.5923/j.bioinformatics.20190901.01.
carbon of one sugar linked to the
carbon of the next, giving the string direction. DNA molecules occur naturally as a double helix composed of polynucleotide strands with the bases facing inward. The two strands are complementary, so it is sufficient to represent a DNA molecule by a sequence of bases on a single strand; refer to Fig. 1. Thus, a strand of DNA can be represented as a sequence
of letters, termed base pairs (bp), from the finite alphabet
The order of the nucleotides contains the genetic information specific to the organism [Stoffer, D. (2012)].![]() | Figure 1. The general structure of DNA and its bases |
and
. The background behind the problem is discussed in detail in the study by Waterman and Vingron (1994). For example, every new DNA or protein sequence is compared with one or more sequence databases to find similar or homologous sequences that have already been studied, and there are numerous examples of important discoveries resulting from these database searches.One naive approach for exploring the nature of a DNA sequence is to assign numerical values (or scales) to the nucleotides and then proceed with standard time series methods. It is clear, however, that the analysis will depend on the particular assignment of numerical values. Consider the artificial sequence ACGTACGTACGT. . . Then, setting A = G = 0 and C = T = 1, yields the numerical sequence 010101010101. . . , or one cycle every two base pairs (i.e., a frequency of oscillation of
Cycle/bp, or a period of oscillation of length
bp=cycle). Another interesting scaling is A = 1, C = 2, G = 3, and T = 4, which results in the sequence 123412341234. . . , or one cycle every four bp
In this example, both scalings of the nucleotides are interesting and bring out different properties of the sequence. It is clear, then, that one does not want to focus on only one scaling. Instead, the focus should be on finding all possible scalings that bring our interesting features of the data. Rather than choose values arbitrarily, the spectral envelope approach selects scales that help emphasize any periodic feature that exists in a DNA sequence of virtually any length in a quick and automated fashion. In addition, the technique can determine whether a sequence is merely a random assignment of letters [Stoffer, D. (2012)].Fourier analysis has been applied successfully in DNA analysis; McLachlan and Stewart (1976) and Eisenberg et al. (1994) studied the periodicity in proteins using Fourier analysis.Stoffer et al. (1993a) proposed the spectral envelope as a general technique for analyzing categorical-valued time series in the frequency domain. The basic technique is similar to the methods established by Tavar´e and Giddings (1989) and Viari et al. (1990), however, there are some differences. The main difference is that the spectral envelope methodology is developed in a statistical setting to allow the investigator to distinguish between significant results and those results that can be attributed to chance.The article authored by Marhon and Kremer 2011, partitions the identification of protein-coding regions into four discrete steps. Based on this partitioning, digital signal processing DSP techniques can be easily described and compared based on their unique implementations of the processing steps. A new methodology for the analysis of DNA/RNA and protein sequences is presented by Bajic in 2000. It is based on a combined application of spectral analysis and artificial neural networks for extraction of common spectral characterization of a group of sequences that have the same or similar biological functions. Fourier transform infrared (FTIR) spectroscopy has been considered by Han et al. in 2018 as a powerful tool for analysing the characteristics of DNA sequence. This work investigated the key factors in FTIR spectroscopic analysis of DNA and explored the influence of FTIR acquisition parameters, including FTIR sampling techniques, pretreatment temperature, and sample concentration, on calf thymus DNA. The results showed that the FTIR sampling techniques had a significant influence on the spectral characteristics, spectral quality, and sampling efficiency. Ruiz et al. 2018 proposed a novel approach for performing cluster analysis of DNA sequences that is based on the use of Genomic signal processing GSP methods and the K-means algorithm. They also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Since the type of numerical representation of a DNA sequence extremely affects the prediction accuracy and precision, by this study Mabrouk in 2017 aimed to compare different DNA numerical representations by measuring the sensitivity, specificity, correlation coefficient (CC) and the processing time for the protein coding region detection. The objective of the paper authored by Roy and Barman in 2011 is to estimate and compare spectral content of coding and non-coding segments of DNA sequence both by Parametric and Nonparametric methods. Consequently an attempt has been made so that some hidden internal properties of the DNA sequence can be brought into light in order to identify coding regions from non-coding ones. In 2006, Galleani and Garello presented a new approach where the mapping is not kept fixed: it is allowed to vary aiming to minimize the spectrum entropy, thus detecting the main hidden periodicities. The new technique is first introduced and discussed through a number of case studies, then extended to encompass time-frequency analysis.For analyzing periodicities in categorical valued time series, the concept of the spectral envelope was introduced by Stoffer et al., 1993 as a computationally simple and general statistical methodology for the harmonic analysis and scaling of non-numeric sequences. However, the spectral envelope methodology is computationally fast and simple because it is based on the fast Fourier transform and is nonparametric (i.e., it is model independent). This makes the methodology ideal for the analysis of long DNA sequences. Fourier analysis has been used in the analysis of correlated data (time series) since the turn of the century. Of fundamental interest in the use of Fourier techniques is the discovery of hidden periodicities or regularities in the data. Since a DNA sequence can be regarded as a categorical-valued time series it is of interest to discover ways in which time series methodologies based on Fourier (or spectral) analysis can be applied to discover patterns in a long DNA sequence or similar patterns in two long sequences. Actually, the spectral envelope is an extension of spectral analysis when the data are categorical valued such as DNA sequences.An algorithm for estimating the spectral envelope and the optimal scalings given a particular DNA sequence with alphabet
is as follows [Stoffer, D. (2012)].1. Given a DNA sequence of length
from the
vectors
namely, for
if
where
is a
vector with a 1 in the jth position as zeros elsewhere, and
if
2. Calculate the Fast Fourier Transform FFT of the data,
Note that
is a
complex-valued vector. Calculate the periodogram,
for
and retain only the real part, say
3. Smooth the real part of the periodogram as preferred to obtain
a consistent estimator of the real part of the spectral matrix.4. Calculate the
variance–covariance matrix of the data,
where
is the sample mean of the data.5. For each
determine the largest eigenvalue and the corresponding eigenvector of the matrix
6. The sample spectral envelope
is the eigenvalue obtained in the previous step.7. The optimal sample scaling is
where
is the eigenvector obtained in the previous step.In this paper, we proposed new algorithm for image similarity to compare among images of variance covariance matrix eigenvalues of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences of five organisms, Human, E. coli, Rat, Wheat and Grasshopper. This algorithm is based on randomized block design model. It should be noted that it is the first time that the variance covariance matrix eigenvalues of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences, is used in an analysis like this and related analyzes.
,Where
is the measurement for treatment i in block j,
is the overall mean,
is the effect of treatment i,
is the effect of block j and
is the error in measurement for treatment i and block j. In most of studies
assumed to be normal variate with mean zero and variance
. Generally, the normality assumption is not necessary due to the robustness property of this analysis against any change in the distribution of the error random variable. In the analysis of variance instead of only explaining the variance through error and treatment, we also include the block as a possible source for variance in the data.The Hypotheses under test in this analysis are,
versus
at least one of the values differs from the others. The test statistic is
based on
and
, where
and
The decision rule according to P-value:
where
follows an
distribution with
and
is reject
if P-value
and do not reject
if P-value
where
is the significant level.
|
![]() | Figure 2. Representation of E. coli eigenvalues vectors |
![]() | Figure 3. Representation of Grasshopper eigenvalues Vectors |
![]() | Figure 4. Representation of Human eigenvalues vectors |
![]() | Figure 5. Representation of Rat eigenvalues vectors |
![]() | Figure 6. Representation of Wheat eigenvalues vectors |
|
![]() | Figure 7. Similarity rate between images of DNA representation for E. coli and Grasshopper |
![]() | Figure 8. Similarity rate between images of DNA representation for E. coli and Human |
![]() | Figure 9. Similarity rate between images of DNA representation for E. coli and Rat |
![]() | Figure 10. Similarity rate between images of DNA representation for E. coli and Wheat |
![]() | Figure 11. Similarity rate between images of DNA representation for Rat and Grasshopper |
![]() | Figure 12. Similarity rate between images of DNA representation for Wheat and Grasshopper |
![]() | Figure 13. Similarity rate between images of DNA representation for Human and Grasshopper |
![]() | Figure 14. Similarity rate between images of DNA representation for Rat and Human |
![]() | Figure 15. Similarity rate between images of DNA representation for Human and Wheat |
![]() | Figure 16. Similarity rate between images of DNA representation for Rat and Wheat |
