American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2013; 3(1): 1-9
doi:10.5923/j.bioinformatics.20130301.01
M. Tahir1, M. Sardaraz1, Ataul Aziz Ikram1, Hassan Bajwa2
1Department of Computing and Technology, Iqra University, Islamabad, Pakistan
2Department of Electrical Engineering, University of Bridgeport
Correspondence to: M. Tahir, Department of Computing and Technology, Iqra University, Islamabad, Pakistan.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
Next-generation high throughput sequencing technologies have opened up a wide range of new genome research opportunities. High throughput sequencing technologies produces a massive amount of short reads data in a single run. The large dataset produced by short read sequencing technologies are highly error-prone as compared to traditional Sanger sequencing approaches. These errors are critical and removing them is challenging. Therefore, there are peremptory demands for statistical tools for bioinformatics to analyze such large amounts of data. In this paper, we present review of and measuring parameters associated with genome sequence short read errors correction tools and algorithms. We further present comprehensive detail of datasets and results obtained with defined parameters. The reviews present the current state of the art and future directions in the area as well.
Keywords: Algorithms, Genome Sequence, Hadoop MapReduce, High Throughput, Next-Generation Sequencing, Short Reads Error Correction
Cite this paper: M. Tahir, M. Sardaraz, Ataul Aziz Ikram, Hassan Bajwa, Review of Genome Sequence Short Read Error Correction Algorithms, American Journal of Bioinformatics Research, Vol. 3 No. 1, 2013, pp. 1-9. doi: 10.5923/j.bioinformatics.20130301.01.
be a disjoint partition of {1,…,k}. These partitions represent spaced seeds into k-mers and also denote the restriction of k-mer x to the seed sj as x(sj). Then k-mers are denoted as d(x,y)
for x and y, there exists j such as x(sj)=y(sj). Based on observations it creates spaced seeds “
” denoted as sj= {j, j+
+1,j+2(
+1),…}. It also makes “
” copies of set “k” k-mers with the jth copy that have the restriction of k to the indices sj, and sorts each copy lexicographically. In the next step it linearly scans each copy and identifies equivalence blocks. This is then used to compute hamming distance for each block corresponding to a full length k-mer. In case of k-mer pair that is within the hamming distance
shows that their components are joined[50]. The algorithm finds connected components “cluster” that does not have size one and the consensus string is unique. It corrects every k-mer in the cluster to the consensus string. Singleton k-mer is trivially the consensus. It must be either the only element in the generating set of k-mer or simply contain more than
errors. Multiple consensus string exists in small clusters of size two or three, which creates ambiguity. In each case hamming graph does not provide good information regarding the correctness of the k-mers. Therefore, it makes decision solely based on the multiplicity of each k-mer x. If it is greater than the threshold value it keeps x in the dataset, otherwise it removes x from the dataset. Computation of the consensus of the cluster performs in
time.In[6] authors present a novel approach “Reptile” to cope with error correction problem of short read sequence in illumine/Solexa reads. Reptile performs the operation with k-mers spectrum and corrects errors using hamming distance. It also makes use of neighboring k-mers for correction possibilities for both potential as well as contextual information. This algorithm has two phases (a) Information extraction: in which k-spectrum Rk of R is extracted for given threshold d-extract hamming graph GH= (VH,EH) where
represents αi Rk and
. This is used to compute tile occurrences. (b) Individual read correction: It points to a tile t at the beginning of read, identifying d-mutant tiles of t. The next step is to correct errors in t, wherever applicable. This is done by adjusting tile t and identifying d-mutant step repeatedly until tile placement choices are exhausted. This amounts to a total run time of O (nLlog(nL)) and space usage of O(|Rk|+|R|t||). Reptile is faster, has low memory requirements, and provides high accuracy. However it is single threaded, and lacks the possibility to determine unambiguously all errors. It also does not show explicitly model repeats. In[12] authors present an algorithm “Quake” for detection and correction of errors in Illumine/Solexa short read sequences. Quake is based on maximum likelihood paradigm which incorporates quality values and nucleotide specific miscall rates. Quake makes use of k-mers coverage for differentiation of trusted k-mers in genome sequence, and searches error correction. Let observed nucleotide be denoted by O=O1,O2,…,On, and actual nucleotide of sequence fragment be denoted by A=A1,A2,…,An. Conditional probability is applied using bayes’ rule to the observed nucleotides as given in equation 112.![]() | (1) |
for the correction within a weak l-tuple. CUDA delta mutation consists of two phases, i.e delta (Δ) mutation voting and identifying all l-tuples. On successful completion of the first phase corresponding counters in voting matrix are incremented to point positions of mutation and errors are fixed, based on the highest scores in matrix. CUDA provides multithreading, parallelism, space efficiency and accuracy, but it does not support distributed memory operations.Research in[18, 47] presents a short read error correction algorithm “Eular SR” for applied Biosciences and Illimina/Solexa reads based on spectral alignment method. This method first establishes a spectrum of trusted k-mers from input data set and then correct each read in the sequence that it contains based on spectrum only. However this algorithm is faster and provides fixation of memory leak, but its accuracy drops significantly while choosing optimal parameters, it also require large memory. Another related research presented in[46] depicts a computation method to detect and correct errors in the genomic sequence repeats for the Pacific Biosciences platform. Computation method is available in the form of software package REDEEM (Read error detection and correction via expectation maximization). To correct errors in read r, it considers each of the nucleotide to have appeared at least once upto k-mers. Let’s suppose that the nucleotide at position 1≤i≤L of the read appear at position 1≤t≤k of k-mer xl. The probability that the true nucleotide at position t was b prior to possible misread as given in equation 2[46]. ![]() | (2) |
) is performed for inspection of nodes from level s to level t+q. The algorithm identifies all nodes with at least two children in which one of the children w has smaller weight than the threshold weight. Subsequently it finds a set of reads R (w). So correction is performed to a sibling of w that fits the suffix and examines each read Ri that belongs to R (w). SHREC points error position in Ri and performs the correction of the associated sibling’s nucleotide edge. It is robust, accurate and has the ability to achieve sensitivity and specificity. It has the drawbacks of large memory usage, does not explicitly model repeats, and difficult choice of optimal parameters. Research in[49] depicts a modified version of SHREC algorithm aiming to correct SOLEXA/Illumina reads. The SHREC’s statistical model is modified to put up reads of variable lengths. Given a set of s reads contains r reads of various lengths and are denoted by l1,l2,…,lr. It is assumed that the number of reads of length li is ki and let m be large enough that each sequence having length m appears once. Weight Wm of a node at level m is the number of suffixes whose path in the trie passes through that node. Level m’s node weight is denoted by Wm=∑i=1 to r W(m,i), where W(m,i) represent the contribution of reads length li to the weight of node. If li<m,W(m,i)=0, otherwise each read is a Bernoulli trail to get the path label. There are n substrings of length m in the genome and a read of length li samples Ai=li-m+1, of them. Thus the success probability of Bernoulli trail is ai/n. Ki is the number of trails and W(m,i) is distributed according to the binomial distribution Bin(Ki,ai/n). Therefore, expected value of W(m,i) is E(W(m,i)=kiai/n for li>=m, and 0 otherwise. Variance is as given in equation 3[49].![]() | (3) |
![]() | (4) |
![]() | (5) |
< 1.0). It performs the correction of base only if the quality threshold tQ (0.5tQ<1.0) exceeds from value Lmmax to eLmmax. Time complexity of quality threshold tQ is O (L) and worst case time complexity is O (Lmmax) to detect potential errors. O (eL2mmax) time is taken to correct them. Coral has less memory requirements compared to SHREC, is faster for short reads and is multithreaded. Coral also has some disadvantages i.e. large memory requirement than Reptile and Quake, and is not compatible with color space reads. In[51] authors present a mapping algorithm”SHRiMP2” based on original short read mapping techniques. SHRiMP2 points to the mapping sensitivity. It also has the ability to achieve high rate of accuracy at significant speed with the use of caching heuristic over previous versions. SHRiMP2 supports input format of fasta and fastq, output format of SAM and mapping of Illumina/solexa, Roche/454, AB /Solid reads. Additionally it has paired mapping mode and parallel computation capability. It supports both letter space and color space reads, provides multithreaded and parallel computation and has better sensitivity. It is however slower and has overhead in scanning the whole genome.In[54] authors present a novel algorithm “ECHO” for correcting base-call errors in the short reads without genome reference for Illumina/Solexa platform. ECHO is based on a probabilistic model and has the ability to assign quality score to each corrected base. There is no need to specify parameters or unknown values for optimization. ECHO sets the parameters automatically and estimates error characteristics specific in each sequence run. ECHO has the ability to improve the accuracy of previous error correction methods modified for specific sequence coverage depth and position in the read. ECHO performs error correction as a pre-processing step and considerably facilitates de novo assembly. ECHO presents improvement towards the end of the reads where previous methods are less effective. But it is slower and has high memory requirements.
|
|
|