American Journal of Bioinformatics Research
p-ISSN: 2167-6992 e-ISSN: 2167-6976
2014; 4(1): 1-5
doi:10.5923/j.bioinformatics.20140401.01
1Department of Mathematical Science, University of Saint Joseph, West Hartford, 06117, USA
2Department of Chemical Biology and Center for Cancer Prevention Research, The State University of New Jersey, Rutgers, 08854, USA
Correspondence to: Hong Zhou, Department of Mathematical Science, University of Saint Joseph, West Hartford, 06117, USA.
| Email: | ![]() |
Copyright © 2012 Scientific & Academic Publishing. All Rights Reserved.
The Basic Local Alignment Search Tool (BLAST) is the most powerful and popular sequence alignment tool used in biological research, which includes the siRNA or shRNA design. Unfortunately, BLAST can overlook some significant homologies that may cause off-target effect for siRNA gene silence. Therefore, the Smith-Waterman alignment algorithm has been called upon, since it guarantees to find all possible mismatch alignments that may cause off-target effect. However, the Smith-Waterman alignment algorithm suffers from its inefficiency in searching through a large sequence database. This paper presents a two-phase homology search strategy that preserves the strength of Smith-Waterman alignment algorithm while shortening its running time. In the first phase of this algorithm, selected siRNA sequences are divided into multiple mutually disjointed substrings and each is used to scan the sequence database for perfect matches against other genes. Only the sequences that have a perfect match to substrings (of a given siRNA) are kept for the second phase. The second phase is the bona fide Smith-Waterman procedure. During this phase, the algorithm only checks the local vicinity sequences where a substring lands on a perfect match. This two-phased arrangement of the algorithm significantly improves the efficiency of the original Smith-Waterman algorithm by concentrating the search on localized regions instead of the whole genome database.
Keywords: siRNA, shRNA, Smith-Waterman, Sequence Alignment, Off-Target
Cite this paper: Hong Zhou, Hong Wang, Localized Smith-Waterman Algorithm for Fast and Complete Search of siRNA off-Target Sequences, American Journal of Bioinformatics Research, Vol. 4 No. 1, 2014, pp. 1-5. doi: 10.5923/j.bioinformatics.20140401.01.
must be constructed. Thus, the computational complexity is of order O(mn). When searching for possible off-target alignment, a siRNA sequence must be aligned against all expressed sequences (mRNAs) except the target gene itself (or gene slice variants). In the human mRNA RefSeq database used for this study, there are 68822 non-redundant sequences with an average length of 3452bp. Therefore, the general computational cost for a siRNA sequence of 21nt would be at least
(suppose only aligning the sense strand of the siRNA sequence against other mRNA sequences). This cost-prohibitive computation makes it almost impossible to incorporate Smith-Waterman algorithm into routine siRNA design. To make Smith-Waterman algorithm practical in siRNA design, we propose a modified application of the algorithm, which was motivated by the following realization.For a siRNA sequence of length m, an off-target homology is defined as a sequence that has less than x mismatches when aligned against the siRNA sequence. Thus, after the siRNA sequence is divided into x equal substrings (as equal as possible), at least one substring must have a perfect match with the off-target region. For the remainder of this paper, let’s assume m=21 and x=3 unless stated otherwise. Under this condition, an off-target homology can only have a maximum of two mismatches, i.e., 0, 1, or 2 mismatches. When there are a maximum of two mismatches, no matter where the possible two mismatches are, at least one third of the siRNA sequence must have an exact match with the homological region as shown in figure 1.
where m is the length of the siRNA sequence, n is the length of the searched gene sequence, x is the number of substrings, and
annotates the substring length. If we assume that each of the four different bases AGCT has an equal probability to appear at any given nucleotide position, then the probability for i consecutive bases to be matched is
. However, in the character by character matching procedure, the second base is required to be compared only if the first base is matched, and the third base is required for comparison only if the first two bases are matched, and so on. Thus, the probability of comparing two bases of the substring is
while the probability of comparing three bases of the substring is
. Therefore the total computational cost for the first phase would be![]() | (1) |
. The second phase is Smith-Waterman alignment algorithm conducted between the siRNA sequence and the potential off-target region whose maximum length is
. The computational cost of this alignment procedure is therefore of order
. However, the probability of finding a potential off-target region is only
, so the actual computational cost of the second phase is![]() | (2) |
![]() | (3) |
![]() | Figure 4. The value of x impacts the efficiency of the two-phase algorithm. sw: Smith-Waterman algorithm alone. tp: the proposed two-phase algorithm |