American Journal of Bioinformatics Research

p-ISSN: 2167-6992    e-ISSN: 2167-6976

2014;  4(1): 11-22

doi:10.5923/j.bioinformatics.20140401.03

Functional Characterization of Expressed Sequence Tags of Bread Wheat (Triticum aestivum) and Analysis of CRISPR Binding Sites for Targeted Genome Editing

Shailesh Sharma, Santosh Kumar Upadhyay

National Agri-Food Biotechnology Institute (Department of Biotechnology, Government of India), C-127, Industrial Area, S.A.S. Nagar, Phase 8, Mohali, Punjab, 160071, India

Correspondence to: Santosh Kumar Upadhyay, National Agri-Food Biotechnology Institute (Department of Biotechnology, Government of India), C-127, Industrial Area, S.A.S. Nagar, Phase 8, Mohali, Punjab, 160071, India.

Email:

Copyright © 2014 Scientific & Academic Publishing. All Rights Reserved.

Abstract

Bread wheat (Triticum aestivum) is one of the leading food crop worldwide. However, functional characterization of wheat genome is still under progress due to its huge size (~17 Gb). We aimed to contribute in this project by functional characterization EST sequences. Wheat EST sequences (1.2 million available in the EST database) were cleaned and assembled into 27268 contigs at stringent parameters. About 89% (24339) contigs were functionally annotated using BlastX search at NCBI-NR protein database with 10-5 e-value. The annotated contigs were further classified into Gene Ontology terms and mapped for KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway using Blast2GO program. A total of 78827 GO terms and 132 KEGG pathways were identified. Purine, and starch and sucrose metabolism were major pathways. Inositol phosphate metabolism pathway, responsible for the synthesis of phytic acid (an anti-nutritional component), was also significantly represented in wheat. We identified 3327 EST-SSRs in 2832 contigs and probable CRISPR binding sites in each contigs. Further, a hypothetical phytic acid biosynthetic pathway and possible important target genes to reduce the phytic acid content in wheat by CRISPR-Cas system has also been described. Our study provides the genetic information about an important food crop as well as method for nutritional improvement using a modern biotechnology tool.

Keywords: Bread wheat, EST, KEGG, GO, CRISPR, Phytic acid

Cite this paper: Shailesh Sharma, Santosh Kumar Upadhyay, Functional Characterization of Expressed Sequence Tags of Bread Wheat (Triticum aestivum) and Analysis of CRISPR Binding Sites for Targeted Genome Editing, American Journal of Bioinformatics Research, Vol. 4 No. 1, 2014, pp. 11-22. doi: 10.5923/j.bioinformatics.20140401.03.

1. Introduction

Bread wheat (Triticum aestivum) is one of the most important food crop which accounts for ~21% food calories of 75% word population (Braun, et al. 2010). Figure is continuously increasing and it is estimated that the demand of wheat will be double by 2050. On the other hand, changes in climatic condition might decrease the production of wheat in coming years (Rosegrant et al. 2010). Introduction of new genetic and molecular biology tools for genome sequencing and genome engineering might be very useful in understanding the wheat biology and improvement in crop yield along with breeding programs (Wilson et al. 2004; Upadhyay et al 2013). Bread wheat has one of the most complicated allohexaploid and largest ~17 Gb genome, which is about 40 time of the rice genome (Arumuganathan and Earle 1991). Characterization of such kind of genome is it-shelf a big challenge; however it is an utmost need.
Expressed sequence tags (EST) are very useful information about the gene sequence and their expression (Duggan et al. 1999). EST sequencing of many plant species or either completed or under way, and they are very useful in gene discovery (Ewing et al. 1999; Fulton et al. 2002; ; Hughes et al. 2004; Ronning et al. 2003; Schlueter et al. 2004). Since the sequences of genome is continuously increasing due to the decrease in sequencing cost, functional annotation and characterization has become great challenge. In case of wheat (www.wheatgenome.org), sequencing of genome is rapidly succeeding, functional characterization of wheat ESTs can be a quick and complementary approach. Further, this resource will be highly valuable in crop improvement program as well as during the annotation of wheat genome.
Genome manipulation has become very important factor for crop improvement. Transcription activator like effector nucleases and Zink finger nucleases has been used for valuable mutation in plants and other organisms (Chen et al. 2013; Zhang et al. 2010). However, these technologies require protein engineering and complicated in designing. A new technology based on prokaryotic type II CRISPR-Cas9 (Clustered regularly interspaced short palindromic repeat-CRISPR associated protein) system has been reported for the genome editing (Cong et al. 2013). Although, some nonspecific editing is also reported (Fu et al. 2013), CRISPR-Cas system is very simple to design and highly effective (Mali et al. 2013; Cong et al. 2013). Further, the nonspecific binding can be quashed by targeting the specific target sequences. In earlier study, we have reported that CRISPR-Cas genome editing system in effective in wheat (Upadhyay et al. 2013). Therefore, it can be utilize for the crop improvement strategies.
In the present study, we aimed to functionally characterize the wheat ESTs available in the database and determine the CRISPR binding sites in wheat genomes. Here, we report the assembly of 1.2 million EST sequences of wheat (T. aestivum) available in NCBI EST database at stringent parameters, annotation of contigs developed, GO mapping, KEGG pathway analysis and frequency of EST SSRs. We have also observed that the CRISPR-Cas mediated genome editing tool might be very effective in wheat due to the occurrence of frequent target sites. Further, we have reported the probable phytic acid pathway in wheat and CRISPR target genes to reduce their synthesis in wheat, because, phytic acid is an anti-nutritional component (Raboy 2009).

2. Materials and methods

EST retrieval and assembly
A total of 1286914 EST sequences of Triticum aestivum were downloaded for NCBI-EST database(http://www.ncbi.nlm.nih.gov/nucest). EST sequences were used for quality improvement following the method and parameters described (Manickavelu et al. 2012). Vector sequences contamination was removed with the help of NCBI Univec database (ftp://ftp.ncbi.nih.gov/pub/UniVec/) following the Cross match program (Ewing et al. 1998). Poly A/T sequences and other ‘X’ characters were removed by using script EST_trimmer.pl script(http://pgrc.ipk-gatersleben.de/misa/download/est_trimmer.pl). These processed EST sequences were used for assembly using CAP3 program (Huang and Madan 1999) at minimum 45 bp overlapping with more than 90% similarity and at least 3 good EST reads. Further, ESTs were mapped over contigs to analyze the number of ESTs involve in contig development. Distribution of ESTs and contigs on the basis of their length was also investigated to see the improvement in average length of contigs over ESTs.
Annotations of assembled contigs
Following the criteria of Ewing et al (1999) only contigs were used for further characterization. We performed Blastx search of contigs against NCBI non-redundant (nr; ftp://ftp.ncbi.nih.gov/blast/db/FASTA/) protein database at e-value ≥10-5 for functional annotation. Sequences were further analyzed by Blast 2GO program with the default parameters (Conesa et al. 2008) using updated databases for GO (gene ontology) mapping, inter-pro-scan, enzyme code and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis. Updated versions of different databases (NCBI, Uniprot, Swissprot, Uniref and others) were used. We proposed hypothetical pathway for phytic acid biosynthesis in wheat. We have also investigated the species showing top blast hit with wheat contigs.
Identification of CRISPR binding sites
Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated protein (Cas) system has been reported as an efficient tool for RNA guided genome editing in crop plants like wheat (Upadhyay et al. 2013). It is based on the ribonucleo protein complex formed by guide RNA and Cas9 protein which bind to the 20 nucleotides target sequence by base pairing and cleave dsDNA at specific position (Jore et al. 2011; Mali et al. 2013). A trinucleotide conserved sequence motif NGG (known as proto-spacer adjacent motif, PAM) at 3’downstream of target sequence is also essential for cleavage (Gasiunas et al. 2012). CRISPR-Cas system might be very useful tool for functional genomics in wheat. To identify the CRISPR binding sites in wheat contigs, a python script was developed using the following criteria- (1) G/C (N)18G/C NGG, (2) G/C (N)18A/T NGG, (3) A/T (N)18G/C NGG and (4) A/T (N)18A/T NGG, where N denotes for any nucleotides and NGG at the end is PAM sequence. CRISPR target sites in important genes of phytic acid biosynthetic pathway were also analyzed.
Identification of EST-SSRs
EST-SSRs identification was performed by using the microsatellite (MISA) identification tool(http://pgrc.ipk-gatersleben.de/misa/misa.html). Identification of dinucleotide (DNRs), trinucleotide (TNRs), tetranucleotide (TtNRs), pentanucleotide (PNRs), hexanucleotide (HNRs), heptanucleotide (HpNRs) and octanucleotide repeats (ONRs) were performed by with the criteria of a minimum five repeat units. Mononucleotide repeats (MNRs) were not considered in this study.
Identification of Transcription Factors
To identify the transcription factors in bread wheat, we downloaded the reported transcription factors of rice (Oriza sativa) from plant transcription factor database (http://planttfdb.cbi.pku.edu.cn/download.php) and used for local blast search in contigs (Perez-Rodrigues et al. 2009). We used the parameters for identification transcription factors as described earlier (Manickavelu et al. 2012) with at least 80% sequence similarity with minimum 50% query coverage.
Data access
Since, we have used the sequence data available in NCBI EST database, we could not deposit contigs developed in this study to the database. Therefore we are providing the sequence file of contigs with the manuscript (Supplementary data 1).

3. Results

Bread wheat (Triticum aestivum) EST sequences and assembly
We downloaded 1286914 ESTs of Triticum aestivum from NCBI database and used for quality improvement. Poly A, T, N and vector sequences were trimmed. Further, ESTs with ambiguous nucleotide sequences were discarded. A total of 1169253 quality ESTs were used for CAP3 assembly under stringent conditions like a minimum of 45 nucleotides overlap with more than 90% similarity and at least 3 good read EST at clip position. We obtained 27268 contigs and 522651 singlets as an output of CAP3 program (Table 1). Contigs were used for the further analysis in this study as reported in other studies (Ewing et al. 1999). Length distribution analysis of ESTs and contigs showed that 48 and 29% ESTs were in the range of 501-750 and 251-500 nucleotides, respectively and only 15 % ESTs were longer that 750 nucleotides (Figure 1A). However, in case of contigs developed after assembly, we found 75% were longer than 750 nucleotides and 20% in the range of 501-750 (Figure 1B). The length of contigs ranging from 111 to 4106 nucleotides, however average contig length was 1018 nucleotides. Number of ESTs assembled in each contigs were ranging between 3 to 6322. About 73% contigs were developed after assembly of 3 to 50 ESTs, remaining 27% were developed by assembling more than 50 ESTs (Figure 1C).
Table 1. Details of ESTs and contigs of Wheat (Triticum aestivum)
     
Figure 1. Sequence length distribution of ESTs and contigs, and number of ESTs involve in contigs development. (A) Sequence length distribution of ESTs. (B) Sequence length distribution of contigs. (C) Statistics of number of ESTs involve in the development of contigs. Figure shows significant improvement in contigs length and involvement of 3 to more than 1000 ESTs in contigs development
Annotation of contigs
Only contigs were used for functional annotation by Blastx search at NCBI-nr protein database at E value 10-5. Further Blast2Go analysis was also performed at default parameters for annotation, however similar results were obtained. We could annotate 89% contigs with significant similarity. A total of 24439 contigs were annotated (Supplementary File 1). About 82% contigs were annotated at more than 70% similarity, in which 26% with more than 90% similarity. We observed top blast hits with the closely related plants species (Supplementary File 2, Figure 2). About 33, 27 and 22% top blast hits were obtained from Aegilops tauschii, Hordeum vulgare and Triticum urartu, respectively. However only 7.3% were shown hits with Trirticum aestivum, which indicated annotation of additional sequences, which were earlier functionally not annotated. This might also indicate the poor availability of annotated sequence resource in bread wheat.
Figure 2. Species showing top blast hits during annotation of wheat contigs. A. tauschii is on top followed by H. vulgare and T. urartu. Similarity with T. aestivum was lower than earlier three species indicating that most of the contigs annotated in present study were new
GO annotation
Blast2GO analysis was used to assign GO terms to the contigs. GO terms has been classified into three categories- cellular components, molecular functions and biological processes. The sum of the GO categories could not match to the number of assigned contigs because several contigs were classified into more than one. A total of 78827 GO terms were assigned on the basis of similarity. Out of total assigned GO terms, 21585 assigned to cellular component, 22698 to molecular function and 34544 to biological process category (Supplementary File 3). Among cellular components, plasma membrane (1695), nucleus (1568), mitochondrion (1460), cytoplasmic membrane bound vesicle (1425), integral to membrane (1156), chloroplast (1072) and membrane (1052) were top ten represented cellular components (Figure 3A). In case of molecular function, ATP binding (1390), DNA binding (806), metal ion binding (804), nucleotide binding (726), structural constituents of ribosomes (617), protein binding (609), zinc ion binding (589), catalytic activity (583) and hydrolase activity (537) were at the top (Figure 3B). Biological processes related to oxidation-reduction (988), translation (663), metabolic process (613), response to cadmium ion (466), protein phosphorylation (463), regulation of transcription (430), response to salt stress (418) and cellular process (409) were among the most enriched processes (Figure 3C).
Figure 3. GO categorization of wheat contigs. Sequences were categorized into (A) cellular component, (B) molecular function and (C) biological process on the basis of similarity
KEGG Pathway
KEGG pathway was obtained by Blast2GO tool. We identified 132 KEGG pathway involving 6095 contigs (Supplementary File 4). Purine metabolism (313), starch and sucrose metabolism (220), phenylalanine metabolism (196), glycolysis/gluconeogenesis (185) and phenyl proponoid biosynthesis (184) were top five pathways (Figure 4). Besides these, several pathways involve in metabolism of primary productivity like carbon fixation in photosynthetic organisms, pentose phosphate pathway, pyruvate metabolism, citrate cycle and other were also present in significant quantity. Genes involve in flavonoid, terpenoids, carotenoids, steroids, vitamins, lipids and other metabolite biosynthetic pathway were also detected (Supplementary File 4). We analyzed the presence of genes inositol phosphate metabolism pathway and proposed a probable phytic acid biosynthetic pathway in wheat.
Figure 4. Top 20 identified KEGG pathways. Purine (313), and starch and sucrose (220) metabolism pathway is highly representative
Hypothetical Phytic acid biosynthetic pathway in wheat
Basically two major (1) lipid dependent and (2) lipid independent phytic acid biosynthetic pathway are reported in plants (Reboy 2009). We mapped our contigs for KEGG pathways using Blast2GO program and found that 41 contigs were mapping to the Inositol Phosphate Metabolism pathway (Supplementary File 5), which is responsible for the synthesis of phytic acid. We analyzed the representation of contigs in this pathway and proposed a hypothetical phytic acid biosynthetic pathway using the reference of reported pathway (Reboy 2009). We could not find any gene involve in lipid independent pathway, however lipid dependent pathway was almost completely represented from glucose-6P to phytic acid in analyzed data (Figure 5). We observed the presence of enzymes responsible for the glucose-6P to Myo-inositol. Although, enzyme for the conversion of Myo-inositol to phosphatidyl-myo-inositol, afterward we found the representation of each enzymes responsible for the conversion of phosphatidyl-myo-inositol to phytic acid in straight forward manner. Further, we could not find some other enzymes involve in a bypass pathway from Myo-inositol 1,4,5 P3 to phytic acid.
Figure 5. A hypothetical phytic acid biosynthetic pathway developed on the basis of KEGG mapping of wheat contigs
CRISPR target sites in wheat
Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated protein (Cas) system has been developed as an attractive tool for genome engineering. CRISPR-Cas technology is also effective for genome editing in wheat (Upadhyay et al. 2013). Therefore we aimed to identify the probable CRISPR binding sites in our contigs, which might be useful in future studies. We observed at least one CRISPR target site in each annotated contigs (Supplementary File 6). A maximum of 363 target sites were detected in Contig20171. Further, we made four different motifs of target sites as explained in materials and methods. We found that the CRISPR site started with G/C and ended at A/T were more frequent followed by G/C to G/C (Figure 6). Result shows that wheat is highly prone for editing by CRISPR-Cas system.
Figure 6. Frequency of CRISPR binding sites in wheat. Number of probable target sites in different combination of start and end bases is explored. NGG at 3' end are PAM sequence
CRISPR-cas genome editing system can also be useful in crop improvement by engineering the phytic acid biosynthetic pathway for reduced phytate content. We analyzed the probable CRISPR binding sites in the contigs represented in phytic acid biosynthetic pathway and found that all of them were prone for editing (Supplementary File 7). Several probable CRISPR binding sites were present in each contigs. Although, one can target any gene in pathway but contig 9217 (annotated as probable inositol- pentakisphosphate 2-kinase-like) responsible for the conversion of Myo-inositol 1,3,4,5,6 P5 to phytic acid might be an important target for genetic engineering in wheat to reduce the phytate content as reported in maize (Shukla et al. 2009).
EST-SSRs
Annotated 24339 contigs were used for EST-SSRs mining using MISA Perl script (http://pgrc.ipk-gatersleben.de/misa/) Microsatellite (MISA) identification tool with standard criteria of 5 motif repeats. Total 3327 EST-SSRs was identified from 2832 contigs after excluding Mono-nucleotide repeats from the analysis. We identified more than one EST-SSRs in 391 contigs. We found Di-nucleotide repeats (DNRs, 640), Tri-nucleotide repeats (TNRs, 1971), Tetra-nucleotide repeats (TtNRs, 139), Penta-nucleotide repeats (PNRs, 37) and Hexa-nucleotide repeats (HNRs, 21) types of EST-SSRs were detected after excluding mononucleotides repeats (Supplementary File 8). Further, we could not detect hepta, octa or more nucleotides repeats. TNRs (70%) were the most abundant EST-SSRs identified followed by DNRs (23%) and TtNRs (5%) (Figure 7A). CT and GA among DNRs with 96 motifs, and CCG and CGC among TNRs with 145 and 135 motifs respectively were most abundant SSRs (Figure 7B).
Figure 7. Frequency distribution of 3327 EST-SSRs motifs identified in wheat contigs. (A) Distribution of different kinds of SSRs. TNRs (70 %) are among the most abundant EST-SSRs identified. (B) Number of top 20 EST-SSRs detected
Transcription factors
Among the 24339 annotated contigs, a total of 597 were mapped as transcription factor (TFs) encoding genes (Supplementary File 9) using homology with O. sativa transcription factors. MYB family (59), AP2 domain containing protein (25), bZIP (25), nuclear transcription factor Y subunit (21), helix-loop-helix DNA binding domain containing protein (20) and zinc finger C3H type (18) were highly represented transcription factors (Figure 8).
Figure 8. Distribution of identified transcription factors from wheat ESTs. MYB, AP2 domain containing protein and bZIP transcription factors are highly represented

4. Discussion

ESTs have been reported as valuable resource for functional genomics research. Millions of uncharacterized ESTs are available in public databases, which need to be functionally characterized. Although, there are several ambiguity in the ESTs data (like presence of vector sequence, short reads, poly A/T sequence, several uncertain (N) nucleotides and others), still it provides valuable information. Further, there are several bioinformatics tools are available to improve the quality of EST sequences to nullify the ambiguity (Ewing et al. 1999).
Wheat is the most valuable food crop worldwide (Braun et al. 2010). Several groups are involved in resolving the mystery of wheat genome but due to the huge size and complicated allopolyploid nature, it is still unresolved. There are millions of EST sequences of wheat are available in the public database without functional characterization. We aimed to contribute in the wheat genomics project by characterizing these sequences. About 1.2 million ESTs were downloaded from the database and used for quality improvement at several stringent parameters using different perl-scripts (Ewing et al. 1999; Manickavelu et al. 2012). This resulted into the ~9% reduction in EST sequences. Quality ESTs (1169253) were assembled into 27268 contigs at stringent parameters. We found large number of singlets (522651), which might be due to the very stringent parameters of assembly as only those reads were considered which was at least represented in 3 EST sequences. To increase the certainty of the result, only contigs were used for further characterization (Ewing et al. 1999). Length distribution showed that 77 % ESTs were in the range of 250-750 nucleotides, however, 75% contigs were longer than 750 nucleotides (Figure 1). The average length of contigs was 1018 nucleotides, which is higher than several other studies (Manickavelu et al. 2012; Somers et al. 2003, Wilson et al. 2004, Zhang et al. 2004). About 70% contigs were developed by assembly of more than 10 ESTs, which showed the accuracy of the sequences (Figure 1C).
Blastx using NCBI-nr protein database and Blast2GO analysis were perfor for functional characterization of contigs. A total of 89% contigs were annotated at 10-5 e value, in which 82% were annotated with more than 70% sequence homology. We observed top hit blast similarity of contigs with most of the Poaceae family members, wherein Aegilops tauschii, Hordeum vulgare and Triticum urartu were at the top Triticum aestivum. This result indicated that the most of the genes we annotated in the present study were not annotated earlier; otherwise we could get highest similarity with T. aestivum. Further, A. tauschi and T. urartu contribute for D and A genome of bread wheat (T. aestivum), therefore their high homology was expected (Brenchley et al. 2012; Ling et al. 2013; Jia et al. 2013).
Gene ontology (GO) analysis was performed by using Blast2GO program (Conesa et al. 2008), which provides information in the form of three major categories- cellular components, molecular functions and biological processes. We observed membrane, nucleus, mitochondrion and chloroplast as major cellular component in wheat as reported earlier (Manickavelu et 2012). In case of biological process, redox reaction, translation, metabolic and cellular process, transcription regulation, post translational protein modifications like phosphorrylation and response to different stresses were highly represented. Similar to the other members of Poaceae, binding activity like ATP, GTP DNA, RNA, metal, and protein binding activity were major molecular function in wheat (Alexandrov et al. 2009, Kikuchi et al. 2003, Zhang et al. 2004).
Blast2GO program was also used for KEGG (Kanehisa et al. 2000) pathway mapping using the enzyme commission numbers assigned to annotated sequences. This is another approach for functional genomics which emphasize the different metabolic biochemical pathways. We observed that the pathways involve in metabolism of different sugars, core components of building blocks like purines, pyrimidines, amino acids, synthesis of primary productivity and energy production were highly enriched in wheat. Besides these pathways, we analyzed the genes involve in phytic acid biosynthetic pathway. Phytic acid is an anti-nutrient compound in most of the crop plants and accounts for ~75% of total phosphorus in seed. It is also responsible for environmental pollution (Reboy 2001; 2003; Stevenson - Paulik et al. 2005). It is highly desirable in present scenario to reduce the phytate content of crop plants. Several approaches has been started to reduce the phytate content in crop plants like maize (Reboy 2009, Shukla et al. 2009), however a prolific approach in wheat is still not reported. Besides this, genes involve in complete phytic acid biosynthetic pathway in wheat is also not known. Therefore it is highly desirable to know the pathway in wheat so that we can select the target genes for genetic manipulation using different modern genetic engineering tools. We found that 41 contigs were mapped in Inositol Phosphate Metabolism (KEGG map 00562, Supplemental File 5), a KEGG pathway involve in phytic acid biosynthesis. On the basis of this mapping and available literature (Reboy 2009), we proposed a probable phytic acid biosynthetic pathway in wheat (Figure 5). Phytic acid biosynthetic pathway basically contains lipid dependent and lipid independent pathway. We found that the lipid dependent pathway was almost completely represented from the developed contigs in the present study, however we could get any genes involve in lipid independent pathway. It might be due to the unavailability of complete dataset. Further, it is also possible that the lipid dependent pathway is more enriched in comparison to the lipid independent pathway; however it is still a matter of deep study.
Present study also analyzed the competence of CRISPR-Cas genome editing tool for genetic engineering in wheat. Earlier, we have reported that this system in effective in wheat, however its efficiency for the editing of different target genes need to be established (Upadhyay et al. 2013). A python script was developed to analyze the CRISPR binding sites in wheat contigs used in the present study. We found that almost each contigs were prone for editing with this system (Supplemental File 6), which indicated a bright future of CRISPR-Cas technology in wheat for different crop improvement programs. We particularly analyzed the CRISPR target region in contigs mapped in phytic acid biosynthetic pathway and found that it can be very useful. Shukla et al (2009) targeted the inositol-1,3,4,5,6- pentakisphosphate 2-kinase in maize by Zinc finger nucleases (ZFNs) to reduce the phytic acid content. We identified homologous gene (Contig 9217) in wheat which can be a sensitive target. Since CRISPR-Cas system is a RNA guided system, and very simple and easy to design, it can be more useful as compared to the other tools like ZFNs, which involves protein engineering (Chen et al. 2013; Cong et al. 2013; Mali et al. 2013; Zhang et al. 2010).
We have also analyzed the frequency of EST-SSRs and transcription factor in the annotated contigs. EST is the reach source of SSR markers and EST-SSR shows high polymorphism, repeatability and very easy to use in studies (Li et al. 2008). We identified 3327 EST-SSRs in 11% of annotated wheat contigs. The frequency observed is quite higher than the earlier studies in wheat, which report SSRs from ~5% of genes (Rota et al. 2005). Trinucleotide repeats were highly represented SSRs in the present study, in which CCG repeats were most common. Result was in agreement with earlier report (Rota et al. 2005).
To analyze the presence of transcription factors in wheat contigs, we downloaded the sequences of O. sativa transcription factor from plant transcription factor database and blast against the contigs at stringent parameters. We found 597 contigs were mapped for transcription factors encoding genes (Supplementary File 9). MYB family transcription factor were highly enriched followed by, AP2 domain containing protein and bZIP. These transcription factors are involve in different regulatory and developmental pathways (Cai et al. 2012, Katiyar et al. 2012, Pré et al. 2008).
In conclusion, present study provides a comprehensive study of wheat ESTs and reported the future of genetic engineering tools for crop improvement programs. Functional characterization of the ESTs after quality improvement provide several new information including the phytic acid biosynthetic pathway in wheat, and probable method and target genes to reduce the phytate content in nutritional enrichment programs.

ACKNOWLEDGEMENTS

Authors are thankful to the Executive Director, National Agri-Food Biotechnology Institute (NABI), Mohali, India. SKU is thankful to Department of Science and Technology (DST), Government of India for DST INSPIRE Faculty Fellowship.
National Agri-Food Biotechnology Institute.

Supplementary File Legends

Supplementary File 1: Annotation details of wheat contigs with NCBI NR database using BlastX with e-value 10-5.  Download Supplementary File 1
Supplementary File 2: Top blast hit species destribution of wheat contigs at NCBI-nr protein database. Download Supplementary File 2
Supplementary File 3: GO functional annotation details of wheat contigs. Download Supplementary File 3
Supplementary File 4: Details of KEGG pathways identified. Download Supplementary File 4
Supplementary File 5: KEGG pathway for Inositol Phosphate Metabolism (map00562). Highlighted EC numbers were identified in our contigs. A total of 41 contigs were involved in this pathway (Supplemental File 4) by encoding different important enzymes. List of enzymes and encoding contigs are given below the figure. Download Supplementary File 5
Supplementary File 6: Probable target sites in annotated contigs of wheat for CRISPR-Cas mediated genome editing tool.  Download Supplementary File 6
Supplementary File 7: Probable CRISPR binding sites in the contigs represented in phytic acid biosynthetic pathway. Download Supplementary File 7
Supplementary File 8: Distribution of 3327 EST-SSRs motifs in wheat annotated contigs. TNRs (1971; 70 %) were among the most abundant EST-SSRs. Download Supplementary File 8
Supplementary File 9: Transcription factor identification. Details of transcription factor encoding genes identification from wheat contigs. Download Supplementary File 9
Supplementary Data 1: Sequences of annotated contigs.  Download Supplementary Data 1

References

[1]  Alexandrov NN, Brover VV, Freidin S, et al. 2009. Insights into corn genes derived from large-scale cDNASequencing, Plant Mol Biol vol 69, pp 179–94.
[2]  Arumuganathan, Earle ED. 1991. Estimation of nuclear DNA content of plants by flow cytometry. Plant Molecular Biology Reporter vol 9, pp 229-241.
[3]  Braun HJ, Atlin G, Payne T. 2010. Multi location testing as a tool to identify plant response to global climate change. In, pp Reynolds, C.R.P. (Ed.), Climate Change and Crop Production. CABI: London, UK.
[4]  Brenchley R, Spannagl M, Pfeifer M, Barker GL, D'Amore R, et al. 2012. Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature vol 491, pp 705–710.
[5]  Cai H, Tian S, Dong H. 2012. Large scale in silico identification of MYB family genes from wheat expressed sequence tags. Mol Biotechnol vol 52, pp 184-92.
[6]  Chen K, Gao C. 2013. TALENs: Customizable Molecular DNA Scissors for Genome Engineering of Plants. J Genet Genomics vol 40, pp 271-9.
[7]  Conesa A, Go¨tz S. 2008, Blast2GO: A comprehensive suite for functional analysis in plant genomics. Int J Plant Genomics vol 2008, pp 619832.
[8]  Cong L, Ran FA, Cox D, Lin S, Barretto R, et al. 2013. Multiplex genome engineering using CRISPR/Cas systems. Science vol 339:819-23.
[9]  Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM. 1999. Expression profiling using cDNA microarrays. Nat Genet vol 21, pp 10–4.
[10]  Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated sequencer traces using phred I. Accuracy assessment. Genome Res vol 8, pp 175–185.
[11]  Ewing RM, Kahla AB, Poirot O, Lopez F, Audic S, Claver JM. 1999. Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res vol 9, pp 950–9.
[12]  Fu Y, Foden JA, Khayter C, Maeder ML, Reyon D, et al. 2013. High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nature Biotechnology; doi:10.1038/nbt.2623.
[13]  Fulton TM, Hoeven R, Eannetta NT, Tanksley SD. 2002. Identification, analysis, and utilization of conserved ortholog set markers for comparative genomics in higher plants, Plant Cell vol 14, pp 1457–67.
[14]  Gasiunas G, Barrangou R, Horvath P, Siksnys V. 2012. Cas9-crRNA ribonucleoprotein complex mediates specific DNA cleavage for adaptive immunity in bacteria. Proc Natl Acad Sci USA vol 109, pp E 2579–E 2586.
[15]  Gene Ontology Consortium: 2006. The Gene Ontology (GO) project in 2006. Nucleic Acids Res vol 34(Database issue), pp D322–D6.
[16]  Huang X, Madan A. 1999. CAP3: A DNA sequence assembly program. Genome Res vol 9, pp 868.
[17]  Hughes, A. and Friedman, R. 2004. Expression patterns of duplicate genes in the developing root in Arabidopsis thaliana. Mol Evol vol 60, pp 247–56.
[18]  Jia J, Zhao S, Kong X, Li Y, Zhao G, et al. 2013. Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature vol 496, pp 91–95.
[19]  Jore MM, Lundgren M, Duijn EV, Bultema JB, Westra ER, et al. 2011. Structural basis for CRISPR RNA-guided DNA recognition by Cascade. Nat Struct Mol Biol vol 18, pp 529–536.
[20]  Kanehisa M, Goto S. 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res vol 28, pp 27–30.
[21]  Katiyar A, Smita S, Lenka SK, Rajwanshi R, Chinnusamy V, Bansal KC. 2012. Genome-wide classification and expression analysis of MYB transcription factor families in rice and Arabidopsis. BMC Genomics vol 13, pp 544.
[22]  Kikuchi S, Sathoh K, Nagata T, et al. 2003. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice, Science vol 301, pp 376.
[23]  Li L, Wang J, Guo Y, Jiang F, Xu Y, et al. 2008. Development of SSR markers from ESTs of gramineous species and their chromosome location on wheat. Progress in Natural Science vol 18, pp 1485–1490.
[24]  Ling HQ, Zhao S, Liu D, Wang J, Sun H, Zhang C, et al. 2013. Draft genome of the wheat A-genome progenitor Triticum urartu. Nature vol 496, pp 87–90.
[25]  Mali P, Yang L, Esvelt KM, Aach J, Guell M, et al. 2013. RNA-guided human genome engineering via Cas9. Science vol 339, pp 823-6.
[26]  Manickavelu A, Kawaura K, Oishi K, Shin T, Kohara Y, et al. 2012. Comprehensive Functional Analyses of Expressed Sequence Tags in Common Wheat (Triticum aestivum) DNA Research vol 19, pp 165–177.
[27]  Perez-Rodriguez P, Riano-Pachon DM, Correa LGG, Rensing SA, Kersten B, Mueller-Roeber B. 2009. PlnTFDB: updated content and new features of the plant transcription factor database. Nucleic Acids Res;doi: 10.1093/nar/gkp805.
[28]  Pré M, Atallah M, Champion A, Vos MD, Pieterse CMJ, Memelink J. 2008. The AP2/ERF Domain Transcription Factor ORA59 Integrates Jasmonic Acid and Ethylene Signals in Plant Defense. Plant Physiology vol 147, pp 1347-1357.
[29]  Raboy V. 2009. Approaches and challenges to engineering seed phytate and total phosphorus. Plant Science vol 177, pp 281–296.
[30]  Raboy V. 2003. Myo-Inositol-1,2,3,4,5,6-hexakisphosphate. Phytochemistry vol 64, pp 1033–1043.
[31]  Raboy V. 2001. Seeds for a better future: ‘low phytate’ grains help to overcome malnutrition and reduce pollution. Trends Plant Sci vol 6, pp 458–462.
[32]  Ronning CM, Stegalkina SS, Ascenzi RA, et al. 2003. Comparative analyses of potato expressed sequence tag libraries. Plant Physiol vol 131, pp 419–29.
[33]  Rosegrant MW, Agcaoili M. 2010. Global Food Demand, Supply, and Price Prospects to 2010. International Food Policy Research Institute: Washington, DC..
[34]  Rota ML, Kantety RV, Yu JK, Sorrells ME. 2005. Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley. BMC Genomics vol 6, pp 23 doi:10.1186/1471-2164-6-23
[35]  Schlueter JA, Dixon P, Granger C, et al. 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome vol 47, pp 868–76.
[36]  Shukla VK, Doyon Y, Miller JC, Moehle EA, Worden SE, et al. 2009. Precise genome modification in the crop species Zea mays using zinc-finger nucleases.; 459:doi:10.1038/nature 07992.
[37]  Somers DJ, Kirkpatrick R, Moniwa M, Walsh A. 2003 Mining single-nucleotide polymorphisms from hexaploid wheat ESTs. Genome, vol 49, pp 431–7.
[38]  Stevenson-Paulik J, Bastidas RJ, Chiou ST, Frye RA. 2005. Generation of phytate-free seeds in Arabidopsis through disruption of inositol polyphosphate kinases. Proc. Natl Acad. Sci. USA vol 102, pp 12612–12617.
[39]  Upadhyay SK, Kumar J, Alok A, Tuli R. 2013. RNA guided genome editing for multiple target gene mutations in wheat. G3 (Bethesda). vol 3, pp 2233-8.
[40]  Wilson ID, Barker GLA, Beswick RW, et al. 2004. A transcriptomics resource for wheat functional genomics, Plant Biotech vol 2, pp 495–506.
[41]  Zhang D, Choi DW, Wanamaker S, et al. 2004. Construction and evaluation of cDNA libraries for large-scale expressed sequence tag sequencing in wheat (Triticum aestivum L.). Genetics vol 168, pp 595–608.
[42]  Zhang F, Maeder ML, Unger-Wallace E, Hoshaw JP, Reyon D, et al. 2010. High frequency targeted mutagenesis in Arabidopsis thaliana using zinc finger nucleases. Proc Natl Acad Sci USA vol 107, pp 12028-33.