TIGR Bos taurus Gene Index
Definitions:
Protocol for Assembly of ESTs and Transcripts
Preparation of EST data
- Sequences were extracted from dbEST and were subjected to quality control screening (vector, polyA, T, or CT removal, minimum length = 100 bp, < 3% N).
Preparation of non-redundant Bos taurus transcript (ET) database
- All Bos taurus sequences from GenBank were extracted.
- Non-coding sequences were discarded and cDNAs and coding sequences from genomic entries were saved.
- Redundant entries for the same gene were removed, retaining link to accession number.
- Sequences and related information are stored in TIGR's Expressed Gene Anatomy Database (EGAD).
- The curated ET data set is available as a multiple FastA format file. Please see the EGAD main page for more information.
Assembly
- Cleaned EST sequences and non-redundant transcript (ET) sequences were combined.
- Using the CAP3 Sequence Assembly Program sequences were assembled into contigs. TCs are consensus sequences based on two or more ESTs (and possibly an ET) that overlap for at least 40 bases with at least 95% sequence identity. These strict criteria help minimize the creation of chimeric contigs. These contigs are assigned a TC (Tentative Bos taurus Consensus) number. TCs may comprise ESTs derived from different tissues.
- The best hits for TC's were assigned by searching the TC set against a
non-redundant amino acid database(nraa) using a search method developed in
house called DNA-Protein Search, DPS (Microbial and Comparative Genomics,
Vol 1, Number 4 1996). The top five hits based on score (cutoff value of 350)
were selected and displayed for each TC.
Caveats
- TCs are only as good as the ESTs underlying them; there may be chimeric ESTs and thus TCs
- There is still redundancy in the TC set because sequences must match end
to end and at a certain percent identity to be combined
- Directionality of the TCs should not be assumed
- Not all TCs contain protein-coding regions
References related to EST strategy [Entrez links]
Adams, MD et. al., "Complementary DNA sequencing: expressed sequence tags and human genome project" Science 252, 1651-6 (1991) [91262645]
Adams, MD et. al., "Sequence identification of 2,375 human brain genes", Nature 355: 632-4 (1992) [92168112]
Adams, MD et. al., "3,400 new expressed sequence tags identify diversity of transcripts in human brain", Nat Genet 4, 256-67 (1993) [93364420]
Adams, MD et. al., "Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library", Nat Genet 4, 373-80 (1993) [94004965]
Adams, MD et. al., "Initial assessment of human gene diversity and
expression patterns based upon 83 million nucleotides
of cDNA sequence", Nature 377(Suppl.): 3-174 (1995) [96026280].