  • TCs:

    Tentative Consensus sequences are created by assembling ESTs into virtual transcripts. In some cases, TCs contain full or partial cDNA sequences (ETs) obtained by classical methods. TCs contain information on the source library and abundance of ESTs and in many cases represent full-length transcripts. Alternative splice forms are built into separate TCs.

    TCs are actual assemblies, with a consensus sequence, and not simply clusters of overlapping sequences. Example TC.

  • ESTs:

    Expressed Sequence Tags are partial, single-pass sequences from either end of a cDNA clone. The EST strategy was developed to allow rapid identification of expressed genes by sequence analysis.

  • ETs:

    The qcGene Database (qcGene) contains a set of nucleotide sequences that represent mature transcripts (ETs). The ETs are curated for nomenclature and links have been made to related accessions. Sequences were downloaded from GenBank (mRNAs and sequences derived from genomic sequences). RefSeq divisions of GenBank (both known and model) were also downloaded and parsed. Where available, 5' and 3' non-coding regions were included. Alternative splice forms of genes are explicitly represented.

  • Singleton ESTs:

    Also refered to as singletons, are ESTs that are not contained in an assembly. These ESTs went through the assembly process but did not meet the match criteria (see below) to be assembled with any other EST in the collection of ESTs and other GenBank sequences used to create the consensus sequences for a particular Gene Index.

Protocol for Assembly of ESTs and Transcripts

Preparation of EST data

  • Sequences were extracted from dbEST and were subjected to quality control screening (vector, E. coli, polyA, T, or CT removal, minimum length = 100 bp, < 3% N).

Preparation of transcript (ET) database

  • All sequences from the appropriate divisions of GenBank (including RefSeq) were extracted.
  • Non-coding sequences were discarded and cDNAs and coding sequences from genomic entries were saved.
  • Sequences and related information (e.g. PubMed links) are stored in the qcGene database (qcGene).


  • Cleaned EST sequences and non-redundant transcript (ET) sequences were combined.
  • Using the Paracel Transcript Assembler Program, sequences were assembled into contigs. TCs are consensus sequences based on two or more ESTs (and possibly an ET) that overlap for at least 40 bases with at least 94% sequence identity. These strict criteria help minimize the creation of chimeric contigs. These contigs are assigned a TC (Tentative Consensus) number. TCs may comprise ESTs derived from different tissues.
  • The best hits for TC's were assigned by searching the TC set against a non-redundant amino acid database(nraa) using BLAT. The top five hits based on score were selected and displayed for each TC.


  • TCs are only as good as the ESTs underlying them; there may be unspliced or chimeric ESTs and thus TCs
  • There is still redundancy in the TC set because sequences must match end to end and at a certain percent identity to be combined
  • Directionality of the TCs should not be assumed
  • Not all TCs contain protein-coding regions

