The DFCI Gene Indices

Frequently Asked Questions About the DFCI Gene Indices


The purpose of this page is to provide answers for commonly asked questions about the DFCI's Gene Indices.
Before you begin, please note:
  • All of the Gene Indices are built with the same structure so the answers to the questions are valid for ANY of the gene indices.
  • All answers requiring navigation of TGI web pages begin from the main index page, links are provided for each of the species by clicking on the picture icon or the organism name.

Category: Terminology & Explanations
  1. What are the different identifiers and what do they mean?
  2. How are the gene indices constructed?
  3. How are the Gene Index sequences annotated and what is the meaning of the various qualifiers on your web page?
  4. How are TC numbers given and how do you track a TC from release to release?
  5. What is in the alignment diagram in TC Report?
  6. Is it possible to print out the the TC Report with the alignment diagram?
  7. Is it possible to highlight a specific sequence in the alignment diagram?
  8. What is the orientation of a TC and how do you determine that?
  9. Do the arrows in the alignment diagram of a TC report indicate the orientation of its component sequences?
  10. what do the red/green tags in the TC sequence in TC Report mean?
  11. what does the gray area in the EST sequence in EST Report mean?
  12. How does the DFCI Gene Index compare with UniGene?
  13. How are the GO terms assigned to THC/TCs? Is it possible for the GO terms to be different from the tentative annotation?
  14. How are the Medline IDs assigned to THC/TCs?
  15. What are the criteria for designing the unique oligos ?

1. What are the different identifiers and what do they mean?

  • ESTs, Expressed Sequence Tags are partial, single-pass sequences from either end of a cDNA clone.
  • THCs, "Tentative Human Consensus" sequences, are assemblies of human ESTs.
  • TCs, "Tentative Consensus" sequences, are assemblies of non-human ESTs.
  • HT stands for "Human Transcript".
  • ET stands for "Expressed Transcript" and is used to denote a non-human transcript.
More descriptions can be found on the Definition page.
The identifiers (abbreviation followed by a number, ie THC101183 or EST68924) correspond to sequences in the indices and are unique for each index. An identifier may be hot-linked to a report page that gives more information about the sequence it represents.
HTs and ETs are part of TIGR's Expressed Gene Anatomy Database(EGAD). They are included in the assembly process and may appear in the THC or TC reports as a NP#.

2. How are the gene indices constructed?

The EST sequences for a given organism are extracted from dbEST when a gene index is being built. The new sequences are screened for quality control purposes (vector, E.coli, poly A,T trimming, length >= 100bp, <3%N).
The ET sequences are extracted from appropriate divisions of GenBank and participate in the clustering and assembly process along with the cleaned ESTs.
ESTs and ETs are compared and clustered together if they meet the following criteria:

  • a minimum of 40 base pair match
  • greater than 94% identity in the overlap region
  • a maximum unmatched overhang of 30 base pairs
These clusters are then assembled into Tentative Consensus (TC) sequences using Paracel Transcript Assembler.

3. How are the Gene Index sequences annotated and what is the meaning of the various qualifiers on your web pages?

In the regular gene index building process, the Tentative Consensus sequences (TCs) are searched against a non-redundant protein database. In some rare cases the singletons are also included in this protein search, (especially when their count is not prohibitive). How the "annotation" of a TC is assigned? When expressed transcripts (ETs) are part of the assembly of a TC, their annotations are concatenated and shown in the Expressed Transcripts entry on the TC report page. In this case they also provide the Tentative Annotation for that TC sequence. But when there are no ET sequences included in the TC, the best scoring hit with a protein is used as the Tentative Annotation. The protein hits displayed in the Similarity search results table use the following assignment convention:

criterion qualifier
100% > percent identity >= 90% homologue to
 90% > percent identity >= 70% similar to
percent identity < 70% weakly similar to
protein coverage > 98% complete
protein coverage <= 98% partial

4. How are TC numbers given and how do you track a TC from release to release?

For the Tentative Consensus (TC) numbering scheme, consecutive numbers are used to name assemblies for that Gene Index Release. The numbers will not be re-used for future Releases. To illustrate, the TC range for Release 1.0 may be TC1-TC1000. For Release 2.0, the assemblies will start from TC1001. If you use an old TC for search, the current TC containing majority of the old TC component sequences will be returned.

The multiple TC numbers listed in the TC report are just a tracking record (history) of how that TC has evolved over different Releases.
For example, >TC209054      TC55451 TC72468 TC89812 TC104422 TC130266 TC140024, the component sequences that were in TC55451 went into TC72468 in the next Release. In the same way, the sequences later went in TC140024 and eventually got in TC209054 for the current Release. The tracking of TC history is important as new ESTs are generated and new assemblies are made.


5. What is in the alignment diagram in TC Report??

The alignment diagram in TC Report shows the layout of the component sequences relative to the consensus sequence (TC) and to each other. EST sequences are shown in black arrows and ET sequences are shown in red arrows. Please see question 9 for explanation on the directionality of the arrows. The numbers associated with the arrows correspond to entries in the component sequence table located below the alignment diagram. EST sequences that are from the 5', 3' end of the same cDNA clone (clone mates) share the same number while denoted with letter a or b. THe clone mates are also shown in the Opposite Ends table in the TC Report. Please see question 10 for explanations for the red, green tags.


6. Is it possible to print out the the TC Report with the alignment diagram?

Yes, clicking at the 'Printable version' button located at the bottom right corner of a TC Report allows one to print out the TC Report with the sequence alignment diagram.


7. Is it possible to highlight a specific sequence in the alignment diagram?

Yes, by supplying an extra parameter in the TC Report URL, you can high light a specific sequence or sequences in the alignment diagram. Please check the question 1 in category Availability to construct a TC Report URL. Type in &highlight= and the accessions (GenBank Accession, clone name, source, or est ID). Multiple accessions can be entered using comma as delimiter.

    Example:
    http://biocomp.dfci.harvard.edu/cgi-bin/tgi/tc_report.pl?tc=TC5926&species=A.salmon&highlight=AJ532826

8. What is the orientation of a TC and how do you determine that?

The DNA template strand for a TC is inferred by an orientation voting scheme and can be 'coding strand', 'reverse complement', or 'undetermined'. Evidence for such "guessing" is looked at based on directionality of component sequences (5' EST, 3' EST, ETs), orientation of the protein hits, and presence of polyA/T in the ESTs. If a TC is determined to be from the reverse complement strand, the sequence is reversed to the coding strand. Please check the orientation table in the TC Report for examples.


9. Do the arrows in the alignment diagram of a TC report indicate the orientation of its component sequences?

The orientation of ESTs and Transcripts in an assembly cannot be assumed to be 5'to 3'. The arrows that appear in the alignment diagram indicate the direction of the sequence in relation to the consensus sequence only. The arrow "----->" means that the primary strand of an EST was used, and the "<------" arrow indicates that the reverse complement of an EST was used in the assembly construction.


10. What do the red/green tags in the TC sequence and in the alignment diagram of TC Reports mean?

The detection of polyadenylation signal in the TC sequence is tagged with red background. The screening is based on locating the hexanucleotide sequence 'AATAAA' in the last 30 nucleotides of the sequence (black letters over red) or 'TTTATT' in the first 30 nucleotides (white letters over red).

EST trimming of polyA/polyTs is also shown in the alignment diagram in the TC report.

  •    represents a trimmed polyT at the 0 of an EST sequence;
  •    represents a trimmed polyA at the end of an EST sequence;
Both are shown in the relative position as they exist in the original sequence.

11. What does the gray background in the EST sequence in EST Reports mean?

EST report shows the entire sequence of the EST, with the trimmed regions on a gray background (this includes polyA/T trimming, vector and adaptor sequences, and low-quality sequence). The trimming is done in the EST cleaning process for quality control purposes. (see Question #2) Only the region over white background (cleaned region) actually enters the gene index build process.


12. How does the DFCI Gene Index compare with UniGene?

All of DFCI Gene Index assemblies are made by first clustering the EST sequences and then assembling these clusters into consensus sequences, or THC/TCs. EST sequences and transcripts are compared and clustered together if they meet the following criteria:

  • a minimum of 40 base pair match
  • greater than 94% identity in the overlap region
  • a maximum unmatched overhang of 30 base pairs
These clusters are then assembled into consensus sequences using Paracel Transcript Assembler.

UniGene links ESTs in a cluster if the sequences have a 50 base pair overlap in the 3' untranslated region (UTR) with 100% identity. These clusters are not run through the more stringent assembly process and consensus sequences are not made. For this reason you will often find several DFCI THCs contained within one UniGene cluster.


13. How are the GO terms assigned to THC/TCs? Is it possible for the GO terms to be different from the tentative annotation??

In the regular gene index building process, the TCs are searched against a non-redundant protein database. The top protein hit with 75% similarity and 50% coverage with a non-automatic GO assignment is taken. The GO terms assigned to the protein are transferred to the THC/TC. The protein hit taken here may or may not agree with the tentative annotation (annotation process is described in question #3).


14. How are the Medline IDs assigned to THC/TCs?

TC report page represents only one identifier for each top5 protein. Proteins with sequence identical to proteins specified in "Similarity search results" table of TC report can appear in the different databases (GenBank, SwissProt, PIR) under different accessions and can be mentioned in different publications. We collected all Medline IDs that can be found in the documents mentioning proteins sequence of which is identical to sequence of top5hits.


15. What are the criteria for designing the unique oligos ?

We're essentially using the OligoPicker algorithm kindly provided by Xiaowei Wang at Harvard (http://pga.mgh.harvard.edu/oligopicker/) for this purpose.
The design criteria are as follows:

  • A primary 15-mer contiguous match filter (i.e every 15-mer in the oligo is unique).
  • An oligo blast score < 30 (this corresponds to a perfect 15-mer match).
  • Oligo Tm is within 5 degrees of the median Tm (calculated based on the GC content of the 70-mers).
  • Oligo doesn't have a low-complexity region (this includes screening for organism specific repeats where available).
  • Oligo doesn't cross-react with rRNA and snRNA filter sequences from the same and related organisms.
  • Oligo doesn't self-anneal to the cDNA complement.
  • Oligo should be readily accessible for hybridisation and should preferably be designed outside of the region that has a high propensity for secondary structure formation. A Gibb's free-energy threshold > -20 kcal/mol is chosen for this purpose.
  • The unique oligo is designed from the region that lies within 1000 residues from the 3' end.

Return to TGI main page.
Comments and suggestions: contact us.

Comments and suggestions : Contact Us

Acknowledgements
   The Gene Index Project is supported in part by funding from the US National Science Foundation through grant #DBI-0552416.