The purpose of this page is to provide answers for commonly asked questions about the DFCI's Gene Indices. | ||||||||||||||||||||||||
| criterion | qualifier |
| 100% > percent identity >= 90% | homologue to |
| 90% > percent identity >= 70% | similar to |
| percent identity < 70% | weakly similar to |
| protein coverage > 98% | complete |
| protein coverage <= 98% | partial |
4. How are TC numbers given and how do you track a TC from release to release?
For the Tentative Consensus (TC) numbering scheme, consecutive numbers are used to name assemblies for that Gene Index Release. The numbers will not be re-used for future Releases. To illustrate, the TC range for Release 1.0 may be TC1-TC1000. For Release 2.0, the assemblies will start from TC1001. If you use an old TC for search, the current TC containing majority of the old TC component sequences will be returned.
The multiple TC numbers listed in the TC report are just a tracking record (history) of
how that TC has evolved over different Releases.
For example, >TC209054      TC55451 TC72468 TC89812 TC104422 TC130266 TC140024,
the component sequences that were in TC55451 went into TC72468 in the next Release.
In the same way, the sequences later went in TC140024 and eventually got in TC209054
for the current Release. The tracking of TC history is important as new ESTs are generated and new assemblies are made.
5. What is in the alignment diagram in TC Report??
The alignment diagram in TC Report shows the layout of the component sequences relative to the consensus sequence (TC) and to each other. EST sequences are shown in black arrows and ET sequences are shown in red arrows. Please see question 9 for explanation on the directionality of the arrows. The numbers associated with the arrows correspond to entries in the component sequence table located below the alignment diagram. EST sequences that are from the 5', 3' end of the same cDNA clone (clone mates) share the same number while denoted with letter a or b. THe clone mates are also shown in the Opposite Ends table in the TC Report. Please see question 10 for explanations for the red, green tags.
6. Is it possible to print out the the TC Report with the alignment diagram?
Yes, clicking at the 'Printable version' button located at the bottom right corner of a TC Report allows one to print out the TC Report with the sequence alignment diagram.
7. Is it possible to highlight a specific sequence in the alignment diagram?
Yes, by supplying an extra parameter in the TC Report URL, you can high light a specific sequence or sequences in the alignment diagram. Please check the question 1 in category Availability to construct a TC Report URL. Type in &highlight= and the accessions (GenBank Accession, clone name, source, or est ID). Multiple accessions can be entered using comma as delimiter.
8. What is the orientation of a TC and how do you determine that?
The DNA template strand for a TC is inferred by an orientation voting scheme and can be 'coding strand', 'reverse complement', or 'undetermined'. Evidence for such "guessing" is looked at based on directionality of component sequences (5' EST, 3' EST, ETs), orientation of the protein hits, and presence of polyA/T in the ESTs. If a TC is determined to be from the reverse complement strand, the sequence is reversed to the coding strand. Please check the orientation table in the TC Report for examples.
9. Do the arrows in the alignment diagram of a TC report indicate the orientation of its component sequences?
The orientation of ESTs and Transcripts in an assembly cannot be assumed to be 5'to 3'. The arrows that appear in the alignment diagram indicate the direction of the sequence in relation to the consensus sequence only. The arrow "----->" means that the primary strand of an EST was used, and the "<------" arrow indicates that the reverse complement of an EST was used in the assembly construction.
10. What do the red/green tags in the TC sequence and in the alignment diagram of TC Reports mean?
The detection of polyadenylation signal in the TC sequence is tagged with red background.
The screening is based on locating the hexanucleotide sequence 'AATAAA' in the last
30 nucleotides of the sequence (black letters over red)
or 'TTTATT' in the first 30 nucleotides (white letters over red).
EST trimming of polyA/polyTs is also shown in the alignment diagram in the TC report.
11. What does the gray background in the EST sequence in EST Reports mean?
EST report shows the entire sequence of the EST, with the trimmed regions on a gray background (this includes polyA/T trimming, vector and adaptor sequences, and low-quality sequence). The trimming is done in the EST cleaning process for quality control purposes. (see Question #2) Only the region over white background (cleaned region) actually enters the gene index build process.
12. How does the DFCI Gene Index compare with UniGene?
All of DFCI Gene Index assemblies are made by first clustering the EST sequences and then assembling these clusters into consensus sequences, or THC/TCs. EST sequences and transcripts are compared and clustered together if they meet the following criteria:
UniGene links ESTs in a cluster if the sequences have a 50 base pair overlap in the 3' untranslated region (UTR) with 100% identity. These clusters are not run through the more stringent assembly process and consensus sequences are not made. For this reason you will often find several DFCI THCs contained within one UniGene cluster.
13. How are the GO terms assigned to THC/TCs? Is it possible for the GO terms to be different from the tentative annotation??
In the regular gene index building process, the TCs are searched against a non-redundant protein database. The top protein hit with 75% similarity and 50% coverage with a non-automatic GO assignment is taken. The GO terms assigned to the protein are transferred to the THC/TC. The protein hit taken here may or may not agree with the tentative annotation (annotation process is described in question #3).
14. How are the Medline IDs assigned to THC/TCs?
TC report page represents only one identifier for each top5 protein. Proteins with sequence identical to proteins specified in "Similarity search results" table of TC report can appear in the different databases (GenBank, SwissProt, PIR) under different accessions and can be mentioned in different publications. We collected all Medline IDs that can be found in the documents mentioning proteins sequence of which is identical to sequence of top5hits.
15. What are the criteria for designing the unique oligos ?
We're essentially using the OligoPicker algorithm kindly provided by Xiaowei Wang at Harvard (http://pga.mgh.harvard.edu/oligopicker/) for this purpose.
The design criteria are as follows:
| Acknowledgements | |
|---|---|
![]() |
The Gene Index Project is supported in part by funding from the US National Science Foundation through grant #DBI-0552416. |
![]()
|