The Gene Indices group is committed to making software tools freely
available to the scientific community. The software provided on this
page represents modified versions of our internal software, designed to
run independently of the operating system environment (e.g. not linked
to the database and software environment at DFCI).
This software is available free of charge under the Artistic License;
download and use of this software constitutes agreement to abide by the
terms of the Artistic License.
If you would like to be notified about software updates, please
click on the package logo and fill in your contact information.
The Gene Index Project and tools is supported in part by funding from the US
Department of Energy, Grant #DE-FG02-99ER62852, and the US National Science Foundation, Grant
#DBI-9983070. Additional funds are provided by the US National Science Foundation
through grants #DBI-9813392 and #DBI-9975866.
TGI Clustering tools (TGICL): a software system for fast clustering of large EST datasets
This package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a
slightly modified version of NCBI's megablast , and the resulting clusters are then assembled using CAP3 assembly program.
TGICL starts with a large multi-FASTA file (and an optional peer quality values file) and outputs the assembly files as
produced by CAP3. Both clustering and assembly phases can be parallelized by distributing the searches and
the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP machines or PVM
(Parallel Virtual Machine) clusters.
The two full precompiled packages below were built on Linux and SunOS, respectively.
They include CAP3, mgblast and all the other binaries for this platform (of course, except the
base Unix utilities like 'sed', 'sort' etc.).
Please note that only the Linux version was thoroughly tested at DFCI.
- tgicl_linux.tar.gz Linux x86 (glibc-2.1, requires perl>=v5.6).
- tgicl_sunos.tar.gz built on SunOS 5.8 sparc SUNW, Ultra-250
The platform independent perl scripts can also be downloaded separately
These scripts are likely to be updated more often, as opposed to
the binaries which are rather stable. In such cases there is no need to
download again the full, large packages provided above, although
they also include these scripts.
Please note that the entire project is hosted at sourceforge.net where you can find all previous versions including the source code. Just click on View all Files button and browse all versions listed there.
For advanced users, a more complex version of TGICL that has its own custom
incremental assembler and extensive grid support (SGE and Condor), can be downloaded here:
(this package contains the scripts and all the necessary pre-compiled binaries for 32bit Linux systems;
for the full source of these binaries see below).
CAP3 Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877.
megablast Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000),
"A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14
NCBI Toolkit a great
resource of bioinformatics source code and tools, provided by U.S.
National Center for Biotechnology Information
If you want to build these programs from source, the C++ source code
for most of the tools is provided below. We do not include here the source code
for CAP3 and for the formatdb utility and other standard BLAST-related tools that are part of the
Please note that in order to compile mgblast, a specific
version of the NCBI Toolkit is needed (the C version, not the C++ one; check the README file in the 'mgblast' source package
below for the specific version of the NCBI Toolkit that should be installed). CAP3 is closed source but one can and
should download the latest public binary build for the target platform in order to use it with TGICL.
The author of CAP3, Dr. Xiaoqiu Huang,
may be contacted for CAP3-specific issues, and currently the CAP3 binary for multiple platforms can be downloaded from:
Our source packages:
- mdust.tar.gz Standalone low complexity ("dust") filter.
- psx.tar.gz a parallel multi-FASTA file processing tool.
- pvmsx.tar.gz a parallel multi-FASTA file processing tool using PVM
- cdbfasta.tar.gz fasta file indexing/retrieving utility (described below)
- mgblast.tar.gz a customized version of megablast (requires NCBI C Toolkit for compilation)
The following utilities depend on a set of C++ utility classes and functions, a
genomic C++ library (gclib)
which should be downloaded first in order to compile them.
- zmsort.tar.gz a merge-sort utility for compressed alignment files, with multi-file output option
- tclust.tar.gz a transitive closure clustering tool with overlap filtering options.
- sclust.tar.gz a seeded clustering tool by processing pairwise alignments.
- nrcl.tar.gz a containment clustering and layout utility by processing pairwise alignments.
- mgmerge.tar.gz a merge utility for compressed alignment data (used by gicl only)
- mblasm.tar.gz a sequence assembler from pairwise gapped alignments (used by gicl only)
clview : an assembly file viewer.
This is a graphical, interactive tool for inspecting the ACE format assembly files generated by
CAP3 or phrap. Beasides the ACE files, the program also supports a custom cluster layout format
for the overview of a possible multiple alignments generated just from pairwise alignments,
where no detailed nucleotide level alignment is needed and provided. The
"containment clustering" program (nrcl) mentioned in the TGI Clustering
tools(TGICL) above can generate such a "cluster layout" file (*.lyt).
Here is a precompiled linux version with the required dynamic FOX library included:
The program was built using the FOX toolkit by Jeroen van der Zijp,
a portable and feature-rich C++ framework for developing graphical user interfaces under Unix and Windows.
In order to compile the source code of this viewer,
you need to download and install the FOX library
SeqClean :a script for automated trimming and validation of ESTs or other DNA sequences
by screening for various contaminants, low quality and low-complexity sequences.
Precompiled Linux versions can be found here:
seqclean.tar.gz (32-bit). There is also a 64 bit version seqclean_x86_64.tar.gz. Please see the README file first.
Please note there is no contaminant database included in the package - you need to provide your own screening files
or download and format a generic vector database like NCBI's
The package also doesn't include NCBI's blastall and megablast utilities which you should obtain from the NCBI site
(their full source is included in the NCBI C Toolkit).
The C/C++ source for the other programs in the package can be found in gclib package:
- mdust.tar.gz Standalone low complexity ("dust") filter.
- trimpoly.tar.gz polyA/T and N trimming utility.
- cdbfasta.tar.gz A FASTA record indexing/retrievieng utility (described separately)
- psx.tar.gz A parallel multi-FASTA file processing tool on a multi-CPU machine.
- pvmsx.tar.gz A parallel multi-FASTA file processing tool on a PVM cluster.
cdbfasta/cdbyank: fast indexing/retrieval of fasta records from flat file databases. These two utilities are based on the
"cdb" (Constant DataBase) concept and the file-based hashing algorithm developed by
D.J. Bernstein (http://cr.yp.to/djb.html)
The source code is the C++ port of the original cdb library,
modified to create a compact separate index file keeping the original flat file in its original format.
Multi-FASTA files of up to 4GB can be indexed with cdbfasta and then any one or more records can be
quickly retrieved using cdbyank.