DFCI Gene Indices Software Tools

The Gene Indices group is committed to making software tools freely available to the scientific community. The software provided on this page represents modified versions of our internal software, designed to run independently of the operating system environment (e.g. not linked to the database and software environment at DFCI). This software is available free of charge under the Artistic License; download and use of this software constitutes agreement to abide by the terms of the Artistic License.

If you would like to be notified about software updates, please click on the package logo and fill in your contact information.

The Gene Index Project and tools is supported in part by funding from the US Department of Energy, Grant #DE-FG02-99ER62852, and the US National Science Foundation, Grant #DBI-9983070. Additional funds are provided by the US National Science Foundation through grants #DBI-9813392 and #DBI-9975866.

TGICL download TGI Clustering tools (TGICL): a software system for fast clustering of large EST datasets

This package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a slightly modified version of NCBI's megablast , and the resulting clusters are then assembled using CAP3 assembly program. TGICL starts with a large multi-FASTA file (and an optional peer quality values file) and outputs the assembly files as produced by CAP3. Both clustering and assembly phases can be parallelized by distributing the searches and the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP machines or PVM (Parallel Virtual Machine) clusters. The two full precompiled packages below were built on Linux and SunOS, respectively. They include CAP3, mgblast and all the other binaries for this platform (of course, except the base Unix utilities like 'sed', 'sort' etc.). Please note that only the Linux version was thoroughly tested at DFCI.

  • tgicl_linux.tar.gz Linux x86 (glibc-2.1, requires perl>=v5.6).
  • tgicl_sunos.tar.gz built on SunOS 5.8 sparc SUNW, Ultra-250

The platform independent perl scripts can also be downloaded separately from tgicl_scripts.tar.gz. These scripts are likely to be updated more often, as opposed to the binaries which are rather stable. In such cases there is no need to download again the full, large packages provided above, although they also include these scripts.

Please note that the entire project is hosted at sourceforge.net where you can find all previous versions including the source code. Just click on View all Files button and browse all versions listed there.


For advanced users, a more complex version of TGICL that has its own custom incremental assembler and extensive grid support (SGE and Condor), can be downloaded here:

  • gicl_Linux-i686.tar.gz (this package contains the scripts and all the necessary pre-compiled binaries for 32bit Linux systems; for the full source of these binaries see below).
  • CAP3 Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877.
  • megablast Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14
  • NCBI Toolkit a great resource of bioinformatics source code and tools, provided by U.S. National Center for Biotechnology Information

If you want to build these programs from source, the C++ source code for most of the tools is provided below. We do not include here the source code for CAP3 and for the formatdb utility and other standard BLAST-related tools that are part of the NCBI Toolkit Please note that in order to compile mgblast, a specific version of the NCBI Toolkit is needed (the C version, not the C++ one; check the README file in the 'mgblast' source package below for the specific version of the NCBI Toolkit that should be installed). CAP3 is closed source but one can and should download the latest public binary build for the target platform in order to use it with TGICL. The author of CAP3, Dr. Xiaoqiu Huang, may be contacted for CAP3-specific issues, and currently the CAP3 binary for multiple platforms can be downloaded from: http://seq.cs.iastate.edu/cap3.html

Our source packages:

  • mdust.tar.gz Standalone low complexity ("dust") filter.
  • psx.tar.gz a parallel multi-FASTA file processing tool.
  • pvmsx.tar.gz a parallel multi-FASTA file processing tool using PVM
  • cdbfasta.tar.gz fasta file indexing/retrieving utility (described below)
  • mgblast.tar.gz a customized version of megablast (requires NCBI C Toolkit for compilation)

The following utilities depend on a set of C++ utility classes and functions, a genomic C++ library (gclib) which should be downloaded first in order to compile them.

  • zmsort.tar.gz a merge-sort utility for compressed alignment files, with multi-file output option
  • tclust.tar.gz a transitive closure clustering tool with overlap filtering options.
  • sclust.tar.gz a seeded clustering tool by processing pairwise alignments.
  • nrcl.tar.gz a containment clustering and layout utility by processing pairwise alignments.
  • mgmerge.tar.gz a merge utility for compressed alignment data (used by gicl only)
  • mblasm.tar.gz a sequence assembler from pairwise gapped alignments (used by gicl only)

clview download clview : an assembly file viewer.

This is a graphical, interactive tool for inspecting the ACE format assembly files generated by CAP3 or phrap. Beasides the ACE files, the program also supports a custom cluster layout format for the overview of a possible multiple alignments generated just from pairwise alignments, where no detailed nucleotide level alignment is needed and provided. The "containment clustering" program (nrcl) mentioned in the TGI Clustering tools(TGICL) above can generate such a "cluster layout" file (*.lyt). Here is a precompiled linux version with the required dynamic FOX library included: clview_linux_i386.tar.gz The program was built using the FOX toolkit by Jeroen van der Zijp, a portable and feature-rich C++ framework for developing graphical user interfaces under Unix and Windows. In order to compile the source code of this viewer, you need to download and install the FOX library

Seqclean download SeqClean :a script for automated trimming and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low-complexity sequences.

Precompiled Linux versions can be found here: seqclean.tar.gz (32-bit). There is also a 64 bit version seqclean_x86_64.tar.gz. Please see the README file first. Please note there is no contaminant database included in the package - you need to provide your own screening files or download and format a generic vector database like NCBI's UniVec. The package also doesn't include NCBI's blastall and megablast utilities which you should obtain from the NCBI site (their full source is included in the NCBI C Toolkit). The C/C++ source for the other programs in the package can be found in gclib package:

  • mdust.tar.gz Standalone low complexity ("dust") filter.
  • trimpoly.tar.gz polyA/T and N trimming utility.
  • cdbfasta.tar.gz A FASTA record indexing/retrievieng utility (described separately)
  • psx.tar.gz A parallel multi-FASTA file processing tool on a multi-CPU machine.
  • pvmsx.tar.gz A parallel multi-FASTA file processing tool on a PVM cluster.

Cdbyank download cdbfasta/cdbyank: fast indexing/retrieval of fasta records from flat file databases. These two utilities are based on the "cdb" (Constant DataBase) concept and the file-based hashing algorithm developed by D.J. Bernstein (http://cr.yp.to/djb.html) The source code is the C++ port of the original cdb library, modified to create a compact separate index file keeping the original flat file in its original format. Multi-FASTA files of up to 4GB can be indexed with cdbfasta and then any one or more records can be quickly retrieved using cdbyank.

Das_viewer download

DAS/XML Genomic Viewer The Genomic viewer borrows modules from http://www.biodas.org (lstein@cshl.org) & http://webreference.com. The application is in constant flux and development and may not up fully compliant with the current DAS protocol all the time.
For program description please refer to dasview.html
Program source: das_viewer.tar.gz
For DAS/XML Viewer Comments/Questions send mail to Foo Cheung

For specific feature requests or programming questions please write directly to the author (Geo Pertea).
For DAS Comments/Questions send mail to (Foo Cheung).

Comments and suggestions : Contact Us

   The Gene Index Project is supported in part by funding from the US National Science Foundation through grant #DBI-0552416.