DFCI Gene Indices Software Tools

The Gene Indices group is committed to making software tools freely available to the scientific community. The software provided on this page represents modified versions of our internal software, designed to run independently of the operating system environment (e.g. not linked to the database and software environment at DFCI). This software is available free of charge under the Artistic License; download and use of this software constitutes agreement to abide by the terms of the Artistic License.

If you would like to be notified about software updates, please click on the package logo and fill in your contact information.

The Gene Index Project and tools is supported in part by funding from the US Department of Energy, Grant #DE-FG02-99ER62852, and the US National Science Foundation, Grant #DBI-9983070. Additional funds are provided by the US National Science Foundation through grants #DBI-9813392 and #DBI-9975866.


TGICL download TGI Clustering tools (TGICL): a software system for fast clustering of large EST datasets

This package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a slightly modified version of NCBI's megablast , and the resulting clusters are then assembled using CAP3 assembly program. TGICL starts with a large multi-FASTA file (and an optional peer quality values file) and outputs the assembly files as produced by CAP3. Both clustering and assembly phases can be parallelized by distributing the searches and the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP machines or PVM (Parallel Virtual Machine) clusters. Here is a link to the README file which comes with the package.
The two full precompiled packages below were built on Linux and SunOS, respectively. They include CAP3, mgblast and all the other binaries for this platform (of course, except the base Unix utilities like 'sed', 'sort' etc.). Please note that only the Linux version was thoroughly tested at DFCI.

The platform independent perl scripts can also be downloaded separately from tgicl_scripts.tar.gz. These scripts are likely to be updated more often, as opposed to the binaries which are rather stable. In such cases there is no need to download again the full, large packages provided above, although they also include these scripts.
Last update of these files was on 12/05/2003.

Acknowledgements:

  • CAP3 Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877.
  • megablast Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14
  • NCBI Toolkit a great resource of bioinformatics source code and tools, provided by U.S. National Center for Biotechnology Information

If you want to build the package on other platforms or customize it, a portable C++ source code for most of the tools included in the package is given below. We do not include here the source code for the formatdb utility and the rest of the NCBI Toolkit sources and libraries which you should download from NCBI (the C version, not the C++ one) in order to compile mgblast. Also, we cannot distribute the source of the CAP3 assembly program - you might want to contact the author Dr. Xiaoqiu Huang for a precompiled version for your specific platform. Source packages:

  • mdust.tar.gz Standalone low complexity ("dust") filter.
  • psx.tar.gz a parallel multi-FASTA file processing tool.
  • pvmsx.tar.gz a parallel multi-FASTA file processing tool using PVM
  • mgblast.tar.gz a customized version of megablast (requires NCBI C Toolkit for compilation)

The following utilities depend on the TGI C++ class library which should be downloaded first in order to compile them.

  • cdbfasta.tar.gz a fasta record indexing/retrieving utility (described separately)
  • zmsort.tar.gz a merge-sort utility for compressed files, with multi-file output option
  • tclust.tar.gz a transitive closure clustering tool with overlap filtering options.
  • sclust.tar.gz a seeded clustering tool by processing pairwise alignments.
  • nrcl.tar.gz a containment clustering and layout utility by processing pairwise alignments.



clview download clview : an assembly file viewer.

This is a graphical, interactive tool for inspecting the ACE format assembly files generated by CAP3 or phrap. Beasides the ACE files, the program also supports a custom cluster layout format for the overview of a possible multiple alignments generated just from pairwise alignments, where no detailed nucleotide level alignment is needed and provided. The "containment clustering" program (nrcl) mentioned in the TGI Clustering tools(TGICL) above can generate such a "cluster layout" file (*.lyt). Here is a precompiled linux version with the required dynamic FOX library included: clview_linux_i386.tar.gz The program was built using the FOX toolkit by Jeroen van der Zijp, a portable and feature-rich C++ framework for developing graphical user interfaces under Unix and Windows. In order to compile the source code of this viewer, you need to download the FOX library and the TGI C++ class library .



Seqclean download SeqClean :a script for automated trimming and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low-complexity sequences.

A precompiled Linux version is here: seqclean.tar.gz Please see the README file first. Please note there is no contaminant database included in the package - you need to provide your own screening files or download and format a generic vector database like NCBI's UniVec. The package also doesn't include NCBI's blastall and megablast utilities which you should obtain from the NCBI site (their full source is included in the NCBI C Toolkit). The C/C++ source for the other programs in the package is provided here:




Cdbyank download cdbfasta/cdbyank: fast indexing/retrieval of fasta records from flat file databases. These two utilities are based on the "cdb" (Constant DataBase) concept and the file-based hashing algorithm developed by D.J. Bernstein (http://cr.yp.to/djb.html) The source code is the C++ port of the original cdb library, modified to create a compact separate index file keeping the original flat file in its original format. Multi-FASTA files of up to 4GB can be indexed with cdbfasta and then any one or more records can be quickly retrieved using cdbyank. Here is a brief usage description.



Das_viewer download

DAS/XML Genomic Viewer The Genomic viewer borrows modules from http://www.biodas.org (lstein@cshl.org) & http://webreference.com. The application is in constant flux and development and may not up fully compliant with the current DAS protocol all the time.
For program description please refer to dasview.html
Program source: das_viewer.tar.gz
For DAS/XML Viewer Comments/Questions send mail to Foo Cheung



For specific feature requests or programming questions please write directly to the author (Geo Pertea).
For DAS Comments/Questions send mail to (Foo Cheung).

Comments and suggestions : Contact Us

Acknowledgements
   The Gene Index Project is supported in part by funding from the US National Science Foundation through grant #DBI-0552416.