|
The Gene Indices group is committed to making software tools freely
available to the scientific community. The software provided on this
page represents modified versions of our internal software, designed to
run independently of the operating system environment (e.g. not linked
to the database and software environment at DFCI).
This software is available free of charge under the Artistic License;
download and use of this software constitutes agreement to abide by the
terms of the Artistic License.
If you would like to be notified about software updates, please
click on the package logo and fill in your contact information.
The Gene Index Project and tools is supported in part by funding from the US
Department of Energy, Grant #DE-FG02-99ER62852, and the US National Science Foundation, Grant
#DBI-9983070. Additional funds are provided by the US National Science Foundation
through grants #DBI-9813392 and #DBI-9975866.
TGI Clustering tools (TGICL): a software system for fast clustering of large EST datasets
This package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a
slightly modified version of NCBI's megablast , and the resulting clusters are then assembled using CAP3 assembly program.
TGICL starts with a large multi-FASTA file (and an optional peer quality values file) and outputs the assembly files as
produced by CAP3. Both clustering and assembly phases can be parallelized by distributing the searches and
the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP machines or PVM
(Parallel Virtual Machine) clusters.
Here is a link to the README file which comes
with the package.
The two full precompiled packages below were built on Linux and SunOS, respectively.
They include CAP3, mgblast and all the other binaries for this platform (of course, except the
base Unix utilities like 'sed', 'sort' etc.).
Please note that only the Linux version was thoroughly tested at DFCI.
The platform independent perl scripts can also be downloaded separately
from tgicl_scripts.tar.gz.
These scripts are likely to be updated more often, as opposed to
the binaries which are rather stable. In such cases there is no need to
download again the full, large packages provided above, although
they also include these scripts.
Last update of these files was on 12/05/2003.
Acknowledgements:
-
CAP3 Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877.
-
megablast Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000),
"A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14
-
NCBI Toolkit a great
resource of bioinformatics source code and tools, provided by U.S.
National Center for Biotechnology Information
If you want to build the package on other platforms or customize it, a portable C++ source code
for most of the tools included in the package is given below. We do not include here the source code
for the formatdb utility and the rest of the NCBI Toolkit
sources and libraries which you should download from NCBI (the C version, not the C++ one)
in order to compile mgblast. Also, we cannot distribute the source of the CAP3 assembly program
- you might want to contact the author Dr. Xiaoqiu Huang
for a precompiled version for your specific platform.
Source packages:
-
mdust.tar.gz Standalone low complexity ("dust") filter.
-
psx.tar.gz a parallel multi-FASTA file processing tool.
-
pvmsx.tar.gz a parallel multi-FASTA file processing tool using PVM
-
mgblast.tar.gz a customized version of megablast (requires NCBI C Toolkit for compilation)
The following utilities depend on the
TGI C++ class library
which should be downloaded first in order to compile them.
-
cdbfasta.tar.gz a fasta record indexing/retrieving utility (described separately)
-
zmsort.tar.gz a merge-sort utility for compressed files, with multi-file output option
-
tclust.tar.gz a transitive closure clustering tool with overlap filtering options.
-
sclust.tar.gz a seeded clustering tool by processing pairwise alignments.
-
nrcl.tar.gz a containment clustering and layout utility by processing pairwise alignments.
clview : an assembly file viewer.
This is a graphical, interactive tool for inspecting the ACE format assembly files generated by
CAP3 or phrap. Beasides the ACE files, the program also supports a custom cluster layout format
for the overview of a possible multiple alignments generated just from pairwise alignments,
where no detailed nucleotide level alignment is needed and provided. The
"containment clustering" program (nrcl) mentioned in the TGI Clustering
tools(TGICL) above can generate such a "cluster layout" file (*.lyt).
Here is a precompiled linux version with the required dynamic FOX library included:
clview_linux_i386.tar.gz
The program was built using the FOX toolkit by Jeroen van der Zijp,
a portable and feature-rich C++ framework for developing graphical user interfaces under Unix and Windows.
In order to compile the source code of this viewer,
you need to download the FOX library
and the TGI C++ class library .
SeqClean :a script for automated trimming and validation of ESTs or other DNA sequences
by screening for various contaminants, low quality and low-complexity sequences.
A precompiled Linux version is here:
seqclean.tar.gz Please see the
README file first.
Please note there is no contaminant database included in the package - you need to provide your own screening files
or download and format a generic vector database like NCBI's
UniVec.
The package also doesn't include NCBI's blastall and megablast utilities which you should obtain from the NCBI site
(their full source is included in the NCBI C Toolkit).
The C/C++ source for the other programs in the package is provided here:
-
mdust.tar.gz Standalone low complexity ("dust") filter.
-
trimpoly.tar.gz polyA/T and N trimming utility.
-
cdbfasta.tar.gz A FASTA record indexing/retrievieng utility (described separately)
-
psx.tar.gz A parallel multi-FASTA file processing tool on a multi-CPU machine.
-
pvmsx.tar.gz A parallel multi-FASTA file processing tool on a PVM cluster.
cdbfasta/cdbyank: fast indexing/retrieval of fasta records from flat file databases. These two utilities are based on the
"cdb" (Constant DataBase) concept and the file-based hashing algorithm developed by
D.J. Bernstein (http://cr.yp.to/djb.html)
The source code is the C++ port of the original cdb library,
modified to create a compact separate index file keeping the original flat file in its original format.
Multi-FASTA files of up to 4GB can be indexed with cdbfasta and then any one or more records can be
quickly retrieved using cdbyank. Here is a
brief usage description.
|