------------------------------------------ The Gene Indices Clustering Tools (TGICL) ------------------------------------------ 0.Introduction 1.System Requirements 2.Installation 3.Usage 4.Output files 5.Dealing with large clusters 6.Command-line clustering utilities 7.PVM related issues 8.Copyright 9.Contact information 0.Introduction ============== The purpose of this package is to efficiently cluster and create assemblies (contigs) from a set of DNA sequences given in a fasta file. The "clustering" phase is intended to partition the input data set into smaller groups of sequences (clusters) that have stringent similarity are potentially coming from the same longer original sequence. However, the clustering phase does not perform any multiple alignment but only fast pairwise alignments (using megablast), which are then filtered and used to build subsets of sequences by a transitive closure approach. In the assembly phase each such cluster is sent to the assembly program (cap3) which attempts the multiple alignment of the sequences in the cluster and creates one or more contigs (consensus sequences). Both clustering and assembly phases can be executed in parallel on multiple CPU machines or in a PVM environment. 1.System Requirements ===================== *ix86 Linux with glibc >= 2.1, libstdc++ >= 2.9. (for the Linux compiled version) *Perl version >= 5.6 *common Unix utilities like sort, sed, cat etc. should be available. 2.Installation ============== Create a directory where you plan the package to reside. Copy the downloaded archive there, then unpack it using a command like this: tar xfvz tgicl.tar.gz This should unpack a few files in the current directory and will create a bin subdirectory with several files. The program to run is tgicl script from the main directory. When launched, this program will prefix the local ./bin subdirectory to the shell's path. Optionally, you may move/copy all binaries and scripts from ./bin subdirectory somewhere in your working shell PATH. You can also copy the main script (tgicl) to your preferred script location (which should be in the shell's PATH), in which case the module Mailer.pm should go in the same directory with the tgicl script or into a one of the PERLLIB (%INC) directories. There are 3 perl scripts in this package: tgicl bin/tgicl_asm.psx bin/tgicl_cluster.psx They all have the perl location as the first line, set to #!/usr/bin/perl If this is not the valid path for your perl installation you need to change these lines in all three files, to point to your actual perl binary location. Also, if you have a better/newer version of cap3 that you prefer to use you should remove or rename the cap3 program included in this package in the ./bin subdirectory (otherwise during the execution of tgicl, the included version will take precedence, as the ./bin subdirectory is added to the PATH environment in front of the existing list of paths) 3.Usage ======= The only input data file required is the multi-fasta file containing all the sequences to be clustered. Optionally, a file containing the quality-values for all the sequences can be provided (-q option) and it will be used during the assembly phase. Each sequence in the input flat file database is identified by the first space-terminated token found after the '>' character. These sequence identifiers should be unique and they shouldn't start with a prefix like "et|" or "np|" unless they are full-length gene transcripts (see below: 4.Dealing with large clusters). The clustering program procedure is not considering sequence similarity against lower-case letters in the sequence, so unless you actually used a repeat masker to do that on purpose, you should make sure that all the nucleotide sequences are uppercase letters. Then go to a working directory where you have permissions to write and create files (this can be the same directory where your input fasta file resides) and plenty of disk space. You can run the "tgicl" script without any parameters to see the usage instructions. To cluster and assemble the clusters with the default options, just type: tgicl fastadb This will run both clustering and assembly procedures. The resulting ACE format assembly files will be in the ./asm_*/ACE files. An "asm_X" subdirectory is created for each CPU in a parallel processing setting (where X is a number from 1 to the number of CPUs you specified, e.g. "asm_1", "asm_2", etc.). With the command line above, only asm_1 directory will be created as by default the program is using one CPU, but if you have a dual-CPU machine you can instruct the program to use both CPUs by adding "-c 2" to the command line above. The list of singletons will be in the file fastadb.singletons. 4.Output files ============== The program creates a few working files during the clustering and assembly process and several output files to be futher processed by the user (the assembly results). First, tgicl creates the necessary index/search files for mgblast searches (formatdb)and for fast sequence retrieval (cdbfasta). For repeated tgicl runs on the same dataset, this phase can be disabled ("-I" option) as it takes some time to (re)create the indices for large input files. The files created by formatdb (NCBI's blast index files), which are required for running mgblast, have these names: fastadb.nhr fastadb.nin fastadb.nsq (assuming the name of the input file was "fastadb") The cdbfasta creates only one index file which is used for fast retrieval of cluster sequences during assembly phase: fastadb.cidx During the clustering phase, cluster_* temporary directories will be created for distributed searching of the databases. If only one CPU is used, only the cluster_1 directory will be created. However, after the clustering is finished, these directories are removed and one or more files named hitsort_NNN.Z will be created instead (where NNN can be 001, 002, ... etc.). This file is the compressed output of the mgblast pairwise searches, a special tab delimited format where each line has the following tab delimited fields: 1) query sequence name 2) length of query sequence 3) start coordinate of the alignemnt on the query sequence 4) end coordinate of the alignment on the query sequence 5) subject sequence name (db sequence name) 6) length of subject sequence 7) start coordinate of the alignment on the subject sequence 8) end coordinate of the alignment on the subject sequence 9) percent of identity for the alignment 10) score of the alignment 11) bit-score of the alignment 12) orientation of the alignment (+/-) These alignments (the overlap information) are analyzed by the clustering programs (tclust, sclust and nrcl) in order to build the clusters. When the input dataset is large, these files can become also extremely large, so tgicl compresses them. In order to build the clusters efficiently and accurately, all the overlaps in the hitsort_*.Z files are sorted by score (best alignments first). The clustering programs take the overlap data contained in the hitsort_*.Z files as an input. The resulting clusters are written in a *_clusters file (fastadb_clusters using the example above). A cluster file has a pseudo FASTA format: each record is a actually a cluster definition and consists in a header line having this format: >cluster_name number_of_components .. followed by the list of sequence names contained in the cluster. In the assembly phase, a FASTA file containing the actual sequence data is built for each cluster, and this file is passed to the assembler the usual way. The resulting *.cap.ace files usually created by cap3 for each of the sequence cluters are concatenated inside each asm_* subdirectory as a single file called ACE. The plain FASTA file containing the resulted contig sequences only can be found in asm_*/contigs files, which are created by concatenating individual *.cap.contigs as they are created by cap3. The singletons resulted in the assembly process are written into asm_*/singlets file, by concatenating all the *.cap.singlets files as produced by cap3 during assembly of the clusters. By subtracting the list of the component sequences in all these ACE files from the list of sequences in the input datafile, tgicl determines the final list of singletons, which is written in an output file called *.singletons (fastadb._singletons in the example above). This is just a list of sequences names, not the sequences themselves. If the user needs the actual FASTA file with the singleton sequences, these can be easily obtained with the following command: cdbyank fastadb.cidx < fastadb.singletons > singleton.seqs Please note that the this global singletons list includes the singleton sequences found in asm_*/singlets, as the singletons resulted in the clustering process are also included. The tgicl program also creates a few log files for the main process, as well as for each of the subprocesses launched for each CPU/node in a parallel processing environment. These log files should be inspected after tgicl terminates, in order to ensure no error messages an unsignaled errors appeared during the process. Assuming that the input file was called "fastadb", the log files would be: err_tgicl_fastadb.log - the log file capturing stderr output of the main script tgicl_fastadb.log - the log file capturing stdout messages of the main script asm_*/err_log - the stderr capturing any error messages from each assembly subprocess (these files should be normally empty) asm_*/log_std - the stdout capture of each assembly subprocess 5.Dealing with large clusters ============================= The main goal of the clustering step (before the actual assembly step) is to partition a large input sequence set into reasonably smaller sequence subsets that can be processed by the assembly program without running out of memory. The transitive closure procedure based on pairwise searches does this, but it might still end up with large clusters, with ESTs from many different genes coming together due to repeats, shared protein domains, or just contaminating sequence (long untrimmed vector sequences). The assembly program might run out of memory and crash when the number of sequences in a cluster is very large, as the number of overlaps to store and to analyze is too large. There are mainly two possible reasons for this: 1)the cluster is made of ESTs coming from one highly expressed gene, with very deep coverage of the transcript 2)the cluster is made of ESTs coming from several highly expressed genes, or even many more distinct genes but they are put together because of chimeric sequences or because of several common sequence substrings - either repeats, shared domains or just unremoved contaminants (large vector sequences left in there?). The situation 1) is actually the worst, because cap3 does not seem to be able to assemble such a high coverage on a reasonable hardware configuration. In the second case, when we suspect the sequences in a large transitive closure cluster are not coming from the same gene, there are a few ways to partition it further: * make sure you removed the vector sequence from your ESTs * run RepeatBlaster to mask known repeats (use lowercase masking!) * attempt to increase the clustering stringency (the -p, -l and -v options of tgicl). For example you might try -p 98 -l 40 -v 20. This will only put together sequences linked to other sequences by overlaps of at least 98% identity, at least 40bp, and at most 20bp overlap distance of sequence end. Actually -v could be set to an extremely low value (down to 1) - if you trust that your EST dataset have all vector/linker sequences removed. * use known full-length gene transcripts (full mRNAs) or similarly long and good quality reference sequences and enable the "seeded clustering" option - see below. The clustering procedure within tgicl is able to make use of specially designated full-length gene sequences in the input fasta file, by not allowing linking of two clusters containing two such sequences unless the full-length sequences have a very good direct overlap. The -s option allows you to specify the maximum number of sequences in a cluster. When this option is given, any cluster with a sequence count larger than the value of the -s parameter will go through a second partitioning attempt, based on full-length transcripts (if any) present within that cluster. In order to take advantage of this feature, the full-length transcripts should be marked by using a "np|" or "et|" prefix (all the other sequences in the input file should not use this prefix, of course). That is, for a full gene sequence, instead of using a defline like this: >AA124125 ...add the "et|" prefix like this: >et|AA124125 This prefix will tell the clustering program that the sequence is a full length sequence and if it's in a large transitive closure cluster, a re-clustering will be performed within that cluster to partition the component sequences by grouping them stringently around the full-length gene sequences (ET = Expressed Transcripts) Going back at the unfortunate case 1) mentioned above: when it's only about one very highly expressed gene (that is, seeded clustering didn't help), usually only the first clusters found in the *_clusters file are very large, so the only way is to take special care of those. A two pass assembly solution can be attempted for each of these large clusters, by assembling smaller subsets (subclusters) and then assemble the contigs resulted from the first step. Whatever the reason a huge cluster (thousands of sequences) appeared, a good insight of its structure can be gained by running the "containment clustering" procedure - the "nrcl" program included in the bin directory of the tgicl package. This program can create subclusters by finding the larger "parent" sequences which almost completely include (contain) other sequences. Each subcluster will be determined by one large parent sequence which acts as a representative sequence for all the included (contained) sequences. So one can assemble separately each of these "container"-based clusters and then take the resulting clusters and assemble them (two-pass assembly). Usually each "parent" (container) sequence comes from one single gene or, if such a containment cluster is still very large, it might be either a large chimeric sequence, "parenting" - alignment-wise containing - ESTs from different genes, or (the worst case) it could be one extremely highly expressed, almost full length transcript. For example, in order to see what "containers" are in the largest cluster resulted from the initial tgicl clustering, below are a few steps that can be performed in order to separate and investigate only that cluster. The first two lines of the *_clusters file is the largest cluster, so we can create a separate file only containing that cluster: head -2 *_clusters > largest_cluster Then run the "containment" clustering procedure, instructing it to only consider the sequences from the cluster of interest: zcat hitsort*.Z | nrcl -c largest_cluster PID=98 OVHANG=20 Then you can just look at the names of the container ("parent") sequences - the largest sequences containing other sequences: grep "^>" largest_cluster.lyt | more The first number following the sequence name is the number of "contained" (children) sequences (with at least 98 percent identity and with at most 20nt end-trimming). You can take a few of the largest sequences and blast them against a known genes database (for example at NCBI's BLAST site) - to see what genes are they coming from, or if they look like possibly chimerical sequences. For a two pass assembly of that initial large cluster, based on these containment clustering results, you just need to convert each body line of the largest_cluster.scls file (generated by nrcl command above), into an independent cluster file and assemble this new subclusters separately (using the -a option of tgicl; WARNING: be sure to make a backup the existing asm_* subdirectories before doing what I suggest below, as the other, smaller assemblies are probably already there and they are okay, but tgicl removes the asm_* directories when it's run). Below there is a two-pass assembly procedure based on those container sequences (assuming perl binary is in your path): First, create a new "cluster file" with all the "container" clusters: grep -v "^>" largest_cluster.scls | perl -ne 'print \ ">C1CL".(++$r)."\t".scalar(split)."\n$_"' > cl1_subclusters tgicl can then be run with the option "-a cl1_subclusters" in order to assemble only the containment subclusters found obtained from the first large cluster. The resulting sequences from the asm_*/contigs and asm_*/singlets file should be then put in one fasta file and reassembled (manually) with cap3. 6.Command-line clustering utilities =================================== There are a few standalone programs included in the package, other than mgblast and CAP3. They can be used independently for advanced or customized clustering/assembly solutions: tclust, nrcl and sclust. Brief usage instructions for any of them can be obtained by typing the name of the program followed by only the "-h" parameter. tclust ------ A transitive-closure clustering program. In its simplest usage it takes pairs of space delimited tokens (one pair per line) at stdin or as the first file parameter and outputs the clusters built by single transitive linking of the tokens. A more advanced usage is when the input data is not simple pairs, but instead is the tab delimited output of a pairwise search program, formatted like the mgblast output (the one stored by tgicl in the hitsort_*.Z files described above). With such an input, the program can actually filter the input data by various criteria and process only overlaps allowed by the filter, thus validating or ignoring links between sequences when performing the transitive closure to build the sequence clusters. The links (alignments) can be filtered on the fly by using thresholds for percentage of identity, coverage of the shorter sequence etc. For example, in order to build high stringency clusters from the pairwise searches performed by tgicl, using 98% identity and a shorter sequence coverage of 80%, the tclust command can be: zcat hitsort_*.Z | tclust SCOV=80 PID=98 -o scov80pid98.cls In the command above, the clusters will be written in the file "scov80pid98.cls", having the format described in "4.Output files" section. nrcl ---- "Non-redundification clustering" - is a specialized version of tclust which builds "containment clusters". These clusters where there is always a larger "parent" sequence covering (containing) almost completely all the other, smaller sequences, within user specified thresholds. An extra feature of this program is that it is able to provide the layout overview (coordinates only) of the multiple alignment which would result for each cluster. A special layout file format is provided which can be visualized by using the Cluster Viewer (clview) program available as a separate package at the TGI software page. As mentioned in the previous topic, nrcl program can be used for investigating very large transitive-closure clusters, when the assembly program runs out of memory because of highly expressed or possibly chimeric transcripts. This is because the complex layout of such a cluster can be usually reduced to several representative sequences - the "parent" sequences found by nrcl - and group many little, redundant sequences under them. The investigation can then focus only onto these larger container sequences, which can be analyzed separately, looking for possible contamination or just assembling each of the containment clusters separately and then assemble the resulting contigs (two-step assembly), as described in the previous topic. sclust ------ This is also a program derived from tclust which is specialized in dealing with full-length mRNAs (known transcripts) and performing the "seeded clustering" procedure based on those. The full-length transcripts are recognized by the "et|" prefix used in front of each such sequence name in the input file. The program supports various options for tweaking the "seeded clustering" procedure. cdbfasta/cdbyank ----------------- These multi-FASTA file indexing and retrieval tools are described at the TGI software page and also available separately there. See 9.Contact information for details. 7.PVM related issues ==================== For the PVM option (using multiple machines), you need the PVM (Parallel Virtual Machine) libraries to be installed and the usual PVM_* environment variables to be set accordingly on each node machine. The current user should have full rlogin permissions of any of the node machines. Also, the working directory should be centrally mounted (NFS) in the same mounting point for all the PVM nodes, as one subdirectory will be created in the working directory for each node. Sometimes pvmd fails to start on a slave machine, in which case you should try to halt the pvmd manually on that machine. 8.Copyright =========== Copyright (c) 2005-2006, Dana-Farber Cancer Institute, All Rights Reserved This software is OSI Certified Open Source Software. OSI Certified is a certification mark of the Open Source Initiative. The folowing programs are distributed as binaries: cap3 ---- Contig Assembly Program version 3 Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877. mgblast ------- Modified version of the megablast program from the NCBI toolkit: Zhang, Schwartz, Wagner, and Miller., A Greedy Algorithm for Aligning DNA Sequences, J. Comp. Biol. 2000, Feb-Apr;7(1-2):203-14 9.Contact ========= This and other bioinformatics tools are available at The Gene Indices Software page at: http://compbio.dfci.harvard.edu/tgi/software For problems and questions related to this package please contact Geo Pertea at gpertea@jimmy.harvard.edu . Please note that for problems related to cap3 and megablast algorithms you should rather contact the original authors of these programs. (See page of Dr. Xiaqiu Huang, author of cap3, at http://www.cs.iastate.edu/~xqhuang/)