In most cases, the UAGC sequencing service includes signal processing on your data, and contig assembly may also be included. The signal processing phase for 454 (Titanium) data is computationally intensive, and is most efficiently done using the UA HPC/HTC systems. The sequence assembly phase can also be done on UA HPC/HTC. Annotation of bacterial genomes may be done via the RAST and/or NCBI PGAAP servers.
See the High Throughput Computing FAQs
First check that you have enough disk space on UA HPC/HTC to hold the data, or use an xdisk allocation if you need more space (xdisk usage is covered in the ICE cluster FAQ). If you have .sff files, you can simply transfer them to ICE using the scp command with your UANetID password:
scp *.sff email@example.com:dest_dir.
If you need to do signal processing (see Question 4 below), you will need to use scp to copy the entire image processing directory, e.g. 'D_yyyy_mm_dd_hh_mm_ss_FLX05080355_imageProcessingOnly', as follows:
scp -r D_yyyy_mm_dd_hh_mm_ss_FLX05080355_imageProcessingOnly firstname.lastname@example.org:dest_dir
No, image processing must be done on the 454 instrument computer.
You probably don't need to, as this is done as part of the sequencing service! But if you insist: Copy the script /genome/ICEbin/454sig.pbs or /genome/ICEbin/454sigPairedEnd.pbs to your directory and modify it to use your email, group, and data file locations. The 454 software runs best with bash, so leave the first line as is. Submit the job using one of the commands:
Using 8 processors, your signal processing run should take 10 hours or less on ICE. You will receive an email message when your job begins, ends, or aborts. For more information on PBS jobs, see the HPC FAQ.
There are several choices for de novo assemblers: Roche newbler/gsAssembler, Celera, and MIRA have been tested and seem to work reasonably well with 454 data. Running each is described below in a separate question/answer. Each assembler is installed in a different location, necessitating the alteration of the PATH environment variable in the .cshrc file in the user's home directory. For an example, see the file /genome/.cshrc.ex
Remember that having fewer contigs does not necessarily imply a better assembly, as repeat regions may have been collapsed. For mapping, run gsMapper (see next question).
Roche's gsMapper and gsAssembler (aka newbler) can be run interactively from the ICE command line. To use roche software, first run the command: 'module load roche454'. If you wish to view the resulting assembly using consed (version 17 is required) be sure to select 'Complete consed folder' in the gsAssembler Parameters Tab before starting the assembly. After creating a new project, select the Project Tab, select the 'GS reads' Tab and click the '+' button to add sff files to the project. Set the desired parameters from the Parameters Tab and press the 'Start' button. After the computation has finished you can use the 'Results files', 'Alignment results', and 'Flowgrams' Tabs. Small genome assemblies typically take about 10 minutes to complete on ICE. For larger projects, you may use /genome/ICEbin/454asm.pbs as a template for a batch assembly. For more complete information see the Roche manual. If you are using the command line assembly tool runAssembly and you are combining paired end reads with shotgun reads, you must specify the paired end reads first, e.g. runAssembly -o MyProject -p Mypairedend.sff Myshotgun.sff
Before running mira, you must extract fasta, quality, and xml clipping information from the454 sff files, using the sff_extract command (one long line). For example:
sff_extract -s proj_in.454.fasta -q proj_in.454.fasta.qual -x proj_traceinfo_in.454.xml FX3UAMY01.sff FX3UAMY02.sffBe sure to name the files as illustrated. The command to run the mira assembler must be entered on one long line as well. Example:
mira -project=proj -job=denovo,genome,normal,454 >& proj.log
The main log file will contain any error messages. If the log file contains a message that says 'megahubs' were found, try re-running the mira command and include the options:
-SK:mnr=yes and -SK:nrr=10There is a sample PBS script /rsgrps1/nrsc/genome/ICEmirademo.csh that you may copy and modify. For more hints on assembling difficult projects, see: http://chevreux.org/uploads/media/mira3_hard.html. The proj_d_info directory contains an assembly.txt file that summarizes the assembly results, and the proj_d_results directory contains the .ace, .fasta, and .fasta.qual files for the resulting assembly. For more information on mira, see http://mira-assembler.sourceforge.net/docs/mira.html
The Celera assembler is installed in /genome/celera/wgs-5.4/Linux-amd64/bin so you will need to add this to your PATH environment variable. The celera assembly process also requires a non-standard Perl library, so you will also need to add the following line to the .cshrc file in your home directory:
setenv PERLLIB /genome/ICEbin/lib
After adding this line, type the command:
There are several steps involved in running the Celera assembler. The first is to convert your sff file(s) to frg format by running the sffToCA command, e.g.:
sffToCA -trim hard -clear 454 -libraryname MyLib -output FX3UAMY01.frg FX3UAMY01.sff
The -trim hard option is recommended for unpaired Titanium reads to tell the assembler to ignore low-quality bases that are beyond the trimming point established by the Roche software. The libraryname is used for removing duplicate reads, and is stored in the resulting frg file. For paired-end Titanium reads, -trim chop is recommended to make sure the linker sequence is removed.
Then create a specifications file, containing lines such as:
unitigger = bog
Your spec file tells the celera assembler to use the bog 'best overlap graph' method (recommended for 454 data), and the mer overlapper, which takes into account the homopolymer length uncertainty characteristic of 454 reads. However, sometimes the mer overlapper fails or runs for too long. In that case, you can use the default setting of overlapper=ovl. Setting utgErrorRate=0.03 is recommended for 454 Titanium reads (as opposed to the default error rate of 0.015, which is appropriate for Sanger reads). The createACE=1 option produces an ace-format file that can be opened in consed . This option stores the ace file in a compressed format (bz2) that will need to be unzipped before it can be viewed (see below). Lastly, the spec file should list the frg file(s) you wish to assemble.
The command to run the assembler is:
runCA-OBT.pl -d asmdir -p project -s specfileIf you prefer not to use a spec file, you can put the options from that file directly on the command line. The last argument should be your .frg file(s). For example:
runCA -d asmdir -p project doOverlapTrimming=1 overlapper=mer unitigger=bog utgErrorRate=0.03 createACE=1 FX3UAMY01.frg
The command should be one long line, even if it wraps around. Do not hit enter until after the frg file list. Also, note that some options are preceded by a minus sign (-d and -p), while others are not.
Output files, including fasta files of all the assembled contigs, can be found in the 9-terminator directory. If you used the createACE=1 option to output an ace file, you will need to run the following command to unzip the compressed file before opening it:
consed -nophd -ace project.ace
gatekeeper -dumpfrg project.gkpStore > project.frg
Consed version 19 is installed in /genome/ICEbin/consed19 so you will need to add this to your PATH environment variable. (See the High Throughout Computing FAQs for instructions on setting your PATH.) For gsAssembler you can run consed as usual (from the edit_dir, type the command:
consed). For celera and mira assemblies you must run:
consed -nophd -ace project.ace
For annotation of microbial genomes, submit your assembled contigs to the RAST server or to the NCBI PGAAP server. RAST is very easy to use. You may wish to try different assemblers and submit the results from each to RAST. Then you may be able to judge which assembly is best suited for submission to PGAAP.