Running BLAST Jobs on the UA Supercomputers

General information on running UA HPC/HTC jobs can be found here and at http://rc.arizona.edu.
It is recommended that you read through the High Performance Computing FAQ in addition to reading this BLAST FAQ.

Bioperl users: on the login.hpc.arizona.edu systems, bioperl is automatically included when you load the perl module (there is no separate bioperl module to load.) You must load the perl module to have bioperl included, as the default perl in /usr/bin doesn't include bioperl.

Note: items below in italics should not be typed literally – instead substitute the appropriate file name or login name.

DO NOT run UA HPC/HTC BLAST searches against any of the databases in /genome on the last day of the month or the first day of the month, because these databases are updated in the early hours of the first day of each month and using a database that is being updated can have unpredictable results.

1. How do I get an account on the UA HPC/HTC systems?

To get an account on UA HPC/HTC systems, you need a UANetID and a faculty sponsor.  If you are a faculty member or PI and wish to sponsor an account, the link is: http://sponsor.hpc.arizona.edu

The accounts page is: http://account.arizona.edu

 

2. How do I login to UA HPC/HTC?

From an XTerminal window on a Linux or MacOSX computer, use the ssh command:

 ssh -X mylogin@login.hpc.arizona.edu

(The -X argument allows you to open X Window GUI based applications such as the nedit editor, gsAssembler, and consed.) 

From a Windows computer, connect to UA HPC/HTC using a Window-based ssh client, such as SSH Secure Shell or PuTTy; however, in order to run X Window applications you will also need to install the XMing server on your Windows desktop. To download SSH or PuTTy, go to https://sitelicense.arizona.edu/.

Alternatively, you can use the BioDesk application if you have an account on one of the BCF servers. For help with BioDesk, see the BioDesk help page.

 

3. How do I access BLAST programs and databases?

Before running blast commands, you'll need to load the blast module by running the command:
     module load blast

There are different blast programs for comparing nucleotide and protein sequences:

      blastn: compare nucleotide queries against a nucleotide BLAST database

      blastp: compare protein queries against a protein BLAST database

      blastx: compare translated nucleotide queries  against a protein BLAST database

      tblastn: compare protein queries against a translated nucleotide BLAST database

      tblastx: compare translated nucleotide queries  against a translated nucleotide BLAST database

For more information onBLAST, see:

http://www.ncbi.nlm.nih.gov/books/NBK21097/

http://www.ncbi.nlm.nih.gov/books/NBK1762/

On the ICE cluster system or the IBM (HTC) you can use at most 12 processors and 23Gb memory for a BLAST run. The amount of memory needed to run a BLAST job depends on the size of the BLAST database being queried, the number and size of the query sequences, and the number of blast matches requested.  For large query sets and/or blasts to the NCBI nt (nucleotide) or nr (protein) BLAST database, requesting 23Gb of memory may be necessary.  If is also a good idea to set -max_target_seqs to 10 or 20, rather than using the default value of 500. 

You must submit BLAST jobs to the PBS (Portable Batch System), as interactive jobs on HPC/HTC are limited to one hour and 2Gb of memory. See the next question for instructions on writing PBS submit scripts and submitting them to be run.

Input files must be in Fasta format (nucleotide or protein), and there may be many sequences in one input file. If the files have been created or edited on a Windows machine or a Mac, they may contain invisible end-of-line characters that can cause problems on a Unix system. See below for commands that will allow you to check for and remove these end-of-line characters. Also see below if you wish to build your own BLAST database from a FASTA file.

There are several options for the format of the BLAST output. By default BLAST output contains, for each query sequence, a list of one-line descriptions of the hit (or subject) sequences, followed by the pairwise alignments. It is possible to get a simpler tab-delimited output using the -outfmt 6 or -outfmt 7 option with the blastall command (see below.) XML output can also be specified, with the -outfmt 5 option. For more details on the options for blast, run the commands:

    blastn -help

    blastp -help

    etc.

In the /genome directory are several blast databases, including uniprot, est, nt, and nr. These are updated periodically. If you need any of these databases to be updated or if you need additional databases loaded onto the campus supercomputer you may contact Susan Miller. To see exactly which nucleotide databases are available, and how large the files are, run the command:
        ls -l   /genome/*.nsq

For protein databases, type:
        ls -l   /genome/*.psq

It is also possible to build your own blast database using the makeblastdb command. For options, type:
        makeblastdb -help

A formatdb.log file will contain the status of the makeblastdb command after it is run.

In the early morning hours of the first day of each month, the nr, nt, and est databases are downloaded from NCBI. The uniprot_sprot and uniprot_trembl and Pfam datadases are also updated at this time. If there is a BLAST job running with one of these databases while the update is in progress, it is likely that the results will be invalidated. For this reason, DO NOT START BLAST jobs using these databases on the last day or early on the first day of the month!

 

4. How do I write a submit file (batch script) and submit a BLAST job to the PBS system?

First run the va (view allocations) command to find your group ID - this is needed for the MyGroup attribute in the batch script.  Then copy the sample file /genome/blastn.csh (or /genome/blastp.csh) and edit it using the nedit editor:

cp   /genome/blastn.csh  myblast.csh

nedit  myblast.csh &


In nedit, FIRST TIME ONLY:
Go to Preferences -> Default Settings -> Wrap and choose None
Go to Preferences->Save Defaults then click OK

Edit the myblast.csh file to specify MyJob, MyGroup, MyEmail, MyDir, and BLAST input/output options. Be careful to spell all information correctly and make sure that the blast command does not span multiple lines, but instead is one long line.

The primary queue is called 'standard'. The sample script /genome/ICEblast.csh requests 23Gb memory and 12 cpus for 24 hours of walltime, for a total CPU time request of 288 hours. This is reflected in the lines:

#PBS -q standard
#PBS -l select=1:ncpus=12:mem=23Gb
#PBS -l cput=288:0:0
#PBS -l walltime=24:0:0

source /usr/share/Modules/init/csh
module load blast

cd MyBlastDir

time blastn -db /genome/nt  -num_threads 12 ...
 

Make sure that the blastn -num_threads flag matches the #PBS ncpus value, because the latter is the number of cpus you allocation will be charged for using. Save the file and exit the editor.

Submit the batch job and look at the batch queues:

qsub myblast.csh

The PBS system will return a job number for your batch submission. To see a list of all of your pending or running batch jobs, run the command:

qstat -u  myUANetID

After submitting a batch job, you can logout of UA HPC/HTC, as an email message will be sent when your job has been run.

 

5. How can I monitor progress of my PBS job?

Run the command: 

qstat -u  myUANetID

For details on a specific job (including CPU and memory usage), run the command: 

qstat -f  JobNumber | head

where JobNumber is the number returned by the qsub command.

 

6.  What options should I use in my blast command?

There are many options to use with the blast commands. To see a brief list of them, type the blast command followed by -h (for a longer description of the options, use -help.)  Example:  blastn -help

It is especially important to understand these options:

   -evalue  EvalueCutoff

        (to reduce poor alignments reported, use a value less than or equal to 1e-3. For only very good alignments, use 1e-100.  For blasting very short queries such as primer sequences, use -evalue 10)

   -num_descriptions  20

        (to limit the number of one-line descriptions of hits - the default is 500!)

   -num_alignments  20

        (to limit the number of pairwise alignments shown - default is 250!)

   -show_gis

        (show GI numbers in hit descriptions - allows linking back to NCBI)

 

7. How will I know the results of my BLAST run?

After BLAST is complete you will receive an email message from PBS (sent by 'root'.)  If this message reports Exit status = 0, the job ran without errors. Otherwise you need to look at the batch error file.

The batch system creates an output file and an error file, named by appending filename extensions consisting of the letter o or e respectively and the job number to the jobname specified in the PBS script.  Example:

        -rw------- 1 sjmiller nirav   0 Feb  3  2011 B11.e208756
        -rw------- 1 sjmiller nirav 254 Feb  3  2011 B11.o208756

The output file contains the standard output from the job, possibly including the line ‘Warning: no access to tty (Bad file number).’ This warning can be ignored. If the job had no errors, the error file will be empty.  Otherwise, the error file will contain more specific information about the error that occurred. It also may include the line ‘stty: tcgetattr: Not a typewriter’, which can be ignored.

 

8. What is a likely cause of errors such as: ‘[blastall] FATAL ERROR: blast: Unable to open input file /home2/u22/sjmiller/Sfile1.tfa’?

There may be a problem in the path or the name of the query input file in the csh batch submit file. Find the correct path with the pwd command and remember that filenames and directory paths are case sensitive. Another (much less likely) possibility is that the permissions on the file are such that it cannot be read. Use 'ls -l' to see the permissions and if necessary, use the 'chmod +r file' command to add read permission.

 

9.  What are possible causes of errors such as 'Segmentation violation' or 'Bus error'?

If your BLAST job was running in the early hours of the first day of the month, and you are using one of the /genome databases that are automatically updated during this time, your job could result in a Segmentation fault or Bus Error.

Another possibility is that if you used a file transfer to move a large input file to the supercomputer, there may be NULL characters in the input file. To check for any NULL characters, run the command:

      tr "\000" "@" < input.fa | grep -n "@"

The output will show line numbers on which NULL characters are found. If you see evidence of NULL characters, download the file again.

 

10. What is a likely cause of errors such as: '/pbs/mom1/mom_priv/jobs/1754...: Command not found.’?

This type of error can occur if you have edited your script or data files on a PC and transferred them to a Unix system. Check for control-M characters (^M) at the end of lines in your script by using the command: .

If ^M’s are present, remove them with the command:
     tr -d "\015" <PCscript.csh >UNIXscript.csh
If the file was edited on a Mac and transferred to marin, use the command:
     tr –s “\015” "\012" <MACscript.csh >UNIXscript.csh

After running the tr command, submit UNIXscript.csh with the qsub command.

 

11.  What can cause errors like this: ‘[blastall] ERROR: Repeat: SeqPortNew: lcl|Repeat stop(1289) >= len(122)...‘?

This type of error can result from a blast database that was built from a file that is not in true FASTA format. Proper FASTA format considers only the first word after the > as the sequence name and anything after the first space is considered annotation. For example, if an input file looks like this:
>Repeat Id: 1234567
ccgcgatgctagtca...
>Repeat Id: 2345678
tttgtcagtgtcg...
all of the sequences have the same name, 'Repeat'. To remove the common portion of the names, the following sed command can be used:
sed -e "s/Repeat Id: //" < Input.fa >NewInput.fa
After that the NewInput.fa file can be used as input to the formatdb command.

 


12.  What can cause PBS to terminate blast jobs immediately and/or produce errors such as: 

terminate called after throwing an instance of 'St9bad_alloc'
  what():  std::bad_alloc
terminate called recursively.‘

This type of error can occur if the job needs more memory or virtual memory than your submit script requested.  Blast runs that return a very large number of hits can cause this problem, so unless you need to see 500 hits and 250 alignments, it is recommended that you use the -num_descriptions and -num_alignment options to blast to limit the output (see the question above on blast options.)  If you do need to increase the memory available to the threads of your blast run, try adding to the submit script:   #PBS -l pvmem=23Gb  (assuming that you have also requested ncpus=12:mem=23Gb.)  Another example to look at is in the file /genome/blastp.csh.