FAQs - How to Run BLAST Jobs on the Supercomputer

 

Note: items in italics should not be typed literally – instead substitute the appropriate file name or login name

General information on running supercomputer jobs can be found at: http://bcf.arl.arizona.edu/resources/faqs/hp_computing.php

It is recommended that you read through the High Performance Computing FAQ in addition to this BLAST FAQ.

Also be sure to read Question 5 as it includes information on when NOT to run a Supercomputer BLAST!

1. How do I get an account on  the Supercomputer?

Apply for an account on the campus Supercomputer:
http://www.hpc.arizona.edu/access.shtml

2. What input file format is needed?

Input files must be in Fasta format (nucleotide or protein), and there may be many sequences in one input file. If the files have been created or edited on a Windows machine or a Mac, they may contain invisible end-of-line characters that can cause problems on a Unix system.  See Question 11 for commands that will allow you to check for and remove these end-of-line characters.   Also see Question 12 if you wish to build your own BLAST database from a FASTA file.

3. What is the format of the output?

There are several options for the format of the output.  By default BLAST output contains, for each query sequence, a list of one-line descriptions of the hit (or subject) sequences, followed by the pairwise alignments.  It is possible to get a simpler tab-delimited output using the -m 8 or -m 9 option with the blastall command (see below.)  XML output can also be specified, with the -m 7 option.  For more details on the options for blastall, see the NCBI page: http://www.ncbi.nlm.nih.gov/BLAST/docs/blastall.html

4. Which BLAST databases are available on the Supercomputer?

On aura in the /genome directory are several blast databases, including swissprot, est, nt, and nr. These are updated periodically.  If you need any of these databases to be updated or if you need additional databases loaded onto the campus supercomputer you may build them yourself using the /usr/local/BLAST/formatdb command, or contact Susan Miller: (sjmiller@email.arizona.edu)

5. How often are the BLAST databases updated on the Supercomputer?

In the early morning hours of the first day of each month, the nr, nt, and est databases are downloaded from NCBI.  If there is a BLAST job running with one of these databases while the update is in progress, it is likely that the results will be invalidated.  For this reason, DO NOT START batch BLAST jobs using these databases on the last day or first day of the month!

 

6. How do I Transfer Files to the Supercomputer?

On Amadeus, use gzip to compress the file (a .gz extension is added by gzip):
gzip   file

Use the sftp command to copy the file to the Supercomputer:
sftp  mylogin@hpc.arizona.edu (then enter password)
put   file.gz
quit

7. How do I access the Supercomputer to run BLAST (for example from biodesk session 20)?

Use the xhost command on amadeus to allow the editor on the Supercomputer to display in your biodesk session (or use the pico editor on the Supercomputer):
xhost  +

Use ssh to login to the Supercomputer.
ssh  mylogin@hpc.arizona.edu (then enter password)

You will see the command prompt change to [aura][~]> . (The name of the Supercomputer is aura.)   Copy the sample batch script to your directory:
cp   /genome/hpcblast.csh  myblast.csh

Unzip the file transferred from Amadeus:
gunzip   file.gz

Use the pwd command to find the full path of your home directory on aura – you will use this in your blast batch script.
pwd

Set up to edit the batch script by setting the DISPLAY variable to reference your amadeus session (use your current session number):
setenv   DISPLAY  amadeus.biosci:20
nedit   myblast.csh

FIRST TIME ONLY:
Go to Preferences->Default Settings->Wrap and choose None
Go to Preferences->Save Defaults then click OK

Edit the file to specify Jobname, email, and BLAST input/output options. Make sure the input file is exactly right and use the full path for input and output files (full path is output by pwd command). The blastall command must be all on one line! Save the file and exit the editor.

Submit the batch job and look at the batch queues:
qsub  myblast.csh
qstat -a

Logout of aura
logout

8. What do I need to know about the Supercomputer batch queues?

There are several queues available. The sample script hpcblast.csh uses the 4-processor queue on node zero. This is reflected in the lines:
#PBS -l nodes=aura0:ppn=4
#PBS -q aura_4p
#PBS -l cput=240:0:0
#PBS -l walltime=60:0:0
time /usr/local/BLAST/blastall -a 4 ...

It is recommended that you run a smaller scale job on the 4-processor queue until you know that your batch script is correct. Typically your job will wait longer to get into the 8 or 16 processor queues and you can avoid wasting this time by making sure your script runs and you get the correct output through the 4-processor queue. After that, you can scale up to 8 or 16 processors by changing the lines above according to the guidelines on:

http://www.super.arizona.edu/batch-queues.shtml. Note that walltime is the maximum amount of time allowed per processor, and cput is this number multiplied by the number of processors. The nodes= name for 8 or 16 processors is aura6, not aura0.

The qsub command submits a batch script to the queuing system and prints a batch job number. To delete a job from the queue, use the command: qdel jobnumber

9. What are all of those options for the blastall command?

There are many options - to see a list of them all type the command:

   /usr/local/BLAST/blastall  -

Although these are called options,several of them are necessary:

   -i  Inputfile

   -o Outputfile

   -a NumberOfProcessors

   -p BlastProgram (blastn, blastx, etc.)

   -d BlastDatabase (nt, nr, uniprot, etc.)

And others are a very good idea to include to keep your output reasonable:

   -e EvalueCutoff (to reduce poor alignments reported, use a value less than or equal to 1e-3.  For only very good alignments, use 1e-100)

   -v 10 (to limit the number of one-line descriptions of hits - the default is 500!)

   -b 10 (to limit the number of pairwise alignments shown - default is 250!)

   -I T (show GI numbers in hit descriptions - allows linking back to NCBI)

Most of the time you will want to include this as well:

   -F F  (turn OFF filtering of low-complexity sequence, such as repetitive elements)

10. How will I know the results of my BLAST run?

After BLAST is complete you will receive an email message. If the message reports Exit status = 0, the job ran without errors. Otherwise you need to look at the batch error file. The batch system creates an output file and an error file, named by appending the letter o or e and the job number to the jobname specified in the PBS script. The output file contains the standard output from the job, possibly including the line ‘Warning: no access to tty (Bad file number).’ This warning can be ignored. The error file will contain more specific information about the error that occurred. It also may include the line ‘stty: tcgetattr: Not a typewriter’, which can be ignored.

11. What is a likely cause of errors such as:

‘/pbs/mom1/mom_priv/jobs/1754.aura.SC: Command not found.’ ?

This type of error can occur if you have edited your script or data files on a PC and transferred them to a Unix system. Check for control-M characters (^M) at the end of lines in your script by using the command: ‘cat –vet script’.

If ^M’s are present, remove them with the command:

tr   –d   “\015”   <PCscript  >UNIXscript

If the file was edited on a Mac and transferred to aura, use the comand:

tr –s   “\015”   "\012"  <MACscript   >UNIXscript

12. What can cause errors like this:

[blastall] ERROR: Repeat: SeqPortNew: lcl|Repeat stop(1289) >= len(122)...?

This type of error can result from a blast database that was built from a file that is not in true FASTA format.  True FASTA format considers only the first word after the > as the sequence name and anything after the first space is considered annotation.  For example, if an input file looks like this:

>Repeat  Id:  1234567

ccgcgatgctagtca...

>Repeat  Id:  2345678

tttgtcagtgtcg...

all of the sequences have the same name, 'Repeat'!  To remove the common portion of the names, the following sed command can be used:

sed  -e  "s/Repeat Id: //"   < Input.fa   >NewInput.fa

After that the NewInput.fa file can be used as input to the formatdb command.

13. How do I retrieve my BLAST output file(for example, if the blast output is in the file MyBLASTout.bln)?

ssh  mylogin@hpc.arizona.edu (then enter password)
gzip  MyBLASTout.bln
sftp  mylogin@amadeus.biosci
put  MyBLASTout.bln.gz
quit
logout

14. How can I view and organize my BLAST output ?

SCOOTaR is a browser-based Sequence Comparison Output Organizing Tool that can help you filter, sort and organize your BLAST/FASTA/HMMER output.  To use SCOOTaR you need a mysql login name and password on  amadeus.  See http://bcf.arl.arizona.edu/resources/faqs/SCOOTaRFAQs.php for more information.

Unzip the output file:
gunzip  MyBLASTout.bln.gz

Load the BLAST results file into SCOOTaR using your mysql login/password/database (type all on one line):
loadscoot -v mysqldb /home/my_dept/my_login/MyBLASTout.bln projectname

Go to http://scoot.biosci.arizona.edu

Click the SCOOT Tab, and you will see 3 smaller Tabs: Load Data, View Results, User Manual. Read through the User Manual and play with SCOOT.

Some caveats:
SCOOTaR works only in Internet Explorer.
At present BLAST result files must be imported into SCOOT by using the loadscoot command on amadeus as described above or by first downloading the blast results to a PC to be loaded into SCOOT (you can View Results from Mac or PC but for now loading only works from a PC or amadeus)

15. Where can I get help with UNIX commands?

http://bcf.arl.arizona.edu/resources/docs/unix.php

BACK TO FAQs