Home arrow Site Navigation arrow FAQs and HowTo's arrow High Throughput Computing arrow Running BLAST Jobs on the UA marin Supercomputer
Running BLAST Jobs on the UA marin Supercomputer | Print |

General information on running supercomputer jobs can be found at: http://bcf.arl.arizona.edu/high-throughput-computing/high-performance-computing-system-marin-2.html

It is recommended that you read through the New High Performance Computing FAQ in addition to this BLAST FAQ.

 

Note: items in italics should not be typed literally – instead substitute the appropriate file name or login name

Also be sure to read Question 5 as it includes information on when NOT to run a Supercomputer BLAST!

1. How do I get an account on the Supercomputer?

You first need a faculty sponsor.  Apply for an account on the campus Supercomputer here:
http://www.hpc.arizona.edu/2007/accounts.shtml

2. What input file format is needed?

Input files must be in Fasta format (nucleotide or protein), and there may be many sequences in one input file. If the files have been created or edited on a Windows machine or a Mac, they may contain invisible end-of-line characters that can cause problems on a Unix system.  See Question 14 for commands that will allow you to check for and remove these end-of-line characters.   Also see Question 4 if you wish to build your own BLAST database from a FASTA file.

3. What is the format of the output?

There are several options for the format of the output.  By default BLAST output contains, for each query sequence, a list of one-line descriptions of the hit (or subject) sequences, followed by the pairwise alignments.  It is possible to get a simpler tab-delimited output using the -m 8 or -m 9 option with the blastall command (see below.)  XML output can also be specified, with the -m 7 option.  For more details on the options for blastall, see the NCBI page: http://www.ncbi.nlm.nih.gov/BLAST/docs/blastall.html

4. Which BLAST databases are available on the Supercomputer?

On marin in the /genome directory are several blast databases, including uniprot, est, nt, and nr. These are updated periodically.  If you need any of these databases to be updated or if you need additional databases loaded onto the campus supercomputer you may contact This e-mail address is being protected from spam bots, you need JavaScript enabled to view it .   To see exactly which nucleotide databases are available, run the following two commands:

        ls -l /genome/*.nsq         
        ls -l /genome/*/*.nsq

For protein databases, type:

        ls -l /genome/*.psq
        ls -l /genome/*/*.psq

It is also possible to build your own blast database using the /usr/local/blast/bin/formatdb command.  For options, type:

/usr/local/blast/bin/formatdb -

A formatdb.log file will be created, containing the status of the formatdb command.

5. How often are the BLAST databases updated on the Supercomputer?

In the early morning hours of the first day of each month, the nr, nt, and est databases are downloaded from NCBI.  The uniprot_sprot and uniprot_trembl and Pfam datadases are also updated at this time. If there is a BLAST job running with one of these databases while the update is in progress, it is likely that the results will be invalidated.  For this reason, DO NOT START batch BLAST jobs using these databases on the last day or early on the first day of the month!

 

6. How do I Transfer Files to the Supercomputer?

On amadeus, use gzip to compress the file (a .gz extension is added by gzip):
gzip   file

Use the sftp command to copy the file to the Supercomputer:
sftp  mylogin@marin.hpc.arizona.edu (then enter the first 8 characters of your UANetID password)
put   file.gz
quit

7. How do I access the Supercomputer to run BLAST ?

Use ssh to login to the Supercomputer.
ssh -X  mylogin@marin.hpc.arizona.edu (then enter the first 8 characters of your UANetID password)

You will see the command prompt change to [marin][~]> . (The name of the Supercomputer front-end is marin.)   Copy the sample batch script to your directory:
cp   /genome/marinblast.csh  myblast.csh

Unzip the file transferred from Amadeus:
gunzip   file.gz

Use the pwd (present working directory) command to find the full path of your home directory on marin – you will use this in your blast batch script.  Use the va (view allocations) command to find your group ID - this is also needed in the batch script.

pwd

nedit   myblast.csh  &

FIRST TIME ONLY:
Go to Preferences->Default Settings->Wrap and choose None
Go to Preferences->Save Defaults then click OK

Edit the file to specify MyJob, MyGroup, MyEmail, and BLAST input/output options. Make sure the input file is exactly right and use the full path for input and output files (full path is output by the pwd command).  Save the file and exit the editor.

Submit the batch job and look at the batch queues:
qsub  myblast.csh
qstat -a

Logout of marin

logout

8. What do I need to know about the Supercomputer batch queues?

The  primary queue is named 'default'. The sample script marinblast.csh requests 4-processors for 24 hours of walltime, for a total CPU request of 96 hours. This is reflected in the lines:

#PBS -q default
#PBS -l ncpus=4
#PBS -l cput=96:0:0
#PBS -l walltime=24:0:0
time /usr/local/blast/bin/blastall -a 4 ...
 

It is recommended that you run a smaller scale job using 4 processors until you know that your batch script is correct. Typically your job will wait longer to use more processors and you can avoid wasting this time and getting charged for more processing time than you need by making sure that your script runs and that you get the correct output with 4 processors. After that, you can scale up to a larger number of CPUs. Note that walltime is equal to cput divided by the number of processors requested.  Make sure that the blastall -a flag matches the #PBS ncpus value, because the latter is the number of cpus you allocation will be charged for using.

The qsub command submits a batch script to the queuing system and prints a batch job number. To delete a job from the queue, use the command: qdel jobnumber

9. What are all of those options for the blastall command?

There are many options - to see a list of them all type the command:

   /usr/local/blast/bin/blastall  -

Although these are called options,several of them are necessary:

   -i  Inputfile

   -o Outputfile

   -a NumberOfProcessors

   -p BlastProgram (blastn, blastx, etc.)

   -d BlastDatabase (nt, nr, uniprot, etc.)

And others are a very good idea to include to keep your output reasonable:

   -e EvalueCutoff (to reduce poor alignments reported, use a value less than or equal to 1e-3.  For only very good alignments, use 1e-100)

   -v 20 (to limit the number of one-line descriptions of hits - the default is 500!)

   -b 20 (to limit the number of pairwise alignments shown - default is 250!)

   -I T (show GI numbers in hit descriptions - allows linking back to NCBI)

Most of the time you will want to include this as well:

   -F F  (turn OFF filtering of low-complexity sequence, such as repetitive elements)

10. What parameters should I use to BLAST short sequences (such as potential primers)?

    -e 100   or  1000

    -W  7   (word size 7) 

11. How will I know the results of my BLAST run?

After BLAST is complete you will receive an email message. If the message reports Exit status = 0, the job ran without errors. Otherwise you need to look at the batch error file. The batch system creates an output file and an error file, named by appending the letter o or e and the job number to the jobname specified in the PBS script. The output file contains the standard output from the job, possibly including the line ‘Warning: no access to tty (Bad file number).’ This warning can be ignored. The error file will contain more specific information about the error that occurred. It also may include the line ‘stty: tcgetattr: Not a typewriter’, which can be ignored.

12. What is a likely cause of errors such as:

‘[blastall] FATAL ERROR: blast: Unable to open input file /home2/u22/sjmiller/Sfile1.tfa’ ?

There may be a problem in the path or the name of the query input file in the csh batch submit file.  Find the correct path with the pwd command and remember that filenames and directory paths are case sensitive.  Another (much less likely) possibility is that the permissions on the file are such that it cannot be read.  Use 'ls -l' to see the permissions and if necessary, use the 'chmod +r file' command to add read permission.

13. What are possible causes of errors such as 'Segmentation violation' or 'Bus error'? 

  If your BLAST job was running in the early hours of the first day of the month, and you are using one of the /genome databases that are automatically updated during this time (see Question 5), your job could result in a Segmentation fault or Bus Error. 

Another possibility is that if you used a file transfer to move a large input file to the supercomputer, there may be NULL characters in the input file.  To check for any NULL characters, run the command:

tr "\000" "@" < input.fa | grep -n "@" 

The output will show line numbers on which NULL characters are found. 

 

14. What is a likely cause of errors such as: ‘/pbs/mom1/mom_priv/jobs/1754...: Command not found.’ ?

 This type of error can occur if you have edited your script or data files on a PC and transferred them to a Unix system. Check for control-M characters (^M) at the end of lines in your script by using the command: ‘cat –vet script’.

If ^M’s are present, remove them with the command:

            tr   –d   “\015”  <PCscript.csh  >UNIXscript.csh

If the file was edited on a Mac and transferred to marin, use the comand:

           tr –s  “\015”   "\012"  <MACscript.csh  >UNIXscript.csh


After running the tr command, submit UNIXscript.csh with the qsub command.

 

15. What can cause errors like this:

[blastall] ERROR: Repeat: SeqPortNew: lcl|Repeat stop(1289) >= len(122)...?

This type of error can result from a blast database that was built from a file that is not in true FASTA format.  Proper FASTA format considers only the first word after the > as the sequence name and anything after the first space is considered annotation.  For example, if an input file looks like this:

>Repeat  Id:  1234567

ccgcgatgctagtca...

>Repeat  Id:  2345678

tttgtcagtgtcg...

all of the sequences have the same name, 'Repeat'.  To remove the common portion of the names, the following sed command can be used:

sed  -e  "s/Repeat Id: //"   < Input.fa   >NewInput.fa

After that the NewInput.fa file can be used as input to the formatdb command.

16. How do I retrieve my BLAST output file from the supercomputer (for example, if the blast output is in the file MyBLASTout.bln)?

ssh -X  mylogin@marin.hpc.arizona.edu (then enter the first 8 characters of your UANetID password)
gzip  MyBLASTout.bln
sftp  mylogin@amadeus.biosci
put  MyBLASTout.bln.gz
quit
logout

17. How can I view and organize my BLAST output ?

SCOOTaR is a browser-based Sequence Comparison Output Organizing Tool that can help you filter, sort and organize your BLAST/FASTA/HMMER output.  To use SCOOTaR you need a mysql login name and password on  amadeus.  See http://bcf.arl.arizona.edu/resources/faqs/SCOOTaRFAQs.php for more information.

Unzip the output file:
gunzip  MyBLASTout.bln.gz

Load the BLAST results file into SCOOTaR using your mysql login/password/database (type all on one line):
loadscoot -v mysqldb /home/my_dept/my_login/MyBLASTout.bln  projectname

Go to http://scoot.biosci.arizona.edu

Click the SCOOT Tab, and you will see 3 smaller Tabs: Load Data, View Results, User Manual. Read through the User Manual and play with SCOOT.

Some caveats:
 At present BLAST result files must be imported into SCOOT by using the loadscoot command on amadeus as described above or by first downloading the blast results to a PC to be loaded into SCOOT (you can View Results from Mac or PC but for now loading only works from a PC or amadeus)

18. Where can I get help with UNIX commands?

http://bcf.arl.arizona.edu/resources/docs/unix.php