Migrating to the PLATFORM LSF batch system from PBS TORQUE

The batch queue system used on COSMA4 is being changed to the same version as that used on the new Dirac machine COSMA5. This is to give Durham users a consistent platform for jobs on both machines (note we have a guaranteed fraction of time on COSMA5, as well as the expectation that Durham users will gain Dirac time on COSMA5) and to make the day to day management of the dual system easier. Other benefits should be the seamless movement of jobs between the two systems.

The new batch system is called PLATFORM LSF (Load Sharing Facility), or just LSF for short. It provides much the same functionality as the currently used PBS/TORQUE/MAUI, but with some subtle differences and changes in syntax that you will need to be aware of.

Submitting a batch job

In LSF you submit a batch job using the bsub command. Using this seems similar to qsub, but there is one big difference you need to take care with, namely that bsub only reads from standard input, it does not attempt to read the file supplied on the command-line. In simple terms this means where in the past you did:

    % qsub batchfile
    
you now need to do:
    % bsub < batchfile
    
or, to make it very clear what is going on:
    % cat batchfile | bsub
    
If you just put the script name on the command-line as for qsub then LSF will attempt to run that as a command, it will not look for job options in the script preamble.

Job options

As for qsub, bsub allows you to specify the number of cores etc. that a job requires using command-line arguments (see the bsub man page) and it also allows these to be specified as macro-like values in the preamble of a script. For bsub these are lines (read from standard input, so need the < syntax) starting with:

    #BSUB
    
Commonly used options under LSF are:
    #BSUB -L /bin/sh             # script shell language (/bin/tcsh, /bin/ksh etc.)
    #BSUB -n ncores              # number of cores required
    #BSUB -J myjob               # name of job
    #BSUB -o myjob.out           # log file for standard output (appends)
    #BSUB -e myjob.err           # log file for standard error (appends)
    #BSUB -oo myjob.log          # log file for standard output (overwrites)
    #BSUB -eo myjob.err          # log file for standard error (overwrites)
    #BSUB -q queue               # target queue for job execution
    #BSUB -W 0:15                # wall clock limit for job
    #BSUB -u user@durham         # email address for output
    #BSUB -N                     # send email even if writing a log file
    #BSUB -P project             # project to charge time
    #BSUB -x                     # give node exclusive access to job
    
Note that without -e or -eo standard output and error will be written to the same log file. Also note that the "!" comments are not allowed in an actual script, please remove them if you cut and paste this section.

Controlling jobs

You can see which jobs you have queued or have running using the command:

    % bjobs
    
If you want to see the jobs running for all users, then use:
    % bjobs -u all
    
The nodes in use, reasons why a job is pending etc. are reported if you add the -l option flag. You can see all the jobs that you have ran recently using the -a option flag.

To delete a job use the:

    % bkill 
    
command. The job id is reported by bjobs.

The queues that are available for submission are reported by the:

    % bqueues [-l]
    
command. The serial queue on COSMA4 is called "cosma" and the serial queue "cordelia". The COSMA5 queue is called "cosma5".

Useful environmental variables

    LSB_ERRORFILE: Name of the error file
    LSB_JOBID:     Batch job ID assigned by LSF.
    LSB_JOBINDEX:  Index of the job that belongs to a job array.
    LSB_HOSTS:     The list of hosts that are used to run the batch job.
    LSB_QUEUE:     The name of the queue the job is dispatched from.
    LSB_JOBNAME:   Name of the job.
    LS_SUBCWD:     The directory where the job was submitted.
    
As in PBS these are available to any scripts or processes.

Array jobs

Array jobs are created by submitting a job whose name is suffixed by a range. A simple example is:

    % bsub -J "myjob[1-4]" myscript
    
That will create four different jobs with the same jobid, but different job indices. Any other job options, like the number of nodes etc., should also be entered on the command-line. In this case myscript will need to be executable and available on the $PATH (or you should replace myscript with the full path name).

The index of a running job is determined using the:

    LSB_JOBINDEX
    
environment variable. Job arrays can also be specified in the preamble of an input script:
    #BSUB -J myjob[1-4]
    
To increment the index using a step greater than 1, you use the syntax:
    #BSUB -J myjob[1-10:2]
    
It is also possible to have discontiguous sequences of job indices:
    #BSUB -J myjob[1-10:2,20-30:3,40,50]
    
The job array index can also be used to change the name of the output files using the special character %I.
    #BSUB -J myjob[1-4]
    #BSUB -o myjob%I.log
    

Passing command-line values to jobs

PBS supported the command-line option -v, which defined the values of variables to be set within the environment of a job. LSF does not support a similar mechanism, so anyone using option this will need to make some significant changes.

The LSF method for passing arguments is to not use an input script, but use a normal command instead. For instance if I have a runnable script myscript.sh present on my $PATH, I could run it using:

    % bsub -n 4 -oo myjob.log myscript.sh arg1 arg2
    
and then arg1 and arg2 will be available as $1 and $2 in myscript.sh, the same would be true if this was a c-shell script (again note you cannot use #BSUB options in the preamble of such scripts, all options will need to be put on the command-line). If the command is not on your $PATH then you would need to supply the full path name.

Another way to pass in arguments are to edit the input script on-the-fly. That is easily done using sed:

    % sed 's/%args%/arg1 arg2/g' < batchfile | bsub
    
That would replace all occurrences of the string %args% with arg1 arg2, which could be handled in the following manner in a c-shell input script:

    #  Edited args:
    set args = (%args%)
    echo "firstvalue = $args[1]"
    echo "secondvalue = $args[2]"
    
In bash that would be:
    args=(%args%)
    echo "firstvalue = ${args[0]}"
    echo "secondvalue = ${args[1]}"
    
For standard Bourne shell you'd need to use something like:
    args="%args%"
    arg1=`echo $args | awk '{print $1}'`
    arg2=`echo $args | awk '{print $2}'`
    echo "firstvalue = $arg1"
    echo "secondvalue = $arg2"
    

Using modules with bash scripts

When LSF accepts a batch job it attempts to retain the environment of the host process (i.e. the terminal you're typing in), which means that bash jobs will not source any of their initialisation scripts. This is necessary to define the module command (for tcsh scripts the .cshrc/.tcshrc file is always read so that works). To work around this make sure your script is ran either as an interactive or login shell. To do this add the "-l" flag to the hashbang:

     #!/bin/bash -l
    
You could use "-i" if your .bashrc script sets up the module command.


Peter W. Draper
Last modified: Mon Dec 10 11:50:28 GMT 2012