The Durham COSMA Batch System

Overview of the batch system

The Durham COSMA machines, COSMA4 and COSMA5, have an integrated set of batch queues that are managed by PLATFORM LSF (Load Sharing Facility) job scheduler. Jobs are ran by the scheduler after submission to the appropriate queue. Which queue to submit your jobs to depends on the projects you are working on, if in doubt about this ask your project lead, supervisor, co-workers or email cosma-support@durham.ac.uk.

The batch queues

Currently we have three main queues:

These will be used by the majority of jobs and all users will have the rights to use at least one of these queues. Access rights are controlled by membership of projects, which are the same as UNIX groups. COSMA5 projects have a DiRAC assigned group code (usually starting with dp followed by three integers) together with a quarterly allocation of time. COSMA4 users just need to be in the durham project (all Virgo consortium members and Durham locals should be in this group). You can check which projects you are in using the command id, which lists the UNIX groups you are a member of. A more authoritative list of group members known to the batch system can be found using the command:

   bugroup
      

In addition to the three main queues we also have:

The -prince queues are only available on request for jobs that cannot run on the cosma5 or cosma queues. Usually this means that they require more time than the run-time limit on cosma5 or cosma and cannot, for technical reasons, be restarted, or restarting them is inefficient (usually very large jobs expected to use a lot of run-time, this is inefficient as they continually need to be fitted back into the machine, holding nodes idle in the process).

The cosma5-pauper queue is used to reduce the priority of COSMA5 projects that have exceeded their quarterly allocation. It has a reduced run-time limit as well as priority, this allows other projects to preferentially get time without stopping progress on over budget projects.

The shm4 and shm5 queues are available for large shared memory jobs that cannot run on a COSMA4 or COSMA5 compute nodes. These share resources with the interactive login machines.

Detailed descriptions of the queues and their families

See the following links for details about the various queues. Dirac users just need to read about the COSMA5 family.

Using the PLATFORM LSF batch system

PLATFORM LSF has a number of man pages available on the system, these should be consulted for detailed information about any commands, but a useful overview of LSF commands and working practices for COSMA is available.

Local utilities

There are also a number of locally developed LSF and related commands available:

These concentrate on providing more condensed information than can be easily extracted from the standard commands. They also help you make more effective use of the queues.