Using back-fill to run jobs

Normally at very busy times the batch queue system will be forward allocating nodes so that jobs that are pending will eventually run, this applies to nodes that are in use and ones that are not. This is essential if any large jobs are ever to be given time to run (otherwise small jobs that can run would be constantly jumping the queue and delaying big jobs further). As the most of the queues are run-time limited you can guarantee to be able to run big jobs on the timescale of the run-time limit, so at times the queues can look (in showq and cutilisation) like they are just sitting and not running jobs, that is deliberate.

During this time, known as the back-fill window, any nodes that are not running jobs can be used, if you guarantee that your job will complete before the nodes are required. This means you will need to specify a run-time limit using the -W option of bsub.

The two back-fill commands are c4backfill and c5backfill. Here is an example of the output of one:

cosma-e > c4backfill
Backfill availability for the cosma queue
SLOTS:    70
RUNTIME:  6 hours 37 minutes 0 seconds
HOSTS:    12*m4140 12*m4168 11*m4071 11*m4153 12*m4061 12*m4084 

What is means is that 70 slots (cores) are available for 6 hours 37 minutes (if you looked at the showq output you could probably work out which job expected to run at around that time). Note that there can be multiple back-fill windows as more than one job can be holding forward allocations. If the window is unlimited, then those slots are available immediately with the run-time limit of the queue (so not literally unlimited), it can also say none in which you'll just have to wait in the queue.

On COSMA5 using this back-fill window is easy you'd just submit a job that used 70 or fewer cores and say "-W 6:00" if you could make use of 6 hours of run-time, assuming no one has submitted another job in the meanwhile, your job would then run.

On COSMA4 it is a little harder as some of these nodes only have 11 cores available, not 12 and jobs asking for more than 12 cores are forced to use exclusive access, so in fact you could run 22 1 core jobs (2*11) or up to 48 cores (4*12) in a single job.