Normally at very busy times the batch queue system will be forward
allocating nodes so that jobs that are pending will eventually run, this
applies to nodes that are in use and ones that are not. This is
essential if any large jobs are ever to be given time to run (otherwise
small jobs that can run would be constantly jumping the queue and
delaying big jobs further). As the most of the queues are run-time
limited you can guarantee to be able to run big jobs on the timescale of
the run-time limit, so at times the queues can look (in
cutilisation) like they are
just sitting and not running jobs, that is deliberate.
During this time, known as the back-fill window, any nodes that are not
running jobs can be used, if you guarantee that your job will complete
before the nodes are required. This means you will need to specify a
run-time limit using the -W option of
The two back-fill commands are
Here is an example of the output of one:
cosma-e > c4backfill Backfill availability for the cosma queue SLOTS: 70 RUNTIME: 6 hours 37 minutes 0 seconds HOSTS: 12*m4140 12*m4168 11*m4071 11*m4153 12*m4061 12*m4084
What is means is that 70 slots (cores) are available for 6 hours 37
minutes (if you looked at the
showq output you could
probably work out which job expected to run at around that time). Note
that there can be multiple back-fill windows as more than one job can be
holding forward allocations. If the window is
those slots are available immediately with the run-time limit of the
queue (so not literally unlimited), it can also say
which you'll just have to wait in the queue.
On COSMA5 using this back-fill window is easy you'd just submit a job that used 70 or fewer cores and say "-W 6:00" if you could make use of 6 hours of run-time, assuming no one has submitted another job in the meanwhile, your job would then run.
On COSMA4 it is a little harder as some of these nodes only have 11 cores available, not 12 and jobs asking for more than 12 cores are forced to use exclusive access, so in fact you could run 22 1 core jobs (2*11) or up to 48 cores (4*12) in a single job.