showq

Running the showq command gives an overview of what jobs are running and pending in all queues for all users and what resources are being used by the jobs. Using its configuration options you can restrict the view to just a queue, a family of related queues, the jobs for a user and also see additional values like what CPU time the a job has used, how much time it has remaining and when a job is expected to start (this particular value is only indicative and only works well for the finite queues and can change as users with differing priorities also submit new jobs).

The options available are:

-hto get help
-q queue_nameshows information for the given queue.
-pshows information for all queues in a family.
-rshows the project.
-sshows information for all queues that start with queue name.
-u user_nameonly show information for the given user.
-n host_nameonly shows information for the given host.
-jshows the job name.
-tshow the time remaining column.
-cshow the cpu time used column.

Note that the -p and -s options are incompatible and only -s will be used when given together.

Examples using showq

Inspecting the cosma queue

Here is the output of a simple request to see the cosma queue status.

cosma-e > showq -q cosma
ACTIVE JOBS -------------
JOBID       USER     STAT  QUEUE         CORES/NODES SUBMIT TIME   START TIME   
605221      spike    RUN   cosma          240/20X    Nov22 11:33   Nov22 12:15  
605609      dph6bmh  RUN   cosma          120/10X    Nov25 10:30   Nov28 20:56  
607175      dph6bmh  RUN   cosma          120/10X    Nov28 18:03   Nov29 02:54  
607213      spike    RUN   cosma          240/20X    Nov29 14:29   Dec01 19:21  
609746      ppgm62   RUN   cosma          512/43X    Dec04 17:11   Dec04 21:25  
613101      dph6bmh  RUN   cosma          120/10X    Dec10 16:33   Dec10 16:33  
612851      ppgm62   RUN   cosma          512/43X    Dec10 11:37   Dec11 03:53  
613680      dph6bfg  RUN   cosma            1/ 1     Dec11 20:09   Dec11 20:09  
613681      dph6bfg  RUN   cosma            1/ 1     Dec11 20:09   Dec11 20:09  
613682      dph6bfg  RUN   cosma            1/ 1     Dec11 20:09   Dec11 20:09  
613683      dph6bfg  RUN   cosma            1/ 1     Dec11 20:09   Dec11 20:09  
613684      dph6bfg  RUN   cosma            1/ 1     Dec11 20:09   Dec11 20:09  
613810      jch      RUN   cosma          256/22X    Dec12 13:44   Dec12 13:44  
614000      till     RUN   cosma          264/44X    Dec12 16:29   Dec12 17:58  
614045      jvbq85   RUN   cosma          108/ 9X    Dec12 18:05   Dec12 18:05  
15 total jobs

SUSPENDED JOBS -------------
no matching jobs found

queue summary
2497 of 2784 cores used (89.69%),  287 available
 232 of  232 nodes used (100.00%),    0 available

overall statistics for nodes available
2497 of 2784 cores used (89.69%),  287 unused
 232 of  232 nodes used (100.00%),    0 closed by admin,    0 unavailable

other queues using these hosts: 

PENDING JOBS -------------
JOBID       USER     STAT  QUEUE         CORES/NODES SUBMIT TIME   START TIME   
613083      ppgm62   PEND  cosma          512/ 1X    Dec10 12:36                
613709      ppgm62   PEND  cosma          512/ 1X    Dec12 11:08                
614001      till     PEND  cosma          264/ 1X    Dec12 16:29   Dec12 18:54  
614002      till     PEND  cosma          264/ 1X    Dec12 16:29                
614003      till     PEND  cosma          264/ 1X    Dec12 16:29                
    

As is immediately clear the queue is running at nearly full capacity and no more jobs can run (it was 18:15 on Dec12 when the command was run). All 232 nodes of the queue are in use, there are 287 cores "free", but those will not be useable by the pending jobs because of the requirement for exclusive node use, or the size of the job.

Job 605221 is using 240 cores running on 20 nodes for which it has exclusive access (20X). The start time for job 614001 is just a guess as it is based on jobs completing at the end of their runtime, not before, and no other users with higher priority submitting jobs.

How much CPU time a job has used?

This can be done by adding the -c option:


cosma-e > showq -q cosma -c
ACTIVE JOBS -------------
JOBID       USER     STAT  QUEUE         CORES/NODES CPU HOURS   SUBMIT TIME    START TIME   
605221      spike    RUN   cosma          240/20X      116655.3  Nov22 11:33    Nov22 12:15  
605609      dph6bmh  RUN   cosma          120/10X       40006.1  Nov25 10:30    Nov28 20:56  
607175      dph6bmh  RUN   cosma          120/10X       39290.5  Nov28 18:03    Nov29 02:54  
607213      spike    RUN   cosma          240/20X       63112.9  Nov29 14:29    Dec01 19:21  
609746      ppgm62   RUN   cosma          512/43X       96715.8  Dec04 17:11    Dec04 21:25  
613101      dph6bmh  RUN   cosma          120/10X        5973.1  Dec10 16:33    Dec10 16:33  
612851      ppgm62   RUN   cosma          512/43X       19679.7  Dec10 11:37    Dec11 03:

Inspecting a family of queues

The main COSMA4 queues and all the COSMA5 queues are arranged as families. For instance the COSMA5 queues are cosma5, cosma5-prince and cosma5-pauper, and the COSMA4 queues are cosma and cosma-prince. These families share the same nodes so making a query about one of these queues only gives a partial picture of the state of COSMA5 or COSMA4. The way to get a full picture of a family is to use the three options -ptc in a command like the following:

 

cosma-e > showq -q cosma5 -ptc
ACTIVE JOBS -------------
JOBID       USER     STAT  QUEUE         CORES/NODES CPU HOURS   SUBMIT TIME    START TIME    TIME REMAINING
612545      dc-jone1 RUN   cosma5         640/40X       45464.5  Dec08 16:38    Dec09 19:30    0:00:57:42  
612505      rrtx34   RUN   cosma5         512/32X       16244.3  Dec07 18:05    Dec11 10:49    1:16:16:22  
612335      till     RUN   cosma5         512/32X       16243.9  Dec06 16:06    Dec11 10:49    1:16:16:25  
613639      till     RUN   cosma5         512/32X       13581.2  Dec11 16:01    Dec11 16:01    1:21:28:27  
613697      dc-bush1 RUN   cosma5         144/ 9X        1443.4  Dec12 08:31    Dec12 08:31    2:13:58:35  
613811      rcrain   RUN   cosma5-prince 2048/128X       9173.3  Dec12 14:02    Dec12 14:04    0:05:31:15  
613812      rcrain   RUN   cosma5-prince 2048/128X       8884.9  Dec12 14:03    Dec12 14:12    0:05:39:42  
614038      till     RUN   cosma5         256/16X          56.7  Dec12 17:02    Dec12 18:19    0:00:46:43  
8 total jobs

SUSPENDED JOBS -------------
no matching jobs found

summary
overall statistics for available queues
6672 of 6704 cores used (99.52%),   32 available
 419 of  420 nodes used (99.76%),    1 closed by admin,    0 unavailable

available queues: cosma5-prince,  c4h, cosma5,  shm, cosma, cordelia, cosma5-pauper,  c4l, 
noqueue

PENDING JOBS -------------
JOBID       USER     STAT  QUEUE         CORES/NODES CPU HOURS   SUBMIT TIME    START TIME    TIME REMAINING
613819      jlvc76   PEND  cosma5-prince 4096/ 1X           0.0  Dec12 14:12    Dec13 00:23    0:00:00:00  
613695      wangjie  PEND  cosma5        1024/ 1X           0.0  Dec12 02:54    Dec14 11:00    3:00:00:01  
613896      arj      PEND  cosma5        1536/ 1X           0.0  Dec12 14:46    Dec15 08:42    0:04:30:01  
614039      till     PEND  cosma5         256/ 2X           0.0  Dec12 17:02    Dec12 19:30    0:01:00:01  
614040      till     PEND  cosma5         256/ 1X           0.0  Dec12 17:02                   0:01:00:01  
614041      till     PEND  cosma5         256/ 1X           0.0  Dec12 17:02                   0:01:00:01  
6 total jobs
    

No project is currently over budget, so no jobs are currently running in the pauper queue, but if they were those would show here.

Two of these three queues are time limited, so the start times shown will be true if nothing changes, that is if jobs do not complete early and no new jobs are submitted (new jobs may be from users with a higher priority). In this case the unlimited prince queue is full as it only allows 4096 cores, but the pending job is given a start time. That will also be correct as the two running jobs were given a runtime limit using the -W option on submission.

Note that job 614039 has a CORES/NODES value of 256/2X, not 256/1X as the others have. This indicates that the job has reserved 2 of the nodes that it requires to run (this number can actually be much larger, depending on the job sizes and load profile). The reservation is for the future so the nodes are still available to run jobs, but only within a fixed time that does not stop this job from running at the expected time. See the c5backfill command for how to utilise this window, a similar c4backfill command exists for the COSMA4 queues.