Skip to end of metadata
Go to start of metadata

The Torque command qstat is used to inquire about the status of one or more jobs:

$ qstat -f <jobid>     (Inquire about a particular jobid)
$ qstat -r             (List all running jobs)
$ qstat -a             (List all jobs)

In addition, the Moab scheduler can be inquired using the showq command:

$ showq -r             (List all running jobs)
$ showq                (List all jobs)

If you want to check the status of a particular jobid use checkjob command:

$ checkjob <jobid>
Adding -v flag(s) to this command will increase the verbosity.

Badly behaving jobs

Another useful command for monitoring batch jobs is pestat, available as a module. Show status of badly behaving jobs, with bad fields marked by star (*)

$ module load tools pestat
$ pestat -f
Listing only nodes that are flagged by *
  node state  load    pmem ncpu   mem   resi usrs tasks  jobids/users
  risoe-r01-f002  free     2* 1034109  32 1046397   8017  1/1    1    103125 s147214
  risoe-r01-f010  free  0.53* 1034109  32 1046397   8451  0/0    0
  risoe-r01-f012  free  0.55* 1034109  32 1046397   8019  0/0    0
  risoe-r02-f019  offl* 0.27  1034107  64 1046395   6590  0/0    0
  risoe-r02-f024  free     1* 1034109  32 1046397   8730  0/0    0
  risoe-r03-cn001  excl    29* 128946  28 133042   8266  1/1    1    100096 qyli
...

One of the most common bad behaviors of batch jobs is exhausting of available RAM memory.

An example of usage of pestat:

$ pestat | grep -e node -e 263945
  node state  load    pmem ncpu   mem   resi usrs tasks  jobids/users
  q008  excl  4.08    7974   4  18628   1275  1/1    4    263945 user
  q037  excl  4.02    7974   4  18628   1285  1/1    4    263945 user

The example job above is behaving correctly. Please consult the script located at `which pestat` for the description of the fields. The most important fields are:state = Torque state (second column)node can be free (not all the cores used), excl (all cores used) or down.load = CPU load average (third column)pmem = Physical memory (fourth column)amount of physical RAM installed in the nodencpu = total number of CPU cores (fifth column)resi = Resident (used) memory (seventh column)total memory in use on the given node (the one reported under RES by the "top" command),If used memory exceeds physical RAM on the node, or CPU load is significantly lower than number of CPU cores, the job becomes a candidate to be killed.An example of a job exceeding physical memory:

$ pestat -f | grep 128081
m016  busy* 4.00    7990   4  23992   9937* 1/1    4    128081 user
m018  excl  4.00    7990   4  23992   9755* 1/1    4    128081 user

An example of a job with incorrect CPU load:

$ pestat -f| grep 129284
a014  excl  7.00*  24098   8  72097   2530  1/1    8    129284 user

Searching for free resources

Show what resources are available for immediate use (see `Batch_jobs#batch-job-node-properties`_ for more options):Fatnode:

$ showbf -f fatnode

Thinnode:

$ showbf -f fatnode

pestat can also be used to check what resources are free:

$ pestat | grep free
  risoe-r01-f006  free    29* 1034109  32 1046397  13226  1/1    1    100074 qyli
  risoe-r01-f010  free   2.4* 1034109  32 1046397  79972  2/1    1*   20078 jogon 20079 jogon
  risoe-r01-f013  free  0.84  1034109  32 1046397   8395  0/0    1    102268 jensf
  risoe-r02-f015  free  0.81  1034109  32 1046397   8212  0/0    1    102268 jensf
  risoe-r02-f017  free  0.15* 1034109  32 1046397   8489  0/0    1    102268 jensf
  risoe-r02-f023  free  0.56  1034109  32 1046397   8313  0/0    1    102268 jensf
  risoe-r02-f024  free  0.08* 1034109  32 1046397   8101  0/0    1    102268 jensf
  risoe-r02-f025  free  0.02* 1034109  32 1046397   7984  0/0    1    102268 jensf
  risoe-r08-cn289  free   1.4  128946  28 133042   3117  1/1    1    102536 csabai
  risoe-r08-cn300  free   1.5  128943  56 133039   6995  3/2    1*   102406 jensf 102407 jensf 102533 cgar
  risoe-r12-cn527  free   1.3  128946  28 133042   2741  1/1    1    10047 sira
  risoe-r02-f018  free    29* 1034110  64 1046398  15376  1/1    1    99432 qyli


The node risoe-r01-f010 is occupied by 1 job (9th column) and two users (8th column) each requesting 1 core. The node risoe-r02-f024 is totally free.