Troubleshoot Jobs#
Overview#
It is often the case in computational work that your job may not do what you expected, or intended. This can be confusing and is a leading source of questions to Research Computing. This guide will explain more about queueing system outcomes and show you how to proactively diagnose the most common issues.
Job is not running#
If your job is queued longer than expected, you can find a reason with the squeue -j jobId
command, where jobId
is the SLURM job number.
$ squeue -j 12345
JOBID PARTITION NAME USER ST TIME NODELIST(REASON)
12345 normal testing NetID PD 0:00 (QOSMaxJobsPerUserLimit)
In the example job 12345 the state (ST) is reported pending (PD). The reason is listed last under the NODELIST(REASON) heading (a compute node list is printed if job is running). The reason listed for the example job is QOSMaxJobsPerUserLimit. This translates to a queue (partition) based resource limit for users. It is important to know that the cluster has limits on what resources any single lab group or user can use. These resource limits, along with other cluster policies, help maintain fair utilization of the cluster.
So, the example job is in the queue because of the cluster's resource limits. Several scenarios could result in this example. The most likely explanation is that you are already running jobs on the cluster, and you've hit that particular limit. In that case, the best thing to do is wait, and the job will likely start running when your previous jobs finish. However, there are other cases of resource limits that are less obvious. For example, the cluster does limit the number of interactive jobs you can run.
The reason listed for our example job is one of several that SLURM provides. Please see the table below for additional examples.
Reason |
Why | What to do |
---|---|---|
QOSMaxJobsPerUserLimit |
You reached the number of running jobs allowed per user for the corresponding partition. | You can wait for previous jobs to finish or cancel running jobs. |
Resources |
The job is waiting for resources to become available. | In many cases you should wait and the job will run. Additionally, try to make sure that you don't request more resources than what you need. |
Dependency |
A job dependency is not yet satisfied. | You can learn more about job dependencies in our video for creating an advanced SLURM script. |
Maintenance |
Your job is blocked until maintenance is finished. | You can wait or cancel and resubmit your job according to the job scheduling and maintenance section of our job submission guide. |
Priority |
Your job is waiting for higher priority jobs to finish. | Please wait and rest assured your job will run in due time. For more info about job priority, please see the job priority section of our job submission guide. |
When to contact help
It is not possible to provide an exhaustive table of scenarios, reasons, and guidance here. When in doubt, please feel free to contact help-rcc@mcw.edu.
Job failed immediately#
Many issues can cause a job to fail immediately. By immediately, we mean the job starts and finishes without producing any useful output. Often there is an error listed in the job output file. The output file is named according to your job name, or can have a specific syntax based on your #SBATCH --output
option. Most often we find the job output file is something like slurm-12345.out
, where 12345 is the jobId number.
Incorrect file names or paths are the most common source of immediate job failure. You will see a specific error in your output file with the syntax command: /file/path: No such file or directory
. This indicates that your job is trying to manipulate a file or directory that does not exist. Most often this is a simple typo error, but could also be caused by trying to use files in /group
, which is not available on compute nodes. We suggest to double check the file names and paths, and then resubmit. If the issue persists, contact help-rcc@mcw.edu.
Storage limits can also cause this issue. Every user has access to at least three storage paths including /home/netId
, /group/pi_netId
, and /scratch/g/pi_netId
. Each of these spaces has a finite limit according to our storage guide. Your job should be using /scratch
for input/output and will fail immediately if your scratch quota is 100% full. You can find your available storage paths and quotas with the mydisks
command.
$ mydisks
=====My Lab=====
Size Used Avail Use% File
47G 29G 19G 61% /home/netId
932G 158G 774G 17% /group/pi_netId
4.6T 0 4.6T 0% /scratch/g/pi_netId
Finally, if you run jobs in OnDemand often, your home directory will fill with temporary files that are created every time you start an OnDemand job. If you primarily use OnDemand and your apps are failing to start, check your home directory limit with mydisks
. If your home directory is full, look for files in /home/netId/ondemand
, where netId
is your username. You can safely clean out this folder and proceed with a logout/login on OnDemand.
Job stopped unexpectedly#
Jobs that start and run correctly can still stop or fail unexpectedly for a variety of reasons. Again, the output file is a good starting place and may contain a useful error for diagnosing the issue.
Memory#
A common failure is the job running out of memory. Jobs that fail with memory issues often produce normal, useful output until they stop abruptly. In this case, your job output file might reference an OOM
or Out-Of-Memory
error.
Another way to diagnose memory issues is the seff
command.
$ seff 250
Job ID: 250
Cluster: cluster
User/Group: user/sg-group
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 5
CPU Utilized: 00:00:01
CPU Efficiency: 6.67% of 00:00:15 core-walltime
Job Wall-clock time: 00:00:03
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 30.00 GB (30.00 GB/node)
If your job fails due to memory, the State: output will show OUT_OF_MEMORY. You will also see high memory utilization above what you allocated. To fix a memory issue, try to increase the amount of memory in the job script and re-submit.
Walltime#
Another common failure of running jobs is a job timeout. Again, you will see an abrupt end to an otherwise well running job. A job can timeout if it tries to run longer than your requested walltime or the max walltime. Walltime tells the cluster how long you expect the job to run, and has a max value of 7 days. You can also use the seff
command to diagnose this issue. If a job fails due to timeout, the State: output will show TIMEOUT. To fix a walltime issue, try to increase the walltime or shorten the simulation, and resubmit.
Other resources#
Research Computing provides a web portal with all job information called XDMoD. XDMoD collects job accounting data and node level metrics during all cluster jobs. This data can be used for troubleshooting in the event of a failed job. However, XDMoD is only useful for retrospective analysis. It collects and aggregates data once per day, rather in realtime. Jobs that run and finish one day will be available in XDMoD the following day.
Please see the XDMoD guide for more info.
Getting help#
It is not possible to provide an exhaustive list of scenarios, reasons, and guidance here. When in doubt, please contact help-rcc@mcw.edu for expert help.