Skip to content

Troubleshoot Jobs#

It is often the case that your job will fail, or not do what you intended. This guide will show you how to proactively monitor your job and diagnose issues both in real-time and retrospectively. Depending on the platform you use to access the cluster, there are a variety of options to monitor and diagnose job issues. We suggest that you familiarize yourself with all of these resources.

Command-line#

The command-line has many powerful tools to monitor your jobs and diagnose issues. The first tool is the squeue command, which prints the current set of jobs in the queue. Monitoring the queue is best practice after you submit any job. It will quickly tell you if your job is running, where it is running, and for how long. When the cluster is busy, or you violate a scheduler policy, your job may be stuck in the queue. This will tell you the status of your job(s), and give reasons for any related issues.

To list only your jobs:

squeue -u NetID

Another useful command is seff, which prints the workload efficiency metrics for a job. Although you can run this tool during a job, it is best used after your job is finished. The output will give an estimate of CPU and memory efficiency, and is very useful when benchmarking a workload for memory limit.

To list only your jobs:

$ seff 250
Job ID: 250
Cluster: cluster
User/Group: user/sg-group
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 5
CPU Utilized: 00:00:01
CPU Efficiency: 6.67% of 00:00:15 core-walltime
Job Wall-clock time: 00:00:03
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 30.00 GB (30.00 GB/node)

Accessing Compute Node#

You may want to access the compute that is running your job to see further information in real-time. Direct compute node access via SSH is prohibited. If you need to access a compute node command-line during your job, you should run the job interactively. Please see interactive jobs for details.

While direct SSH to a compute node is prohibited, there are other ways to pull real-time diagnostics from the compute node(s) that are running your job. For instance, you can run an additional command within an already running job with the srun command.

Suppose that we already have a running job on the cluster.

$ squeue -u NetID
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            263052    normal  testing    NetID  R       0:11      1 cn60

We can retrieve information, in this case hostname, from the compute node via srun.

$ srun --jobid 263052 hostname
cn60.cluster.local

For useful diagnostics about running processes, we run the ps u -u $USER command.

$ srun --jobid 263052 ps u -u $USER
USER           PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
username     67486  0.0  0.0 126484  2836 pts/0    Ss+  12:31   0:00 /usr/bin/bash
username     73572  0.0  0.0 165776  1932 ?        R    12:37   0:00 /usr/bin/ps u -u username

Passing inline commands to srun is limited. However, you can pass a script to print more information.

Web portal (XDMoD)#

XDMoD collects job accounting data and node level metrics during all cluster jobs. This data can be used for troubleshooting in the event of a crashed job.

Please see the XDMoD guide for more info.