Job scheduling and maintenance#
Why is my job blocked?
This is the most common question before cluster maintenance. It has a simple, but hard to explain answer. Since we answer this question often, we wanted to share the answer, and provide an easy tool to help.
Some background info...
Each job has a walltime request. The walltime request sets the amount of time that your job will be allowed to run on the cluster. On the HPC cluster, the maximum allowed walltime is 7 days.
If you submit a job before a maintenance window and the job is sitting in the queue, first check to see why with squeue
. In the output, you'll see the last column NODELIST(REASON)
. If your job is held for maintenance, you'll see (ReqNodeNotAvail, Reserved for maintenance)
.
Answering the question...
If you submit your job with X hours walltime request, and the maintenance window starts in less than X hours, the job will be held until after the maintenance window closes. The reason is that the scheduler cannot guarantee your job will finish before maintenance begins.
Fixing the issue...
To make the job run, resubmit your job with a walltime request that ensures the job will finish before the maintenance window. We can use a simple formula to find that walltime.
walltime = ( $maint_start_time - $current_time )
To simplify the process, RCC created a new login node command, maxwalltime
. This command will solve the above formula, and give you the maximum walltime request that will allow your job to run before maintenance.
We hope this helps and stay tuned for more helpful tips and solutions posted here in the News section.