Skip to content

What is scratch space?#

Every RCC user has access to scratch space (i.e., /scratch) on the cluster, but many do not use it properly, or might not understand its purpose. Here we'll talk about scratch space, why it exists, and how to use it properly.

Overview

Scratch space is traditionally the high-performance storage component in any cluster. It's purpose is to hold temporary files generated by running jobs. Many HPC jobs need to write large files, or many files, which are temporary and only used during the job. These files might include checkpoints (incremental save points), intermediate files, etc. These are files that may be needed to diagnose a failed job, but are not kept long term, and are not integral to publishable results.

A common question is why do we separate scratch space and general file storage (i.e., /group)? Many users question why, since this may require copying data between storage, and extra data management steps. The simple answer is that high-performance storage is expensive. If we combined the general purpose storage and scratch space workloads into one storage system, some decisions must be made. For example, if the combined storage is all general purpose, the job workloads could crash or severely affect the slower general purpose storage system. Or if the combined space is all high-performance, general file data (which is most of our data) would be a wasted expense.

Why talk about scratch space?

Most users do not understand or properly use scratch space. We have 215TB of high-performance scratch space and approximately 50-60% is improperly utilized at any given time. Much of this improper use is old files that are not cleaned up or avoiding additional storage charges in /group.

Why is this a problem?

RCC staff are constantly having to contact users to ask them to clean-up their scratch space. This takes time and effort, and is not very effective. The scratch space remains unnecessarily full, which means new users will not have space to work. Moreover, existing users are forced to make use of a small amount of space, which hurts productivity.

Proper use

We have a simple workflow to follow:

  1. User copies job input/supporting files from an RGS directory within /group/{PI_NetID}/... to their scratch directory /scratch/g/{PI_NetID}
  2. User submits job that computes with the staged job input/supporting files
  3. Job finishes and user copies results from /scratch/g/{PI_NetID} back to /group/{PI_NetID}/...
  4. User continues with further computations with the job input/supporting files
  5. User finishes computations and deletes unneeded job input/supporting files from /scratch/g/{PI_NetID}

Summary

Try to remember that scratch space is shared. Proper use of scratch space benefits all users. So, clean up your scratch space by following a simple rule; when not running jobs, your scratch should be empty!