Submission Scripts
In order for the cluster to compute a user’s job, the user has to write and submit a slurm script. A slurm script is a shell script that is used to inform slurm on how many resources should be allocated for the user’s job, the dependencies of a job such as modules and other debugging tools the user would like to add.
Below is an example of a simple slurm script that submits a python job.
#!/bin/bash
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH --time=1:00:00
# The anaconda module provides Python
module load anaconda/anaconda-2023.09
# Activate one of the pre-made ARCC environments (or
# your own environment) on top of the base environment.
conda activate base
# Run a Python script
python3 python-script.py
The very first line of every Slurm script will be #!/bin/bash. This is called Shebang, and it notifies the system to use a bash shell to run the script.
The second section of slurm-script.slurm is where the user writes the Slurm directives. While these might look like comments, given by the “#” character, these lines of code gives us the ability to choose how many resources we would like to allocate for our job and many other attributes. The Slurm directive syntax is as follows: #SBATCH [option]
Below is a list of some commonly used Slurm directives with brief explanations:
Slurm Option | Description |
---|---|
--nodes=<n> |
Number of nodes to request. Default is one. |
--time=dd-hh:mm:ss |
Amount of time needed for this job. Default is one hour. |
--ntasks-per-node=<n> |
Number of tasks per requested node. |
--mem=<n>G |
The memory needed for the job. |
--gres=gpu:<n> |
Number of GPUs needed for the job. |
--gres=gpu:<type>:<n> |
Request a specific GPU. |
--cpus-per-task=<n> |
Number of CPUs allocated to each task. |
--job-name=<name> |
Gives your job a name. |
--output=<name> |
Specifies the file name for the output. Use %j for the job ID. |
--error=<name> |
Specifies the file name for the error. Use %j for the job ID. |
--partition=<name> |
Specifies the partition to submit the job to. |
Slurm offers many more Slurm directives, which can be found in the Slurm documentation here.
Submitting a Slurm Script
Once the Slurm script is written, the user may submit it using the sbatch command. The syntax is as follows:
sbatch [slurm script]
If the user has successfult submitted the script the following message should be displayed:
Submitted batch job [job ID]
Once the job has been deployed and given a job ID, the user may check the current status of their job using squeue -u [username]. This will list all the jobs the user has currently running.
Getting an Interactive Session of a Compute Node
Using a Slurm script to submit your jobs is the recommended method; however, the user can directly (interactively) connect to a compute node and start their job manually, without the need of a Slurm script.
Warning
Deploying your jobs from a compute node directly is not recommended for long jobs, because losing your connection will automatically terminate your job.
In order to connect to a compute node, the user can use the srun command. The syntax for srun is as follows:
srun [slurm directives] --pty bash
Similarly to a Slurm script, the user can add several Slurm detectives to their srun command. Slurm will then try to connect the user to a compute node that has the available resources described in the Slurm detectives.
For example, the job deployed in the Slurm script above can be computed using the following srun command:
srun --time=1:00:00 --cpus-per-task=4 --mem=2G --pty bash
This requests a 1-hour job with 4 CPUs and 2GB of memory. If the allocation is successful, your prompt will show you’re connected to a compute node.
At this point, the user can load any modules simply run their code manually.