Job Submission within the ARCC
Workflow on an HPC typically involves submitting jobs to a batch workload manager, which then wait until sufficient compute resources are available to execute the job. Scheduling, resource management, and accounting of usage on ARCC resources such as Stokes and Newton are handled by the Simple Linux Utility for Research Management (SLURM). In order to use our computing facilities, you will have to learn some of the basics of job and account management in SLURM.
To create a job, typically you need to write a submit script. A submit script is just like any other unix shell scrip except that it also contains directives to the workload manager that tell it how many resources you need and for how long, etc. Submit scripts are submitted to SLURM, which assigns it a job id and puts the job in a partition (SLURM calls queues "partitions"). You can see what partitions are available and for what nodes by typing the following command at the unix command-line:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 178 Idle* ec[1-8,155-324]
preemptable up 2-00:00:00 178 Idle* ec[1-8,1550324]
You can see the status of all running and pending jobs using the following command:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
37 normal SocNetPD pwiegand R 15:32 2 ec[7-8]
To create a SLURM submit script, use an editor to create a text file. Preface all SLURM directives on separate lines with #SBATCH.
$ vi simple-submit.slurm
Here is a very simple example MPI script:
# Load modules
echo "Slurm nodes assigned :$SLURM_JOB_NODELIST"
module load openmpi-1.8.6-ic-2015.3.187
The meaning of each of these parameters are as follows:
- account indicates which account you are using (see below);
- nodes indicates how many compute nodes you are requesting;
- ntasks-per-node indicates how many cores per compute node you would like;
- time indicates how long you expect the job to take (in terms of walltime, not cumulative compute time);
- error specifies the name of the file into which the error stream of your job will be directed (the %J argument gives the job ID);
- output specifies the name of the file into which the output stream of your job will be directed (the %J argument gives the job ID);
- job-name specifies the name of the job as you will see it in squeue, for example.
You should try to be as accurate as possible with the time parameter. If it is short, your job will be killed before it finishes. If it is too long, it may take the scheduler longer to find resources than it should. By default, your job is given just under 2 GB of memory per core (specifically 1990 MB). If your job tries to use more than this, it will most likely crash. If you need more, you must request more using the --mem-per-cpu=<amount-in-MB> parameter. Our SLURM configuration will not let a job use more memory or cores than were requested.
To submit this script to the scheduler, type:
$ sbatch simple-submit.slurm
The job will be placed in a queue and then deployed to the resources assigned to it as soon as the workload manager can find those resources. When the job executes, it will begin in the same directory as you were when you submitted. So if you submit from the directory in which you would like to run, you do not need to change to that direct in your submit script. Our builds of MPI are smart enough to know which nodes you were assigned and to use the Infiniband network, so you do not have to create a hosts file; however, if you do need to know what resources were assigned to you from inside the submit script, you can use the $SLURM_JOB_NODELIST environmental variable.
The ARCC has placed some examples on the system in the following directory:
SLURM provides several ways to interact with the workload manager. For example, if you just want to run 26 copies of particular project wherever you can find them, type:
$ srun --ntasks=26 ./myexecutable
Or, if you just want SLURM to allocate resources for you and then figure out what to do with them once you get them, you can use salloc. The salloc command will create a new shell on the login node and allocate the requested resources (once they are available). If you want to use them, you will have to either use srun or ssh to one of the nodes you were assigned. You can use this idea to get an interactive session. Indeed, we strongly advise you do so for large compilations since the ulimits for users on the login nodes are more limited than on the compute nodes.
The following is an example shell session using salloc.
$ salloc --nodes=2 --ntasks-per-node=3 --time=00:30:00
salloc: Granted job allocation 100
$ srun hostname
$ ssh ec155
Last login: Sun Dec 20 09:32:03 2015 from euser1
$ echo "I get three whole cores here!"
I get three whole cores here!
Connection to ec155 closed.
salloc: Relinquishing job allocation 100
If you need to cancel the job before it has completed, type:
$ scancel <job-id-number>
To learn more about your job, either while it is waiting to run (pending) or while it is running, type:
$ scontrol show job <job-id-number>
If your job is waiting in the queue and you are confused about why it is not running while other jobs seem to be, you can use that command to learn more. Also, squeue provides an abbreviated "REASON" column to give you some idea whether there is a problem with your job. Keep in mind that prioritization of jobs is a complicated function of several factors, including recent usage. Those who use system resources less have higher priority than those who use it more -- this is known as FairShare. Older jobs (those that have been in the queue longer) also have increased priority. You can see the relative priorities of all waiting jobs with the following command:
Currently, we have only two partitions (queues): normal and preemptable. By default, jobs are queued into the normal partition. There are no constraints on this partition, but once the account being charged for the hours has reached its limit for the month (see the next subsection) no more jobs will be executed in the normal partition and running jobs will be killed. If you there are compute resources available and you still have work to do, you can submit your job to the preemtable partition. Jobs submitted to this partition have a lower priority than jobs submitted to the normal partition and, as the name implies, jobs running under that partition can be preempted by jobs other jobs. That is, if there is a job with a higher priority requesting resources that are otherwise unavailable, SLURM will cancel the preemptable job in order to make space for the higher priority job. To submit to the preemptable partition, type:
$ sbatch --partition=preemptable --qos=preemptable simple-submit.slurm
It is generally best if distributed jobs run on nodes attached to the same Infiniband switch. Luckily, you don't have to specify switches to try to keep your jobs on one switch. SLURM knows the topology of our network and tries to find resources for a job that are all on one switch, if that is possible.
For more comprehensive information about using SLURM and job directives, consult their Quick Start User's Guide. You can also scan over these slides from a HUGO event last year. Also, NERSC maintains a useful Slurm-To-Torque translative page.
Sometimes when debugging, compiling, or running certain softwares, it will be neccesary to run your job interactively. In order to submit an interactive session, you can use the following command:
$ srun --pty bash
The above command will automatically use the default parameters that is set up with stokes, and log you directly into the node that is associated with your job. You can also specify special parameters just as you would in any submission script script, if anything that differs from the defaults needs to be specified. They follow the same syntax as those in the "#SBATCH" lines in your script.
An example interactive session command submitted with the same parameters as the example MPI script above:
$ srun --account=`id -g -n` --nodes=4 --ntasks-per-node=10 --time=01:00:00 --job-name=SimpleMPIJob --pty bash
Every user on ARCC resources is associated with a particular SLURM account. Typically, a principle investigator (PI) is responsible for an account and the "users" of that account are he or she and his or her students. There is a limit to the compute resources available to each account each month. On Stokes, currently each account is given at least 80,000 dedicated processor hours to spend. SLURM prefers to report this number in minutes, so the standard monthly allocation is 4,800,000 minutes.
Your usage is a total of all the processor time you have consumed. For example, if you run a job for 10 minutes on 2 nodes using 6 cores on each node, you will have consumed two hours of compute time (10*2*6=120 minutes). You can see your usage from the beginning of the month by typing the following command, replacing pwiegand with your username and 12/1/15 with the start date in which you are interested:
$ sreport cluster AccountUtilizationByUser start=12/1/15
Cluster/Account/User Utilization 2015-12-01T00:00:00 - 2015-12-17T23:59:59 (1468800 secs)
Use reported in TRES Minutes
Cluster Account Login Proper Name Used Energy
--------- --------------- --------- --------------- ---------- ----------
stokes root 195 0
stokes root root root 4 0
stokes pwiegand 191 0
stokes pwiegand nlucas 3 0
stokes pwiegand pwiegand 188 0
You can also use sshare to see your usage in terms of how it relates to your job priority:
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
pwiegand pwiegand 1 1.000000 11280 1.000000 0.333333
The RawUsage column here reports usage in seconds. EffectiveUsage is a ratio that gives a sense for how much of the cluster resources that have been used were used by you relative to others in your account. The FairShare column gives your user's current fair share value. This is a number between 0 and 1. The more resources you consume relative to everyone else on the cluster, the lower that number gets and the lower your job priority is. This ensures no one monopolizes the resources unfairly. You can find out more information about FairShare here.
To determine what the constraints are for your account, you can use the following command and look for the cpu=<minutes> in the GrpTRESMins column.
$ sacctmgr show qos arcc
Name Priority GraceTime Preempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins
---------- ---------- ---------- ---------- ----------- ------ ---------- ----------- ------------- -------------
arcc 100 00:00:00 cluster 1.000000 cpu=4800000
You can also see recent history of your account using sacct:
$ sacct -X
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
85 bash normal pwiegand 10 CANCELLED+ 0:0
86 bash normal pwiegand 6 TIMEOUT 1:0
87 bash normal pwiegand 1 CANCELLED+ 0:0
88 bash normal pwiegand 1 CANCELLED+ 0:0
89 bash normal pwiegand 1 CANCELLED+ 0:0
90 small-tes+ preemptab+ pwiegand 12 PREEMPTED 0:0
You can find more information about SLURM accounting here.
Newton is a smaller cluster with some specialized GPU resources. Consequently, the monthly allocation is lower on Newton (10,000 dedicated processor hours of CPU time and 2,000 hours of GPU time). Also, since users will typically be requesting specialized resources, you need to know how to specify this in your submit script. The parameter of interest here is --gres (for generic resource). To indicate that you would like such resources, you specify the resource and the number, for example: --gres=gpu:2. Also, because Newton's resources are optimized so SLURM is smart enough to schedule the cores "closest" to the GPU, you will typically want to ensure you are using one CPU core per task. Here is an example submit script that asks for one node, four cores per node, and two GPUs on each node:
# Load modules
echo "Slurm nodes assigned :$SLURM_JOB_NODELIST"
module load cuda/cuda-9.0
You can find more information about SLURM Generic Resources.