Introduction to Job Scheduling

Job schedulers are pieces of software that implement a batching capability on a computational cluster. Users typically submit their jobs to the scheduler, which will coordinate what nodes best fit that job in terms of time and functionality. The job scheduler also tracks how nuch time your job takes and applies it to your allocation. Users submit their jobs via scripts from the login/user node, which goes into an appropriate queue for placement by the job scheduler.

Example of Job Scheduler¹:

The ARCC uses the Slurm Workload Manager (usually called just "Slurm") for job scheduling. Slurm is used at most clusters, including those at UCF, regionally, and nationally.

Warning

Job schedulers, including Slurm, also track the maximum amount of time your job should run, and will kill it if it reaches that maximum time!

Be careful to choose wisely when requesting resources. For example, if you:

Request significantly more time than your job actually needs, then it may take longer for the scheduler to be able to deploy your job.
Request less time than your job actually needs to finish, then the scheduler will simply kill the job once the time allocated is up.
Request resources (memory, GPUs) that are not available at all, then your job may be stuck in the queue forever.

Slurm Allocations within the UCF ARCC

At UCF, each faculty member that requests usage of the clusters currently gains subsidized time on the two clusters, Stokes and Newton. We call this an allocation. A student (or other user) working under that faculty uses that allocation.

On Stokes (our general-purpose cluster), each faculty currently is granted 80,000 core-hours per month. These are available each month and do not roll-over if not used. On Newton (our GPU cluster), each faculty receives 10,000 CPU core-hours and 2,000 GPU hours per month. They, similarly, do not roll-over if not used.

Note

Any user can see the state of their allocation at any time by running "myusage" on either cluster. For example, on Stokes:

euser1 1912% myusage
Usage for Account arcc on the stokes cluster for start=2025-02-01 end=2025-03-01
==================================================
    CPU Used: 3.4 hours of 80,000.0 (0.1%)
==================================================
USER                      CPU Hours
----                      ---------
arcc                      3.4

or, on Newton:

evuser1 1182% myusage
Usage for Account arcc on the Newton cluster for start=2025-02-01T00:00:00 end=2025-03-01T00:00:00
==================================================
  CPU Used:  114.2 hours of 10,000.0 (1.1%)
  GPU Used:  14.3 hours of 2,000.0 (0.8%)
==================================================
USER               CPU-Hours   GPU-Hours
----               ---------   ---------
arcc               114.2       14.3

When a job ends, the total amount of core-hours (and GPU hours, if appropriate) are used from the user's allocation. For example, if a user's job takes 3 hours to run and uses 8 cores each on 5 nodes, then 120 total core-hours are used (3 x 8 x 5 = 120).

Other Resources

Official Slurm Documentation

Retrieved from https://hpc-wiki.info/hpc/File:Batch_System.PNG on February 23, 2025. ↩