Slurm partitions

The default partition is named cpu. There is a limit of 960 compute nodes.

Partition	Nodes	Time limit	Node list	Memory
dev	8	30:00	login[0001-0008]	257496MiB, 251GiB
cpu	960	2-00:00:00	cn[0001-0960]	257470MiB, 251GiB
longcpu	6	4-00:00:00	cn[0010-0015]	257470MiB, 251GiB
gpu	60	2-00:00:00	gn[01-60]	515517MiB, 503GiB
largemem	192	2-00:00:00	cn[0385-0576]	1031613MiB, 1007GiB

Partitions and transactions

Slurm understands resources in a cluster as nodes. Nodes with the same hardware configuration, however, are grouped into partitions. Partitions are therefore logical units of several nodes, but can also be considered as queues of jobs, each of which has a certain constraints, such as constraints on the size of jobs, time constraints, users who can use the partition, and so on. Preferentially, assigned jobs are assigned to nodes in a partition until resources (nodes, processors, memory, etc.) within that partition are depleted. When a set of nodes is assigned to a job, the user can initiate parallel work in the form of job steps in any configuration within the assignment. For example, you can start only one step of a specific job that uses all the nodes assigned to the job, or you can run multiple job steps at the same time that can use part of the resource allocation independently. Slurm, on the other hand, also provides resource management for processors assigned to a job so that multiple job steps can be sent simultaneously and queued until there are no resources available within the job assignment.

If you want the job to run on the appropriate node type, you will need to specify the partition in the job script using the --partition option and specify the partition name.

The partitions available in the cluster can be specified with the sinfo command.

[user@login0001]# sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
cpu*         up 2-00:00:00     0/936/24/960 cn[0001-0960]
largemem     up 2-00:00:00      0/185/7/192 cn[0385-0576]
gpu          up 2-00:00:00        0/59/1/60 gn[01-60]
longcpu      up 4-00:00:00        0/22/0/22 cn[0010-0015]
dev          up       30:00          0/8/0/8 login[0001-0008]

See also options:

sinfo -l -N - detailed information
sinfo -T - display reservations

Detail information for all partitions in the cluster can be specified with the (scontrol show partition) command.

# scontrol show partition
PartitionName=cpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cn[0001-0960]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=245760 TotalNodes=960 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1000 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=1G

PartitionName=largemem
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cn[0385-0576]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=49152 TotalNodes=192 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1000 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1,Mem=0.25G

PartitionName=gpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=gn[01-60]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=15360 TotalNodes=60 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1000 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=0.5G,GRES/gpu=64.0

PartitionName=longcpu
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:10:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=4-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=cn[0010-0015]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=YES:4
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1536 TotalNodes=6 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=1000 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=1.0,Mem=1G

PartitionName=dev
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=00:30:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=login[0001-0008]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=768 TotalNodes=8 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

We have a special partition on Vega called dev. It is installed only on all login nodes. It is intended for testing. Each user has the option to test their own scripts, before sending them to other partitions.

The (squeue) command can be used to check the partition name (PARTITION), the specific nodes within the partitions (NODELIST) and the status (ST / R - Running, PD - Pending) of the jobs running on these partitions. See the man page (squeue man command) for more information.

[user@login0004]:squeue
JOBID PARTITION  NAME  USER  ST  TIME NODES  NODE LIST (REASON)
65646     cpu     chem  mike  R 24:19     2  cn00[27-28]
65647     cpu     bio   joan  R  0:09     1  cn00014
65648     cpu     math  phil PD  0:00     6  (Resources)

The squeue command has many options with which the user can easily view information about the transactions that interest them.

$ squeue -l – details of jobs in a row (-l = long),
$ squeue -u $USER – get jobs by $USER,
$ squeue -p – tasks in the partition queue,
$ squeue -t PD – jobs waiting in line,
$ squeue -j --start – estimated start time of the job. The scontrol command can be used to get more detailed information about nodes, partitions, tasks, task steps, and configuration.
$ scontrol show partition – get partition data
$ scontrol show nodes – get node information