Glossary

Slurm – Simple Linux Utility for Resource Management is a free and open-source job scheduler for Linux and Unix-like kernels that manages computing resources and schedules tasks on them. It is the software used on the Vega cluster.

Scheduler – Software used for finding appropriate resources to run computational tasks, it allocates nodes, CPUs, memory and other computing resources and schedules the job execution on them.

Partition – Partition is a set of compute nodes with a common feature. It is also called queue in some other batch systems.

Node – A node is a physical server that can handle computing tasks and run batch jobs. It is interconnected with other nodes and container CPUs, memory, and other devices.

Job – Job (batch job) is a base unit of computing in Slurm. By jobs, resources are allocated to users for a specified amount of time.

Job steps – Job steps are sets of tasks within a job.

Task – Task is a single process. A multi-process program encompasses several tasks.

Age – Age is the length of time a job has been waiting in the queue, eligible to be scheduled.

Priority – Priority is used to order pending jobs in the queue with jobs having a higher score run before those with a lower score. The priority calculation is based primarily on the historical usage of cluster resources by an account- accounts with high utilization (i.e. lots of jobs and lots of CPUs) have lower priority scores than those accounts with lower usage.

Limits – A variety of limits are used to ensure equitable access to computing resources. The primary limit is a maximum on the number of CPUs in use by any account or user.

Fair share – Fair share is one of the factors that determine the priority of a job. It is computed based on user's effective usage (past usage) and the amount of the resources this user should be able to use.

QOS – Quality Of Service, contains set of rules that apply to a job.

Backfill – Backfill is a method that a scheduler uses to maximize resource utilization. It allows (small) lower priority jobs to start before the higher priority large jobs, as long as it doesn't influence the start of the long job.

Infiniband – Infiniband is the interconnect used on Vega clusters. The infiniband version HDR (high data rate) enables transferring data at up to 200Gb/s.