The resources on HPC Vega are meant to be provided in a fair faishon between all projects and users. Therefore, every participating user is entitled for the same rules regarding the availability of resources and job waiting times. This system is set in place in order to encourage users to balance their resources over time and to de-prioritize users which are not balancing resources. Thus no user gets to dominate the cluster.
Slurm computes job priorities regularly and updates them to reflect continuous change in the siutation. For instance, if the priority is configured to take into account the past usage of the cluster by the user, running jobs of one user do lower the priority of that user's pending jobs.
There are multiple factors that influence job priority. The job's priority at any given time will be a weighted sum of all enabled factors.
The factors for defining job priorities are (ranked by the weight of priority):
Fairshare: It's the largest factor in determining priority on HPC VEGA. It is a measure of past usage of the cluster by the user. It actually tracks the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed by user. Users with a larger Fair Share score receive a higher priority in the queue. Using faster than your expected rate of usage will usually cause your Fair Share score to decrease. The more extreme the overuse, the more severe the likely drop. And the other way around.
TRES Factors : Each TRES Type has its own priority factor for a job which represents the amount of TRES Type requested/allocated in a given partition.
Job size: The job size factor correlates to the number of nodes or CPUs the job has requested. Currently jobs are not being prioritized base on size of the job.
Partition: The partition to which a job is submitted, specified with the --partition submission parameter. The larger the number of assigned integer priority, the greater the job priority will be for jobs that request to run in this partition. Currently no partitions are prioritized.
QOS: A quality of service associated with the job, specified with the --qos submission parameter. Each QOS can be assigned an integer priority. The larger the number, the greater the job priority will be for jobs that request this QOS.
Job age: How long the job has been waiting in the queue. The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete.
Calculation of job priority is therefore:
Job_priority = (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightAge) * (age_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + SUM(TRES_weight_cpu * TRES_factor_cpu, TRES_weight_<type> * TRES_factor_<type>