Specification of the Resource Reservation and Backfilling Grid Engine 6.0 scheduler enhancement Andreas Haas 19 February 2004 0. Introduction Adding "Resource reservation" and "Backfilling" capabilities to the Grid Engine 6.0 scheduler is a first step towards a looking ahead scheduler. For the sake of orientation it is helpful to delimit the terms "resource reservation" and "backfilling" term from other terms occasionally used to circumscribe looking ahead scheduler capabilities in a less specific fashion: Resource Reservation A job-specific reservation created by the scheduler for pending jobs. During the reservation the resources are blocked for lower priority jobs. Backfilling The process of starting jobs of the job priority list despite of higher priority pending jobs that might own a future reservation with the same resource. Thus backfilling has a meaning only in the context of Resource Reservation or Advance Reservation. Advance Reservation A reservation (possibly independent of a particular job) that can be requested by a user or administrator and gets created by the scheduler. The reservation causes the associated resources be blocked for other jobs. Preemption The process of interrupting job executions in order to free resources for particular jobs. See under 'Discussion' below for more details on Resource Reservation and Backfilling as implemented with Grid Engine 6.0. 1. Acknowledgements I gratefully acknowledge useful conversations and input in other forms with Lev Markov, Fritz Ferstl, Martin Berger and others. 2. Discussion The 6.0 resource reservation and backfilling capabilities considers built-in/user defined consumable resources associated with the cluster (aka global host), a particular host and a particular queue instance. For considering parallel jobs in a similar fashion the limited slot amount of parallel environments is also considered with reservation and backfilling. a) Resource Reservation Resource reservation can be used to guarantee resources are dedicated to jobs in jobs priority order. A good example which helps to comprehend the problem solved with resource reservaiton/backfilling is the so-called "large parallel job starvation problem". In this scenario there is one high priority pending job (possibly parallel) A that requires a larger quota of a particular resource and a stream of smaller and lower priority jobs B(i) requiring a smaller quota of the same resource. Without resource reservation an assignment for A can not be guaranteed assumed the stream of B(i) jobs does not stop - even if job A actually has higher priority than the B(i) jobs: A | +---+----+--------+--------+--------+--------+--------+ +----------+ | B(0) | B(2) | B(4) | B(6) | B(8) | B(10) | | | +---+----+---+----+---+----+---+----+---+----+---+----+---+ A | | B(1) | B(3) | B(5) | B(7) | B(9) | B(11) | | +--------+--------+--------+--------+--------+--------+----------+--> With resource reservation job A gets a reservation that blocks lower priority B(i) jobs and thus guarantees resources will be available for A as soon as possible: A | +---+----+----------+--------+ | B(0) | | B(2) | ... +---+----+ A +--------+--------+ | | | B(1) | B(3) | ... +----+----------+--------+--------+-------------------------------> b) Backfilling Backfilling can allow utilization of resources that are blocked due to a reservation by jobs of lower priority. Backfilling can take place only if there is a runnable job with a prospective runtime small enough to allow the blocked resource be used without endangering the reservation of a job with high priority. In the example above a low priority job C of very short duration could be started prior to job A due to backfilling. A | +---+----+----------+--------+ | B(0) | | B(2) | ... +---+---++ A +--------+--------+ | C || | B(1) | B(3) | ... +---++----------+--------+--------+-------------------------------> The benefit of backfilling is an improved resource utilization. c) Tuning considerations Since reservation scheduling requires the scheduler to look ahead Grid Engine scheduler will be more performance consuming in reservation mode than in non-reservation mode. In smaller clusters the additional effort is certainly negligible as long as there are only a few pending jobs. With a growing cluster size however and in particular with a growing number of pending jobs the additional effort will be sensible. The key with tuning resource reservation is to delimit the overall number that are made during a scheduling interval. To accomplish this two means are provided: (1) For limiting the absolute number of reservations made during a schedule interval the 'max_reservation' parameters in the scheduler configuration can be used by the Grid Engine administrator. E.g. when max reservation is set to 20 not more than 20 reservations are made within a scheduling interval and as a result the additional effort for reservation scheduling is limited accordingly. (2) The -R y|n submit option allows restriction of reservation scheduling only to those jobs that actually are critical. In the examples above there is no need to schedule B(i) job reservations just for the sake of guaranteeing the job A resource assignment. The only job that needs to be submitted using -R y is job A. This means all B(i) jobs can be submitted with the -R n (default) without actually endangering the reservation for A. d) Monitoring The discussion about tuning indicates it can be useful for administrators gain a certain degree of understanding on how schedulers decisions are actually influenced by resource reservation. For this purpose the scheduler configuration param setting MONITOR can be used. This causes additional information be provided in the 'schedule' file kept in the Grid Engines common directory. The following example shortly introduces scheduler monitoring. Assume the following sequence of jobs qsub -N L4_RR -R y -l h_rt=30,license=4 -p 100 $SGE_ROOT/examples/jobs/sleeper.sh 20 qsub -N L5_RR -R y -l h_rt=30,license=5 $SGE_ROOT/examples/jobs/sleeper.sh 20 qsub -N L1_RR -R y -l h_rt=31,license=1 $SGE_ROOT/examples/jobs/sleeper.sh 20 be submitted into a cluster with the global 'license' consumable resource be limited to a number of 5 licenses. Due to the use of default priority settings in the scheduler configuration weight_priority 1.000000 weight_urgency 0.100000 weight_ticket 0.010000 it is ensured the -p priority of the L4_RR job overwhelms the license based urgency finally resulting in a priorization such as job-ID prior name --------------------- 3127 1.08000 L4_RR 3128 0.10500 L5_RR 3129 0.00500 L1_RR in this case traces of those jobs can be found in the 'schedule' file for 6 schedule intervals :::::::: 3127:1:STARTING:1077903416:30:G:global:license:4.000000 3127:1:STARTING:1077903416:30:Q:all.q@carc:slots:1.000000 3128:1:RESERVING:1077903446:30:G:global:license:5.000000 3128:1:RESERVING:1077903446:30:Q:all.q@bilbur:slots:1.000000 3129:1:RESERVING:1077903476:31:G:global:license:1.000000 3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000 :::::::: 3127:1:RUNNING:1077903416:30:G:global:license:4.000000 3127:1:RUNNING:1077903416:30:Q:all.q@carc:slots:1.000000 3128:1:RESERVING:1077903446:30:G:global:license:5.000000 3128:1:RESERVING:1077903446:30:Q:all.q@es-ergb01-01:slots:1.000000 3129:1:RESERVING:1077903476:31:G:global:license:1.000000 3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000 :::::::: 3128:1:STARTING:1077903448:30:G:global:license:5.000000 3128:1:STARTING:1077903448:30:Q:all.q@carc:slots:1.000000 3129:1:RESERVING:1077903478:31:G:global:license:1.000000 3129:1:RESERVING:1077903478:31:Q:all.q@bilbur:slots:1.000000 :::::::: 3128:1:RUNNING:1077903448:30:G:global:license:5.000000 3128:1:RUNNING:1077903448:30:Q:all.q@carc:slots:1.000000 3129:1:RESERVING:1077903478:31:G:global:license:1.000000 3129:1:RESERVING:1077903478:31:Q:all.q@es-ergb01-01:slots:1.000000 :::::::: 3129:1:STARTING:1077903480:31:G:global:license:1.000000 3129:1:STARTING:1077903480:31:Q:all.q@carc:slots:1.000000 :::::::: 3129:1:RUNNING:1077903480:31:G:global:license:1.000000 3129:1:RUNNING:1077903480:31:Q:all.q@carc:slots:1.000000 :::::::: each section shows for a schedule interval all resource utilizations that were taken into account. The 'RUNNING' entries show utilizations of jobs that already were running at the begin of the interval, 'STARTING' entries denote immediate utilizations that were decided within the interval and 'RESERVING' entries show utilizations that are planned for the future i.e. reservations. The format of the schedule file is : The jobs id. : The array task id or 1 in case of non-array jobs. : One of RUNNING/SUSPENDED/MIGRATING/STARTING/RESERVING. : Start time in seconds after 1.1.1970. : Assumed job duration in seconds. : One of {P,G,H;Q} standing for {PE,Global,Host,Queue}. : The name of the PE/global/host/queue. : The name of the consumable resource. The resource utilization debited for the job. A line "::::::::" marks the begin of a new schedule interval. Please note this file is not truncated. Make sure the monitoring is switched off in case you have no automated procedure setup that truncates the 'schedule' file. 3. Changes with command line interface and configuration file formats The change as seen from the end users perspective is a improved behaviour of the "-pe parallel_environment range" submit option and a new submit switch "-R y|n" with the commands qsub(1) qrsh(1) qsh(1) qlogin(1) qalter(1) Old behaviour: < -pe parallel_environment n[-[m]]|[-]m,... < < You can specify the PE name by using the wildcard < character "*", thus the request "pvm*" will match any < parallel environment with a name starting with the < string "pvm". < The range specification is a list of range expressions < of the form n-m (n as well as m being positive non-zero < integer numbers), where m is an abbreviation for m-m, < -m is a short form for 1-m and n- is an abbreviation < for n-infinity. The range specification is processed as < follows: The largest number of queues requested is < checked first. If enough queues meeting the specified < attribute list are available, all are allocated. The < next smaller number of queues is checked next and so < forth. New behaviour: > -pe parallel_environment n[-[m]]|[-]m,... > > You can specify the PE name by using the wildcard > character "*", thus the request "pvm*" will match any > parallel environment with a name starting with the > string "pvm". > The range specification is a list of range expressions > of the form n-m (n as well as m being positive non-zero > integer numbers), where m is an abbreviation for m-m, > -m is a short form for 1-m and n- is an abbreviation > for n-infinity. When a range specification covers mul- > tiple slot numbers all enlisted slot numbers are con- > sidered valid assignments for the job and largest pos- > sible assignment is chosen for the job. Similarly in > case multiple parallel environments match the specified > PE name and the range specification covers more than a > slot number the largest possible parallel environment > assignment is chosen. > New switch: > > -R y|n > Available for qsub, qrsh, qsh, qlogin and qalter. > > Indicates whether a reservation for this job should be > done. Reservation is never done for immediate jobs sub- > mitted using 'yes' with -now option. Please note > irrespective of the reservation request job reservation > might be disabled using max_reservation in > sched_conf(5) and might be limited only to a certain > number of high priority jobs. > > By default jobs are submitted with -R n option. > From the administrators point of view two new scheduler configuration parameters were added to sched_conf(5) > max_reservation > The maximum number of reservations scheduled within a > schedule interval. When a runnable job can not be started > due to a shortage of resources a reservation can be > scheduled instead. A reservation can cover consumable > resources with the global host, any execution host and any > queue. For parallel jobs reservations are done also for > slots resource as specified in sge_pe(5). As job runtime > the maximum of the time specified with -l h_rt=... or -l > s_rt=... is assumed. For jobs that have neither of them the > default_duration is assumed. Reservations prevent jobs of > lower priority as specified in sge_priority(5) from utiliz- > ing the reserved resource quota during the while of reserva- > tion. Jobs of lower priority are allowed to utilize those > reserved resources only if their prospective job end is > before the start of the reservation (backfilling). Reserva- > tion is done only for non-immediate jobs (-now no) that > request reservation (-R y). If max_reservation is set to "0" > no job reservation is done. > > Note, that reservation scheduling can be performance consum- > ing and hence reservation scheduling is switched off by > default. Since reservation scheduling performance consump- > tion is known to grow with the number of pending jobs use of > -R y option is recommended only for those jobs actually > queuing for bottleneck resources. Together with the > max_reservation parameter this technique can be used to nar- > row down performance impacts. > > default_duration > When job reservation is enabled through max_reservation > sched_conf(5) parameter the default duration is assumed as > runtime for jobs that have neiter -l h_rt=... nor -l > s_rt=... specified. In contrast to a h_rt/s_rt time limit > the default_duration is not enforced. > another parameter was added to scheduler configuration sched_conf(5) to allow for monitoring resource reservation: > params > This is foreseen for passing additional parameters to the > Sun Grid Engine scheduler. The following values are recognized > currently: > > MONITOR > If set, the scheduler records information for each > scheduling run allowing to reproduce job resources > utilization in the file > $SGE_ROOT//common/schedule. > yet another parameter was added to scheduler configuration sched_conf(5) to allow for modeling difference between net and gross job runtimes: > DURATION_OFFSET > If set overrides the default of 60 seconds that is > assumed as offset by N1GE scheduler when plan- > ning resource utilization as delta between net job > runtimes and gross time until resources are available > again. Jobs net runtime as specified with -l h_rt=... > or -l s_rt=... or default_duration always differ from > jobs gross runtime due to delays before and after > actual job start. Amongst these delays before job start > is the time until the end of a schedule_interval, the > time it takes to deliver a job to sge_execd(8) and the > delays prolog in queue_conf(5) , start_proc_args in > sge_pe(5) and starter_method in queue_conf(5) may > effectuate. The delays after a jobs actual run > comprises delays due to a forced job termination > (notify, terminate_method or checkpointing), procedure > runs after acutal job finished such as stop_proc_args > in sge_pe(5) or epilog in queue_conf(5) and the delay > until a new schedule_interval. > If the offset is too low resource reservations (see > max_reservation) can be delayed repeatedly due to a too > optistic job circulation time. 4. Implementation For the reservation scheduling major reconstructions in scheduling code were made. The new algorithm consists of parts that can be used both for sequential and parallel jobs and other parts where handling of sequential and parallel jobs differs. Important steps of the algorithm with all parts are summarized at large: a) Common part of the algorithm that is used both for sequential and parallel jobs * jobs are processed in priority order to find suited assignments (dispatch_jobs()) * if possible we try to make an assignment to start the job immediately otherwise a reservation is made afterwards (select_assign_debit()) past that stage the algorithm splits for sequential and parallel jobs. b) Parts of the algorithm that are used for sequential jobs only * for sequential job assignments all the earliest job start time is determined with each queue instance and the earliest one gets chosen. Secondary criterion for queue selection minimizing jobs soft requests (sge_sequential_assignment()) * for determining earliest start time per queue at first the earliest start time due to cluster-wide (=global) resource availability is determined, then for each queue instance the earliest start time due to host resource availability and due to queue instance resource availability is determined (sequential_tag_queues_suitable4job()) * for sequential jobs determining the earliest start time due to resource availability is done by ensuring various constraints are fulfulled and then determining earliest start time with resources of a resource container are fulfilled (sequential_queue_time(), sequential_global_time(), sequential_host_time()) * the earliest start time due to resources of a resource container is the maximum of the earliest start times for all resources comprised by the resource container that requested by a job (rc_time_by_slots()) * the earliest start time due to availability of a resource is determined by ensuring non-consumable resource requests are fulfilled or by finding the earliest time utilization of a consumable resource is below the threshold required for the request (ri_time_by_slots()) * the earliest time a utilization is below a threshold is determined by searching backwards in resource utilization diagrams (utilization_below()) c) Parts of the algorithzm that are used for parallel jobs only * for parallel job assignments all the earliest job start time of all parallel environments matching the parallel environment request is determined by determining the earliest start time with each eligable parallel environment. The earliest parallel environment is chosen. Afterwards for the chosen parallel environment assignment the slot amount is maximized in case of a slot range request (sge_select_parallel_environment()) * the earliest job start time for a parallel environment is determined by iterating through all times where resource utilization within the schedule actually changes concerning the consumable resources requested by the job and checking whether resources would be sufficient for the job at that particular time. As an optimization for the time iteration a queue end time iterator (QETI) is used (parallel_reservation_max_time_slots()) * checking whether resources would be sufficient at a particular time with a parallel environment is done by checking at first whether the 'slot' resource of the parallel environment is sufficient and then whether remaining job resource requests can be fulfilled (parallel_assignment()) * checking whether remaining resources meet the request of the parallel job at a particular time is done by checking whether cluster-wide (=global) resources are sufficient and whether the slots available on all eligable hosts together(!) are sufficient for the parallel job (parallel_tag_queues_suitable4job()) * at each host the number of slots available at a particular time for the parallel job is determined as the minimum of the slots available due to resource availability on a host at one hand and the slots available on all eligable queues together(!) for the parallel job (parallel_tag_hosts_queues) * the number of slots available for a parallel job at a particular time in a resource container is determined by ensuring various constraints are met and then determining the number of slots available due to resource requests and resource availability at the particular time (parallel_queue_slots(), parallel_host_slots(), parallel_global_slots(), parallel_available_slots()) * the number of slots available for a parallel job at a particular time in a resource container is determined as the minimum of the slots available for all resources requested by the job and limited in the resource container (rc_slots_by_time()) * the number of slots available with a resource can be zero for static resources or is determined based on maximum utilization within the specific time frame, the total amount of the resource and the per task request of the parallel job (ri_slots_by_time()) * the maximum utilization within a time frame is determined based on information in resource utilization diagram for the consumable resource (utilization_max())