<div class="sright">
  <div class="sub-content-head">
    Maui Scheduler  </div>
  <div id="sub-content-rpt" class="sub-content-rpt" >
 <div class="tab-container docs" id="tab-container">
<div class="topNav">

<div class="docsSearch">
</div>

<div class="navIcons topIcons">
<a href="index.php"><img src="/resources/docs/images/home.png" title="Home" alt="Home" border="0"></a>
<a href="5.0prioritization.php"><img src="/resources/docs/images/upArrow.png" title="Up" alt="Up" border="0"></a>
<a href="5.1.5prioritymanagement.php"><img src="/resources/docs/images/prevArrow.png" title="Previous" alt="Previous" border="0"></a>
<a href="5.3nodeaccess.php"><img src="/resources/docs/images/nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>

<h1>5.2 Node Allocation</h1>

<p> While job prioritization allows a site to determine 
which job to run, node allocation policies allow a site to specify how available
resources should be allocated to each job. The algorithm used is specified
by the parameter <a href="a.fparameters.php#nodeallocationpolicy">NODEALLOCATIONPOLICY</a>
. There are multiple node allocation policies to choose from allowing
selection based on reservation constraints, node configuration, available
resource constraints, and other issues. These policies can be specified
with a system wide default value and on a per job override basis.  The following
algorithms are available and described in detail below: <a href="#FIRSTAVAILABLE">
FIRSTAVAILABLE</a>
, <a href="#LASTAVAILABLE">LASTAVAILABLE</a>
, <a href="#PRIORITY">PRIORITY</a>
, <a href="#CPULOAD">CPULOAD</a>
, <a href="#MINRESOURCE">MINRESOURCE</a>
, <a href="#CONTIGUOUS">CONTIGUOUS</a>
, <a href="#MAXBALANCE">MAXBALANCE</a>
, <a href="#FASTEST">FASTEST</a>
, and <a href="#LOCAL">LOCAL</a>.
<!-- Additional load allocation polices such as may be enabled through
extension libraries such as G2. Documentation for the extension library
of interest should be consulted.--> </p>

<ul>
   <li><a href="#node">5.2.1 Node Allocation Overview</a>
   <li><a href="#resource">5.2.2 Resource Based Algorithms</a>
   <ul>
      <li><a href="#CPULOAD">5.2.2.1 CPULOAD</a>
      <li><a href="#FIRSTAVAILABLE">5.2.2.2 FIRSTAVAILABLE</a>
      <li><a href="#LASTAVAILABLE">5.2.2.3 LASTAVAILABLE</a>
      <li><a href="#PRIORITY">5.2.2.4 PRIORITY</a>
      <li><a href="#MINRESOURCE">5.2.2.5 MINRESOURCE</a>
      <li><a href="#CONTIGUOUS">5.2.2.6 CONTIGUOUS</a>
      <li><a href="#MAXBALANCE">5.2.2.7 MAXBALANCE</a>
      <li><a href="#FASTEST">5.2.2.8 FASTEST</a>
      <li><a href="#LOCAL">5.2.2.9 LOCAL</a>
   </ul>
   <li><a href="#time">5.2.3 Time Based Algorithms</a>
   <li><a href="#LOCAL">5.2.4 Locally Defined Algorithms</a>
   <li><a href="#pref">5.2.5 Specifying 'Per Job' Resource Preferences</a>
</ul>

<p>
<img src="/images/logo1.gif" height=24> Node allocation policy, along with many many of settings can be set graphically with the <a href="../mcm/index.php">Moab Cluster Manager</a><sup>TM</sup>.

<p><a name="node"></a>
<h2>5.2.1 Node Allocation Overview</h2>
 </p>
<p> Node allocation is the process of selecting the best
resources to allocate to a job from a list of available resources. Making
this decision intelligently is important in an environment which possesses
one or more of the following attributes: </p>
<p> - heterogeneous resources (resources
which vary from node to node in terms of quantity or quality)</p>
<p> - shared nodes (nodes may be utilized
by more than one job)</p>
<p> - reservations or service guarantees
<br>
</p>
<p> - non-flat network (a network in
which a perceptible performance degradation may potentially exist depending
on workload placement)</p>
<p><br>
<h3>5.2.1.1 Heterogeneous Resources</h3>
 </p>
<p> If the available compute resources
have differing configurations, and a subset of the submitted jobs cannot
run on all of the nodes, then allocation decisions can significantly affect
scheduling performance. For example, a system may be comprised of two
nodes, A and B, which are identical in all respects except for RAM, possessing
256MB and 1GB of RAM respectively. Two single processor jobs, X and
Y, are submitted, one requesting at least 512 MB of RAM, the other, at least
128 MB. The scheduler could run job X on node A in which case job Y
would be blocked until job X completes. A more intelligent approach
may be to allocate node B to job X because it has the fewest available resources
yet still meets the constraints. This is somewhat of a 'bestfit' approach
in the configured resource dimension and is essentially what is done by the
'MINRESOURCE' algorithm.</p>
<p><br>
<h3>5.2.1.2 Shared Nodes</h3>
<p> Shared node systems are most 
often involve SMP nodes although this is not mandatory. Regardless, 
when sharing the resources of a given node amongst tasks from more than one
job, resource contention and fragmentation issues arise. </p>
<p> Most current systems still do
not do a very good job of logically partitioning the resources (i.e., CPU,
Memory, network bandwidth, etc.) available on a given node. Consequently
contention often arises between tasks of independent jobs on the node.
This can result in a slowdown for all jobs involved which can have significant
ramifications if large way parallel jobs are involved. </p>
<p> On large way SMP systems (i.e.,
&gt; 32 processors/node), job packing can result in intra-node fragmentation. 
For example, again take two nodes, A and B each with 64 processors. 
Assume they are currently loaded with various jobs and have 24 and 12 processors 
free respectively. Two jobs are submitted, Job X requesting 10 processors, 
and job Y requesting 20 processors. Job X can start on either node but
starting it on node A will prevent job Y from running. An algorithm 
to handle intra-node fragmentation is pretty straightforward for a single 
resource case, but what happens when jobs request a combination of processors, 
memory, and local disk. Determining the correct node suddenly gets significantly
more complex<!--difficult. Algorithms to handle these type of issues are currently
available in the G2 extension library.--></p>

<h3>5.2.1.3 Reservations or Service Guarantees</h3>
<p> A reservation based system adds
the time dimension into the node allocation decision. With reservations, 
node resources must be viewed in a type of two dimension 'node-time' space. 
Allocating nodes to jobs fragments this node-time space and makes it more 
difficult to schedule jobs in the remaining, more constrained node-time slots.
Allocation decisions should be made in such a way as top minimize this fragmentation
and maximize the schedulers ability to continue to start jobs in existing
slots. See the figure to hopefully remove a small amount of the incoherency
contained in the above sentences. In this figure, Job A and job B are
already running. A reservation, X, has been created some time in the
future. Assume that job A is 2 hours long and job B is 3 hours long.
Again, two new single processor jobs are submitted, C and D; job C requires
3 hours of compute time while job D requires 5 hours. Either job will
just fit in the free space located above Job A or in the free space located
below job B. If job C is placed above Job A, job D, requiring 5 hours
of time will be prevented from running by the presence of reservation X.
However, if job C is placed below job B, job D can still start immediately
above Job A.<img src="/images/nodespace.gif" border="2" height="146" width="271" align="Right">
Hopefully, this canned example demonstrates the importance of time based
reservation information in making node allocation decisions, both at the
time of starting jobs, and at the time of creating reservations. The
impact of time based issues grows significantly with the number of reservations
in place on a given system. The LASTAVAILABLE algorithm works on this
premise, locating resources which have the smallest space between the end
of a job under consideration and the start of a future reservation.</p>

<h3>5.2.1.4 Non-flat Network</h3>
<p> On systems where network connections
do not resemble a flat 'all-to-all' topology, the placement of tasks may
present a significant impact on the performance of communication intensive
parallel jobs. If latencies and bandwidth of the network between any
two nodes vary significantly, the node allocation algorithm should attempt
to pack tasks of a given job as close to each other as possible to minimize
the impact of these bandwidth and latency differences. </p>
<a name="resource"></a>
<h2>5.2.2 Resource Based Algorithms</h2>
<p> Maui contains a number of allocation algorithms which 
address some of the needs described above. Additional 'homegrown' allocation
algorithms may also be created and interfaced into the Maui scheduling system.
The current suite of algorithms is described below. </p>
<a name="CPULOAD"></a>
  <h3>5.2.2.1 CPULOAD</h3>
  <p> Nodes are selected which have
the maximum amount of available, unused cpu power, i.e. &lt;#of CPU's&gt; 
- &lt;CPU load&gt;. Good algorithm for timesharing node systems. 
This algorithm is only applied to jobs starting immediately. For the
purpose of future reservations, the MINRESOURCE algorithm is used. <br>
 </p>

  <a name="FIRSTAVAILABLE"></a>
  <h3>5.2.2.2 FIRSTAVAILABLE</h3>
  <p><b></b> Simple first come, 
first server algorithm where nodes are allocated in the order they are presented
by the resource manager. This is a very simple, and very fast algorithm. 
  <br>
 </p>

  <a name="LASTAVAILABLE"></a>
  <h3>5.2.2.3 LASTAVAILABLE</h3>
  <p><b></b> Algorithm which selects 
resources so as to minimize the amount of time after the job and before the
the trailing reservation. This algorithm is a 'best fit in time' algorithm
which minimizes the impact of reservation based node-time fragmentation. 
It is useful in systems where a large number of reservations (job, standing, 
or administrative) are in place. <br>
 </p>
  <a name="PRIORITY"></a>
  <h3>5.2.2.4 PRIORITY</h3>
  <p><b></b> This algorithm allows 
a site to specify the priority of various static and dynamic aspects of compute
nodes and allocate them accordingly. It is highly flexible allowing
node attribute and usage information to be combined with reservation affinity.
Using node allocation priority, the following priority components can be
specified: <br>
 
  <table border="1" width="100%" nosave="">
 <tbody>
      <tr>
 <td><b>Component Name</b></td>
  <td><b>Description</b></td>
 </tr>
  <tr>
 <td><b>ADISK</b></td>
  <td>local disk currently available to batch jobs</td>
 </tr>
  <tr>
 <td><b>AMEM</b></td>
  <td>real memory currently available to batch jobs</td>
 </tr>
  <tr>
 <td><b>APROCS</b></td>
  <td>processors currently available to batch jobs (configured procs - dedicated 
procs)</td>
 </tr>
  <tr>
 <td><b>ASWAP</b></td>
  <td>virtual memory currently available to batch jobs</td>
 </tr>
  <tr>
 <td><b>CDISK</b></td>
  <td>total local disk allocated for use by batch jobs</td>
 </tr>
  <tr>
 <td><b>CMEM</b></td>
  <td>total real memory on node</td>
 </tr>
  <tr>
 <td><b>CPROCS</b></td>
  <td>total processors on node</td>
 </tr>
  <tr>
 <td><b>CSWAP</b></td>
  <td>total virtually memory configured on node</td>
 </tr>
  <tr>
 <td><b>JOBCOUNT</b></td>
  <td>number of jobs currently running on node</td>
 </tr>
  <tr>
 <td><b>LOAD</b></td>
  <td>current 1 minute load average</td>
 </tr>
  <tr>
        <td valign="Top"><b>PREF</b><br>
        </td>
        <td valign="Top">node meets job specific resource preferences<br>
        </td>
      </tr>
      <tr>
 <td><b>PRIORITY</b></td>
  <td>admin specified node priority</td>
 </tr>
  <tr>
 <td><b>RESAFFINITY</b></td>
  <td>reservation affinity for job being evaluated</td>
 </tr>
  <tr>
 <td><b>SPEED</b></td>
  <td>if set, node 'procspeed'. otherwise, relative node 'speed'</td>
 </tr>
  <tr>
 <td><b>USAGE</b></td>
  <td>percentage of time node has been running batch jobs since the last statistics
initialization</td>
 </tr>
 
    </tbody>
  </table>
  </p>
  <p> The node allocation priority function can be specified 
on a node by node or cluster wide basis. In both cases, the recommended 
approach is to specify the <b>PRIORITYF</b> attribute with the <b><a href="a.fparameters.php#nodecfg">
NODECFG</a>
  </b> parameter. A few examples follow. </p>
  <p>Example 1: Favor the fastest nodes with the most available memory 
which are running the fewest jobs. </p>
  <p><tt><font size="-1">NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF='SPEED+ .01 * AMEM - 10 * JOBCOUNT'</font></tt> </p>
  <p>Example 2: A site has a batch system consisting of two dedicated 
'batchX' nodes, as well as numerous desktop systems. The allocation 
function should favor batch nodes first, followed by desktop systems which 
are the least loaded and have received the least historical usage. </p>
  <p><tt><font size="-1">NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF='-LOAD - 5*USAGE'</font></tt> <br>
  <tt><font size="-1">NODECFG[batch1] PRIORITY=1000 PRIORITYF='PRIORITY +
APROCS'</font></tt> <br>
  <tt><font size="-1">NODECFG[batch2] PRIORITY=1000 PRIORITYF='PRIORITY +
APROCS'</font></tt> </p>
  <p>Example 3: Pack tasks onto loaded nodes first. </p>
  <p><tt><font size="-1">NODEALLOCATIONPOLICY PRIORITY<br>
NODECFG[DEFAULT] PRIORITYF=JOBCOUNT</font></tt> </p>

<table class="note">
  <tbody>
    <tr>
      <td class="noteIMG"><img src="/resources/docs/images/note.png" title="Note" alt="Note"></td>
      <td class="noteDetail">As in the example above, if spaces are placed within 
the priority function for readability, the priority function value will need
to be quoted to allow proper parsing.</td>
    </tr>
  </tbody>
</table>
  <a name="MINRESOURCE"></a>
  <h3>5.2.2.5 MINRESOURCE</h3>
  <p> This algorithm priorities 
nodes according to the configured resources on each node. Those nodes 
with the fewest configured resources which still meet the job's resource constraints
are selected. <br>
 </p>
  <a name="CONTIGUOUS"></a>
  <h3>5.2.2.6 CONTIGUOUS</h3>
  <p><b></b> This algorithm will 
allocate nodes in contiguous (linear) blocks as required by the Compaq RMS
system. <br>
 </p>
  <a name="MAXBALANCE"></a>
  <h3>5.2.2.7 MAXBALANCE</h3> 
  <p> This algorithm will attempt 
to allocate the most 'balanced' set of nodes possible to a job. In most
cases, but not all, the metric for balance of the nodes is node speed. 
Thus, if possible, nodes with identical speeds will be allocated to the job.
If identical speed nodes cannot be found, the algorithm will allocate the
set of nodes with the minimum node speed 'span' or range. <br />
 </p>
  <a name="FASTEST"></a>
  <h3>5.2.2.8 FASTEST</h3>
  <p> This algorithm will select 
nodes in 'fastest node first' order. Nodes will be selected by node 
  <i>speed</i> if specified. If node speed is not specified, nodes
will be selected by processor speed. If neither is specified, nodes
will be selected in a random order. <br>
 </p>
  <a name="LOCAL"></a>
  <h3>5.2.2.9 LOCAL</h3> 
  <p><b> </b>This will call the 
locally created 'contrib' node allocation algorithm. </p>
  <p><b>See also</b> </p>
  <p><b> </b>N/A.</p>

 <a name="time"></a>
  <h2>5.2.3 Time Based Algorithms</h2>
 
  <p> Under Construction </p>
  <p><a name="locally"></a>
  <h2>5.2.4 Locally Defined Algorithms</h2>
 </p>
  <p> Under Construction</p>
  <h2><a name="pref"></a>
5.2.5 Specifying 'Per Job' Resource Preferences</h2>
  <p> While the resource based node allocation algorithms
can make a good guess at what compute resources would best satisfy a job,
sites often possess a subset of jobs which benefit from more explicit resource
allocation specification. For example one job may perform best on a
particular subset of nodes due to direct access to a tape drive, another
may be very memory intensive. Resource preferences are distinct from
node requirements. While the former describes what a job needs to run
at all, the latter describes what the job needs to run well. In general,
a scheduler must satisfy a job's node requirement specification, and then,
as best possible, should satisfy the job's resource preferences.</p>
 <h3>5.2.5.1  Specifying Resource Preferences</h3>
  <p> A number of resources managers natively support the
concept of resource preferences (ie, Loadleveler). When using these
systems, the language specific preferences keywords may be used. For
systems which do not support resource preferences natively, Maui provides
a resource manager extension keyword, '<b>PREF</b>' which may be utilized
to specify desired resources. This extension allows specification of node
features, memory, swap, and disk space conditions which define whether or
not the node is considered to be 'preferred. (NOTE: Maui 3.2.5
only supports feature based preferences)<br>
  </p>
<h3>5.2.5.2  Selecting 'Preferred' Resources</h3>
  </p>
  <p> Enforcing resource preferences is not completely
straightforward. A site may have a number of potentially conflicting
desires which the scheduler is asked to simultaneously satisfy. For
example, a scheduler may be asked to maximize the proximity of the allocated
nodes at the same time it is supposed to satisfy resource preferences and
minimize node overcommitment. To allow site specific 'weighting' of
these varying desires, Maui allows resources preferences to be enabled through
the '<a href="#PRIORITY">Priority</a>
' node allocation algorithm. For example, to utilize resource preferences
together with node load, the following configuration might be used:</p>
  <p>-----<br>
  </p>
  <pre>NODEALLOCATIONPOLICY PRIORITY<br>NODECFG[DEFAULT]     PRIORITYF='5 * PREF - LOAD'</pre>
  <p>-----</p>
  <p> To request specific resource preferences, a user could then submit
a job indicating those preferences. In the case of a PBS job, the following
might work:</p>
  <p>----<br>
qsub -l nodes=4,walltime=1:00:00 -W x=PREF(FEATURE:FAST,FEATURE:TAPE)<br>
----</p>

<div class="navIcons bottomIcons">
<a href="index.php"><img src="/resources/docs/images/home.png" title="Home" alt="Home" border="0"></a>
<a href="5.0prioritization.php"><img src="/resources/docs/images/upArrow.png" title="Up" alt="Up" border="0"></a>
<a href="5.1.5prioritymanagement.php"><img src="/resources/docs/images/prevArrow.png" title="Previous" alt="Previous" border="0"></a>
<a href="5.3nodeaccess.php"><img src="/resources/docs/images/nextArrow.png" title="Next" alt="Next" border="0"></a>
</div>


 </div>
 </div>
  </div>
  <div class="sub-content-btm"></div>
</div>
</div>