Execd - the execution daemon
The Execution Daemon (execd) is the instance in Grid Engine that
starts jobs.
controls jobs, e.g. can suspend / unsuspend a job, reprioritize the processes
associated with a job, etc.
gathers information about jobs, e.g. resource usage, exit code etc.
gathers information about the execution host it controls, e.g. load, free
memory, etc.
There is one execd on each host of a cluster.
Process flow
When execd starts up, the following actions are taken:
-
General initializations
(source/daemons/execd/execd.c: main)
-
Connect to commd.
(source/libs/comm/commlib.c: enroll)
-
Try to contact qmaster and register with qmaster.
(source/daemons/execd/setup_execd.c: sge_setup_sge_execd)
If the execd can't contact qmaster, it will continue and try to contact
qmaster in regular intervals.
-
Look for old jobs (jobs that have been started before execd was shutdown).
(source/daemons/execd/setup_execd.c: sge_setup_old_jobs)
-
Establish process control for still running jobs.
(source/daemons/execd/execd_ck_to_do.c: register_at_ptf)
-
Cleanup finished jobs and report them to qmaster.
(source/daemons/execd/reaper_execd.c: clean_up_old_jobs)
After startup, execd enters its main loop where it
-
receives requests,
-
processes requests, and
-
sends reports in regular intervals.
PDC - the Portable Data Collector
The PDC is a module inside the execd that collects information about running
jobs like CPU usage, memory consumption, etc.
Data is collected for all processes of a job on the basis of a criterion
unique to a job. On Systems that support some sort of jobid for a hierarchy
of processes, this jobid is used. On all other systems, an additional user
group id (gid) is created on behalf of each job and then used.
The jobid / additional group id is attached to the root process of a
job and is inherited by its child processes.
The PDC is implemented in source/daemons/pdc.c.
PTF - the Priority Translation Facility
In SG3E mode, Grid Engine has the feature of a share-based scheduler (product
mode sgeee). Each job gets a certain share of the system resources.
There exist different mechanisms (policies) to assign shares to a job.
The sum of all shares for a job is expressed in so called tickets - a job
has a certain number of tickets enabling it to run with certain process
priorities.
If multiple jobs are running concurrently on a host, their different
share of system resources - their different number of tickets - can be
mapped to priorities in the operating system.
Setting priorities in the operating system is done by either setting
the nice value for all processes of a job or by using special priority
mapping facilities provided by the underlying operating system.
Grid Engine reassigns the number of tickets per job in a regular interval.
The PTF then maps the number of tickets of a job to nice values (or another
operating system priority representation) and renices all processes of
the job.
Like the PDC, the PTF uses the jobid / additional group id to capture
all processes of a job.
The PTF is implemented in source/daemons/execd/ptf.c.
Requests to execd
Execd requests are specified by a request tag (e.g. TAG_JOB_EXECUTION).
For incoming requests a mapping is done from a request tag to a callback
function that processes the request.
Execd accepts and processes the following requests:
-
Execute a job (TAG_JOB_EXECUTION)
If a request to execute a job is received from the qmaster, the job
is spooled to disk and started via a shepherd process - see the shepherd
documentation.
During the job's runtime, the processes of the job can be monitored
and controlled by the PDC and PTF modules of the execd.
After the job finished, all relevant information about the job is gathered
and the job end is reported to the qmaster.
The function execd_job_exec in source/daemons/execd/execd_job_exec.c
processes this type of request.
-
Execute a task inside a parallel job (TAG_SLAVE_ALLOW)
The model of parallel jobs in Grid Engine provides the concept of a
tight integration of a parallel job's tasks in Grid Engine. In this tight
integration, the tasks of a parallel job are under full control of Grid
Engine.
Tasks can be started with the qrsh binary (qrsh -inherit).
qrsh -inherit itself contacts Execd using the GDI function
sge_qrexec().
Like a single job, a task is started via a shepherd process and can
be monitored and controlled by PDC and PTF.
After a task finishes, all relevant information about the task is gathered
and the task end is reported to the qmaster.
The function execd_job_slave in source/daemons/execd/execd_job_exec.c
processes this type of request.
-
Assign Tickets to a running job, reprioritize job (TAG_CHANGE_TICKET)
In regular intervals, the number of tickets is reassigned to each running
job. The number of tickets is reported from the qmaster to the execd's.
The number of tickets is mapped to an operating system nice value or
another operating system provided priority representation and thus all
processes of a job are reprioritized.
The function execd_ticket in source/daemons/execd/execd_ticket.c
processes this type of request.
-
Acknowledge from qmaster to a previously sent job report (TAG_ACK_REQUEST)
After a job or a task finishes, the execd reports this as a job report
to the qmaster. The qmaster must acknowledge a job report; if no acknowledge
arrives at the execd within a certain interval, the job report is resent.
The function execd_c_ack in source/daemons/execd/job_report_execd.c
processes this type of request.
-
Signal all jobs in a queue (TAG_SIGQUEUE)
The qmaster asks the execd to send a certain signal to all jobs in
a certain queue. This, for example, can be triggered by suspending the
queue.
The execd signals the process group of each job in the queue.
The function execd_signal_queue in source/daemons/execd/execd_signal_queue.c
processes this type of request.
-
Signal a job (TAG_SIGJOB)
The qmaster can ask the execd to send a certain signal to a single
job, for example, if the job is suspended.
The execd will signal the process group of this job.
This request is also processed by the function execd_signal_queue
in source/daemons/execd/execd_signal_queue.c.
-
Shutdown (TAG_KILL_EXECD)
Tells the execd to do a clean shutdown.
The function execd_kill_execd in source/daemons/execd/execd_kill_execd.c
processes this type of request.
-
Activate/deactivate certain features, e.g. job repriorization - PTF
(TAG_NEW_FEATURES)
The function execd_new_features in source/daemons/execd/execd_kill_execd.c
processes this type of request.
-
Configuration changed (TAG_GET_NEW_CONF)
If the cluster configuration (either the global or for a specific host)
is changed, all affected hosts will be notified by the qmaster about the
configuration change.
The function execd_get_new_conf in source/daemons/execd/execd_get_new_conf.c
processes this type of request.
Reports from execd to qmaster
The execd sends reports to the qmaster in a regular interval. These reports
contain
-
Load values
All load values collected by the load sensor(s) of an execd in a load
report interval are sent to the qmaster in one report message - see also
man page sge_conf(5).
-
Job reports
Job reports are created during a job's runtime by PDC reporting the
job's resource consumption accumulated so far. A job report is also created
when a job finishes to report the final resource consumption. Multiple
job reports are collected and sent in one report message to the qmaster
- job reports for tasks of parallel jobs are sent to qmaster immediately
(see source/daemons/execd/reaper_execd.c - the variable flush_jr
defines if a job report is sent immediately or with the report interval).
The load sensor interface
A load sensor is a module, that retrieves any host specific values and
passes them to the execd.
The execd will report these host specific values, called load values
in the following text, to the qmaster.
The execd contains a load sensor for the common host characteristics
like load, total memory, free memory, total swap, free swap etc.
The file doc/load_parameters.asc contains a detailed
description of all load values including platform dependencies.
Load values are retrieved by the (platform dependene)
function get_load_avg and get_cpu_load in the file source/libs/uti/sge_getloadavg.c.
Memory load values are retrieved by the (platform
dependent) function load_mem in file source/libs/uti/sge_loadmem.c.
In addition, there exists an interface to integrate one or multiple
external load sensors into the execd - see man page sge_conf(5).
This is for example done to integrate license counters from licensing
systems into Grid Engine or to provide additional host characteristics
to Grid Engine that are not handled by the built-in load sensor.
An external load sensor can be any executable like a binary, a shell
script, a perl script ...
It can be configured in the (host specific) cluster configuration by
setting the parameter load_sensor - see man page sge_conf(5)
- and is started by the execd as a child process (see function sge_ls_start_ls
in file source/daemons/execd/sge_load_sensor.c).
Multiple load sensors can be started by one execd.
A load sensor gets commands from execd on stdin and has to report the
load values on stdout. It has to implement the following protocol:
Commands from execd
-
Retrieve and send load values
In a regular interval defined as load_report_interval in the
cluster configuration, the execd will ask the load sensor to retrieve actual
load values and send them back to the execd.
Execd will send a single linefeed (\n) to trigger this action.
-
Shutdown
The execd can tell the load sensor to shutdown. Therefor it sends the
command
quit followed by a linefeed.
Format of load values
A record containing all load values provided by a load sensor may only
be sent after a request from execd.
The record is formed by
-
the keyword begin followed by a linefeed
-
Any number of load values, each in a single line
-
the keyword end followed by a linefeed
The format for a load value is
hostname:name:value
Examples of load sensors are installed in the directory
$SGE_ROOT/util/resources/loadsensors
Further information on setting up loadsensors can be found at
http://gridengine.sunsource.net/project/gridengine/howto/loadsensor.html
Copyright 2001 Sun Microsystems, Inc. All rights reserved.