The shepherd
The shepherd is the instance in Grid Engine, that starts and controls a
job. It
-
sets up resources before the job start (prolog, PE start-up).
-
sets the jobs environment, limits etc.
-
starts the job (as a child process of shepherd).
-
controlls the job - sends signals to the job (suspend/unsuspend, terminate),
takes care of checkpointing/migration of a job.
-
after job termination cleans up resources (PE shut-down, epilog).
Process Flow
Initialization
The following actions are taken to initialize the shepherd (see source/clients/shepherd.c:
main):
-
Setup signal handlers and signal masks
(source/libs/uti/sge_set_def_sig_mask.c: sge_set_def_sig_mask,
source/clients/shepherd.c: set_sig_handler, source/clients/shepherd.c:
set_shepherd_signal_mask)
-
Read configuration for job
(source/daemons/common/config_file: read_config)
-
Check the configuration (source/daemons/shepherd/shepherd.c: verify_method):
-
Terminate, suspend, resume methods.
-
Access to the jobs spool directory.
-
Validity of the checkpointing interface.
Setting up the job environment
A job runs in a defined process environment. This process environment comprises
of environment variables, limits, special parallel processing environments,
security settings etc.
In addition to the standard setup procedure, scripts can be defined
to be executed as prolog/epilog for the job and as setup/shutdown procedure
for parallel environments.
Other components of the job environment can be:
-
If available (depends on system platform) and requested: setup a processor
set for the job.
-
If available and requested: setup AFS security mechanism.
-
If requested: start a prolog script - see queue_conf(5).
-
In case of parallel jobs that need special setup: start a pe startup procedure
- see sge_pe(5).
This process environment is setup in source/daemons/shepherd/shepherd.c:
main
Running the job
A job is executed as a child process of the shepherd process. During the
runtime of the job, the shepherd parent process controls its child and
may send signals to the child process.
The job's startup contains the following steps (see source/daemons/shepherd/shepherd.c:
start_child):
If the job has been checkpointed previously and shall now be rerun, a checkpoint
restart method is setup.
Fork a child process that will execute the job.
The parent waits for the child to exit, if any signals have to be delivered
to the job, the parent process sends these signals to the process group
of its child process.
The child process starts up the job (see source/daemons/shepherd/builtin_starter.c:
son
and source/daemons/shepherd/builtin_starter.c: start_command):
-
Setup an own process group.
-
Set limits according to queue/job configuration.
-
Setup the environment variables.
-
If requested, set a special group id different from the users primary group
id - see execd_param USE_QSUB_GID in the cluster configuration:
sge_conf(5).
-
Set an operating system job id, if the operating system provides this feature,
otherwise set an additional user group id to controll/monitor all child
processes of the job, see documentation
of the execd
-
Setup the redirection of stdin (from /dev/null) and stdout/stderr (to files)
-
Setup the commandline depending on the type of job (jobscript, qsh, qlogin/qrsh):
-
If it is a batch job, execute the jobscript with commandline parameters.
-
If it is a qsh job, start an xterm process as defined in the cluster configuration,
parameter xterm, see sge_conf(5).
-
If it is a qrsh or qlogin job, call the qlogin_starter that starts a telnetd,
rshd or rlogind as defined in the cluster configuration, parameters qlogin_daemon,
rsh_daemon and rlogin_daemon, see sge_conf(5).
-
Execute the commandline built in the previous step.
After a job exits, job information is spooled in the job's spool directory
for further use by the execution daemon containing:
-
exit code
-
usage / accounting information
Cleaning up the job environment
After a job has terminated, the job is cleaned up:
-
If it is a parallel job and a pe stop procedure has been defined: Execute
the pe stop procedure, see sge_pe(5).
-
Execute an epilog script if defined, see queue_conf(5).
-
Free a previously defined processor set.
-
Shutdown security mechanisms setup during job setup.
-
Write the job's exit code to file and eventually to a qrsh process (see
also "Communication with qrsh").
The process environment is cleaned up in source/daemons/shepherd/shepherd.c:
main
Communication with the execd - spooled files
The Shepherd and the execution daemon share some information by writing
spool files .
The execution daemon passes information like the jobs configuration
and the environment to shepherd (see for example source/daemons/execd/exec_job.c:
sge_exec_job),
the shepherd reports information like the jobs exit code and usage
information back to the execution daemon (see for example source/daemons/shepherd/shepherd.c:
wait_my_child).
The spool files are located in a host specific spool directory (usually
$SGE_ROOT/default/spool/<hostname>).
The spool directory can be defined in the cluster configuration - see parameter
execd_spool_dir
in the cluster configuration, sge_conf(5).
Under the host specific spool directory, a subdirectory active_jobs
is created by the execd, for each job, the execd creates a subdirectory
<jobid>.<taskid>
that holds the job specific spool files.
If the job is a parallel job with a tight integration, this job specific
spool directory is created on each execution host involved in the execution
of the job, for each task a subdirectory <petaskid> is created
that holds the task specific spool files.
The following spool files exist in Grid Engine:
-
addgrpid
Contains one line with the additional group id used to control and
monitor the job, if the operating system does not provide the feature of
a jobid (see file osjobid)
-
checkpointed
If a checkpointed job is restarted, shepherd writes a "1" into this
file. If a new checkpointing is requested during startup of the job, no
new checkpoint will be written.
-
config
The job's configuration - contains one line per configuration parameter
-
environment
All environment variables to be setup for the job in the form
<variable>=<value>.
This file is written by the execd and used by shepherd to setup the job's
environment
-
error
Contains an error message in case of severe errors during the startup
of a job (e.g. Execd cannot start shepherd)
-
exit_status
The numeric exit code of the job in one single line
-
job_pid
The process id of the job (the shepherd's child process)
-
osjobid
Contains one line with the operating system jobid used to control and
monitor the job (if the operating system provides this feature)
-
pid
The process id of the shepherd
-
processor_set_number
If a processor set is created for the execution of a job, the shepherd
writes the processor set number to this file. It is used after job exit
to free the processor set.
-
shepherd_about_to_exit
In parallel jobs, multiple instances of a task may be executed in sequence.
For example a qmake job will be assigned a fixed number of slots in a Grid
Engine cluster and will execute multiple tasks (e.g. Compile tasks) in
each slot.
If this happens in fast sequence, the execd may not yet have seen the
termination of a previously exited task when it receives the order to start
the next task. If in this case all slots are in use, and the execd would
therefore have to reject the execution of an additional task, it checks
the existence of the spool file "shepherd_about_to_exit" for all
running tasks of the job that requests execution of the next task; if it
finds a task that has already exited but is not yet cleaned up, it can
allow the execution of the next task.
-
signal
Usually, if shepherd has to deliver a signal to a job, the execd will
notify the shepherd with (potentially another) signal, which the shepherd
then maps to the signal to deliver and sends it to the job's process group.
If shepherd receives a SIGTTIN, it reads the signal to deliver
from the spoolfile "signal" and sends this signal to the job's
process group.
-
trace
A file containing debug information about the job's execution
If the job is a parallel job, the following additional files are created
-
if it is a parallel job with tight integration, a subdirectory for each
task containing the files described above for each task
-
pe_hostfile
A file describing the host setup of a parallel job, containing each
involved host, the queues the job was spooled into and the number of reserved
slots (tasks) per host
Communication with qlogin/qrsh
Interactive jobs submitted using qlogin or qrsh require additional communication
between shepherd and the qlogin/qrsh client command - see also documentation
of qrsh.
Qlogin/qrsh sets up a socket and writes the port address to an environment
variable "QRSH_PORT" in the format "<host>:<port>" (see source/clients/qsh/qsh.c:
main).
The shepherd connects to this port and sends the following information
to the qlogin/qrsh client: