The shepherd

The shepherd is the instance in Grid Engine, that starts and controls a job. It

sets up resources before the job start (prolog, PE start-up).
sets the jobs environment, limits etc.
starts the job (as a child process of shepherd).
controlls the job - sends signals to the job (suspend/unsuspend, terminate), takes care of checkpointing/migration of a job.
after job termination cleans up resources (PE shut-down, epilog).

Process Flow

Initialization

The following actions are taken to initialize the shepherd (see source/clients/shepherd.c: main):

Setup signal handlers and signal masks

source/libs/uti/sge_set_def_sig_mask.c: sge_set_def_sig_mask, source/clients/shepherd.c: set_sig_handler, source/clients/shepherd.c: set_shepherd_signal_mask

Read configuration for job

source/daemons/common/config_file: read_config

Check the configuration (source/daemons/shepherd/shepherd.c: verify_method):

Terminate, suspend, resume methods.
Access to the jobs spool directory.
Validity of the checkpointing interface.

Setting up the job environment

A job runs in a defined process environment. This process environment comprises of environment variables, limits, special parallel processing environments, security settings etc.

In addition to the standard setup procedure, scripts can be defined to be executed as prolog/epilog for the job and as setup/shutdown procedure for parallel environments.

Other components of the job environment can be:

If available (depends on system platform) and requested: setup a processor set for the job.
If available and requested: setup AFS security mechanism.
If requested: start a prolog script - see queue_conf(5).
In case of parallel jobs that need special setup: start a pe startup procedure - see sge_pe(5).

This process environment is setup in source/daemons/shepherd/shepherd.c: main

Running the job

A job is executed as a child process of the shepherd process. During the runtime of the job, the shepherd parent process controls its child and may send signals to the child process.

The job's startup contains the following steps (see source/daemons/shepherd/shepherd.c: start_child):

If the job has been checkpointed previously and shall now be rerun, a checkpoint restart method is setup.

Fork a child process that will execute the job.

The parent waits for the child to exit, if any signals have to be delivered to the job, the parent process sends these signals to the process group of its child process.

The child process starts up the job (see source/daemons/shepherd/builtin_starter.c: sonand source/daemons/shepherd/builtin_starter.c: start_command):

Setup an own process group.

Set limits according to queue/job configuration.

Setup the environment variables.

If requested, set a special group id different from the users primary group id - see execd_param USE_QSUB_GID in the cluster configuration: sge_conf(5).

Set an operating system job id, if the operating system provides this feature, otherwise set an additional user group id to controll/monitor all child processes of the job, see documentation of the execd

Setup the redirection of stdin (from /dev/null) and stdout/stderr (to files)

Setup the commandline depending on the type of job (jobscript, qsh, qlogin/qrsh):

If it is a batch job, execute the jobscript with commandline parameters.

If it is a qsh job, start an xterm process as defined in the cluster configuration, parameter xterm, see sge_conf(5).

If it is a qrsh or qlogin job, call the qlogin_starter that starts a telnetd, rshd or rlogind as defined in the cluster configuration, parameters qlogin_daemon, rsh_daemon and rlogin_daemon, see sge_conf(5).

Execute the commandline built in the previous step.

After a job exits, job information is spooled in the job's spool directory for further use by the execution daemon containing:

exit code
usage / accounting information

Cleaning up the job environment

After a job has terminated, the job is cleaned up:

If it is a parallel job and a pe stop procedure has been defined: Execute the pe stop procedure, see sge_pe(5).
Execute an epilog script if defined, see queue_conf(5).
Free a previously defined processor set.
Shutdown security mechanisms setup during job setup.
Write the job's exit code to file and eventually to a qrsh process (see also "Communication with qrsh").

The process environment is cleaned up in source/daemons/shepherd/shepherd.c: main

Communication with the execd - spooled files

The Shepherd and the execution daemon share some information by writing spool files .

The execution daemon passes information like the jobs configuration and the environment to shepherd (see for example source/daemons/execd/exec_job.c: sge_exec_job),
the shepherd reports information like the jobs exit code and usage information back to the execution daemon (see for example source/daemons/shepherd/shepherd.c: wait_my_child).

The spool files are located in a host specific spool directory (usually $SGE_ROOT/default/spool/<hostname>). The spool directory can be defined in the cluster configuration - see parameter execd_spool_dir in the cluster configuration, sge_conf(5).

Under the host specific spool directory, a subdirectory active_jobs is created by the execd, for each job, the execd creates a subdirectory <jobid>.<taskid> that holds the job specific spool files.
If the job is a parallel job with a tight integration, this job specific spool directory is created on each execution host involved in the execution of the job, for each task a subdirectory <petaskid> is created that holds the task specific spool files.

The following spool files exist in Grid Engine:

addgrpid

file osjobid

checkpointed

config

environment

<variable>=<value>

error

exit_status

job_pid

osjobid

pid

processor_set_number

shepherd_about_to_exit

shepherd_about_to_exit

signal

SIGTTIN

signal

trace

If the job is a parallel job, the following additional files are created

if it is a parallel job with tight integration, a subdirectory for each task containing the files described above for each task

pe_hostfile

Communication with qlogin/qrsh

Interactive jobs submitted using qlogin or qrsh require additional communication between shepherd and the qlogin/qrsh client command - see also documentation of qrsh.

Qlogin/qrsh sets up a socket and writes the port address to an environment variable "QRSH_PORT" in the format "<host>:<port>" (see source/clients/qsh/qsh.c: main).

The shepherd connects to this port and sends the following information to the qlogin/qrsh client:

If an error occurs during job setup or the further job monitoring, an error message will be sent to qlogin/qrsh in the format:

1:<error message>

source/daemons/common/err_trace.c: shepherd_error_impl

To initialize a telnet/rsh/rlogin connection, the shepherd creates a socket port to which the client command telnet/rsh/rlogin shall connect.

0:<port>:<rootdir>:<arch>:<spool dir>:<hostname>

port is the socket port number
rootdir is the path to the SGE root directory ($SGE_ROOT)
arch is the architecture of the execution host ($SGE_ROOT/util/arch)
spool_dir is the job's spool directory (usually $SGE_ROOT/default/spool/<hostname>/active_jobs/<job_id>.<taskid>)
hostname is the name of the execution host

daemons/common/qlogin_starter.c: qlogin_starter

source/daemons/common/qlogin_starter.c: write_exit_code_to_qrsh