shadowd - qmaster breakdown detection daemon

The  qmaster is the central control unit of a Grid Engine cluster. If it is unavailable to provide service, the cluster operation is strongly affected (note, however, that executing jobs will finish normally). To prevent from qmaster becoming a single point of failure, the shadowd is available. It will detect when the currently active qmaster becomes unavailable and it will startup a new qmaster/ schedd pair. The shadowd has to run on a so called shadow master host, which has to have identical access permissions to the qmaster spool files as the currently active qmaster host. If multiple shadowds are active in a cluster, they will run a protocol to ensure that only one new qmaster/schedd pair is started. For more detailed information look at the man page sge_shadowd(8).

The shadowd mechanism will only work if all Grid Engine client commands can access the same shared file system for the cell directory (In the cell directory - often $SGE_ROOT/default - are all configuration files and the spooling directories for the cluster).

To start up a shadowd use the Grid Engine cluster startup script sge5 with the -shadowd  option.

Note that the shadowd will not start a new qmaster/schedd if the file "lock" exists in the qmaster spool directory. This file is written by the qmaster in the case of a regular shutdown of the cluster (qconf -km). The shadowd identifies unavailability of the qmaster service first by reading a so called heartbeat file, which an active qmaster updates from time to time. If it does not get updated, the shadowd will not start a new qmaster/schedd immediatelly. It will wait for several minutes to ensure that a temporary network overload or breakdown are not the cause of the problem.
 
 

The shadowd main function is implemented in the file:

gridengine/source/daemons/shadowd/shadowd.c
The main loop looks like follows:
1. Check for running shadowd on the local host. If there is already a running shadowd: stop

2. Prepare enroll for later enroll() call of commlib: prepare_enroll()

Set all parameter for comunication with commd
3. Install exit functions and setup signal handlers

4. Parse cmd line parameters ( only -help available )

5. Get qmaster spool directory from configuration file

The configuration file is $SGE_ROOT/default/common/configuration
6. Switch to admin user ( is also stored in the configuration file )

7. Get qmaster heartbeat ( from file heartbeat in the qmaster spool directory )

8. When heartbeat is unchanged more than one heartbeat interval (240 seconds) start delay timer ( 600 seconds )

9. When the heartbeat is still unchanged after heartbeat interval time and delay time:

if no valid lock file ( from other shadowd ) found:
- lock qmaster
- startup  qmaster  and schedd on local host