The shadowd mechanism will only work if all Grid Engine client commands can access the same shared file system for the cell directory (In the cell directory - often $SGE_ROOT/default - are all configuration files and the spooling directories for the cluster).
To start up a shadowd use the Grid Engine cluster startup script sge5 with the -shadowd option.
Note that the shadowd will not start a new qmaster/schedd
if the file "lock" exists in the qmaster spool directory. This file is
written by the qmaster in the case of a regular shutdown of the cluster
(qconf -km). The shadowd identifies unavailability of the qmaster
service first by reading a so called heartbeat file, which an active qmaster
updates from time to time. If it does not get updated, the shadowd will
not
start a new qmaster/schedd immediatelly. It will wait for several minutes
to ensure that a temporary network overload or breakdown are not the cause
of the problem.
The shadowd main function is implemented in the file:
gridengine/source/daemons/shadowd/shadowd.cThe main loop looks like follows:
1. Check for running shadowd on the local host. If there is already a running shadowd: stop2. Prepare enroll for later enroll() call of commlib: prepare_enroll()
Set all parameter for comunication with commd3. Install exit functions and setup signal handlers4. Parse cmd line parameters ( only -help available )
5. Get qmaster spool directory from configuration file
The configuration file is $SGE_ROOT/default/common/configuration6. Switch to admin user ( is also stored in the configuration file )7. Get qmaster heartbeat ( from file heartbeat in the qmaster spool directory )
8. When heartbeat is unchanged more than one heartbeat interval (240 seconds) start delay timer ( 600 seconds )
9. When the heartbeat is still unchanged after heartbeat interval time and delay time:
if no valid lock file ( from other shadowd ) found:- lock qmaster
- startup qmaster and schedd on local host