Grid Engine Debug-Monitoring

Overview

The rmon library is a module which helps running Grid Engine components in a debug trace mode. It is most often applied for monitoring the behavior of daemons. In this mode the daemon does not disconnect from the controlling terminal, but stays in foreground and prints monitoring messages. Based on these messages, it is possible to trace into a distributed system like Grid Engine. Tracing into the system is important for many tasks:

find the location/reason for a daemon crash.
monitor daemon-internal processes in development phases.
find bottlenecks or just develop a sense for possible bottlenecks in the overall architecture.

The monitoring library provides a set of functions which are used to implement macros like DPRINTF, DTRACE, DENTER and DEXIT. These macros are widely used in the source code.

Running daemons in monitoring mode

Usually, users don't start the daemons directly. Instead they use the start-up script for this task. For purposes of a monitoring run, the daemon has to be started directly from a shell. Nevertheless, it is necessary to have the same environment in the shell as it is also set by the start-up script (SGE_ROOT, SGE_QMASTER_PORT, ...). This can be easily achieved by sourcing the settings file:

% source $SGE_ROOT/util/dl.csh

The monitoring library allows to select 8 different categories of messages and it supports also a concept of 8 different layers. The idea behind layers and categories is to provide the possibility to hide unnecessary debug output and display the pertinent one. All DPRINTF messages belong to the category of info messages, all DENTER/DEXIT messages are trace messages. The layer of a function is passed as argument to the DENTER macro. To communicate to the daemon that it should run in monitoring mode and to pass the information which category and which layer of messages is to be displayed the environment variable SGE_DEBUG_LEVEL is used. So SGE_DEBUG_LEVEL must be set before the daemon is started in monitoring mode. E.g.

% setenv SGE_DEBUG_LEVEL "1 0 0 0 0 0 0"

will cause all top layer DENTER/DEXIT messages being printed. As a convenience there are small shell script functions for sh/csh allowing to set/unset the debug level easily. To activate these functions do a:

% source $SGE_ROOT/util/dl.csh

this activates an alias 'dl' which can be used like a command to operate on SGE_DEBUG_LEVEL:

% dl 1

% echo $SGE_DEBUG_LEVEL

2 0 0 0 0 0 0 0

% dl 2

% echo $SGE_DEBUG_LEVEL

3 0 0 0 0 0 0 0

The following "dl" settings are supported (or provide some useful combinations of debug output):

dl 1 --> 2 0 0 0 0 0 0 0 TOP_LAYER

dl 2 --> 3 0 0 0 0 0 0 0 dto. + DENTER/DEXIT

dl 3 --> 2 2 0 0 0 0 2 0 CULL_LAYER + GDI_LAYER

dl 4 --> 3 3 0 0 0 0 3 0 dto. + DENTER/DEXIT

dl 5 --> 3 0 0 3 0 0 3 0 --> useful sometimes for qmon

dl 6 --> undefined

dl 7 --> 0 0 0 0 3 0 0 0 --> only used for CORBA, here: unsupported

dl 8 --> 2 0 0 0 0 2 0 0 TOP_LAYER + COMMD_LAYER

dl 9 --> 3 0 0 0 0 3 0 0 dto. + DENTER/DEXIT

dl 10 --> 3 3 3 0 0 0 0 3 TOP_LAYER + CULL_LAYER + BASIS_LAYER + PACK_LAYER --> usually not of any interest

Limitations of the Grid Engine Monitoring System

A common misunderstanding of the Grid Engine monitoring mechanism is that it can be only as good as the debug information which the code is setup to produce. You might not be able to identify a problem with the software via debug traces because there is simply no message generated which would be related to the problem. In many practical cases it can be necessary to enhance existing messages or to add new messages and then to re-build Grid Engine and run it in debug mode again. The monitoring system is a developer tool and thus it is the developer who decides what layer a function belongs to and what messages should be printed. Sometimes new modules start with many monitoring messages in the top layer and after some time they end in some less significant layer without any messages. The monitoring messages must not be seen as the primary interface to provide information about error conditions. If you feel there is a lack of information about error conditions, you should first look at the output which the Grid Engine component provides itself or which the component logs into pertinent log files. Likewise, if you find nothing and if you consider to make a corresponding change, the primary goal should be to provide such error information to the administrator or user via output and log files but not to the developer via monitoring messages.

Implementation Overview

For the librmon.a library only the files rmon_macros.c rmon_semaph.c and rmon_monitoring_level.c are needed. The definition of monitoring levels can be found in rmon_monitoring_level.h. The definitions of the macros mentioned above are located in sgermon.h.