Spooling framework

Idea

Spooling is done through a spooling framework, that can have different implementations, e.g. spooing in ascii files, in a database ...

In a first step, spooling for monitoring and accounting is done in a separate event client subscribing a certain number of object types and simply spooling them through the spooling framework.

Qmaster still spools its own ascii files. If spooling framework proves to be stable, switch qmaster to use the spooling framework and let the Grid Engine admin decide, which spooling type to use.

If qmaster is set to spool into database, and a common production and reporting database is to be used, the event client is not needed.

Spooled Objects – current implementation

One implementation for each object type – for the reading of most objects a common function call read_object is used.

Object	Implementation	Structure	Comment
Accounting	daemons/qmaster/job_exit.c, clients/qacct/qacct.c	Ascii file, one line per record, fixed delimiter	Nothing to do. The same information can come from spooling with history.
Calendar	common/read_write_cal.c	Ascii file per object, one whitespace separated name/value per line
Checkpoint Environment	common/read_write_ckpt.c	Ascii file per object, one whitespace separated name/value per line	sublist: queues, only names, could be stored as string
Cluster configuration	common/rw_configuration.c	Ascii file per object, one whitespace separated name/value per line	Probably merge with host objects
Complex	common/sge_complex.c	Ascii file per complex, one line per complex attribute, whitespace separated fields	Need rules for spooling of complex attributes. On/Off. Min,Max,Avg in a certain interval.
History	common/complex_history.c	Directory for hosts and queues, one file per timestamp, complex file format	Nothing to do. The same information can come from spooling with history.
Host	common/read_write_host.c	Ascii file per object, one whitespace separated name/value per line Admin and submit hosts only contain one attribute, the name	Admin-/Exec-/Submit- hosts are different objects. Should be merged into one object.
Hostgroup	common/read_write_host_group.c		Not active
Job	daemons/common/read_write_job.c	Directory structure, multiple binary files (cull packing buffer) Job script is stored separately
Manager Operator	daemons/qmaster/read_write_manop.c	Ascii files, one line per user name	Should better be attribute of a user object
Messages		Ascii files, one line per record, fixed delimiter	No real objects at the moment. But each message has a structure well suited for storage in database tables.
Parallel Environment	common/read_write_pe.c	Ascii file per object, one whitespace separated name/value per line	sublist: queues, only names, could be stored as string
Project	common/read_write_userprj.c	Ascii file per object, one whitespace separated name/value per line	Usage and longterm usage are sublists. Stored as name/values pairs: cpu, mem, io, finished jobs. Could also be stored as single attributes.
Queue	common/read_write_queue.c	Ascii file per object, one whitespace separated name/value per line	Qtype is stored as bitfield, spooled as list of type identifiers sublists: thresholds (name/value pairs), owner (string list), user (string list), xuser (string list), subordinates (string list), complexes (string list), complex_values (name/value pairs), projects (string list), xprojects (string list)
Sharetree	common/sge_sharetree.c	One ascii file, references by node ids within the file
User	common/read_write_userprj.c	Ascii file per object, one whitespace separated name/value per line, special format for project related data
Usermapping	common/read_write_ume.c		Not active
Userset	common/read_write_userset.c	Ascii file per object, one whitespace separated name/value per line

Implementation

Types of spooling

Spooling is done in a certain spooling context.

A spooling context defines, how objects are spooled.

Multiple spooling contexts can be used within one process.

Examples for spooling types/destinations:

Ascii file, one record per file, name/value pairs per line
Ascii file, fixed delimiters for objects and attributes
Cull binary file (actually used for jobs, combined with a sophisticated directory structure).
XML files. They could easily replace the Cull binary file format, as hierarchies can be implemented in a straigthforward and readable way.
Database files (e.g. Xbase)
SQL Database
LDAP Repository (for certain objects like users)

Further information stored in a spooling context:

spool historical data (with timestamp) or snapshot
spooling type specific information, e.g. delimiters for ascii file spooling, file handles, database connections etc. if they are to be kept open.

Spooling of sublists

Many Grid Engine object types contain sublists.

In the current implementation, these hierarchical data structures are stored in different ways:

by referencing other objects using string lists, e.g. the queue names in pe objects reference queue objects
by using name/value pairs in string lists, e.g. complex variables set for queues are stored in a string lists containing tuples in the format <name>=<value>
by using special formats within the same ascii file (e.g. the user object or the sharetree). We should avoid these in the future.
by using the cull binary format as spool file format including sublists. We should not differentiate between ascii and cull binary file formats in the future.
by using directory hierarchies (e.g. storing array tasks within the jobs spool directory). For file based storage, we'll need them also in future implementations.

For the new implementation, we'll have to differentiate between file based formats and database storage.

For file based storage, we should use the following strategies:

when referencing other spooled objects, we should store a unique keys. Lists of such keys can be stored as string list.
name/value pairs can be stored in string lists in the existing format <name>=<value>
We'll have to continue the use of directory hierarchies for job spooling due to limitations of the number of files per directory.

For database storage, we should use the following strategies:

referencing single other objects can be done by storing a unique key.
referencing lists of other objects can also be done by storing a string list of keys, if we want to accept performance drawbacks for certain queries, e.g. „which pe's contain queue xyz“.
Better would be to use mapping tables, e.g. a table pe_queues, that links queues to pe's. Problem: Special keywords like „all“ would have to be handled by either a pseudo queue „all“ or a mapping entry without queue reference.
name/value pairs have to be stored in additional tables. In certain cases this can be extended mapping tables, e.g. mapping complex attributes to queues and giving them a value.
The hierarchy job – ja_task – pe_task can be easily implemented by referencing the hierarchical superior object in the subordinated object – pe_tasks reference the ja_task, ja_tasks reference the job.

reference type	current implementation	new filebased	new database
referencing objects	object id from cull	object id from cull	object id, either from cull or database internal serial number
list of references	string list or cull sublist	string list	mapping table
name/value pairs	string list or cull sublist	string list	mapping table with value
subordinate objects	special format or spool in cull binary format	break up such hierarchies (e.g. possible in the user object) or store data in additional files or directory structure and reference these files	store them in additional tables and make them reference their superior object
job hierarchy	directory hierarchy	directory hierarchy	subordinate objects reference superior objects

Spooling policies dependent on component

Current implementation

In the current implementation we have different spooling policies dependent on the component that does spooling.

Main spooling component is the qmaster.

But also execd has spooling of jobs and related information, e.g. queues, or parallel environment information.

The related information reflects the status of the spooled object at the time the job was delivered to execd.

It is also possible that execd does spool other attributes of jobs than does qmaster.

Suggestions for a new implementation

Different approaches are possible to address this issue. The following will discuss some ideas.

Multiple writing instances to one global database

All daemons use a common database. The execds can write directly to the database. Qmaster is notified about changes by the database.

Pros:

Reduced message transfer volume between qmaster and execd
Reduced spooling overhead in qmaster
More accurate data in the database, as data doesn't have to go through qmaster.

Cons:

Danger of inconsistencies between data in qmaster and data in the database. This problem exists with any implementation, but most probably qmaster should be the instance that holds the most recent information.
Scalability issues. It takes away the possibility of local spooling.

Probably not an option for the near future.

Restrict to file spooling in execd

Each execd has its own area for spooling, usually file based, either on a local disk (recommended) or via NFS mount.

Use formats that allow the spooling of hierarchical data, i.e. either cull binary format or XML format.

As execd spools information in a different way (not all / other attributes as qmaster, different strategy for sublists), the spooling implementation has to provide means to overwrite the spooling strategies defined as default for certain object types, or 2 spooling strategies have to be defined for object types.

Pros:

spooling load can be easily distributed by using local file systems
execd is the only instance that needs to spool hierarchical data not normalized, as the sub objects that have to be spooled are only valid for the lifetime of the only spooled object types (job related data)

Cons:

Different spooling strategies within one cluster have to be implemented
spooling remains a bottleneck when NFS has to be used for some reason, e.g. diskless compute engines
on very big SMP machines (some hundred processors) spooling could become a bottleneck due to slow file spooling

Cull enhancements

Definition of attributes

Cull definition will have to contain information, which fields have to be spooled and how sublists are spooled.

Replace the many similar definitions for same object types by a combination of flags. Example:

We have now 14 definitions for the string datatype (SGE_STRING, SGE_STRINGH, SGE_STRING_HU, SGE_KSTRING, ...)

A list element definition like

SGE_KULONGH(JB_job_number)

could be replaced by

SGE_ULONG(JG_job_number, HASH | UNIQUE | SPOOL | QIDL_K)

A keyword DEFAULT could be used, if no special settings are done for a type.

Descriptor field mt has lots of free space (currently only uses 4 bit for the data types from a (32 bit) integer) that could hold the following additional information:

ARRAY
For an array implementation (optionally to be done in a separate step)
HASH
Enable hashing for the field.
UNIQUE
Attribute has unique values within one list. This is at the moment only checked for attributes that have hashing enabled, but could be extended to any operations setting values.
SPOOL
Shall the attribute be spooled.
SHOW
Shall the attribute be shown (e.g. in qconf -s*, qstat -j etc.)
CONFIG
Shall the attribute be configurable, i.e. be contained in the temporary files created for qconf -m* operations or for qconf -mattr operations

Probably we should use a prefix like CULL_ or SGE_ to ensure uniqueness, e.g. CULL_HASH instead of HASH.

Tracking of changed attributes

To be able to interface a database using mechanisms like SQL, each object must know, which attributes have changed. Otherwise, the whole object has to be spooled on each spooling function call, even if only few attributes have been changed or the object hasn't been changed at all.

This could be achieved by making a struct arround the lMultiType enum type and reserving „one bit“ for the changed attribute.

Or by adding a bitfield containing this information to the lListElem data type – this would be less memory consuming.

Attribute names

A set of attribute names are generated using the NAMEDEF macros for each object type.

These attribute names have very limited use in the current implementation – they are only used for debugging purposes (lWrite* function calls).

For spooling, information output and configuration changes we also need attribute names. These names are at the moment hardcoded in the spooling, output and parsing functions.

It would be better, to extend the existing NAMEDEF macros to create struct objects containing both the internal attribute name and an attribute name to be used for the other purposes.

Functions

create_spooling_context

free_spooling_context

spool_prepare

spool_commit

spool_object

spool_attribute

Installation issues

First step:

Provide an install_monitoring script to setup the event client and its spooling configuration.

Second step:

In qmaster install, decide which spooling type to use, with type specific further actions (for SQL database, query user for parameters and test the database).

Implementation proposal

The implementation can be done in separate steps that can each face thorough testing. Time estimations are netto times and include documentation and testing.

task	est. time [weeks]
implement the suggested cull object definition changes	2
implement tracking of attribute changes	2
implement file based spooling. Restrict to the following text file formats: one record per file, name/value pairs per line fixed delimiters for objects and attribute values XML	3
make a compile time switch that will make the new spooling functions used by qmaster for some selected object types. Only for test purposes.	1
implement database storage	8
create an event client that subscribes all events for all object types and spools them to a database	2
do extensive tests with qmaster using some of the new spooling functions to files and the event client attached, continue tests during the next phases.	2
Sum essential steps	20
make qmaster and execd use the new spooling framework (compile time option), test different spooling strategies	4
make new spooling framework the default, create means to configure spooling strategies during the installation process	2
create install_monitoring that will install the event client separately	1
create means to update the database structure, backup and purging of outdated information	2
build clients that use the database as source of information instead of qmaster (qhost, qstat, qacct)	2
change qconf and qalter to use the new spooling framework for reading information and for creating and processing the data to be configured.	2
Sum additional steps	13