Spooling is done through a spooling framework, that can have different implementations, e.g. spooing in ascii files, in a database ...
In a first step, spooling for monitoring and accounting is done in a separate event client subscribing a certain number of object types and simply spooling them through the spooling framework.
Qmaster still spools its own ascii files. If spooling framework proves to be stable, switch qmaster to use the spooling framework and let the Grid Engine admin decide, which spooling type to use.
If qmaster is set to spool into database, and a common production and reporting database is to be used, the event client is not needed.
One implementation for each object type – for the reading of most objects a common function call read_object is used.
Object |
Implementation |
Structure |
Comment |
---|---|---|---|
Accounting |
daemons/qmaster/job_exit.c, clients/qacct/qacct.c |
Ascii file, one line per record, fixed delimiter |
Nothing to do. The same information can come from spooling with history. |
Calendar |
common/read_write_cal.c |
Ascii file per object, one whitespace separated name/value per line |
|
Checkpoint Environment |
common/read_write_ckpt.c |
Ascii file per object, one whitespace separated name/value per line |
sublist: queues, only names, could be stored as string |
Cluster configuration |
common/rw_configuration.c |
Ascii file per object, one whitespace separated name/value per line |
Probably merge with host objects |
Complex |
common/sge_complex.c |
Ascii file per complex, one line per complex attribute, whitespace separated fields |
Need rules for spooling of complex attributes. On/Off. Min,Max,Avg in a certain interval. |
History |
common/complex_history.c |
Directory for hosts and queues, one file per timestamp, complex file format |
Nothing to do. The same information can come from spooling with history. |
Host |
common/read_write_host.c |
Ascii file per object, one whitespace separated name/value per line Admin and submit hosts only contain one attribute, the name |
Admin-/Exec-/Submit- hosts are different objects. Should be merged into one object. |
Hostgroup |
common/read_write_host_group.c |
|
Not active |
Job |
daemons/common/read_write_job.c |
Directory structure, multiple binary files (cull packing buffer) Job script is stored separately |
|
Manager Operator |
daemons/qmaster/read_write_manop.c |
Ascii files, one line per user name |
Should better be attribute of a user object |
Messages |
|
Ascii files, one line per record, fixed delimiter |
No real objects at the moment. But each message has a structure well suited for storage in database tables. |
Parallel Environment |
common/read_write_pe.c |
Ascii file per object, one whitespace separated name/value per line |
sublist: queues, only names, could be stored as string |
Project |
common/read_write_userprj.c |
Ascii file per object, one whitespace separated name/value per line |
Usage and longterm usage are sublists. Stored as name/values pairs: cpu, mem, io, finished jobs. Could also be stored as single attributes. |
Queue |
common/read_write_queue.c |
Ascii file per object, one whitespace separated name/value per line |
Qtype is stored as bitfield, spooled as list of type identifiers sublists: thresholds (name/value pairs), owner (string list), user (string list), xuser (string list), subordinates (string list), complexes (string list), complex_values (name/value pairs), projects (string list), xprojects (string list) |
Sharetree |
common/sge_sharetree.c |
One ascii file, references by node ids within the file |
|
User |
common/read_write_userprj.c |
Ascii file per object, one whitespace separated name/value per line, special format for project related data |
|
Usermapping |
common/read_write_ume.c |
|
Not active |
Userset |
common/read_write_userset.c |
Ascii file per object, one whitespace separated name/value per line |
|
Spooling is done in a certain spooling context.
A spooling context defines, how objects are spooled.
Multiple spooling contexts can be used within one process.
Examples for spooling types/destinations:
Ascii file, one record per file, name/value pairs per line
Ascii file, fixed delimiters for objects and attributes
Cull binary file (actually used for jobs, combined with a sophisticated directory structure).
XML files. They could easily replace the Cull binary file format, as hierarchies can be implemented in a straigthforward and readable way.
Database files (e.g. Xbase)
SQL Database
LDAP Repository (for certain objects like users)
Further information stored in a spooling context:
spool historical data (with timestamp) or snapshot
spooling type specific information, e.g. delimiters for ascii file spooling, file handles, database connections etc. if they are to be kept open.
Many Grid Engine object types contain sublists.
In the current implementation, these hierarchical data structures are stored in different ways:
by referencing other objects using string lists, e.g. the queue names in pe objects reference queue objects
by using name/value pairs in string lists, e.g. complex variables set for queues are stored in a string lists containing tuples in the format <name>=<value>
by using special formats within the same ascii file (e.g. the user object or the sharetree). We should avoid these in the future.
by using the cull binary format as spool file format including sublists. We should not differentiate between ascii and cull binary file formats in the future.
by using directory hierarchies (e.g. storing array tasks within the jobs spool directory). For file based storage, we'll need them also in future implementations.
For the new implementation, we'll have to differentiate between file based formats and database storage.
For file based storage, we should use the following strategies:
when referencing other spooled objects, we should store a unique keys. Lists of such keys can be stored as string list.
name/value pairs can be stored in string lists in the existing format <name>=<value>
We'll have to continue the use of directory hierarchies for job spooling due to limitations of the number of files per directory.
For database storage, we should use the following strategies:
referencing single other objects can be done by storing a unique key.
referencing lists of other objects can also be done by
storing a string list of keys, if we want to accept performance
drawbacks for certain queries, e.g. „which pe's contain queue
xyz“.
Better would be to use mapping tables, e.g. a table
pe_queues, that links queues to pe's. Problem: Special keywords like
„all“ would have to be handled by either a pseudo queue
„all“ or a mapping entry without queue reference.
name/value pairs have to be stored in additional tables. In certain cases this can be extended mapping tables, e.g. mapping complex attributes to queues and giving them a value.
The hierarchy job – ja_task – pe_task can be easily implemented by referencing the hierarchical superior object in the subordinated object – pe_tasks reference the ja_task, ja_tasks reference the job.
reference type |
current implementation |
new filebased |
new database |
---|---|---|---|
referencing objects |
object id from cull |
object id from cull |
object id, either from cull or database internal serial number |
list of references |
string list or cull sublist |
string list |
mapping table |
name/value pairs |
string list or cull sublist |
string list |
mapping table with value |
subordinate objects |
special format or spool in cull binary format |
break up such hierarchies (e.g. possible in the user object) or store data in additional files or directory structure and reference these files |
store them in additional tables and make them reference their superior object |
job hierarchy |
directory hierarchy |
directory hierarchy |
subordinate objects reference superior objects |
In the current implementation we have different spooling policies dependent on the component that does spooling.
Main spooling component is the qmaster.
But also execd has spooling of jobs and related information, e.g. queues, or parallel environment information.
The related information reflects the status of the spooled object at the time the job was delivered to execd.
It is also possible that execd does spool other attributes of jobs than does qmaster.
Different approaches are possible to address this issue. The following will discuss some ideas.
All daemons use a common database. The execds can write directly to the database. Qmaster is notified about changes by the database.
Pros:
Reduced message transfer volume between qmaster and execd
Reduced spooling overhead in qmaster
More accurate data in the database, as data doesn't have to go through qmaster.
Cons:
Danger of inconsistencies between data in qmaster and data in the database. This problem exists with any implementation, but most probably qmaster should be the instance that holds the most recent information.
Scalability issues. It takes away the possibility of local spooling.
Probably not an option for the near future.
Each execd has its own area for spooling, usually file based, either on a local disk (recommended) or via NFS mount.
Use formats that allow the spooling of hierarchical data, i.e. either cull binary format or XML format.
As execd spools information in a different way (not all / other attributes as qmaster, different strategy for sublists), the spooling implementation has to provide means to overwrite the spooling strategies defined as default for certain object types, or 2 spooling strategies have to be defined for object types.
Pros:
spooling load can be easily distributed by using local file systems
execd is the only instance that needs to spool hierarchical data not normalized, as the sub objects that have to be spooled are only valid for the lifetime of the only spooled object types (job related data)
Cons:
Different spooling strategies within one cluster have to be implemented
spooling remains a bottleneck when NFS has to be used for some reason, e.g. diskless compute engines
on very big SMP machines (some hundred processors) spooling could become a bottleneck due to slow file spooling
Cull definition will have to contain information, which fields have to be spooled and how sublists are spooled.
Replace the many similar definitions for same object types by a combination of flags. Example:
We have now 14 definitions for the string datatype (SGE_STRING, SGE_STRINGH, SGE_STRING_HU, SGE_KSTRING, ...)
A list element definition like
SGE_KULONGH(JB_job_number)
could be replaced by
SGE_ULONG(JG_job_number, HASH | UNIQUE | SPOOL | QIDL_K)
or
SGE_LIST_ELEMENT(JG_job_number, ULONG | HASH | UNIQUE | SPOOL | SHOW | QIDL_K)
A keyword DEFAULT could be used, if no special settings are done for a type.
Descriptor field mt has lots of free space (currently only uses 4 bit for the data types from a (32 bit) integer) that could hold the following additional information:
ARRAY
For an array implementation (optionally to be done
in a separate step)
HASH
Enable hashing for the field.
UNIQUE
Attribute has unique values within one list. This
is at the moment only checked for attributes that have hashing
enabled, but could be extended to any operations setting values.
SPOOL
Shall the attribute be spooled.
SHOW
Shall the attribute be shown (e.g. in qconf -s*,
qstat -j etc.)
CONFIG
Shall the attribute be configurable, i.e. be
contained in the temporary files created for qconf -m* operations or
for qconf -mattr operations
Probably we should use a prefix like CULL_ or SGE_ to ensure uniqueness, e.g. CULL_HASH instead of HASH.
To be able to interface a database using mechanisms like SQL, each object must know, which attributes have changed. Otherwise, the whole object has to be spooled on each spooling function call, even if only few attributes have been changed or the object hasn't been changed at all.
This could be achieved by making a struct arround the lMultiType enum type and reserving „one bit“ for the changed attribute.
Or by adding a bitfield containing this information to the lListElem data type – this would be less memory consuming.
A set of attribute names are generated using the NAMEDEF macros for each object type.
These attribute names have very limited use in the current implementation – they are only used for debugging purposes (lWrite* function calls).
For spooling, information output and configuration changes we also need attribute names. These names are at the moment hardcoded in the spooling, output and parsing functions.
It would be better, to extend the existing NAMEDEF macros to create struct objects containing both the internal attribute name and an attribute name to be used for the other purposes.
create_spooling_context
free_spooling_context
spool_prepare
spool_commit
spool_object
spool_attribute
First step:
Provide an install_monitoring script to setup the event client and its spooling configuration.
Second step:
In qmaster install, decide which spooling type to use, with type specific further actions (for SQL database, query user for parameters and test the database).
The implementation can be done in separate steps that can each face thorough testing. Time estimations are netto times and include documentation and testing.
task |
est. time [weeks] |
---|---|
implement the suggested cull object definition changes |
2 |
implement tracking of attribute changes |
2 |
implement file based spooling. Restrict to the following text file formats:
|
3 |
make a compile time switch that will make the new spooling functions used by qmaster for some selected object types. Only for test purposes. |
1 |
implement database storage |
8 |
create an event client that subscribes all events for all object types and spools them to a database |
2 |
do extensive tests with qmaster using some of the new spooling functions to files and the event client attached, continue tests during the next phases. |
2 |
Sum essential steps |
20 |
make qmaster and execd use the new spooling framework (compile time option), test different spooling strategies |
4 |
make new spooling framework the default, create means to configure spooling strategies during the installation process |
2 |
create install_monitoring that will install the event client separately |
1 |
create means to update the database structure, backup and purging of outdated information |
2 |
build clients that use the database as source of information instead of qmaster (qhost, qstat, qacct) |
2 |
change qconf and qalter to use the new spooling framework for reading information and for creating and processing the data to be configured. |
2 |
Sum additional steps |
13 |