Slurm Profile Accounting Plugin API (AcctGatherProfileType)
Overview
This document describes Slurm profile accounting plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own Slurm profile accounting plugins.
A profiling plugin allows more detailed information on the execution of jobs than can reasonably be kept in the accounting database. (All jobs may also not be profiled.) A seperate User Guide documents how to use the hdf5 version of the plugin. An influxdb plugin is also available since Slurm 17.11.
The plugin provides an API for making calls to store data at various points in a step's lifecycle. It collects data periodically from potentially several sources. The periodic samples are eventually consolidated into one time series dataset for each node of a job.
The plugin's primary work is done within slurmstepd on the compute nodes. It assumes a shared file system, presumably on the management network. This avoids having to transfer files back to the controller at step end. Data is typically gathered at job_acct_gather interval or acct_gather_energy interval and the volume is not expected to be burdensome.
The hdf5 and influxdb implementations record I/O counts from the network interface (Interconnect), I/O counts from the node from the Lustre parallel file system, disk I/O counts, cpu and memory utilization for each task, and a record of energy use.
The hdf5 implementation stores this data in a HDF5 file for each step on each node for the jobs. A separate program (sh5util) is provided to consolidate all the node-step files in one container for the job. HDF5 is a well known structured data set that allows different types of related data to be stored in one file. Its internal structure resembles a file system with groups being similar to directories and data sets being similar to files. There are commodity programs, notably HDF5View for viewing and manipulating these files. sh5util also provides some capability for extracting subsets of date for import into other analysis tools like spreadsheets.
This plugin is incompatible with --enable-front-end. It you need to simulate a large configuration, please use --enable-multiple-slurmd.
Slurm profile accounting plugins must conform to the Slurm Plugin API with the following specifications:
const char plugin_name[]="full text name"
A free-formatted ASCII text string that identifies the plugin.
const char
plugin_type[]="major/minor"
The major type must be "acct_gather_profile." The minor type can be any suitable name for the type of profile accounting. We currently use
- none — No profile data is gathered.
- hdf5 — Gets profile data about energy use, i/o sources (Lustre, network) and task data such as local disk i/o, CPU and memory usage.
const uint32_t plugin_version
If specified, identifies the version of Slurm used to build this plugin and
any attempt to load the plugin from a different version of Slurm will result
in an error.
If not specified, then the plugin may be loaded by Slurm commands and
daemons from any version, however this may result in difficult to diagnose
failures due to changes in the arguments to plugin functions or changes
in other Slurm functions used by the plugin.
The programmer is urged to study src/plugins/acct_gather_profile/acct_gather_profile_hdf5.c and src/common/slurm_acct_gather_profile.c for a sample implementation of a Slurm profile accounting plugin.
API Functions
All of the following functions are required. Functions which are not implemented must be stubbed.
int init (void)
Description:
Called when the plugin is loaded, before any other functions are
called. Put global initialization here.
Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
void fini (void)
Description:
Called when the plugin is removed. Clear any allocated storage here.
Returns: None.
Note: These init and fini functions are not the same as those described in the dlopen (3) system library. The C run-time system co-opts those symbols for its own initialization. The system _init() is called before the Slurm init(), and the Slurm fini() is called before the system's _fini().
void acct_gather_profile_g_conf_options(void)
Description:
Called from slurmstepd between fork() and exec() of application.
Close open files
Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
void acct_gather_profile_g_conf_options(s_p_options_t **full_options, int *full_options_cnt)
Description:
Defines configuration options in acct_gather.conf
Arguments:
full(out) option definitions.
full_options_cnt(out) number in full.
Returns:
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
void acct_gather_profile_g_conf_set(s_p_hashtbl_t *tbl)
Description:
Set configuration options from acct_gather.conf
Arguments: Returns:
void acct_gather_profile_g_conf_get(s_p_hashtbl_t *tbl)
Description: Returns:
int acct_gather_profile_p_node_step_start(stepd_step_rec_t* job)
Description: Arguments: Returns:
int acct_gather_profile_p_node_step_end(stepd_step_rec_t* job)
Description: Arguments: Returns:
int acct_gather_profile_p_task_start(stepd_step_rec_t* job, uint32_t taskid)
Description: Arguments: Returns:
int acct_gather_profile_p_task_end(stepd_step_rec_t* job, pid_t taskpid)
Description: Arguments: Returns:
int acct_gather_profile_p_add_sample_data(uint32_t type, void* data);
Description: Arguments: Returns: These parameters can be used in the slurm.conf to configure the
plugin and the frequency at which to gather node profile data. The acct_gather.conf provides profile
configuration options.
The default profile value is none which means no profiling will be done
for jobs. The hdf5 plugin also includes; Use caution when setting the default to values other than none as a file for
each job will be created. This option is provided for test systems. Most of the sources of profile data are associated with various
acct_gather plugins. The acct_gather.conf file has setting for various
sampling mechanisms that can be used to change the frequency at which
samples occur. A plugin-like structure is implemented to generalize HDF5 data operations from
various sources. A C typedef is defined for each datatype. These
declarations are in /common/slurm_acct_gather_profile.h so the datatype are
common to all profile plugins. The operations are defined via structures of function pointers, and they are
defined in /plugins/acct_gather_profile/common/profile_hdf5.h and should work
on any HDF5 implementation, not only hdf5. Functions must be implemented to perform various operations for the datatype.
The api for the plugin includes an argument for the datatype so that the
implementation of that api can call the specific operation for that datatype. Groups in the HDF5 file containing a dataset will include an attribute for
the datatype so that the program that merges step files into the job can
discover the type of the group and do the right thing. For example, the typedef for the energy sample datatype;
A factory method is implemented for each type to construct a structure
with functions implementing various operations for the type.
The following structure of functions is required for each type.
Note there are two different data types for supporting time series. When adding a new type, the profile_factory function has to be
modified to return an ops for the type. Interaction between type and hdf5. Last modified 27 March 2015
tbl -- hash table of options./span>
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
Gets configuration options from acct_gather.conf
void* pointer to slurm_acct_gather_conf_t
on success, or
NULL on failure.
Called once per step on each node from slurmstepd, before launching tasks.
Provides an opportunity to create files and other node-step level
initialization.
job -- slumd_job_t structure containing information
about the step.
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
Called once per step on each node from slurmstepd, after all tasks end.
Provides an opportunity to close files, etc.
job -- slumd_job_t structure containing information
about the step.
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
Called once per task from slurmstepd, BEFORE node step start is called.
Provides an opportunity to gather beginning values from node counters
(bytes_read ...)
job -- slumd_job_t structure containing information
about the step.
taskid -- Slurm taskid.
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
Called once per task from slurmstepd.
Provides an opportunity to put final data for a task.
job -- slumd_job_t structure containing information
about the step.
pid -- task process id (pid_t).
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
Put data at the Node Samples level. Typically called from something called
at either job_acct_gather interval or acct_gather_energy interval.
All samples in the same group will eventually be consolidated in one
time series.
type -- identifies the type of data.
data -- data structure to be put to the file.
SLURM_SUCCESS on success, or
SLURM_ERROR on failure.
Parameters
Data Types
typedef struct profile_energy {
char tod[TOD_LEN]; // Not used in node-step
time_t time;
uint64_t watts;
uint64_t cpu_freq;
} profile_energy_t;
/*
* Structure of function pointers of common operations on a
* profile data type. (Some may be stubs, particularly if the data type
* does not represent a time series.
* dataset_size -- size of one dataset (structure size).
* create_memory_datatype -- creates hdf5 memory datatype
* corresponding to the datatype structure.
* create_file_datatype -- creates hdf5 file datatype
* corresponding to the datatype structure.
* create_s_memory_datatype -- creates hdf5 memory datatype
* corresponding to the summary datatype structure.
* create_s_file_datatype -- creates hdf5 file datatype
* corresponding to the summary datatype structure.
* init_job_series -- allocates a buffer for a complete time
* series (in job merge) and initializes each member
* merge_step_series -- merges all the individual time samples
* into a single data set with one item per sample.
* Data items can be scaled (e.g. subtracting beginning time)
* differenced (to show counts in interval) or other things
* appropriate for the series.
* series_total -- accumulate or average members in the entire
* series to be added to the file as totals for the node or
* task.
* extract_series -- format members of a structure for putting
* to a file data extracted from a time series to be imported into
* another analysis tool. (e.g. format as comma separated value.)
* extract_totals -- format members of a structure for putting
* to a file data extracted from a time series total to be imported
* into another analysis tool. (e.g. format as comma,separated value.)
*/
typedef struct profile_hdf5_ops {
int (*dataset_size) ();
hid_t (*create_memory_datatype) ();
hid_t (*create_file_datatype) ();
hid_t (*create_s_memory_datatype) ();
hid_t (*create_s_file_datatype) ();
void* (*init_job_series) (int, int);
void (*merge_step_series) (hid_t, void*, void*, void*);
void* (*series_total) (int, void*);
void (*extract_series) (FILE*, bool, int, int, char*,
char*, void*);
void (*extract_totals) (FILE*, bool, int, int, char*,
char*, void*);
} profile_hdf5_ops_t;
1) A primary type is defined for gathering data in the node step file.
It is typically named profile_{series_name}_t.
2) Another type is defined for summarizing series totals.
It is typically named profile_{series_name}_s_t. It does not have a 'factory'.
It is only used in the functions of the primary data type and the
primaries structure has operations to create appropriate hdf5 objects.