Elasticsearch Guide

Slurm provides multiple Job Completion Plugins. These plugins are an orthogonal way to provide historical job accounting data for finished jobs.

In most installations, Slurm is already configured with an AccountingStorageType plugin — usually slurmdbd. In these situations, the information captured by a completion plugin is intentionally redundant.

The jobcomp/elasticsearch plugin can be used together with a web layer on top of the Elasticsearch server — such as Kibana — to visualize your finished jobs and the state of your cluster. Some of these visualization tools also let you easily create different types of dashboards, diagrams, tables, histograms and/or apply customized filters when searching.

Configuration

The plugin requires the libcurl library to be installed and usable on the controller, and the development libraries to be available at configure time. At Slurm configuration time, the configure script should emit a message like this if the appropriate library and headers have been successfully located:

checking whether libcurl is usable... yes
There are two configure options to control whether to look for the library or not and where:
--with-libcurl=PREFIX	look for the curl library in PREFIX/lib and headers in
PREFIX/include (default PREFIX is curl-config path or $PATH).
--without-libcurl

The Elasticsearch instance should be running and reachable from the multiple SlurmctldHost configured. Refer to the Elasticsearch Official Documentation for further details on setup and configuration.

There are three slurm.conf options related to this plugin:

  • JobCompType is used to select the job completion plugin type to activate. It should be set to jobcomp/elasticsearch.
    JobCompType=jobcomp/elasticsearch
  • JobCompLoc should be set to the Elasticsearch server URL, including the port number after the semicolon ":".
    JobCompLoc=http://<elasticserver>:<port>
    The plugin will remove any trailing slashes from that URL, and append /slurm/jobcomp at the end. The first part of the path — slurm — defines the Elasticsearch index name and the second — jobcomp — is the index type name. These concepts are further described in the Elasticsearch documentation.
  • JobCompParams should be set to the Elasticsearch server connecting comma delimited options:
    • JobCompParams=timeout=5
      Use a timeout when communication with Elasticsearch server. After the timeout, error out and queue job record for 30 seconds to try again.
    • JobCompParams=connect_timeout=5
      Use a timeout when connecting to Elasticsearch server. After the timeout, error out and queue job record for 30 seconds to try again.
  • DebugFlags could include the Elasticsearch flag for extra debugging purposes.
    DebugFlags=Elasticsearch
    It is a good idea to turn this on initially until you have verified that finished jobs are properly indexed. Note that you do not need to manually create the Elasticsearch index, since the plugin will automatically do so when trying to index the first job document.

Visualization

Once jobs are being indexed, it is a good idea to use a web visualization layer to analyze the data. Kibana is a recommended open-source data visualization plugin for Elasticsearch. Once installed, an Elasticsearch index name or pattern has to be configured to instruct Kibana to retrieve the data. The appropriate index for Slurm is either slurm or slurm*. Once data is loaded it is possible to create tables where each row is a finished job, ordered by any column you choose — the @end_time timestamp is suggested — and any dashboards, graphs, or other analysis of interest.

Testing and Debugging

For debugging purposes, you can use the curl command or any similar tool to perform REST requests against Elasticsearch directly. Some of the following examples using the curl tool may be useful.

Query information about the slurm index, including the document count (which should be one per job indexed):

$ curl -XGET http://localhost:9200/_cat/indices/slurm?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   slurm 103CW7GqQICiMQiSQv6M_g   5   1          9            0    142.8kb        142.8kb

Query all indexed jobs in the slurm index:

$ curl -XGET 'http://localhost:9200/slurm/_search?pretty=true&q=*:*' | less

Delete the slurm index (caution!):

$ curl -XDELETE http://localhost:9200/slurm
{"acknowledged":true}

Query information about _cat options. More can be found in the official documentation.

$ curl -XGET http://localhost:9200/_cat

Failure management

When the primary slurmctld is shut down, information about all completed but not yet indexed jobs held within the Elasticsearch plugin saved to a file named elasticsearch_state, which is located in the StateSaveLocation. This permits the plugin to restore the information when the slurmctld is restarted, and will be sent to the Elasticsearch database when the connection is restored.

Acknowledgments

The Elasticsearch plugin was created as part of Alejandro Sanchez's Master's Thesis.

Last modified 5 April 2019