Elasticsearch Guide
Slurm provides multiple Job Completion Plugins. These plugins are an orthogonal way to provide historical job accounting data for finished jobs.
In most installations, Slurm is already configured with an AccountingStorageType plugin — usually slurmdbd. In these situations, the information captured by a completion plugin is intentionally redundant.
The jobcomp/elasticsearch plugin can be used together with a web layer on top of the Elasticsearch server — such as Kibana — to visualize your finished jobs and the state of your cluster. Some of these visualization tools also let you easily create different types of dashboards, diagrams, tables, histograms and/or apply customized filters when searching.
Configuration
The plugin requires the libcurl library to be installed and usable on the controller, and the development libraries to be available at configure time. At Slurm configuration time, the configure script should emit a message like this if the appropriate library and headers have been successfully located:
checking whether libcurl is usable... yesThere are two configure options to control whether to look for the library or not and where:
--with-libcurl=PREFIX look for the curl library in PREFIX/lib and headers in PREFIX/include (default PREFIX is curl-config path or $PATH). --without-libcurl
The Elasticsearch instance should be running and reachable from the multiple SlurmctldHost configured. Refer to the Elasticsearch Official Documentation for further details on setup and configuration.
There are three slurm.conf options related to this plugin:
-
JobCompType
is used to select the job completion plugin type to activate. It should be set
to jobcomp/elasticsearch.
JobCompType=jobcomp/elasticsearch
-
JobCompLoc should be set to
the Elasticsearch server URL, including the port number after the semicolon
":".
JobCompLoc=http://<elasticserver>:<port>
The plugin will remove any trailing slashes from that URL, and append /slurm/jobcomp at the end. The first part of the path — slurm — defines the Elasticsearch index name and the second — jobcomp — is the index type name. These concepts are further described in the Elasticsearch documentation. -
JobCompParams should be
set to the Elasticsearch server connecting comma delimited options:
-
JobCompParams=timeout=5
Use a timeout when communication with Elasticsearch server. After the timeout, error out and queue job record for 30 seconds to try again. -
JobCompParams=connect_timeout=5
Use a timeout when connecting to Elasticsearch server. After the timeout, error out and queue job record for 30 seconds to try again.
-
-
DebugFlags could include
the Elasticsearch flag for extra debugging purposes.
DebugFlags=Elasticsearch
It is a good idea to turn this on initially until you have verified that finished jobs are properly indexed. Note that you do not need to manually create the Elasticsearch index, since the plugin will automatically do so when trying to index the first job document.
Visualization
Once jobs are being indexed, it is a good idea to use a web visualization layer to analyze the data. Kibana is a recommended open-source data visualization plugin for Elasticsearch. Once installed, an Elasticsearch index name or pattern has to be configured to instruct Kibana to retrieve the data. The appropriate index for Slurm is either slurm or slurm*. Once data is loaded it is possible to create tables where each row is a finished job, ordered by any column you choose — the @end_time timestamp is suggested — and any dashboards, graphs, or other analysis of interest.
Testing and Debugging
For debugging purposes, you can use the curl command or any similar tool to perform REST requests against Elasticsearch directly. Some of the following examples using the curl tool may be useful.
Query information about the slurm index, including the document count (which should be one per job indexed):
$ curl -XGET http://localhost:9200/_cat/indices/slurm?v health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open slurm 103CW7GqQICiMQiSQv6M_g 5 1 9 0 142.8kb 142.8kb
Query all indexed jobs in the slurm index:
$ curl -XGET 'http://localhost:9200/slurm/_search?pretty=true&q=*:*' | less
Delete the slurm index (caution!):
$ curl -XDELETE http://localhost:9200/slurm {"acknowledged":true}
Query information about _cat options. More can be found in the official documentation.
$ curl -XGET http://localhost:9200/_cat
Failure management
When the primary slurmctld is shut down, information about all completed but not yet indexed jobs held within the Elasticsearch plugin saved to a file named elasticsearch_state, which is located in the StateSaveLocation. This permits the plugin to restore the information when the slurmctld is restarted, and will be sent to the Elasticsearch database when the connection is restored.Acknowledgments
The Elasticsearch plugin was created as part of Alejandro Sanchez's Master's Thesis.
Last modified 5 April 2019