This file describes changes in recent versions of Slurm. It primarily documents those changes that are of interest to users and administrators. * Changes in Slurm 20.02.6 ========================== -- Fix sbcast --fanout option. -- Tighten up keyword matching for --dependency. -- Fix "squeue -S P" not sorting by partition name. -- Fix segfault in slurmctld if group resolution fails during job credential creation. -- sacctmgr - Honor PreserveCaseUser when creating users with load command. -- Avoid attempting to schedule jobs on magnetic reservations when they aren't allowed. -- Always make sure we clear the magnetic flag from a job. -- In backfill avoid NULL pointer dereference. -- Fix Segfault at end of slurmctld if you have a magnetic reservation and you shutdown the slurmctld. -- Silence security warning when a Slurm is trying a job for a magnetic reservation. -- Have sacct exit correctly when a user/group id isn't valid. -- Remove extra \n from invalid user/group id error message. -- Detect when extern steps trigger OOM events and mark extern step correctly. -- pam_slurm_adopt - permit root access to the node before reading the config file, which will give root a chance to fix the config if missing or broken. -- Reset DefMemPerCPU, MaxMemPerCPU, and TaskPluginParam (among other minor flags) on reconfigure. -- Fix incorrect memory handling of mail_user when updating mail_type=none. -- Handle mail_user and mail_type independently. -- Fix thread-safety issue with assoc_mgr_get_admin_level(). -- Ignore step features if equal to job features -- Fix slurmstepd segfault caused by incorrect strtok() usage. -- CRAY - Remove unneeded ATP spank plugin from ansible playbook. -- Fix core selection for exclusive step on nodes where CPUs == Cores. -- Fix topology aware scheduling reservations. -- Fix loading cpus_per_task on a job from state file. -- When a partition has no nodes fix estimate of max cpus possible on a job trying to run there. -- In cons_tres fix sorting functions to handle node/topo weight correctly. -- Fix regression in 20.02.5 where you couldn't request contraints with a simple & and a count. -- Limit the number of threads for servicing emails. -- Avoid possible double init race condition in assoc_mgr_lock(). -- Add missing locks in slurm_cred_handle_reissue(). -- Add missing locks in slurm_cred_revoked(). -- Fix slurmctld segfault due to tight reconfigure RPC requests by serializing the RPC handler processing logic. -- Use _exit() instead exit() after fork(). -- Perl API - fix hang reading config in configless environments. -- slurmrestd - request detailed node information to populate GRES fields. -- slurmrestd - request detailed job information to populate GRES fields. -- Fix job license update bug on array tasks or hetjob components. -- Fix job partition update bug on array tasks or hetjob components. -- Fix slurmctld segfault on _pick_best_nodes() when processing a job request with XOR'd constraints and no nodeset has the feature. -- Fix job requests rejected with incorrect NODE_CONFIG_UNAVAIL when nodes are actually only busy due to an overlapping MAINT reservation. -- Fix sacctmgr allowing the deletion of a user's default account. -- Fix srun and other Slurm commands running within a "configless" salloc terminal. -- MySQL - Correctly handle QOS deletion from assocation tables. -- Fix update of First_Cores flag in a reservation. -- Fix parsing of update reservation flags. -- Fix --switches for cons_tres. -- Retry connection on ETIMEDOUT in slurm_send_addr_recv_msgs. -- Fix wait for RPC_PROLOG_LAUNCH notification 2*MessageTimeout. -- Have slurm_send_addr_recv_msgs conn_timeout to match rpc_wait in slurmd. -- pam_slurm_adopt - operate correctly even if ConstrainRAMSpace is not enabled on the node by falling back to the cpuset, devices, or freezer subsystem instead. -- slurmrestd - use memmove() instead of memcpy() in string manipulation to avoid bugs related to overlapping memory regions. -- slurmrestd - avoid xassert() failure on duplicated headers in request. -- Remove stale 'ReqNodeNotAvail, Reserved for maintenance' message from pending jobs after a maintenance reservation ended. -- MySQL - Stop steps from printing when outside time range. -- Fixed kmem limit calculation to use MaxKmemPercent correctly. -- Fix initialization of cpuset.mems/cpus on uid cgroup subdir. -- MySQL - Remove potential race condition when sending updates to a cluster and commit_delay used. -- Avoid double free of step_record_t in the slurmctld when node is removed from config. -- cons_tres - fix regression regarding gpus with --cpus-per-task. -- Don't send job completion email for revoked federation jobs. -- PMIx - fix potential buffer overflows from use of unpackmem(). CVE-2020-27745. -- X11 forwarding - fix potential leak of the magic cookie when sent as an argument to the xauth command. CVE-2020-27746. * Changes in Slurm 20.02.5 ========================== -- Fix leak of TRESRunMins when job time is changed with --time-min -- pam_slurm - explicitly initialize slurm config to support configless mode. -- scontrol - Fix exit code when creating/updating reservations with wrong Flags. -- When a GRES has a no_consume flag, report 0 for allocated. -- Fix cgroup cleanup by jobacct_gather/cgroup. -- When creating reservations/jobs don't allow counts on a feature unless using an XOR. -- Improve number of boards discovery -- Fix updating a reservation NodeCnt on a zero-count reservation. -- slurmrestd - provide an explicit error messages when PSK auth fails. -- cons_tres - fix job requesting single gres per-node getting two or more nodes with less CPUs than requested per-task. -- cons_tres - fix calculation of cores when using gres and cpus-per-task. -- cons_tres - fix job not getting access to socket without GPU or with less than --gpus-per-socket when not enough cpus available on required socket and not using --gres-flags=enforce binding. -- Fix HDF5 type version build error. -- Fix creation of CoreCnt only reservations when the first node isn't available. -- Fix wrong DBD Agent queue size in sdiag when using accounting_storage/none. -- Improve job constraints XOR option logic. -- Fix preemption of hetjobs when needed nodes not in leader component. -- Fix wrong bit_or() messing potential preemptor jobs node bitmap, causing bad node deallocations and even allocation of nodes from other partitions. -- Fix double-deallocation of preempted non-leader hetjob components. -- slurmdbd - prevent truncation of the step nodelists over 4095. -- Fix nodes remaining in drain state state after rebooting with ASAP option. * Changes in Slurm 20.02.4 ========================== -- srun - suppress job step creation warning message when waiting on PrologSlurmctld. -- slurmrestd - fix incorrect return values in data_list_for_each() functions. -- mpi/pmix - fix issue where HetJobs could fail to launch. -- slurmrestd - set content-type header in responses. -- Fix cons_res GRES overallocation for --gres-flags=disable-binding. -- Fix cons_res incorrectly filtering cores with respect to GRES locality for --gres-flags=disable-binding requests. -- Fix regression where a dependency on multiple jobs in a single array using underscores would only add the first job. -- slurmrestd - fix corrupted output due to incorrect use of memcpy(). -- slurmrestd - address a number of minor Coverity warnings. -- Handle retry failure when slurmstepd is communicating with srun correctly. -- Fix jobacct_gather possibly duplicate stats when _is_a_lwp error shows up. -- Fix tasks binding to GRES which are closest to the allocated CPUs. -- Fix AMD GPU ROCM 3.5 support. -- Fix handling of job arrays in sacct when querying specific steps. -- slurmrestd - avoid fallback to local socket authentication if JWT authentication is ill-formed. -- slurmrestd - restrict ability of requests to use different authentication plugins. -- slurmrestd - unlink named unix sockets before closing. -- slurmrestd - fix invalid formatting in openapi.json. -- Fix batch jobs stuck in CF state on FrontEnd mode. -- Add a separate explicit error message when rejecting changes to active node features. -- cons_common/job_test - fix slurmctld SIGABRT due to double-free. -- Fix updating reservations to set the duration correctly if updating the start time. -- Fix update reservation to promiscuous mode. -- Fix override of job tasks count to max when ntasks-per-node present. -- Fix min CPUs per node not being at least CPUs per task requested. -- Fix CPUs allocated to match CPUs requested when requesting GRES and threads per core equal to one. -- Fix NodeName config parsing with Boards and without CPUs. -- Ensure SLURM_JOB_USER and SLURM_JOB_UID are set in SrunProlog/Epilog. -- Fix error messages for certain invalid salloc/sbatch/srun options. -- pmi2 - clean up sockets at step termination. -- Fix 'scontrol hold' to work with 'JobName'. -- sbatch - handle --uid/--gid in #SBATCH directives properly. -- Fix race condition in job termination on slurmd. -- Print specific error messages if trying to run use certain priority/multifactor factors that cannot work without SlurmDBD. -- Avoid partial GRES allocation when --gpus-per-job is not satisfied. -- Cray - Avoid referencing a variable outside of it's correct scope when dealing with creating steps within a het job. -- slurmrestd - correctly handle larger addresses from accept(). -- Avoid freeing wrong pointer with SlurmctldParameters=max_dbd_msg_action with another option after that. -- Restore MCS label when suspended job is resumed. -- Fix insufficient lock levels. -- slurmrestd - use errno from job submission. -- Fix "user" filter for sacctmgr show transactions. -- Fix preemption logic. -- Fix no_consume GRES for exclusive (whole node) requests. -- Fix regression in 20.02 that caused an infinite loop in slurmctld when requesting --distribution=plane for the job. -- Fix parsing of the --distribution option. -- Add CONF READ_LOCK to _handle_fed_send_job_sync. -- prep/script - always call slurmctld PrEp callback in _run_script(). -- Fix node estimation for jobs that use GPUs or --cpus-per-task. -- Fix jobcomp, job_submit and cli_filter Lua implementation plugins causing slurmctld and/or job submission CLI tools segfaults due to bad return handling when the respective Lua script failed to load. -- Fix propagation of gpu options through hetjob components. -- Add SLURM_CLUSTERS environment variable to scancel. -- Fix packing/unpacking of "unlinked" jobs. -- Connect slurmstepd's stderr to srun for steps launched with --pty. -- Handle MPS correctly when doing exclusive allocations. -- slurmrestd - fix compiling against libhttpparser in a non-default path. -- slurmrestd - avoid compilation issues with libhttpparser < 2.6. -- Fix compile issues when compiling slurmrestd without --enable-debug. -- Reset idle time on a reservation that is getting purged. -- Fix reoccurring reservations that have Purge_comp= to keep correct duration if they are purged. -- scontrol - changed the "PROMISCUOUS" flag to "MAGNETIC" -- Early return from epilog_set_env in case of no_consume. -- Fix cons_common/job_test start time discovery logic to prevent skewed results between "will run test" executions. -- Ensure TRESRunMins limits are maintained during "scontrol reconfigure". -- Improve error message when host lookup fails. * Changes in Slurm 20.02.3 ========================== -- Factor in ntasks-per-core=1 with cons_tres. -- Fix formatting in error message in cons_tres. -- Fix calling stat on a NULL variable. -- Fix minor memory leak when using reservations with flags=first_cores. -- Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node. -- Fix --mem-per-gpu for heterogenous --gres requests. -- Fix slurmctld load order in load_all_part_state(). -- Fix race condition not finding jobacct gather task cgroup entry. -- Suppress error message when selecting nodes on disjoint topologies. -- Improve performance of _pack_default_job_details() with large number of job arguments. -- Fix archive loading previous to 17.11 jobs per-node req_mem. -- Fix regresion validating that --gpus-per-socket requires --sockets-per-node for steps. Should only validate allocation requests. -- error() instead of fatal() when parsing an invalid hostlist. -- nss_slurm - fix potential deadlock in slurmstepd on overloaded systems. -- cons_tres - fix --gres-flags=enforce-binding and related --cpus-per-gres. -- cons_tres - Allocate lowest numbered cores when filtering cores with gres. -- Fix getting system counts for named GRES/TRES. -- MySQL - Fix for handing typed GRES for association rollups. -- Fix step allocations when tasks_per_core > 1. -- Fix allocating more GRES than requested when asking for multiple GRES types. * Changes in Slurm 20.02.2 ========================== -- Fix slurmctld segfault when checking no_consume GRES node allocation counts. -- Fix resetting of cloud_dns on a reconfigure. -- squeue - change output for dependency column to use "(null)" instead of "" for no dependncies as documented in the man page, and used by other columns. -- Clear node_cnt_wag after job update. -- Fix regression where AccountingStoreJobComment was not defaulting to 'yes'. -- Send registration message immediately after a node is resumed. -- Cray - Fix hetjobs when using only a single component in the step launch. -- Cray - Fix hetjobs launched without component 0. -- Cray - Quiet cookies missing message which is expected on for hetjobs. -- Fix handling of -m/--distribution options for across socket/2nd level by task/affinity plugin. -- Fix grp_node_bitmap error when slurmctld started before slurmdbd. -- Fix scheduling issue when there are not enough nodes available to run a job resulting in possible job starvation. -- Make it so mpi/cray_shasta appears in srun --mpi=list -- Don't requeue jobs that have been explicitly canceled. -- Fix error message for a regular user trying to update licenses on a running job. -- Fix backup slurmctld handling for logrotation via SIGUSR2. -- Fix reservation feature specification when looking for inactive features after active features fails. -- Prevent misleading error messages for reservation creation. -- Print message in scontrol when a request fails for not having enough nodes. -- Fix duplicate output in sacct with multiple resv events. -- auth/jwt - return correct gid for a given user. This was incorrectly assuming the users's primary group name matched their username. -- slurmrestd - permit non-SlurmUser/root job submission. -- Use host IP if hostname unknown for job submission for allocating node. -- Fix issue with primary_slurmdbd_resumed_operation trigger not happening on slurmctld restart. -- Fix race in acct_gather_interconnect/ofed on step termination. -- Fix typo of SlurmctldProlog -> PrologSlurmctld in error message. -- slurm.spec - add SuSE-specific dependencies for optional slurmrestd package. -- Fix FreeBSD build issues. -- Fixed sbatch not processing --ignore-pbs in batch script. -- Don't clear the qos_id of an invalid QOS. -- Allow a job that was once FAIL_[QOS|ACCOUNT] to be eligible again if the qos|account limitation is remedied. -- Fix core reservations using the FLEX flag to allow use of resources outside of the reservation allocation. -- Fix MPS without File with 1 GPU, and without GPUs. -- Add FreeBSD support to proctrack/pgid plugin. -- Fix remote dependency testing for meta job in job array. -- Fix preemption when dealing with a job array. -- Don't send remote non-pending singleton dependencies on federation update. -- slurmrestd - fix crash on empty query. -- Fix race condition which could lead to invalid references in backfill. -- Fix edge case in _remove_job_hash(). -- Fix exit code when using --cluster/-M client options. -- Fix compilation issues in GCC10. -- Fix invalid references when federated job is revoked while in backfill loop. -- Fix distributing job steps across idle nodes within a job. -- Fix detected floating reservation overlapping. -- Break infinite loop in cons_tres dealing with incorrect tasks per tres request resulting in slurmctld hang. -- Send the current (not the previous) reason for a pending job to client commands like squeue/scontrol. -- Fix incorrect lock levels for select_g_reconfigure(). -- Handle hidden nodes correctly in slurmrestd. -- Allow sacctmgr to use MaxSubmitP[U|A] as format options. -- Fix segfault when trying to delete a corrupted association. -- Fix setting ntasks-per-core when using --multithread. -- Only override job wait reason to priority if Reason=None or Reason=Resources. -- Perl API / seff - fix missing symbol issue with accounting_storage/slurmdbd. -- slurm.spec - add --with cray_shasta option. -- Downgrade "Node config differ.." error message if config_overrides enabled. -- Add client error when using --gpus-per-socket without --sockets-per-node. -- Fix nvml/rsmi debug statements making it to stderr. -- NodeSets - fix slurmctld segfault in newer glibc if any nodes have no defined features. -- ConfigLess - write out plugstack config to correct config file name in the config cache. -- priority/multifactor - gracefully handle NULL list of associations or array of siblings when calculating FairTree fairshare. -- Fix cons_tres --exclusive=user to allocate only requested number of CPUs. -- Add MySQL deadlock detection and automatic retry mechanism. -- Reject repeating floating reservations as they aren't supported. -- Fix testing of reservation flags that may be NO_VAL64. -- Fix _verify_node_state memory requested as --mem-per-gpu DefMemPerGPU. -- Fix DependencyNeverSatisfied not set as the job's state reason if kill_invalid_depend or --kill-on-invalid-dep are used. -- pam_slurm_adopt - explicitly call slurm_conf_init(). -- configless - fix plugstack.conf handling for client commands. -- Set SLURM_JOB_USER and SLURM_JOB_UID in task_epilog correctly. -- slurmrestd - authenticate job submissions by SlurmUser properly. * Changes in Slurm 20.02.1 ========================== -- Improve job state reason for jobs hitting partition_job_depth. -- Speed up testing of singleton dependencies. -- Fix negative loop bound in cons_tres. -- srun - capture the MPI plugin return code from mpi_hook_client_fini() and use as final return code for step failure. -- Fix segfault in cli_filter/lua. -- Fix --gpu-bind=map_gpu reusability if tasks > elements. -- Make sure config_flags on a gres are sent to the slurmctld on node registration. -- Prolog/Epilog - Fix missing GPU information. -- Fix segfault when using config parser for expanded lines. -- Fix bit overlap test function. -- Don't accrue time if job begin time is in the future. -- Remove accrue time when updating a job start/eligible time to the future. -- Fix regression in 20.02.0 that broke --depend=expand. -- Reset begin time on job release if it's not in the future. -- Fix for recovering burst buffers when using high-availability. -- Fix invalid read due to freeing an incorrectly allocated env array. -- Update slurmctld -i message to warn about losing data. -- Fix scontrol cancel_reboot so it clears the DRAIN flag and node reason for a pending ASAP reboot. * Changes in Slurm 20.02.0 ========================== -- Fix minor memory leak in slurmd on reconfig. -- Fix invalid ptr reference when rolling up data in the database. -- Change shtml2html.py to require python3 for RHEL8 support, and match man2html.py. -- slurm.spec - override "hardening" linker flags to ensure RHEL8 builds in a usable manner. -- Fix type mismatches in the perl API. -- Prevent use of uninitialized slurmctld_diag_stats. -- Fixed various Coverity issues. -- Only show warning about root-less topology in daemons. -- Fix accounting of jobs in IGNORE_JOBS reservations. -- Fix issue with batch steps state not loading correctly when upgrading from 19.05. -- Deprecate max_depend_depth in SchedulerParameters and move it to DependencyParameters. -- Silence erroneous error on slurmctld upgrade when loading federation state. -- Break infinite loop in cons_tres dealing with incorrect tasks per tres request resulting in slurmctld hang. -- Improve handling of --gpus-per-task to make sure appropriate number of GPUs is assigned to job. -- Fix seg fault on cons_res when requesting --spread-job. * Changes in Slurm 20.02.0rc1 ============================= -- sbatch - fix segfault when no newline at the end of a burst buffer file. -- Change scancel to only check job's base state when matching -t options. -- Save job dependency list in state files. -- cons_tres - allow jobs to be run on systems with root-less topologies. -- Restore pre-20.02pre1 PrologSlurmctld synchonization behavior to avoid various race conditions, and ensure proper batch job launch. -- Add new slurmrestd command/daemon which implements the Slurm REST API. * Changes in Slurm 20.02.0pre1 ============================== -- Avoid possible race when 2 conf files are read at the same exact time. -- Add last and mean backfill table size to sdiag output. -- Add support for additional job submit environment variables: SALLOC_MEM_PER_CPU, SALLOC_MEM_PER_NODE, SBATCH_MEM_PER_CPU and SBATCH_MEM_PER_NODE. -- Add 'Agent thread count' stat to sdiag. -- Add sdiag -M, --clusters option. -- NodeName configurations with CPUs != Sockets*Cores or Sockets*Cores*Threads will be rejected with fatal. -- Add scontrol write config option. -- Increase maximum number of hostlist ranges from 64k to 256k. -- Don't acquire unneeded locks in slurmctld _run_prolog thread. -- Fix sinfo/squeue sort by nodename/nodeaddr/hostname. -- Optimize getting wckey and associations usage. -- Keep SLURM_MPI_TYPE variable in srun when not set to 'none'. -- Remove slurm.spec-legacy packaging file. -- pam_slurm_adopt - with action_unknown=newest configured, pick a user job even when failing to get cgroup mtime. -- Fix "srun --export=" parsing to handle nested commas. -- Add default "reboot requested" reason to nodes when rebooting with scontrol. -- Duplicate PartitionName entries in slurm.conf will now fatal() instead of printing an error message and ignoring the successive records. -- Remove the smap command. -- Change exclusive behavior of a node to include all GRES on a node as well as the cpus. -- Append ": reboot issued" to node reason when reboot is issued from controller. Previously only happened when nextstate was specified. -- Add default jobname of "no-shell" for salloc --no-shell. -- Save reservation state when automatically shrinking nodes. -- Add slurm.conf option MaxDBDMsgs to control how many messages will be stored in the slurmctld before throwing them away when the slurmdbd is down. -- Change default SLURM_PMIX_TMPDIR to include user id to avoid potential conflicts on development systems running multiple Slurm instances. -- Return a newly added ESLURM_DEFER error and set a job state reason to FAIL_DEFER for immediate alloc requests if defer in SchedulerParameters. -- Make slurmctld fatal if unable to load a script or a job environment when building the launch job message. -- Removed the checkpoint plugin interface and all associated API calls. -- Add job_get_grace_time() functions to preempt plugins and refactor slurm_job_check_grace() to use them. -- Remove --disable-iso8601 configure option. -- Display StepId=.batch instead of StepId=.4294967294 in output of "scontrol show step". (slurm_sprint_job_step_info()) -- Make it so you can have a grace time when preempting by requeue. -- Translate MpiDefault=openmpi to functionally-equivalent MpiDefault=none, and remove the mpi/openmpi plugin. -- burst_buffer/datawarp - add a set of % symbols that will be replaced by job details. E.g., %d will be filled in with the WorkDir for the job. -- Fix sacctmgr show events to support node list ranges. -- Add SchedulerParameters option bf_one_resv_per_job to disallow adding more than one backfill reservation per job. -- Allow sacctmgr to filter node events by states that are flags. -- Allow sacctmgr to filter node events by REBOOT state/flag. -- Add ability to set MailType and MailUser of job with scontrol. -- slurm_init_job_desc_msg() initializes mail_type as uint16_t. This allows mail_type to be set to NONE with scontrol. -- Add new slurm_spank_log() function to print messages back to the user from within a SPANK plugin. (This can be done with slurm_error() instead, but that will always prepend "error: " to every message which may lead to confusion.) -- Enforce specification of partition and ALL nodes with PART_NODES flag. -- Add 'promiscuous' flag to a reservation. -- Implement the idea of PURGE_COMP=timespec. -- SPANK - removed never-implemented slurm_spank_slurmd_init() interface. This hook has always been accessible through slurm_spank_init() in the S_CTX_SLURMD context instead. -- sbcast - add new BcastAddr option to NodeName lines to allow sbcast traffic to flow over an alternate network path. -- Add auth/jwt plugin. -- Add new 'scontrol token' subcommand. -- PMIx - improve performance of proc map generation. -- For a heterogeneous job to be considered for preemption all components must be eligible for preemption. -- Added JobCompParams to slurm.conf. -- Add configuration parameter DependencyParameters to slurm.conf. -- Deprecate kill_invalid_depend in SchedulerParameters and move it to new DependencyParameters. -- Enable job dependencies for any job on any cluster in the same federation. -- Stricter escaping of strings sent to Elasticsearch. -- Allow clusters to be added automatically to db at startup of ctld. -- Add AccountingStorageExternalHost slurm.conf parameter. -- Add support for srun -M --jobid=# for existing remote allocations. -- Remove LicensesUsed from 'scontrol show config'. -- sbatch - adjusted backoff times for "--wait" option to reduce load on slurmctld. This results in a steady-state delay of 32s between queries, instead of the prior 10s delay. -- Add SchedulerParameters option bf_running_job_reserve to add backfill reservations for jobs running on whole nodes -- salloc/sbatch/srun - error on invalid --profile option strings. -- Remove max_job_bf option and replace with bf_max_job_test. -- Disable sbatch, salloc, srun --reboot for non-admins. -- jobcomp/elasticsearch - added connect_timeout and timeout options to JobCompParams. -- SPANK - added support for S_JOB_GID in the job script context with spank_get_item(). -- Prolog/Epilog - add SLURM_JOB_GID environment variable. -- Add gpu/rsmi plugin to support AMD GPUs -- Make it so you can "stack" the energy plugins -- Add energy accounting plugin for AMD GPU * Changes in Slurm 19.05.9 ========================== * Changes in Slurm 19.05.8 ========================== -- sbatch - handle --uid/--gid in #SBATCH directives properly. -- Fix HDF5 type version build error. -- PMIx - fix potential buffer overflows from use of unpackmem(). CVE-2020-27745. -- X11 forwarding - fix potential leak of the magic cookie when sent as an argument to the xauth command. CVE-2020-27746. * Changes in Slurm 19.05.7 ========================== -- Fix handling of -m/--distribution options for across socket/2nd level by task/affinity plugin. -- Fix grp_node_bitmap error when slurmctld started before slurmdbd. -- Fix compilation issues in GCC10. -- Fix distributing job steps across idle nodes within a job. -- Break infinite loop in cons_tres dealing with incorrect tasks per tres request resulting in slurmctld hang. -- priority/multifactor - gracefully handle NULL list of associations or array of siblings when calculating FairTree fairshare. -- Fix cons_tres --exclusive=user to allocate only requested number of CPUs. -- Add MySQL deadlock detection and automatic retry mechanism. -- Fix _verify_node_state memory requested as --mem-per-gpu DefMemPerGPU. -- Factor in ntasks-per-core=1 with cons_tres. -- Fix formatting in error message in cons_tres. -- Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node. -- Fix --mem-per-gpu for heterogenous --gres requests. -- Fix slurmctld load order in load_all_part_state(). -- Fix getting system counts for named GRES/TRES. -- MySQL - Fix for handing typed GRES for association rollups. -- Fix step allocations when tasks_per_core > 1. * Changes in Slurm 19.05.6 ========================== -- Fix OverMemoryKill. -- Fix memory leak in scontrol show config. -- Remove PART_NODES reservation flag after ignoring it at creation. -- Fix deprecation of MemLimitEnforce parameter. -- X11 forwarding - alter Xauthority regex to work when "FamilyWild" cookies are present in the "xauth list" output. -- Fix memory leak when utilizing core reservations. -- Fix issue where adding WCKeys and then using them right away didn't always work. -- Add cosmetic batch step to correct component in a hetjob. -- Fix to make scontrol write config create a usable config without editing. -- Fix memory leak when pinging backup controller. -- Fix issue with 'scontrol update' not enforcing all QoS / Association limits. -- Fix to properly schedule certain jobs with cons_tres plugin. -- Fix FIRST_CORES for reservations when using cons_tres. -- Fix sbcast -C argument parsing. -- Replace/deprecate max_job_bf with bf_max_job_test and print error message. -- sched/backfill - fix options parsing when bf_hetjob_prio enabled. -- Fix for --gpu-bind when no gpus requested. -- Fix sshare -l crash with large values. -- Fix printing NULL job and step pointers. -- Break infinite loop in cons_tres dealing with incorrect tasks per tres request resulting in slurmctld hang. -- Improve handling of --gpus-per-task to make sure appropriate number of GPUs is assigned to job. * Changes in Slurm 19.05.5 ========================== -- Fix both socket-[un]constrained GRES issues that would lead to incorrect GRES allocations and GRES underflow errors at deallocation time. -- Reject unrunnable jobs submitted to reservations. -- Fix misleading error returned for immediate allocation requests when defer in SchedulerParameters by decoupling defer from too fragmented logic. -- Fix printf format string error on FreeBSD. -- Fix parsing of delay_boot in controller when additional arguments follow it. -- Fix --ntasks-per-node in cons_tres. -- Fix array tasks getting same reject reason. -- Ignore DOWN/DRAIN partitions in reduce_completing_frag logic. -- Fix alloc_node validation when updating a job. -- Fix for requesting specific nodes when using cons_tres topology. -- Ensure x11 is setup before launching a job step. -- Fix incorrect SLURM_CLUSTER_NAME env var in batch step. -- Perl API - Fix undefined symbol for slurmdbd_pack_fini_msg. -- Install slurmdbd.conf.example with 0600 permissions to encourage secure use. CVE-2019-19727. -- srun - do not continue with job launch if --uid fails. CVE-2019-19728. * Changes in Slurm 19.05.4 ========================== -- Don't allow empty string as a reservation name; generate a name if empty string is provided. -- Fix salloc segfault when using --no-shell option. -- Fix divide by zero when normalizing partition priorities. -- Restore ability to set JobPriorityFactor to 0 on a partition. -- Fix multi-partition non-normalized job priorities. -- Adjust precedence between --mem-per-cpu and --mem-per-node to enforce them as mutually exclusive. Specifying either on the command line will now explicitly override any value inherited through the environment. -- Always print node's version, if it exists, in scontrol show nodes. -- sbatch - ensure SLURM_NTASKS_PER_NODE is exported when --ntasks-per-node is set. -- slurmctld - fix memory leak when using DebugFlags=Reservation. -- Reset --mem and --mem-per-cpu options correctly when using --mem-per-gpu. -- Use correct function signature for step_set_env() in gres plugin interface. -- Restore pre-19.05 hostname handling behavior for AllocNodes by always truncating to just the host portion and dropping any domain name portion returned by gethostbyaddr(). -- Fix abort initializing a configuration without acct_gather.conf. -- Fix GRES binding and CLOUD nodes GRES setup regressions. -- Make sview work with glib2 v2.62. -- Fix slurmctld abort when in developer mode and submitting to multiple partitions with a bad QOS and not enforcing QOS. -- Enforce PART_NODES if only PartitionName is specified. -- Fix slurmd -G functionality. -- Fix build on 32-bit systems. -- Remove duplicate log entry on update job. -- sched/backfill - fix the estimated sched_nodes for multi-part jobs. -- slurm.spec - fix pmix_version global context macro. -- Fix cons_tres topology logic incorrectly evaluating insufficient resoruces. -- Fix job "--switches=count@time" option handling in cons_tres topology. -- scontrol - allow changes to the WorkDir for pending jobs. -- Enable coordinators to delete users if they only belong to accounts that the coordinator is over. -- Fix regression on update from older versions with DefMemPerCPU. -- Fix issues with --gpu-bind while using cgroups. -- Suspend nodes after being down for SuspendTime. -- Fix rebooting nodes from skipping nextstate states on boot. -- Fix regression in reservation creation logic from 19.05.3 which would incorrectly deny certain valid reservations from being created. -- slurmdbd - process sacct/sacctmgr job queries from older clients correctly. * Changes in Slurm 19.05.3-2 ============================ -- Fix missing include for Cray Aries systems. * Changes in Slurm 19.05.3 ========================== -- Fix missing check from conversion of cray -> cray_aries. -- Improve job state reason string when required nodes are not available by not including those that don't belong to the job partition. -- Set a more appropriate ESLURM_RESERVATION_MAINT job state reason for jobs requesting feature(s) and required nodes are in a maintenance reservation. -- Fix logic to better handle maintenance reservations. -- Add spank options to cache in remote callback. -- Enforce the use of spank_option_getopt(). -- Fix select plugins' will run test under-allocating nodes usage for completing jobs. -- Nodes in COMPLETING state treated as being currently available for job will-run test. -- Cray - fix contribs slurm.conf.j2 with updated cray_aries plugin names. -- job_submit/lua - fix problem where nil was expected for min_mem_per_cpu. -- Fix extra, unaccounted TRESRunMins usage created by heterogeneous jobs when running with the priority/multifactor plugin. -- Detach threads once they are done to avoid having to join them in track scripts code. -- Handle situation where a slurmctld tries to communicate with slurmdbd more than once at the same time. -- Fix XOR/XAND features like cpu&fastio&[knl|westmere] to be resolved correctly. -- Don't update [min|max]_exit_code on job array task requeue. -- Don't assume the first node of a job is the batch host when testing if the job's allocated nodes are booted/ready. -- Make --batch= requests wait for all nodes to be booted so that it can choose the batch host after the nodes have been booted -- possibly with different features. -- Fix talking to batch host on it's protocol version when using --batch. -- gres/mic plugin - add missing fini() function to clean up plugin state. -- Move _validate_node_choice() before prolog/epilog check. -- Look forward one week while create new reservation. -- Set mising resv_desc.flags before call _select_nodes(). -- Use correct start_time for TIME_FLOAT reservation in _job_overlap(). -- Properly enforce a job's mem-per-cpu option when allocate the node exclusively to that job. -- sched/backfill - clear estimated sched_nodes as done for start_time. -- Have safe_[read|write] handle EAGAIN and EINTR. -- Fix checking for flag with logical AND. -- Correct "extern" definition of variable if compiling with __APPLE__. -- Deprecate FastSchedule. FastSchedule will be removed in 20.02. The FastSchedule=2 functionality (used for testing and development) has been retained as the new SlurmdParameters=config_overrides option. -- Fix preemption issue when picking nodes for a feature job request. -- Fix race condition preventing held array job from getting a db_index. -- Fix select/cons_tres gres code infinite loop leaving slurmctld unresponsive. -- Remove redefinition of global variable in gres.c -- Fix issue where GPU devices are denied access when MPS is enabled. -- Fix uninitialized errors when compiling with CFLAGS="--coverage". -- Fix scancel --full for proctrack/cgroups. -- Fix sdiag backfill last and mean queue length stats. -- Do not remove batch host when resizing/shrinking a batch job. -- nss_slurm - fix file descriptor leaks. -- Fix preemption for jobs using complex feature requests (e.g. -C "[rack1*2&rack2*4]"). -- Fix memory leaks in preemption when jobs request multiple features. -- Allow Operator users to show/fix runaways. -- Disallow coordinators to show/fix runaways. -- mpi/pmi2 - increase array len to avoid buffer size exceeded error. -- Preserve rebooting node's nextstate when updating state with scontrol. -- Fully merge slurm.conf and gres.conf before node_config_load(). -- Remove FastSchedule dependence from gres.conf's AutoDetect=nvml. -- Forbid mix of typed and untyped GRES of same name in slurm.conf. -- cons_tres: Prevent creating a job without CPUs. -- Prevent underflow when filtering cores with gres. -- proctrack/cray_aries: use current pid instead of thread if we're in a fork. -- Fix missing check for prolog launch credential creation failure that can lead to segfaults. * Changes in Slurm 19.05.2 ========================== -- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block. -- Allow account coordinators to add users who don't already have an association with any account. -- If only allowing particular alloc nodes in a partition, deny any request coming from an alloc node of NULL. -- Prevent partial-load of plugins which can leave certain interfaces in an inconsistent state. -- Remove stray __USE_GNU macro definitions from source. -- Fix loading fed state by backup on subsequent takeovers. -- Add missing job read lock when loading fed job state. -- Add missing fed_job_info jobs if fed state is lost. -- Do not build cgroup plugins on FreeBSD or NetBSD, and use proctrack/pgid by default instead. -- Do not build switch/cray_aries plugin on FreeBSD, NetBSD, or macOS. -- Fix build on FreeBSD. -- Fix race condition in route/topology plugin. -- In munge decode set the alloc_node field to the text representation of an IP address if the reverse lookup fails. -- Fix infinite loop in slurmstepd handling for nss_slurm REQUEST_GETGR RPC. -- Fix slurmstepd early assertion fail which prevented batch job launch or tasks launch on non-Linux systems. -- Fix regression with SLURM_STEP_GPUS env var being renamed SLURM_STEP_GRES. -- Fix pmix v3 linking if no rpath is allowed on build. -- Fix sacctmgr error handling when removing associations and users. -- Allow sacctmgr to add users to WCKeys without having TrackWCKey set in the slurm.conf. -- Allow sacctmgr to delete WCKeys from users. -- Change GRES type set by gpu/gpu_nvml plugin to be more specific - based on device name instead of brand name. -- cli_filter - fix logic error with option lookup functions. -- Fix bad testing of NodeFeatures debug flag in contribs/cray. -- Cleanup track_script code to avoid race conditions and invalid memory access. -- Fix jobs being killed after being requeued by preemption. -- Make register nodes verify correctly when using cons_tres. -- Fix srun --mem-per-cpu being ignored. -- Fix segfault in _update_job() under certain conditions. -- job_submit/lua - restore slurm.FAILURE as a synonym for slurm.ERROR. * Changes in Slurm 19.05.1-2 ============================ -- Fix mistake in QOS time limit calculations for UsageFactor != 0 with any combination of flags set. * Changes in Slurm 19.05.1 ========================== -- accounting_storage/mysql - fix incorrect function names in error messages. -- accounting_storage/slurmdbd - trigger an fsync() on the dbd.messages state file to ensure it is committed to disk properly. -- Avoid JobHeldUser state reason from being updated at allocation time. -- Fix dump/load of rejected heterogeneous jobs. -- For heterogeneous jobs, do not count the each component against the QOS or association job limit multiple times. -- Comment out documentation for the incomplete and currently unusable burst_buffer/generic plugin. -- Add new error ESLURM_INVALID_TIME_MIN_LIMIT to make note when a time_min limit is invalid based on timelimit. -- Correct slurmdb cluster record pack with NULL pointer input. -- Clearer error message for ESLURM_INVALID_TIME_MIN_LIMIT. -- Fix SchedulerParameter bf_min_prio_reserve error when not the last parameter -- When fixing runaway jobs, change to reroll from earliest submit time, and never reroll from Unix epoch. -- Display submit time when running sacctmgr show runawayjobs and add format option to display eligible time. -- jobcomp/elasticsearch - fix minor race related to JobCompLoc setup. -- For HetJobs, ensure SLURM_PACK_JOB_ID is set regardless of whether PrologFlags=Alloc is enabled. -- Fix PriorityFlags regression with the mutation of FAIR_TREE to NO_FAIR_TREE. -- select/cons_res - fix debug flag SelectType handling in select_p_job_test. -- Fix sacctmgr archive dump commit confirmation. -- Prevent extra resources from being allocated when combining certain flags. -- Cray - fix template generator with update cray_aries plugin names. -- accounting_storage/slurmdbd - provide additional detail in several error messages. -- Backfill - If a job has a time_limit guess the end time of a job better if OverTimeLimit is Unlimited. -- Remove premature call to get system gpus before querying fake gpus that should override the real. -- Fix segfault in epilog_set_env() when gres_devices is NULL. -- Fix (un)supported states in sacct. -- Adjust build system to no longer use the AC_FUNC_MALLOC autoconf macro. -- srun - restore the --cpu_bind option to srun. -- Add UsageFactorSafe QOS flag to control applying UsageFactor at submission/scheduling time. -- Create missing reservations on DBD_MODIFY_RESV. -- Add error message when attempting to update association manager and object doesn't exist. -- Fix security issue in accounting_storage/mysql plugin on archive file loads by always escaping strings within the slurmdbd. CVE-2019-12838. * Changes in Slurm 19.05.0 ========================== -- Fix deprecated group by clause to use order by. -- NVML - Git rid of unneeded * when passing nvmlDevice_t to functions. -- NVML - Fix clang warning about unneeded variable initialization. -- NVML - remove unneeded {}. -- Add timers to new site_factor plugin APIs to warn of slow-running plugins, which can lead to issues with throughput and responsiveness. -- X11 forwarding - ignore screen value for local DISPLAY. -- Add missing locks protecting slurmctld_config.server_thread_count access. -- Fix jobs stuck from FedJobLock when requeueing in a federation -- Fix requeueing job in a federation of clusters with differing associations -- sacctmgr - free memory before exiting in 'sacctmgr show runaway'. -- Fix seff showing memory overflow when steps tres mem usage is 0. -- Fix memory leaks in 'sacctmgr show runawayjobs'. -- Fix potential deadlock in nss_slurm. -- Fix memory leaks due to incomplete slurmdb_cluster_cond_t destructor. -- Alter reservation flags column in slurmdbd to use uint64_t instead of uint16_t to ensure all current flags are saved correctly. Older releases unfortunately could not store details for newer flags (using bits 17-32) due to this field being silently truncated. -- Modify task layout with --overcommit option plus a heterogeneous job allocation so that a cyclic task distribution can start happening before all CPUs on all nodes are fully allocated. The number of tasks per node will be unchanged from the previous algorithm, but tasks will be distributed in a cyclic fashion first and then extra tasks placed on nodes with more CPUs. Previously all CPUs would be fully allocated in a cyclic fashion, then excess tasks distributed evenly across all allocated nodes. -- In select/cons_tres: Only allocate 1 CPU per node with the --overcommit option. -- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and --nodelist options. -- Fix DefMemPer[CPU|Node] assignment on multi-partition job requests. -- Fix wrongly setting start_time to 0 for multi-part jobs. -- Upon archive file name collision, create new archive file instead of overwriting the old one to prevent lost records. -- Limit archive files to 50000 records per file so that archiving large databases will succeed. -- Remove stray newlines in SPANK plugin error messages. -- Fix archive loading events. -- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and --nodelist options. -- Fix main scheduler from potentially not running through whole queue. -- Fix variable initiation to avoid slurmctld abort. -- In partition preemption, sort preemptor jobs only if they overlap a preemtable partition. -- cons_tres/dist_tasks - fix variable usage in cyclic distribution. -- cons_res/job_test - prevent a job from overallocating a node memory. -- cons_res/job_test - fix to consider a node's current allocated memory when testing a job's memory request. -- Fix issue where multi-node job steps on cloud nodes wouldn't finish cleaning up until the end of the job (rather than the end of the step). -- Fix packing pack_jobid in an sbcast. -- Fix GCC 9 compiler warnings. -- Add new job bit_flags of JOB_DEPENDENT. -- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. -- Fix sacctmgr --parsable2 output for reservations and tres. -- In multi-node systems make sure GRES are found on node when not bound to specific sockets. -- Fix gres-per-task logic for gres not bound to sockets. -- Fix issue when --gpus plus --cpus-per-gres was forcing socket binding unnecessarily. -- Change event table's state column to handle 32bits. -- Prevent slurmctld from potential segfault after job_start_data() called for completing job. -- Fix jobs getting on nodes with "scontrol reboot asap". -- Record node reboot events to database. -- Fix node reboot failure message getting to event table. -- Don't write "(null)" to event table when no event reason exists. -- Fix invalid memory read in cons_tres. -- Fix minor memory leak when clearing runaway jobs. -- Avoid flooding slurmctld and logging when prolog complete RPC errors occur. -- Fix slurmctld node_scheduler's feature_bitmap memory leak. -- Fatal when reading config if Alloc flag configured on FrontEnd mode. -- Modifications needed to run Federations with clusters running different select/switch plugins. -- Fix Clang errors for zero initializing struct with nested arrays. -- Fix minor memory leak in pmi2. -- MySQL - Fix minor memory leak when quering suspended jobs fails. -- Fix seff human readable memory string for values below a megabyte. -- Avoid slurmctld abort if GRES defined in gres.conf, but not in the node configuration of slurm.conf. -- Calculate task count for job with --gpus-per-task option, but no explicit task count. * Changes in Slurm 19.05.0rc1 ============================= -- Set CUDA_VISIBLE_DEVICES environment variable in Prolog and Epilog for jobs requesting gres/gpu. -- Remove '-U' argument - which was deprecated when '-A' was made the single character option before the Slurm 2.1 release - as an alternative to '--account' for salloc/sbatch/srun. -- Remove direct BLCR support and srun_cr. -- Make slurm_print_node_table only print a node's slurmd version if it is different to the one reported by slurm_load_ctl_conf. -- Call gres plugin environment setup even if gres not requested in job. -- Do not set CUDA_VISIBLE_DEVICES=NoDevFiles when no gres requested. -- If GRES configuration data is unavailable from gres.conf, then use the node's "Gres=" information slurm.conf. This will eliminate or minimize the gres.conf file in many situations. -- Fix checking IPMI XCC raw command response length. -- jobacct_gather/common - improve lightweight process identification. -- Cloud/PowerSave Improvements: - Better repsonsiveness to resuming and suspending. - Powering down nodes not eligible to be allocated until after SuspendTimeout. - Powering down nodes put in "Powering Down / %" state until after SuspendTimeout. -- Add idle_on_node_suspend SlurmctldParameter to make nodes idle regardless of state when suspended. -- Add PowerSave DebugFlag for Suspend/Resume debugging. -- Changed "scontrol reboot" to not default to ALL nodes. -- Changed "scontrol completing" to include two new fields - EndTime and CompletingTime. -- select/cons_tres - prevent job from overallocating a node memory. -- Refactor CLI option parsing for salloc/sbatch/srun into a central set of functions in src/common/slurm_opt.c. Note that this new option parsing can be stricter in a few specific situations - places that used to ignore invalid options and still submit/launch a job or job step may return an error() and refuse to proceed instead. -- Add preempt_send_user_signal SlurmctldParameter option to send user signal (e.g. --signal=) at preemption if it hasn't already been sent. -- Add PreemptExemptTime parameter to slurm.conf and QOS to guarantee a minimum runtime before preemption. -- Set job's preempt time for non-grace time preemptions. -- Add sinfo format option to show used gres. -- Add reboot_from_controller SlurmctldParameter to allow RebootProgram to be run from the controller instead of the slurmds. -- Fix increasing of job size when extern steps exist. -- Reset GPU-related arguments to salloc/sbatch/srun for each separate heterogeneous job component. -- Do not set "(null)" for SLURM_JOB_CONSTRAINTS when no constraints are set in PrologSlurmctld/EpilogSlurmctld. -- Add SRUN_EXPORT_ENV as an input environment variable to srun. -- Return an error for invalid #SBATCH directives, and do not submit the job. -- Add S_JOB_ARRAY_ID and S_JOB_ARRAY_TASK_ID to spank_get_item(). -- Change container_{g,p}_add_pid() to container_{g,p}_join() and remove the 'pid_t pid' argument. -- Add new site_factor plugin type to permit sites to build plugins to set and modify the site priority factor value both initially on job submission, and periodically every PriorityCalcPeriod. -- Rename Cray plugins cray_aries in preperation for Cray/Shasta. -- Allow Het Jobs to work on a Cray. -- Add new cli_filter plugin type to permit sites to build plugins to log, modify, or reject CLI options within the salloc/sbatch/srun commands themselves. -- Allocate nodes that are booting. Previously, nodes that were being booted were off limits for allocation. This caused more nodes to be booted than needed in a cloud environment. -- pam_slurm_adopt - inject SLURM_JOB_ID environment variable into adopted processes. -- PMIx - use the Tree-based collective for empty fence operations. -- PMIx - replace use of the non-standard PMIX_VAL_SET macro with the standardized PMIX_VALUE_LOAD macro. -- slurm.spec - change --without cray option to set configure option of --enable-really-no-cray. -- slurm.spec - add new --with slurmsmwd option. -- pmi2: add mutex locking to all API calls to ensure thread-safety. -- Fix QOS usage factor to apply to TRES time limits and usage. -- Fix multi-cluster srun's with Select/Cray and other_cons_res. * Changes in Slurm 19.05.0pre3 ============================== -- Fix RPM packaging for accounting_storage/mysql. * Changes in Slurm 19.05.0pre2 ============================== -- Removed select/serial plugin. -- Remove 512-character line length limit in slurm_print_topo_record(). (Used by "scontrol show topology".) -- Removed crypto/openssl plugin. -- Tweak the sdiag gettimeofday() line format for greater clarity. -- Add support for SALLOC/SBATCH/SLURM_NO_KILL environment variables. Add salloc/sbatch/srun support for optional "--no-kill=off" option to disable the environment variables. -- Fix salloc and missing SLURM_NTASKS. -- Alter the backfill scheduler behavior to prevent it from scheduling lower priority jobs on resources that become available during the backfill scheduling cycle when bf_continue is enabled. This behavior was available as the bf_ignore_newly_avail_nodes option in 18.08.4+, but is now enabled by default. (The SchedulerParameters option of bf_ignore_newly_avail_nodes is also now removed, although harmless if still set.) -- Make LaunchParameters=send_gids the default introducing the reverse option "disable_send_gids to go back to the original behavior. -- Limit pam_slurm_adopt to run only in the sshd context by default, for security reasons. A new module option 'service=' can be used to allow a different PAM applications to work. The option 'service=*' can be used to restore the old behavior of always performing the adopt logic regardless of the PAM application context. -- pam_slurm_adopt: Use uid to determine whether root is logging. -- Remove sbatch --x11 option. Slurm's internal X11 forwarding is now only supported from salloc, or an allocating srun command. -- Suppressed printing of job id in sbatch when quiet flag is set. -- Changed sreport 'SizesByAccount' and 'SizesByAccountAndWckey' default behavior and added new 'AcctAsParent' option. -- Add ave watts to api and sview. -- Added printf attribute to setenvf() and corrected related warnings. -- Kill running/pending job is allocated GRES and that GRES has a "File" configuration, and the GRES count changes. -- Add new DebugFlag=Accrue for accrue accounting debugging purposes. -- Change CryptoType option to CredType, and rename crypto/munge plugin to cred/munge. -- Add slurmd -G option to print GRES configuration and exit. This is useful for testing and debugging. -- Support GRES types that include numbers (e.g. "--gres=gpu:123g:2"). -- Remove MemLimitEnforce parameter and move functionality into JobAcctGatherParam=OverMemoryKill. -- sview - disable admin mode option (which would not work anyways) if the user is not an admin in SlurmDBD. -- Remove joules reporting from sview and scontrol. -- Change the default fair share algorithm to "fair tree". The new PriorityFlags option of NO_FAIR_TREE can be used to revert to "classic" fair share scheduling instead. -- libslurmdb has been merged into libslurm. -- Added -b as a short option for --begin and removed the -b option which was a left over artifact from the Moab compatibility work. -- Add ArrayTaskThrottle to "scontrol show job" output. -- Added SPRIO_FORMAT env variable to the sprio command. -- Add batch step at the beginning of a batch job so that squeue, sstat, and sacct will show the batch step. -- Deprecated 32-bit builds. -- Make -l and -o mutually exclusive in saccct, squeue, sinfo, and sprio -- Disable running job expansion by default. A new SchedulerParameter of permit_job_expansion has been added for sites that wish to re-enable it. -- Permit changing a job array's ArrayTaskThrottle value even if the job is terminated (for job requeue). -- Add scontrol requeue option of "Incomplete" which will requeue jobs only if they failed to complete with an exit code of zero. -- Modify GrpNodes limit to apply to unique nodes allocated (avoid double counting nodes allocated to multiple jobs in the same QOS or association). -- If a job submit does NOT include --cpus-per-task option, then report the value as "N/A" rather than always mapping the value to 1. -- X11 forwarding - use the raw value from gethostname() with xauth to avoid authentication issues when Slurm has internally stripped off the domain portion. -- Change how slurmd fills in the registration message version string from PACKAGE_VERSION to SLURM_VERSION_STRING, affecting how the version is displayed with sview, sinfo, scontrol and through the API. -- Remove autogen.sh script. Please use the autoreconf command instead. -- Disable a configuration of SelectTypeParameters=CR_ONE_TASK_PER_CORE with SelectType=select/cons_tres. This will be addressed later. -- job_submit/lua - expose more fields off the partition record. -- task/cgroup - prevent setting a memory.soft_limit_in_bytes higher than the memory.limit_in_bytes since the hard limit will take precedence anyway. -- If a GrpNodes limit is configurated in an association, partition QOS or job QOS then favor use of nodes already allocated to that entity. This will result in the configured node "Weight" being incremented by one for nodes which are not prefered. Consider adjusting configured node "Weight" values to achieve the desired node preferences. -- Add full node state debug2 output to slurmdbd node up/down update -- Set CUDA_VISIBLE_DEVICES and CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment variables in Prolog and Epilog for jobs requesting gres/mps. -- Added thresholds for backfill parameters. -- Fix for backfill sleep overflow when large values are set. -- Execute Epilog on nodes reliquished from job (i.e. job resized). -- Rename burst_buffer/cray plugin to burst_buffer/datawarp. -- X11 Forwarding - reimplement using new internal network forwarding RPCs. -- Remove slurm_jobcomp_get_errno and slurm_jobcomp_strerror from jobcomp plugin API. -- Optimize backfill for checking max jobs per assoc, partition, user, etc. * Changes in Slurm 19.05.0pre1 ============================== -- Run epilog and clean up allocation when a job is resized to zero and its resources transferred to another job (--depend=expand). -- If GRES are associated with specific sockets, identify those sockets in the output of "scontrol show node". For example if all 4 GPUs on a node are all associated with socket zero, then "Gres=gpu:4(S:0)". If associated with sockets 0 and 1 then "Gres=gpu:4(S:0-1)". The information of which specific GPUs are associated with specific GPUs is not reported, but only available by parsing the gres.conf file. -- Add configuration parameter "GpuFreqDef" to control a job's default GPU frequency. -- Add job flags to the database. Currently used to determine which scheduler scheduled the job. -- Add constraints/features to the database. -- Add last reason job didn't run before resources/priority to the database. -- Make it so we set the alloc_node in a resource allocation based on the auth plugin instead of the rpc call. * Changes in Slurm 18.08.10 =========================== * Changes in Slurm 18.08.9 ========================== -- Wrap END_TIMER{,2,3} macro definition in "do {} while (0)" block. -- Make sview work with glib2 v2.62. -- Make Slurm compile on linux after sys/sysctl.h was deprecated. -- Install slurmdbd.conf.example with 0600 permissions to encourage secure use. CVE-2019-19727. -- srun - do not continue with job launch if --uid fails. CVE-2019-19728. * Changes in Slurm 18.08.8 ========================== -- Update "xauth list" to use the same 10000ms timeout as the other xauth commands. -- Fix issue in gres code to handle a gres cnt of 0. -- Don't purge jobs if backfill is running. -- Verify job is pending add/removing accrual time. -- Don't abort when the job doesn't have an association that was removed before the job was able to make it to the database. -- Set state_reason if select_nodes() fails job for QOS or Account. -- Avoid seg_fault on referencing association without a valid_qos bitmap. -- If Association/QOS is removed on a pending job set that job as ineligible. -- When changing a jobs account/qos always make sure you remove the old limits. -- Don't reset a FAIL_QOS or FAIL_ACCOUNT job reason until the qos or account changed. -- Restore "sreport -T ALL" functionality. -- Correctly typecast signals being sent through the api. -- Properly initialize structures throughout Slurm. -- Sync "numtask" squeue format option for jobs and steps to "numtasks". -- Fix sacct -PD to avoid CA before start jobs. -- Fix potential deadlock with backup slurmctld. -- Fixed issue with jobs not appearing in sacct after dependency satisfied. -- Fix showing non-eligible jobs when asking with -j and not -s. -- Fix issue with backfill scheduler scheduling tasks of an array when not the head job. -- accounting_storage/mysql - fix SIGABRT in the archive load logic. -- accounting_storage/mysql - fix memory leak in the archive load logic. -- Limit records per single SQL statement when loading archived data. -- Fix unnecessary reloading of job submit plugins. -- Allow job submit plugins to be turned on/off with a reconfigure. -- Fix segfault when loading/unloading Lua job submit plugin multiple times. -- Fix printing duplicate error messages of jobs rejected by job submit plugin. -- Fix printing of job submit plugin messages of het jobs without pack id. -- Fix memory leak in group_cache.c -- Fix jobs stuck from FedJobLock when requeueing in a federation -- Fix requeueing job in a federation of clusters with differing associations -- sacctmgr - free memory before exiting in 'sacctmgr show runaway'. -- Fix seff showing memory overflow when steps tres mem usage is 0. -- Upon archive file name collision, create new archive file instead of overwriting the old one to prevent lost records. -- Limit archive files to 50000 records per file so that archiving large databases will succeed. -- Remove stray newlines in SPANK plugin error messages. -- Fix archive loading events. -- In select/cons_res: Only allocate 1 CPU per node with the --overcommit and --nodelist options. -- Fix main scheduler from potentially not running through whole queue. -- cons_res/job_test - prevent a job from overallocating a node memory. -- cons_res/job_test - fix to consider a node's current allocated memory when testing a job's memory request. -- Fix issue where multi-node job steps on cloud nodes wouldn't finish cleaning up until the end of the job (rather than the end of the step). -- Fix issue with a 17.11 sbcast call to a 18.08 daemon. -- Add new job bit_flags of JOB_DEPENDENT. -- Make it so dependent jobs reset the AccrueTime and do not count against any AccrueTime limits. -- Fix sacctmgr --parsable2 output for reservations and tres. -- Prevent slurmctld from potential segfault after job_start_data() called for completing job. -- Fix jobs getting on nodes with "scontrol reboot asap". -- Record node reboot events to database. -- Fix node reboot failure message getting to event table. -- Don't write "(null)" to event table when no event reason exists. -- Fix minor memory leak when clearing runaway jobs. -- Avoid flooding slurmctld and logging when prolog complete RPC errors occur. -- Fix GCC 9 compiler warnings. -- Fix seff human readable memory string for values below a megabyte. -- Fix dump/load of rejected heterogeneous jobs. -- For heterogeneous jobs, do not count the each component against the QOS or association job limit multiple times. -- slurmdbd - avoid reservation flag column corruption with the use of newer flags, instead preserve the older flag fields that we can still fit in the smallint field, and discard the rest. -- Fix security issue in accounting_storage/mysql plugin on archive file loads by always escaping strings within the slurmdbd. CVE-2019-12838. * Changes in Slurm 18.08.7 ========================== -- Set debug statement to debug2 to avoid benign error messages. -- Add SchedulerParameters option of bf_hetjob_immediate to attempt to start a heterogeneous job as soon as all of its components are determined able to do so. -- Fix underflow causing decay thread to exit. -- Fix main scheduler not considering hetjobs when building the job queue. -- Fix regression for sacct to display old jobs without a start time. -- Fix setting correct number of gres topology bits. -- Update hetjobs pending state reason when appropriate. -- Fix accounting_storage/filetxt's understanding of TRES. -- Set Accrue time when not enforcing limits. -- Fix srun segfault when requesting a hetjob with test_exec or bcast options. -- Hide multipart priorities log message behind Priority debug flag. -- sched/backfill - Make hetjobs sensitive to bf_max_job_start. -- Fix slurmctld segfault due to job's partition pointer NULL dereference. -- Fix issue with OR'ed job dependencies. -- Add new job's bit_flags of INVALID_DEPEND to prevent rebuilding a job's dependency string when it has at least one invalid and purged dependency. -- Promote federation unsynced siblings log message from debug to info. -- burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes. -- burst_buffer/cray - fix memory leak due to unfreed job script content. -- node_features/knl_cray - fix script_argv use-after-free. -- burst_buffer/cray - fix script_argv use-after-free. -- Fix invalid reads of size 1 due to non null-terminated string reads. -- Add extra debug2 logs to identify why BadConstraints reason is set. * Changes in Slurm 18.08.6-2 ============================ -- Remove deadlock situation when logging and --enable-debug is used. -- Fix RPM packaging for accounting_storage/mysql. * Changes in Slurm 18.08.6 ========================== -- Added parsing of -H flag with scancel. -- Fix slurmsmwd build on 32-bit systems. -- acct_gather_filesystem/lustre - add support for Lustre 2.12 client. -- Fix per-partition TRES factors/priority -- Fix per-partition NICE priority -- Fix partition access check validation for multi-partition job submissions. -- Prevent segfault on empty response in 'scontrol show dwstat'. -- node_features/knl_cray plugin - Preserve node's active features if it has already booted when slurmctld daemon is reconfigured. -- Detect missing burst buffer script and reject job. -- GRES: Properly reset the topo_gres_cnt_alloc counter on slurmctld restart to prevent underflow. -- Avoid errors from packing accounting_storage_mysql.so when RPM is built with out mysql support. -- Remove deprecated -t option from slurmctld --help. -- acct_gather_filesystem/lustre - fix stats gathering. -- Enforce documented default usage start and end times when querying jobs from the database. -- Fix issues when querying running jobs from the database. -- Deny sacct request where start time is later than the end time requested. -- Fix sacct verbose about time and states queried. -- burst_buffer/cray - allow 'scancel --hurry ' to tear down a burst buffer that is currently staging data out. -- X11 forwarding - allow setup if the DISPLAY environment variable lacks a screen number. (Permit both "localhost:10.0" and "localhost:10".) -- docs - change HTML title to include the page title or man page name. -- X11 forwarding - fix an unnecessary error message when using the local_xauthority X11Parameters option. -- Add use_raw_hostname to X11Parameters. -- Fix smail so it passes job arrays to seff correctly. -- Don't check InactiveLimit for salloc --no-shell jobs. -- Add SALLOC_GRES and SBATCH_GRES as input to salloc/sbatch. -- Remove drain state when node doesn't reboot by ResumeTimeout. -- Fix considering "resuming" nodes in scheduling. -- Do not kill suspended jobs due to exceeding time limit. -- Add NoAddrCache CommunicationParameter. -- Don't ping powering up cloud nodes. -- Add cloud_dns SlurmctldParameter. -- Consider --sbindir configure option as the default path to find slurmstepd. -- Fix node state printing of DRAINED$ -- Fix spamming dbd of down/drained nodes in maintenance reservation. -- Avoid buffer overflow in time_str2secs. -- Calculate suspended time for suspended steps. -- Add null check for step_ptr->step_node_bitmap in _pick_step_nodes. -- Fix multi-cluster srun issue after 'scontrol reconfigure' was called. -- Fix accessing response_cluster_rec outside of write locks. -- Fix Lua user messages not showing up on rejected submissions. -- Fix printing multi-line error messages on rejected submissions. * Changes in Slurm 18.08.5-2 ============================ -- Fix Perl build for 32-bit systems. * Changes in Slurm 18.08.5 ========================== -- Backfill - If a job has a time_limit guess the end time of a job better if OverTimeLimit is Unlimited. -- Fix "sacctmgr show events event=cluster" -- Fix sacctmgr show runawayjobs from sibling cluster -- Avoid bit offset of -1 in call to bit_nclear(). -- Insure that "hbm" is a configured GresType on knl systems. -- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres other than knl. -- cons_res: Prevent overflow on multiply. -- Better debug for bad values in gres.conf. -- Fix double accounting of energy at end of job. -- Read gres.conf for cloud nodes on slurmctld. -- Don't assume the first node of a job is the batch host when purging jobs from a node. -- Better debugging when a job doesn't have a job_resrcs ptr. -- Store ave watts in energy plugins. -- Add XCC plugin for reading Lenovo Power. -- Fix minor memory leak when scheduling rebootable nodes. -- Fix debug2 prefix for sched log. -- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job. -- sbatch - search current working directory first for job script. -- Make it so held jobs reset the AccrueTime and do not count against any AccrueTime limits. -- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the job sorting algorithm for scheduling heterogeneous jobs. -- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures. -- Fix segfault with job arrays using X11 forwarding. -- Revert regression caused by e0ee1c7054 which caused negative values and values starting with a decimal to be invalid for PriorityWeightTRES and TRESBillingWeight. -- Fix possibility to update a job's reservation to none. -- Suppress connection errors to primary slurmdbd when backup dbd is active. -- Suppress connection errors to primary db when backup db kicks in -- Add missing fields for sacct --completion when using jobcomp/filetxt. -- Fix incorrect values set for UserCPU, SystemCPU, and TotalCPU sacct fields when JobAcctGatherType=jobacct_gather/cgroup. -- Fixed srun from double printing invalid option msg twice. -- Remove unused -b flag from getopt call in sbatch. -- Disable reporting of node TRES in sreport. -- Re-enabling features combined by OR within parenthesis for non-knl setups. -- Prevent sending duplicate requests to reboot a node before ResumeTimeout. -- Down nodes that don't reboot by ResumeTimeout. -- Update seff to reflect API change from rss_max to tres_usage_in_max. -- Add missing TRES constants from perl API. -- Fix issue where sacct would return incorrect array tasks when querying specific tasks. -- Add missing variables to slurmdb_stats_t in the perlapi. -- Fix nodes not getting reboot RPC when job requires reboot of nodes. -- Fix failing update the partition list of a job. -- Use slurm.conf gres ids instead of gres.conf names to get a gres type name. -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc. CVE-2019-6438. * Changes in Slurm 18.08.4 ========================== -- burst_buffer/cray - avoid launching a job that would be immediately cancelled due to a DataWarp failure. -- Fix message sent to user to display preempted instead of time limit when a job is preempted. -- Fix memory leak when a failure happens processing a nodes gres config. -- Improve error message when failures happen processing a nodes gres config. -- When building rpms ignore redundant standard rpaths and insecure relative rpaths, for RHEL based distros which use "check-rpaths" tool. -- Don't skip jobs in scontrol hold. -- Avoid locking the job_list when unneeded. -- Allow --cpu-bind=verbose to be used with SLURM_HINT environment variable. -- Make it so fixing runaway jobs will not alter the same job requeued when not runaway. -- Avoid checking state when searching for runaway jobs. -- Remove redundant check for end time of job when searching for runaway jobs. -- Make sure that we properly check for runawayjobs where another job might have the same id (for example, if a job was requeued) by also checking the submit time. -- Add scontrol update job ResetAccrueTime to clear a job's time previously accrued for priority. -- cons_res: Delay exiting cr_job_test until after cores/cpus are calculated and distributed. -- Fix bug where binary in cwd would trump binary in PATH with test_exec. -- Fix check to test printf("%s\n", NULL); to not require -Wno-format-truncation CFLAG. -- Fix JobAcctGatherParams=UsePss to report the correct usage. -- Fix minor memory leak in pmix plugin. -- Fix minor memory leak in slurmctld when reading configuration. -- Handle return codes correctly from pthread_* functions. -- Fix minor memory leak when a slurmd is unable to contact a slurmctld when trying to register. -- Fix sreport sizesbyaccount report when using Flatview and accounts. -- Fix incorrect shift when dealing with node weights and scheduling. -- libslurm/perl - Fix segfault caused by incorrect hv_to_slurm_ctl_conf. -- Add qos and assoc options to confirmation dialogs. -- Handle updating identical license or partition information correctly. -- Makes sure accounts and QOS' are all lower case to match documentation when read in from the slurm.conf file. -- Don't consider partitions without enough nodes in reservation, main scheduler. -- Set SLURM_NTASKS correctly if having to determine from other options. -- Removed GCP scripts from contribs. Now located at: https://github.com/SchedMD/slurm-gcp. -- Don't check existence of srun --prolog or --epilog executables when set to "none" and SLURM_TEST_EXEC is used. -- Add "P" suffix support to job and step tres specifications. -- When doing a reconfigure handle QOS' GrpJobsAccrue correctly. -- Remove unneeded extra parentheses from sh5util. -- Fix jobacct_gather/cgroup to work correctly when more than one task is started on a node. -- If requesting --ntasks-per-node with no tasks set tasks correctly. -- Accept modifiers for TRES originally added in 6f0342e0358. -- Don't remove reservation on slurmctld restart if nodes are removed from configuration. -- Fix bad xfree in task/cgroup. -- Fix removing counters if a job array isn't subject to limits and is canceled while pending. -- Make sure SLURM_NTASKS_PER_NODE is set correctly when env is overwritten by the command line. -- Clean up step on a failed node correctly. -- mpi/pmix: Fixed the logging of collective state. -- mpi/pmix: Make multi-slurmd work correctly when using ring communication. -- mpi/pmix: Fix double invocation of the PMIx lib fence callback. -- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll. -- Fix race condition in route/topology when the slurmctld is reconfigured. -- In route/topology validate the slurmctld doesn't try to initialize the node system. -- Fix issue when requesting invalid gres. -- Validate job_ptr in backfill before restoring preempt state. -- Fix issue when job's environment is minimal and only contains variables Slurm is going to replace internally. -- When handling runaway jobs remove all usage before rollup to remove any time that wasn't existent instead of just updating lines that have time with a lesser time. -- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET in the environment if the corresponding command line options are used. -- slurmd - fix handling of the -f flag to specify alternate config file locations. -- Fix scheduling logic to avoid using nodes that require a reboot for KNL node change when possible. -- Fix scheduling logic bug. There should have been a test for _not_ NODE_SET_REBOOT to continue. -- Fix a scheuling logic bug with respect to XOR operation support when there are down nodes. -- If there is a constraint construct of the form "[...&...]" then an error is generated if more than one of those specifications contains KNL NUMA or MCDRAM modes. -- Fix stepd segfault race if slurmctld hasn't registered with the launching slurmd yet delivering it's TRES list. -- Add SchedulerParameters option of bf_ignore_newly_avail_nodes to avoid scheduling lower priority jobs on resources that become available during the backfill scheduling cycle when bf_continue is enabled. -- Decrement message_connections in stepd code on error path correctly. -- Decrease an error message to be debug. -- Fix missing suffixes in squeue. -- pam_slurm_adopt - send an error message to the user if no Slurm jobs can be located on the node. -- Run SlurmctldPrimaryOffProg when the primary slurmctld process shuts down. -- job_submit/lua: Add several slurmctld return codes. -- job_submit/lua: Add user/group info to jobs. -- Fix formatting issues when printing uint64_t. -- Bump RLIMIT_NOFILE for daemons in systemd services. -- Expand %x in job name in 'scontrol show job'. -- salloc/sbatch/srun - print warning if mutually exclusive options of --mem and --mem-per-cpu are both set. * Changes in Slurm 18.08.3 ========================== -- Fix regression in 18.08.1 that caused dbd messages to not be queued up when the dbd was down. -- Fix regression in 18.08.1 that can cause a slurmctld crash when splitting job array elements. * Changes in Slurm 18.08.2 ========================== -- Correctly initialize variable in env_array_user_default(). -- Remove race condition when signaling starting step. -- Fix issue where 17.11 job's using GRES in didn't initialize new 18.08 structures after unpack. -- Stop removing nodes once the minimum CPU or node count for the job is reached in the cons_res plugin. -- Process any changes to MinJobAge and SlurmdTimeout in the slurmctld when it is reconfigured to determine changes in its background timers. -- Use previous SlurmdTimeout in the slurmctld after a reconfigure to determine the time a node has been down. -- Fix multi-cluster srun between clusters with different SelectType plugins. -- Fix removing job licenses on reconfig/restart when configured license counts are 0. -- If a job requested multiple licenses and one license was removed then on a reconfigure/restart all of the licenses -- including the valid ones would be removed. -- Fix issue where job's license string wasn't updated after a restart when licenses were removed or added. -- Add allow_zero_lic to SchedulerParameters. -- Avoid scheduling tasks in excess of ArrayTaskThrottle when canceling tasks of an array. -- Fix jobs that request memory per node and task count that can't be scheduled right away. -- Avoid infinite loop with jobacct_gather/linux when pids wrap around /proc/sys/kernel/pid_max. -- Fix --parsable2 output for sacct and sstat commands to remove a stray trailing delimiter. -- When modifying a user's name in sacctmgr enforce PreserveCaseUser. -- When adding a coordinator or user that was once deleted enforce PreserveCaseUser. -- Correctly handle scenarios where a partitions MaxMemPerCPU is less than a jobs --mem-per-cpu and also -c is greater than 1. -- Set AccrueTime correctly when MaxJobsAccrue is disabled and BeginTime has not been established. -- Correctly account for job arrays for new {Max/Grp}JobsAccrue limits. * Changes in Slurm 18.08.1 ========================== -- Remove commented-out parts of man pages related to cons_tres work in 19.05, as these were showing up on the web version due to a syntax error. -- Prevent slurmctld performance issues in main background loop if multiple backup controllers are unavailable. -- Add missing user read association lock in burst_buffer/cray during init(). -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'. -- Fix creation of step hwloc xml file for after cpuset cgroup has been created. -- Add userspace as a valid default governor. -- Add timers to group_cache_lookup so if going slow advise LaunchParameters=send_gids. -- Fix SLURM_STEP_GRES=none to work correctly. -- Fix potential memory leak when a failure happens unpacking a ctld_multi_msg. -- Fix potential double free when a faulure happens when unpacking a node_registration_status_msg. -- Fix sacctmgr show runaways. -- Removed non-POSIX append operator from configure script for non-bash support. -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'. -- Fix sacct to not print huge reserve times when the job was never eligible. -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a burst buffer. -- burst_buffer/cray - Update burst buffers when an association or qos is removed from the system. -- Remove documentation for deprecated Cray/ALPS systems. Please switch to Native Cray mode instead. -- Completely copy features when copying the list in the slurmctld. -- PMIX - Fix issue with packing processes when using an arbitrary task distribution. -- Fix hostlists to be able to handle nodenames with '-' in them surrounded by integers. -- Added sort option to sprio output. -- Fix correct job CPU count allocated. -- Fix sacctmgr setting GrpJobs limit when setting GrpJobsAccrue limit. -- Change the defaults to MemLimitEnforce=no and NoOverMemoryKill (See RELEASE_NOTES). -- Prevent abort when using Cray node features plugin on non-knl. -- Add ability to reboot down nodes with scontrol reboot_nodes. -- Protect against sending to the slurmdbd if the connection has gone away. -- Fix invalid read when not using backup slurmctlds. -- Prevent acct coordinators from changing default acct on add user. -- Don't allow scontrol top do modify job priorities when priority == 1. -- slurmsmwd - change parsing code to handle systems with the svid or inst fields set in xtconsumer output. -- Fix infinite loop in slurmctld if GRES is specified without a count. -- sacct: Print error when unknown arguments are found. -- Fix checking missing return codes when unpacking structures. -- Fix slurm.spec-legacy including slurmsmwd -- More explicit error message when cgroup oom-kill events detected. -- When updating an association and are unable to find parent association initialize old fairshare association pointer correctly. -- Wrap slurm_cond_signal() calls with mutexes where needed. -- Fix correct timeout with resends in slurm_send_only_node_msg. -- Fix pam_slurm_adopt to honor action_adopt_failure. -- Have the slurmd recreate the hwloc xml file for the full system on restart. -- sdiag - correct the units for the gettimeofday() stat to microseconds. -- Set SLURM_CLUSTER_NAME environment variable in MailProg to the ClusterName. -- smail - use SLURM_CLUSTER_NAME environment variable. -- job_submit/lua - expose argc/argv options through lua interface. -- slurmdbd - prevent false-positive warning about innodb settings having been set too low if they're actually set over 2GB. * Changes in Slurm 18.08.0 ========================== -- Fix segfault on job arrays when starting controller without dbd up. -- Fix pmi2 to build with gcc 8.0+. -- Remove the development snapshot of select/cons_tres plugin. -- Fix slurmd -C to not print benign error from xcpuinfo. -- Fix potential double locks in the assoc_mgr. -- Fix sacct truncate flag behavior Truncated pending jobs will always return a start and end time set to the window end time, so elapsed time is 0. -- Fix extern step hanging forever when canceled right after creation. -- sdiag - add slurmctld agent count. -- Remove requirement to have cgroup_allowed_devices_file.conf in order to constrain devices. By default all devices are allowed and GRES, that are associated with a device file, that are not requested are restricted. -- Fix proper alignment of clauses when determining if more nodes are needed for an allocation. -- Fix race condition when canceling a federation job that just started running. -- Prevent extra resources from being allocated when combining certain flags. -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing when using --hint=nomultithread. -- Fix left over socket file when step is ending and using pmi2 with %n or %h in the spool dir. -- Don't remove hwloc full system xml file when shutting down the slurmd. -- Fix segfault that could happen with a het job when it was canceled while starting. -- Fix scan-build false-positive warning about invalid memory access in the _ping_controller() function. -- Add control_inx value to trigger_info_msg_t to permit future work in the trigger management code to distinguish which of multiple backup controllers has changed state. * Changes in Slurm 18.08.0rc1 ============================== -- Add TimelimitRaw sacct output field to display timelimit numbers. -- Fix job array preemption during backfill scheduling. -- Fix scontrol -o show assoc output. -- Add support for sacct --whole-hetjob=[yes|no] option. -- Make salloc handle node requests the same as sbatch. -- Add shutdown_on_reboot SlurmdParameter to control whether the Slurmd will shutdown itself down or not when a reboot request is received. -- Add cancel_reboot scontrol option to cancel pending reboot of nodes. -- Make Users case insensitive in the database based on Parameters=PreserveCaseUser in the slurmdbd.conf. -- Improve scheduling when dealing with node_features that could have a boot delay. -- Fix issue if a step launch fails we don't get a bunch of '(null)' strings in the step record for usage. -- Changed the default AuthType for slurmdbd to auth/munge. -- Make it so libpmi.so doesn't link to libslurm.so.$apiversion. -- Added 'remote-fs.target' to After directive of slurmd.service file. -- Fix filetxt plugin to handle it when you aren't running a jobacct_gather plugin. -- Remove drain on node when reboot nextstate used. -- Speed up pack of job's qos. -- Fix race condition when trying to update reservation in the database. -- For the PrologFlags slurm.conf option, make NoHold mutually exclusive with Contain and/or X11 options. -- Revise the handling of SlurmctldSyslogLevel and SlurmdSyslogLevel options in slurm.conf and DebugLevelSyslog in slurmdbd.conf. -- Gate reading the cgroup.conf file. -- Gate reading the acct_gather_* plugins. -- Add sacctmgr options to prevent/manage job queue stuffing: - GrpJobsAccrue= Maximum number of pending jobs in aggregate able to accrue age priority for this association and all associations which are children of this association. To clear a previously set value use the modify command with a new value of -1. - MaxJobsAccrue= Maximum number of pending jobs able to accrue age priority at any given time for the given association. This is overridden if set directly on a user. Default is the cluster's limit. To clear a previously set value use the modify command with a new value of -1. - MinPrioThreshold Minimum priority required to reserve resources when scheduling. * Changes in Slurm 18.08.0pre2 ============================== -- Remove support for "ChosLoc" configuration parameter. -- Configuration parameters "ControlMachine", "ControlAddr", "BackupController" and "BackupAddr" replaced by an ordered list of "SlurmctldHost" records with the optional address appended to the name enclosed in parenthesis. For example: "SlurmctldHost=head(12.34.56.78)". An arbitrary number of backup servers can be configured. -- When a pending job's state includes "UnavailableNodes" do not include the nodes in FUTURE state. -- Remove --immediate option from sbatch. -- Add infrastructure for per-job and per-step TRES parameters: tres-per-job, tres-per-node, tres-per-socket, tres-per-task, cpus-per-tres, mem-per-tres, tres-bind and tres-freq. These new parameters are not currently used, but have been added to the appropriate RPCs. -- Add DefCpuPerGpu and DefMemPerGpu to global and per-partition configuration parameters. Shown in scontrol/sview as "JobDefaults=...". NOTE: These options are for future use and currently have no effect. -- Fix for setting always the correct status on job update in mysql -- Add ValidateMode configuration parameter to knl_cray.conf for static MCDRAM/NUMA configurations. -- Fix security issue in accounting_storage/mysql plugin by always escaping strings within the slurmdbd. CVE-2018-7033. -- Disable local PTY output processing when using 'srun --unbuffered'. This prevents the PTY subsystem from inserting extraneous \r characters into the output stream. -- Change the column name for the %U (User ID) field in squeue to 'UID'. -- CRAY - Add CheckGhalQuiesce to the CommunicationParameters. -- When a process is core dumping, avoid terminating other processes in that task group. This fixes a problem with writing out incomplete OpenMP core files. -- CPU frequency management enhancements: If scaling_available_frequencies file is not available, then derive values from scaling_min_freq and scaling_max_freq values. If cpuinfo_cur_freq file is not available then try to use scaling_cur_freq. -- Add pending jobs count to sdiag output. -- Fix update job function. There were some incosistencies on the behavior that caused time limits to be modified when swapping QOS, bad permissions check for a coordinator and AllowQOS and DenyQOS were not enforced on job update. -- Add configuration paramerers SlurmctldPrimaryOnProg and SlurmctldPrimaryOffProg, which define programs to execute when a slurmctld daemon becomes the primary server or goes from primary to backup mode. -- Add configuration paramerers SlurmctldAddr for use with virtual IP to manage backup slurmctld daemons. -- Explicitly shutdown the slurmd process when instructed to reboot. -- Add ability to create/update partition with TRESBillingWeights through scontrol. -- Calcuate TRES billing values at submission so that billing limits can be enforced at submission with QOS DenyOnLimit. -- Add node_features plugin function "node_features_p_reboot_weight()" to return the node weight to be used for a compute node that requires reboot for use (e.g. to change the NUMA mode of a KNL node). -- Add NodeRebootWeight parameter to knl.conf configuration file. -- Fix insecure handling of job requested gid field. CVE-2018-10995. -- Fix srun to return highest signal of any task. -- Completely remove "gres" field from step record. Use "tres_per_node", "tres_per_socket", etc. -- Add "Links" parameter to gres.conf configuration file. -- Force slurm_mktime() to set tm_isdst to -1 so anyone using the function doesn't forget to set it. -- burst_buffer.conf - Add SetExecHost flag to enable burst buffer access from the login node for interactive jobs. -- Append ", with requeued tasks" to job array "end" emails if any tasks in the array were requeued. This is a hint to use "sacct --duplicates" to see the whole picture of the array job. -- Add ResumeFailProgram slurm.conf option to specify a program that is called when a node fails to respond by ResumeTimeout. -- Add new job pending reason of "ReqNodeNotAvail, reserved for maintenance". -- Remove AdminComment += syntax from 'scontrol update job'. -- sched/backfill: Reset job time limit if needed for deadline scheduling. -- For heterogeneous job component with required nodes, explicitly exclude those nodes from all other job components. -- Add name of partition used to output of srun --test-only output (valuable for jobs submitted to multiple partitions). -- If MailProg is not configured and "/bin/mail" (the default) does not exist, but "/usr/bin/mail" does exist then use "/usr/bin/mail" as a default value. -- sdiag output now reports outgoing slurmctld message queue contents. -- Fix issue in performance when reading slurm conf having nodes with features. -- Make it so the slurmdbd's pid file gets created before initing the database. -- Improve escaping special characters on user commands when specifying paths. -- Fix directory names with special char '\' that are not handled correctly. -- Add salloc/sbatch/srun option of --gres-flags=disable-binding to disable filtering of CPUs with respect to generic resource locality. This option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). This option may permit a job to be allocated resources sooner than otherwise possible, but may result in lower job performance. -- SlurmDBD - Print warning if MySQL/MariaDB internal tuning is not at least half of the recommended values. -- Move libpmi from src/api to contribs/pmi. -- Add ability to specify a node reason when rebooting nodes with "scontrol reboot". -- Add nextstate option to "scontrol reboot" to dictate state of node after reboot. -- Consider "resuming" (nextstate=resume) nodes as available in backfill future scheduling and don't replace "resuming" nodes in reservations. -- Add the use of a xml file to help performance when using hwloc. * Changes in Slurm 18.08.0pre1 ============================== -- Add new burst buffer state of "teardown-fail" to indicate the burst buffer teardown operation is failing on specific buffers. This changes the numeric value of the BB_STATE_COMPLETE type. Any Slurm version 17.02 or 17.11 tool used to report burst buffer state information will report a state of "66" rather than "complete" for burst buffers which have been deleted, but still exist in the slurmctld daemon's tables (a very short-lived situation). -- Multiple backup slurmctld daemons can be configured: * Specify "BackupController#= and "BackupAddr#=
" to identify up to 9 backup servers. * Output format of "scontrol ping" and the daemon status at the end of "scontrol status" is modified to report up status of the primary and all backup servers. * "scontrol takeover [#]" command can now identify the SlurmctldHost index number. Default value is "1" (the first backup configured SlurmctldHost). -- Enable jobs with zero node count for creation and/or deletion of persistent burst buffers. * The partition default MinNodes configuration parameter is now 0 (previously 1 node). * Zero size jobs disabled for job arrays and heterogeneous jobs, but supported for salloc, sbatch and srun commands. -- Add "scontrol show dwstat" command to display Cray burst buffer status. -- Add "GetSysStatus" option to burst_buffer.conf file. For burst_buffer/cray this would indicate the location of the "dwstat" command. -- Add node and partition configuration options of "CpuBind" to control default task binding. Modify the scontrol to report and modify these parameters. -- Add "NumaCpuBind" option to knl.conf file to automatically change a node's CpuBind parameter based upon changes to a node's NUMA mode. -- Add sbatch "--batch" option to identify features required on batch node. For example "sbatch --batch=haswell ...". -- Add "BatchFeatures" field to output of "scontrol show job". -- Add support for "--bb" option to sbatch command. -- Add new SystemComment field to job data structure and database. Currently used for Burst Buffer error logs. -- Expand reservation "flags" field from 32 to 64 bits. -- Add job state flag of "SIGNALING" to avoid race condition with multiple SIGSTOP/SIGCONT signals for the same job being active at the same time. -- Properly handle srun --will-run option when there are jobs in COMPLETING state. -- Properly report who is signaling a step. -- Don't combine updated reservation records in sreport's reservation report. -- node_features plugin - Add suport for XOR & XAND of job constraints (node feature specifications). -- Add support for parenthesis in a job's constraint specification to group like options together. For example --constraint="[(knl&snc4&flat)*4&haswell*1]" might be used to specify that four nodes with the features "knl", "snc4" and "flat" plus one node with the feature "haswell" are required. -- Improvements to how srun searches for the executible when using cwd. -- Now programs can be checked before execution if test_exec is set when using multi-prog option. -- Report NodeFeatures plugin configuration with scontrol and sview commands. -- Add acct_gather_profile/influxdb plugin. -- Add new job state of SO/STAGE_OUT indicating that burst buffer stage-out operation is in progress. -- Correct SLURM_NTASKS and SLURM_NPROCS environment variable for heterogeneous job step. Report values representing full allocation. -- Expand advanced reservation feature specification to support parenthesis and counts of nodes with specified features. Nodes with the feature currently active will be prefered. -- Defer job signaling until prolog is completed -- Have the primary slurmctld wait until the backup has completely shutdown before taking control. -- Fix issue where unpacking job state after TRES count changed could lead to invalid reads. -- Heterogeneous job steps allocations supported with * Open MPI (with Slurm's PMI2 and PMIx plugins) and * Intel MPI (with Slurm's PMI2 plugin) -- Remove redundant function arguments from task plugins: * Remove "job_id" field from task_p_slurmd_batch_request() function. * Remove "job_id" field from task_p_slurmd_launch_request() function. * Remove "job_id" field from task_p_slurmd_reserve_resources() function. -- Change function name from node_features_p_changible_feature() to node_features_p_changeable_feature in node_features plugin. -- Add Slurm configuration file check logic using "slurmctld -t" command. * Changes in Slurm 17.11.14 =========================== * Changes in Slurm 17.11.13-2 ============================= -- Fix Perl build for 32-bit systems. * Changes in Slurm 17.11.13 =========================== -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc. CVE-2019-6438. * Changes in Slurm 17.11.12 =========================== -- Fix regression in 17.11.10 that caused dbd messages to not be queued up when the dbd was down. * Changes in Slurm 17.11.11 =========================== -- Correctly initialize variable in env_array_user_default(). -- Correctly handle scenarios where a partitions MaxMemPerCPU is less than a jobs --mem-per-cpu and also -c is greater than 1. * Changes in Slurm 17.11.10 =========================== -- Move priority_sort_part_tier from slurmctld to libslurm to make it possible to run the regression tests 24.* without changing that code since it links directly to the priority plugin where that function isn't defined. -- Fix issue where job time limits can increase to max walltime when updating a job with scontrol. -- Fix invalid protocol_version manipulation on big endian platforms causing srun and sattach to fail. -- Fix for QOS, Reservation and Alias env variables in srun. -- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached thread. -- When allowing heterogeneous steps make sure we copy all the options to avoid copying strings that may be overwritten. -- Print correctly when sh5util finds and empty file. -- Fix sh5util to not seg fault on exit. -- Fix sh5util to check correctly for H5free_memory. -- Adjust OOM monitoring function in task/cgroup to prevent problems in regression suite from leaked file descriptors. -- Fix issue with gres when defined with a type and no count (i.e. gres=gpu/tesla) it would get a count of 0. -- Allow sstat to talk to slurmd's that are new in protocol version. -- Permit database names over 33 characters in accounting_storage/mysql. -- Fix negative values when profiling. -- Fix srun segfault caused by invalid memory reads on the env. -- Fix segfault on job arrays when starting controller without dbd up. -- Fix pmi2 to build with gcc 8.0+. -- Fix proper alignment of clauses when determining if more nodes are needed for an allocation. -- Fix race condition when canceling a federation job that just started running. -- Prevent extra resources from being allocated when combining certain flags. -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing when using --hint=nomultithread. -- Fix left over socket file when step is ending and using pmi2 with %n or %h in the spool dir. -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'. -- Fix sacct to not print huge reserve times when the job was never eligible. -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a burst buffer. -- burst_buffer/cray - Update burst buffers when an association or qos is removed from the system. -- If failed over to a backup controller, ensure the agent thread is launched to handle deferred tasks. -- Fix correct job CPU count allocated. -- Protect against sending to the slurmdbd if the connection has gone away. -- Fix checking missing return codes when unpacking structures. -- Fix slurm.spec-legacy including slurmsmwd -- More explicit error message when cgroup oom-kill events detected. -- When updating an association and are unable to find parent association initialize old fairshare association pointer correctly. -- Wrap slurm_cond_signal() calls with mutexes where needed. -- Fix correct timeout with resends in slurm_send_only_node_msg. -- Fix pam_slurm_adopt to honor action_adopt_failure. -- job_submit/lua - expose argc/argv options through lua interface. * Changes in Slurm 17.11.9-2 ============================ -- Fix printing of node state "drain + reboot" (and other node state flags). -- Fix invalid read (segfault) when sorting multi-partition jobs. -- Move several new error() messages to debug() to keep them out of users' srun output. * Changes in Slurm 17.11.9 ========================== -- Fix segfault in slurmctld when a job's node bitmap is NULL during a scheduling cycle. Primarily caused by EnforcePartLimits=ALL. -- Remove erroneous unlock in acct_gather_energy/ipmi. -- Enable support for hwloc version 2.0.1. -- Fix 'srun -q' (--qos) option handling. -- Fix socket communication issue that can lead to lost task completition messages, which will cause a permanently stuck srun process. -- Handle creation of TMPDIR if environment variable is set or changed in a task prolog script. -- Avoid node layout fragmentation if running with a fixed CPU count but without Sockets and CoresPerSocket defined. -- burst_buffer/cray - Fix datawarp swap default pool overriding jobdw. -- Fix incorrect job priority assignment for multi-partition job with different PriorityTier settings on the partitions. -- Fix sinfo to print correct node state. * Changes in Slurm 17.11.8 ========================== -- Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path. -- Do not allocate nodes that were marked down due to the node not responding by ResumeTimeout. -- task/cray plugin - search for "mems" cgroup information in the file "cpuset.mems" then fall back to the file "mems". -- Fix ipmi profile debug uninitialized variable. -- Improve detection of Lua package on older RHEL distributions. -- PMIx: fixed the direct connect inline msg sending. -- MYSQL: Fix issue not handling all fields when loading an archive dump. -- Allow a job_submit plugin to change the admin_comment field during job_submit_plugin_modify(). -- job_submit/lua - fix access into reservation table. -- MySQL - Prevent deadlock caused by archive logic locking reads. -- Don't enforce MaxQueryTimeRange when requesting specific jobs. -- Modify --test-only logic to properly support jobs submitted to more than one partition. -- Prevent slurmctld from abort when attempting to set non-existing qos as def_qos_id. -- Add new job dependency type of "afterburstbuffer". The pending job will be delayed until the first job completes execution and it's burst buffer stage-out is completed. -- Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd and avoid race condition calling task before proctrack can introduce. -- Prevent reboot of a busy KNL node when requesting inactive features. -- Revert to previous behavior when requesting memory per cpu/node introduced in 17.11.7. -- Fix to reinitialize previously adjusted job members to their original value when validating the job memory in multi-partition requests. -- Fix _step_signal() from always returning SLURM_SUCCESS. -- Combine active and available node feature change logs on one line rather than one line per node for performance reasons. -- Prevent occasionally leaking freezer cgroups. -- Fix potential segfault when closing the mpi/pmi2 plugin. -- Fix issues with --exclusive=[user|mcs] to work correctly with preemption or when job requests a specific list of hosts. -- Make code compile with hdf5 1.10.2+ -- mpi/pmix: Fixed the collectives canceling. -- SlurmDBD: improve error message handling on archive load failure. -- Fix incorrect locking when deleting reservations. -- Fix incorrect locking when setting up the power save module. -- Fix setting format output length for squeue when showing array jobs. -- Add xstrstr function. -- Fix printing out of --hint options in sbatch, salloc --help. -- Prevent possible divide by zero in _validate_time_limit(). -- Add Delegate=yes to the slurmd.service file to prevent systemd from interfering with the jobs' cgroup hierarchies. -- Change the backlog argument to the listen() syscall within srun to 4096 to match elsewhere in the code, and avoid communication problems at scale. * Changes in Slurm 17.11.7 ========================== -- Fix for possible slurmctld daemon abort with NULL pointer. -- Fix different issues when requesting memory per cpu/node. -- PMIx - override default paths at configure time if --with-pmix is used. -- Have sprio display jobs before eligible time when PriorityFlags=ACCRUE_ALWAYS is set. -- Make sure locks are always in place when calling _post_qos_list(). -- Notify srun and ctld when unkillable stepd exits. -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in the jobacct_gather fini() interfaces introduced in 17.11.6. -- Fix slurmstepd deadlock in PMIx startup. -- task/cgroup - fix invalid free() if the hwloc library does not return a string as expected. -- Fix insecure handling of job requested gid field. CVE-2018-10995. -- Add --without x11 option to rpmbuild in slurm.spec. * Changes in Slurm 17.11.6 ========================== -- CRAY - Add slurmsmwd to the contribs/cray dir. -- sview - fix crash when closing any search dialog. -- Fix initialization of variable in stepd when using native x11. -- Fix reading slurm_io_init_msg to handle partial messages. -- Fix scontrol create res segfault when wrong user/account parameters given. -- Fix documentation for sacct on parameter -X (--allocations) -- Change TRES Weights debug messages to debug3. -- FreeBSD - assorted fixes to restore build. -- Fix for not tracking environment variables from unrelated different jobs. -- PMIX - Added the direct connect authentication. When upgrading this may cause issues with jobs using pmix starting on mixed slurmstepd versions where some are less than 17.11.6. -- Prevent the backup slurmctld from losing the active/available node features list on takeover. -- Add documentation for fix IDLE*+POWER due to capmc stuck in Cray systems. -- Fix missing mutex unlock when prolog is failing on a node, leading to a hung slurmd. -- Fix locking around Cray CCM prolog/epilog. -- Add missing fed_mgr read locks. -- Fix issue incorrectly setting a job time_start to 0 while requeueing. -- smail - remove stray '-s' from mail subject line. -- srun - prevent segfault if ClusterName setting is unset but SLURM_WORKING_CLUSTER environment variable is defined. -- In configurator.html web pages change default configuration from task/none to task/affinity plugin and from select/linear plugin to select/cons_res plus CR_Core. -- Allow jobs to run beyond a FLEX reservation end time. -- Fix problem with wrongly set as Reservation job state_reason. -- Prevent bit_ffs() from returnig value out of bitmap range. -- Improve performance of 'squeue -u' when PrivateData=jobs is enabled. -- Make UnavailableNodes value in job reason be correct for each job. -- Fix 'squeue -o %s' on Cray systems. -- Fix incorrect error thrown when cancelling part of a job array. -- Fix error code and scheduling problem for --exclusive=[user|mcs]. -- Fix build when lz4 is in a non-standard location. -- Be able to force power_down of cloud node even if in power_save state. -- Allow cloud nodes to be recognized in Slurm when booted out of band. -- Fixes race condition in _pack_job_gres() when is called multiple times. -- Increase duration of "sleep" command used to keep extern step alive. -- Remove unsafe usage of pthread_cancel in slurmstepd that can lead to to deadlock in glibc. -- Fix total TRES Billing on partitions. -- Don't tear down a BB if a node fails and --no-kill or resize of a job happens. -- Remove unsafe usage of pthread_cancel in pmix plugin that can lead to to deadlock in glibc. -- Fix fatal in controller when loading completed trigger -- Ignore reservation overlap at submission time. -- GRES type model and QOS limits documentation added -- slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set. -- PMIx - move two error messages on retry to debug level, and only display the error after the retry count has been exceeded. -- Increase number of tries when sending responses to srun. -- Fix checkpointing requeued/completing jobs in a bad state which caused a segfault on restart. -- Fix srun on ppc64 platforms. -- Prevent slurmd from starting steps if the Prolog returns an error when using PrologFlags=alloc. -- priority/multifactor - prevent segfault running sprio if a partition has just been deleted and PriorityFlags=CALCULATE_RUNNING is turned on. -- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code value. -- job_submit/lua - print an error if the script calls log.user in job_modify() instead of returning it to the next submitted job erroneously. -- select/linear - handle job resize correctly. -- select/cons_res - improve handling of --cores-per-socket requests. * Changes in Slurm 17.11.5 ========================== -- Fix cloud nodes getting stuck in DOWN+POWER_UP+NO_RESPOND state after not responding by ResumeTimeout. -- Add job's array_task_cnt and user_name along with partitions [max|def]_mem_per_[cpu|node], max_cpus_per_node, and max_share with the SHARED_FORCE definition to the job_submit/lua plugin. -- srun - fix for SLURM_JOB_NUM_NODES env variable assignment. -- sacctmgr - fix runaway jobs identification. -- Fix for setting always the correct status on job update in mysql. -- Fix issue if running with an association manager cache (slurmdbd was down when slurmctld was started) you could loose QOS usage information. -- CRAY - Fix spec file to work correctly. -- Set scontrol exit code to 1 if attempting to update a node state to DRAIN or DOWN without specifying a reason. -- Fix race condition when running with an association manager cache (slurmdbd was down when slurmctld was started). -- Print out missing SLURM_PERSIST_INIT slurmdbd message type. -- Fix two build errors related to use of the O_CLOEXEC flag with older glibc. -- Add Google Cloud Platform integration scripts into contribs directory. -- Fix minor potential memory leak in backfill plugin. -- Add missing node flags (maint/power/etc) to node states. -- Fix issue where job time limits may end up at 1 minute when using the NoReserve flag on their QOS. -- Fix security issue in accounting_storage/mysql plugin by always escaping strings within the slurmdbd. CVE-2018-7033. -- Soften messages about best_fit topology to debug2 to avoid alarm. -- Fix issue in sreport reservation utilization report to handle more allocated time than 100% (Flex reservations). -- When a job is requesting a Flex reservation prefer the reservation's nodes over any other nodes. * Changes in Slurm 17.11.4 ========================== -- Add fatal_abort() function to be able to get core dumps if we hit an "impossible" edge case. -- Link slurmd against all libraries that slurmstepd links to. -- Fix limits enforce order when they're set at partition and other levels. -- Add slurm_load_single_node() function to the Perl API. -- slurm.spec - change dependency for --with lua to use pkgconfig. -- Fix small memory leaks in node_features plugins on reconfigure. -- slurmdbd - only permit requests to update resources from operators or administrators. -- Fix handling of partial writes in io_init_msg_write_to_fd() which can lead to job step launch failure under higher cluster loads. -- MYSQL - Fix to handle quotes in a given work_dir of a job. -- sbcast - fix a race condition that leads to "Unspecified error". -- Log that support for the ChosLoc configuration parameter will end in Slurm version 18.08. -- Fix backfill performance issue where bf_min_prio_reserve was not respected. -- Fix MaxQueryTimeRange checks. -- Print MaxQueryTimeRange in "sacctmgr show config". -- Correctly check return codes when creating a step to check if needing to wait to retry or not. -- Fix issue where a job could be denied by Reason=MaxMemPerLimit when not requesting any tasks. -- In perl tools, fix for regexp that caused extra incorrectly shown results. -- Add some extra locks in fed_mgr to be extra safe. -- Minor memory leak fixes in the fed_mgr on slurmctld shutdown. -- Make sreport job reports also report duplicate jobs correctly. -- Fix issues restoring certain Partition configuration elements, especially when ReconfigFlags=KeepPartInfo is enabled. -- Don't add TRES whose value is NO_VAL64 when building string line. -- Fix removing array jobs from hash in slurmctld. -- Print out missing user messages from jobsubmit plugin when srun/salloc are waiting for an allocation. -- Handle --clusters=all as case insensitive. -- Only check requested clusters in federation when using --test-only submission option. -- In the federation, make it so you can cancel stranded sibling jobs. -- Silence an error from PSS memory stat collection process. -- Requeue jobs allocated to nodes requested to DRAIN or FAIL if nodes are POWER_SAVE or POWER_UP, preventing jobs to start on NHC-failed nodes. -- Make MAINT and OVERLAP resvervation flags order agnostic on overlap test. -- Preserve node features when slurmctld daemons reconfigured including active and available KNL features. -- Prevent creation of multiple io_timeout threads within srun, which can lead to fatal() messages when those unexpected and additional mutexes are destroyed when srun shuts down. -- burst_buffer/cray - Prevent use of "#DW create_persistent" and "#DW destroy_persistent" directives available in Cray CLE6.0UP06. This will be supported in Slurm version 18.08. Use "#BB" directives until then. -- Fix task/cgroup affinity to behave correctly. -- FreeBSD - fix build on systems built with WITHOUT_KERBEROS. -- Fix to restore pn_min_memory calculated result to correctly enforce MaxMemPerCPU setting on a partition when the job uses --mem. -- slurmdbd - prevent infinite loop if a QOS is set to preempt itself. -- Fix issue with log rotation for slurmstepd processes. * Changes in Slurm 17.11.3-2 ========================== -- Revert node_features changes in 17.11.3 that lead to various segfaults on slurmctld startup. * Changes in Slurm 17.11.3 ========================== -- Send SIG_UME correctly to a step. -- Sort sreport's reservation report by cluster, time_start, resv_name instead of cluster, resv_name, time_start. -- Avoid setting node in COMPLETING state indefinitely if the job initiating the node reboot is cancelled while the reboot in in progress. -- Scheduling fix for changing node features without any NodeFeatures plugins. -- Improve logic when summarizing job arrays mail notifications. -- Add scontrol -F/--future option to display nodes in FUTURE state. -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE. -- When a job array is preempting make it so tasks in the array don't wait to preempt other possible jobs. -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free in slurmstepd. -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld reconfigured. -- node_feature/knl_cray - Fix memory leak that can occur during normal operation. -- Fix srun environment variables for --prolog script. -- Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. -- Fix potential deadlock in the slurmctld when using list_for_each. -- Fix for possible memory corruption in srun when running heterogeneous job steps. -- Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. -- Fix output file containing "%t" (task ID) for heterogeneous job step to be based upon global task ID rather than task ID for that component of the heterogeneous job step. -- MYSQL - Fix potential abort when attempting to make an account a parent of itself. -- Fix potentially uninitialized variable in slurmctld. -- MYSQL - Fix issue for multi-dimensional machines when using sacct to find jobs that ran on specific nodes. -- Reject --acctg-freq at submit if invalid. -- Added info string on sh5util when deleting an empty file. -- Correct dragonfly topology support when job allocation specifies desired switch count. -- Fix minor memory leak on an sbcast error path. -- Fix issues when starting the backup slurmdbd. -- Revert uid check when requesting a jobid from a pid. -- task/cgroup - add support to detect OOM_KILL cgroup events. -- Fix whole node allocation cpu counts when --hint=nomultihtread. -- Allow execution of task prolog/epilog when uid has access rights by a secondary group id. -- Validate command existence on the srun *[pro|epi]log options if LaunchParameter test_exec is set. -- Fix potential memory leak if clean starting and the TRES didn't change from when last started. -- Fix for association MaxWall enforcement when none is given at submission. -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld. -- burst_buffer/cray: Attempts by job to create persistent burst buffer when one already exists owned by a different user will be logged and the job held. -- CRAY - Remove race in the core_spec where we add the slurmstepd to the job container where if the step was canceled would also cancel the stepd erroneously. -- Make sure the slurmstepd blocks signals like SIGTERM correctly. -- SPANK - When slurm_spank_init_post_opt() fails return error correctly. -- When revoking a sibling job in the federation we want to send a start message before purging the job record to get the uid of the revoked job. -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss was the only correct capitialization; UsePSS or usepss were silently ignored. -- Prevent pthread_atfork handlers from being added unnecessarily after 'scontrol reconfigure', which can eventually lead to a crash if too many handlers have been registered. -- Better debug messages when MaxSubmitJobs is hit. -- Docs - update squeue man page to describe all possible job states. -- Prevent orphaned step_extern steps when a job is cancelled while the prolog is still running. * Changes in Slurm 17.11.2 ========================== -- jobcomp/elasticsearch - append Content-Type to the HTTP header. -- MYSQL - Fix potential abort of slurmdbd when job has no TRES. -- Add advanced reservation flag of "REPLACE_DOWN" to replace DOWN or DRAINED nodes. -- slurm.spec-legacy - add missing libslurmfull.so to slurm.files. -- Fix squeue job ID filtering for pending job array records. -- Fix potential deadlock in _run_prog() in power save code. -- MYSQL - Add dynamic_offset in the database to force range for auto increment ids for the tres_table. -- MYSQL - Fix fallout from MySQL auto increment bug, see RELEASE_NOTES, only affects current 17.11 users tracking licenses or GRES in the database. -- Refactor logging logic to avoid possible memory corruption on non-x86 architectures. -- Fix memory leak when getting jobs from the slurmdbd. -- Fix incorrect logic behind MemorySwappiness, and only set the value when specified in the configuration. * Changes in Slurm 17.11.1-2 ============================ -- MYSQL - Make index for pack_job_id * Changes in Slurm 17.11.1 ========================== -- Fix --with-shared-libslurm option to work correctly. -- Make it so only daemons log errors on configuration option duplicates. -- Fix for ConstrainDevices=yes to work correctly. -- Fix to purge old jobs using burst buffer if slurmctld daemon restarted after the job's burst buffer work was already completed. -- Make logging prefix for slurmstepd to happen as soon as possible. -- mpi/pmix: Fix the job registration for the PMIx v2.1. -- Fix uid check for signaling a step with anything but SIGKILL. -- Return ESLURM_TRANSITION_STATE_NO_UPDATE instead of EAGAIN when trying to signal a step that is still running a prolog. -- Update Cray slurm_playbook.yaml with latest recommended version. -- Only say a prolog is done running after the extern step is launched. -- Wait to start a batch step until the prolog and extern step are fully ran/launched. Only matters if running with PrologFlags=[contain|alloc]. -- Truncate a range for SlurmctldPort to FD_SETSIZE elements and throw an error, otherwise network traffic may be lost due to poll() not detecting traffic. -- Fix for srun --pack-group option that can reuse/corrupt memory. -- Fix handling ultra long hostlists in a hostfile. -- X11: fix xauth regex to handle '-' in hostnames again. -- Fix potential node reboot timeout problem for "scontrol reboot" command. -- Add ability for squeue to sort jobs by submit time. -- CRAY - Switch to standard pid files on Cray systems. -- Update jobcomp records on duplicate inserts. -- If unrecognized configuration file option found then print an appropriate fatal error message rather than relying upon random errno value. -- Initialize job_desc_msg_t's instead of just memset'ing them. -- Fix divide by zero when job requests no tasks and more memory than MaxMemPer{CPU|NODE}. -- Avoid changing Slurm internal errno on syslog() failures. -- BB - Only launch dependent jobs after the burst buffer is staged-out completely instead of right after the parent job finishes. -- node_features/knl_generic - If plugin can not fully load then do not spawn a background pthread (which will fail with invalid memory reference). -- Don't set the next jobid to give out to the highest jobid in the system on controller startup. Just use the checkpointed next use jobid. -- Docs - add Slurm/PMIx and OpenMPI build notes to the mpi_guide page. -- Add lustre_no_flush option to LaunchParameters for Native Cray systems. -- Fix rpmbuild issue with rpm 4.13+ / Fedora 25+. -- sacct - fix the display for the NNodes field when using the --units option. -- Prevent possible double-xfree on a buffer in stepd_completion. -- Fix for record job state on successful allocation but failed reply message. -- Fill in the user_name field for batch jobs if not sent by the slurmctld. (Which is the default behavior if LaunchParameters=send_gids is not enabled.). This prevents job launch problems for sites using UsePAM=1. -- Handle syncing federated jobs that ran on non-origin clusters and were cancelled while the origin cluster was down. -- Fix accessing variable outside of lock. -- slurm.spec: move libpmi to a separate package to solve a conflict with the version provided by PMIx. This will require a separate change to PMIx as well. -- X11 forwarding: change xauth handling to use hostname/unix:display format, rather than localhost:display. -- mpi/pmix - Fix warning if not compiling with debug. * Changes in Slurm 17.11.0 ========================== -- Fix documentation for MaxQueryTimeRange option in slurmdbd.conf. -- Avoid srun abort trying to run on heterogeneous job component that has ended. -- Add SLURM_PACK_JOB_ID,SLURM_PACK_JOB_OFFSET to PrologSlurmctld and EpilogSlurmctld environment. -- Treat ":" in #SBATCH arguments as fatal error. The "#SBATCH packjob" syntax must be used instead. -- job_submit/lua plugin: expose pack_job fields to get. -- Prevent scheduling deadlock with multiple components of heterogeneous job in different partitions (i.e. one heterogeneous job component is higher priority in one partition and another component is lower priority in a different partition). -- Fix for heterogeneous job starvation bug. -- Fix some slurmctld memory leaks. -- Add SLURM_PACK_JOB_NODELIST to PrologSlurmctld and EpilogSlurmctld environment. -- If PrologSlurmctld fails for pack job leader then requeue or kill all components of the job. -- Fix for mulitple --pack-group srun arguments given out of order. -- Update slurm.conf(5) man page with updated example logrotate script. -- Add SchedulerParameters=whole_pack configuration parameter. If set, then hold, release and cancel operations on any component of a heterogeneous job will be applied to all components -- Handle FQDNs in xauth cookies for x11 display forwarding properly. -- For heterogeneous job steps, the srun --open-mode option default value will be set to "append". -- Pack job scheduling list not being cleared between runs of the backfill scheduler resulted in various anomalies. -- Fix that backward compat for pmix version < 1.1.5. -- Fix use-after-free that can lead to slurmstepd segfaulting when setting ulimit values. -- Add heterogeneous job start data to sdiag output. -- X11 forwarding - handle systems with X11UseLocalhost=no set in sshd_config. -- Fix potential missing issue with missin symbols in gres plugins. -- Ignore querying clusters in federation that are down from status commands. -- Base federated jobs off of origin job and not the local cluster in API. -- Remove erroneous double '-' on rpath for libslurmfull. -- Remove version from libslurmfull and move it to $LIBDIR/slurm since the ABI could change from one version to the other. -- Fix unused wall time for reservations. -- Convert old reservation records to insert unused wall into the rows. -- slurm.spec: further restructing and improvements. -- Allow nodes state to be updated between FAIL and DRAIN. -- x11 forwarding: handle build with alternate location for libssh2. * Changes in Slurm 17.11.0rc3 ============================== -- Fix extern step to wait until launched before allowing job to start. -- Add missing locks around figuring out TRES when clean starting the slurmctld. -- Cray modulefile: avoid removing /usr/bin from path on module unload. -- Make reoccurring reservations show up in the database. -- Adjust related resources (cpus, tasks, gres, mem, etc.) when updating NumNodes with scontrol. -- Don't initialize MPI plugins for batch or extern steps.` -- slurm.spec - do not install a slurm.conf file under /etc/ld.so.conf.d. -- X11 forwarding - fix keepalive message generation code. -- If heterogeneous job step is unable to acquire MPI reserved ports then avoid referencing NULL pointer. Retry assigning ports ONLY for non-heterogeneous job steps. -- If any acct_gather_*_init fails fatal instead of error and keep going. -- launch/slurm plugin - Avoid using global variable for heterogeneous job steps, which could corrupt memory. * Changes in Slurm 17.11.0rc2 ============================== -- Prevent slurmctld abort with NodeFeatures=knl_cray and non-KNL nodes lacking any configured features. -- The --cpu_bind and --mem_bind options have been renamed to --cpu-bind and --mem-bind for consistency with the rest of Slurm's options. Both old and new syntaxes are supported for now. -- Add slurmdb_connection_commit to the slurmdb api to commit when needed. -- Add the federation api's to the slurmdb.h file. -- Add job functions to the db_api. -- Fix sacct to always use the db_api instead of sometimes calling functions directly. -- Fix sacctmgr to always use the db_api instead of sometimes calling functions directly. -- Fix sreport to always use the db_api instead of sometimes calling functions directly. -- Make global uid to the db_api to minimize calls to getuid(). -- Add support for HWLOC version 2.0. -- Added more validation logic for updates to node features. -- Added node_features_p_node_update_valid() function to node_features plugin. -- If a job is held due to bad constraints and a node's features change then test the job again to see if can run with the new features. -- Added node_features_p_changible_feature() function to node_features plugin. -- Avoid rebooting a node if a job's requested feature is not under the control of the node_features plugin and is not currently active. -- node_features/knl_generic plugin: Do not clear a node's non-KNL features specified in slurm.conf. -- Added SchedulerParameters configuration option "disable_hetero_steps" to disable job steps that span multiple components of a heterogeneous job. Disabled by default except with mpi/none plugin. This limitation to be removed in Slurm version 18.08. * Changes in Slurm 17.11.0rc1 ============================== -- Added the following jobcomp/script environment variables: CLUSTER, DEPENDENCY, DERIVED_EC, EXITCODE, GROUPNAME, QOS, RESERVATION, USERNAME. The format of LIMIT (job time limit) has been modified to D-HH:MM:SS. -- Fix QOS usage factor applying to individual TRES run minute usage. -- Print numbers using exponential format if required to fit in allocated field width. The sacctmgr and sshare commands are impacted. -- Make it so a backup DBD doesn't attempt to create database tables and relies on the primary to do so. -- By default have Slurm dynamically link to libslurm.so instead of static linking. If static linking is desired configure with --without-shared-libslurm. -- Change --workdir in sbatch to be --chdir as in all other commands (salloc, srun). -- Add WorkDir to the job record in the database. -- Make the UsageFactor of a QOS work when a qos has the nodecay flag. -- Add MaxQueryTimeRange option to slurmdbd.conf to limit accounting query ranges when fetching job records. -- Add LaunchParameters=batch_step_set_cpu_freq to allow the setting of the cpu frequency on the batch step. -- CRAY - Fix statically linked applications to CRAY's PMI. -- Fix - Raise an error back to the user when trying to update currently unsupported core-based reservations. -- Do not print TmpDisk space as part of 'slurmd -C' line. -- Fix to test MaxMemPerCPU/Node partition limits when scheduling, previously only checked on submit. -- Work for heterogeneous job support (complete solution in v17.11): * Set SLURM_PROCID environment variable to reflect global task rank (needed by MPI). * Set SLURM_NTASKS environment variable to reflect global task count (needed by MPI). * In srun, if only some steps are allocated and one step allocation fails, then delete all allocated steps. * Get SPANK plungins working with heterogeneous jobs. The spank_init_post_opt() function is executed once per job component. * Modify sbcast command and srun's --bcast option to support heterogeneous jobs. * Set more environment variables for MPI: SLURM_GTIDS and SLURM_NODEID. * Prevent a heterogeneous job allocation from including the same nodes in multiple components (required by MPI jobs spanning components). * Modify step create logic so that call components of a heterogeneous job launched by a single srun command have the same step ID value. -- Modify output of "--mpi=list" to avoid duplicates for version numbers in mpi/pmix plugin names. -- Allow nodes to be rebooted while in a maintenance reservation. -- Show nodes as down even when nodes are in a maintenance reservation. -- Harden the slurmctld HA stack to mitigate certain split-brain issues. -- Work for heterogeneous job support (complete solution in v17.11): * Add burst buffer support. * Remove srun's --mpi-combine option (always combined). * Add SchedulerParameters configuration option "enable_hetero_steps" to enable job steps that span multiple components of a heterogeneous job. Disabled by default as most MPI implementations and Slurm configurations are not currently supported. Limitation to be removed in Slurm version 18.08. * Synchronize application launch across multiple components with debugger. * Modify slurm_kill_job_step() to cancel all components of a heterogeneous job step (used by MPI). * Set SLURM_JOB_NUM_NODES environment variable as needed by MVAPICH. * Base time limit upon the time that the latest job component is available (after all nodes in all components booted and ready for use). -- Add cluster name to smail tool email header. -- Speedup arbitrary distribution algorithm. -- Modify "srun --mpi=list" output to match valid option input by removing the "mpi/" prefix on each line of output. -- Automatically set the reservation's partition for the job if not the cluster default. -- mpi/pmi2 plugin - vestigial pointer could be referenced at shutdown with invalid memory reference resulting. -- Fix to _is_gres_cnt_zero() return false for improper input string -- Cleanup all pthread_create calls and replace with new slurm_thread_create macro. -- Removed obsolete MPI plugins. Remaining options are openmpi, pmi2, pmix. -- Removed obsolete checkpoint/poe plugin. -- Process spank environment variable options before processing spank command line options. Spank plugins should be able to handle option callbacks being called multiple times. -- Add support for specialized cores with task/affinity plugin (previously only supported with task/cgroup plugin). -- Add "TaskPluginParam=SlurmdOffSpec" option that will prevent the Slurm compute node daemons (slurmd and slurmstepd) from executing on specialized cores. -- CRAY - Make native mode default, use --disable-native-cray to use ALPS instead of native Slurm. -- Add ability to prevent suspension of some count of nodes in a specified range using the SuspendExcNodes configuration parameter. -- Add SLURM_WCKEY to PrologSlurmctld and EpilogSlurmctld environment. -- Return user response string in response to successful job allocation request not only on failure. Set in LUA using function 'slurm.user_msg("STRING")'. -- Add 'scontrol write batch_script ' command to retrieve the batch script for a given job. -- Remove option to display the batch script as part of 'scontrol show job'. -- On native Cray system the configured RebootProgram is executed on on the head node by the slurmctld daemon rather than by the slurmd daemons on the compute nodes. The "capmc_resume" program from "contribs/cray" can be used. -- Modify "scontrol top" command to accept a comma separated list of job IDs as an argument rather than a single job ID. -- Add MemorySwappiness value to cgroup.conf. -- Add new "billing" TRES which allows jobs to be limited based on the job's billable TRES calculated by the job's partition's TRESBillingWeights. -- sbatch - force line-buffered output so 'sbatch -W' returns the jobid over a piped output immediately. -- Regular user use of "scontrol top" command is now diabled. Use the configuration parameter "SchedulerParameters=enable_user_top" to enable that functionality. The configuration parameter "SchedulerParameters=disable_user_top" will be silently ignored. -- Add -TALL to sreport. -- Removed unused SlurmdPlugstack option and associated framework. -- Correct logic for line continuation in srun --multi-prog file. -- Add DBD Agent queue size to sdiag output. -- Add running job count to sdiag output. -- Print unix timestamps next to ASCII timestamps in sdiag output. -- In a job allocation spanning KNL and non-KNL nodes and requiring a reboot, do not attempt to set default NUMA or MCDRAM modes on non-KNL nodes. -- Change default to let pending jobs run outside of reservation after reservation is gone to put jobs in held state. Added NO_HOLD_JOBS_AFTER_END reservation flag to use old default. -- When creating a reservation, validate the CoreCnt specification matches the number of nodes listed. -- When creating a reservation, correct logic to ignoring job allocations on request. -- Deprecate BLCR plugin, and do not build by default. -- Change sreport report titles from "Use" to "Usage" * Changes in Slurm 17.11.0pre2 ============================== -- Initial work for heterogeneous job support (complete solution in v17.11): * Modified salloc, sbatch and srun commands to parse command line, job script and environment variables to recognize requests for heterogeneous jobs. Same commands also modified to set environment variables describing each component of the heterogeneous job. * Modified job allocate, batch job submit and job "will-run" requests to pass a list of job specifications and get a list of responses. * Modify slurmctld daemon to process a heterogeneous job request and create multiple job records as needed. * Added new fields to job record: pack_job_id, pack_job_offset and pack_job_set (set of job IDs). Added to slurmctld state save/restore logic and job information reported. * Display new job fields in "scontrol show job" output. * Modify squeue command to display heterogeneous job records using "#+#" format. The squeue --job=# output lists all components of a heterogeneous job. * Modify scancel logic to cancel all components of a heterogeneous job with a single request/RPC. * Configuration parameter DebugFlags value of "HeteroJobs" added. * Job requeue and suspend/resume modified to operate on all components of a heterogeneous job with a single request/RPC. * New web page added to describe heterogeneous jobs. * Descriptions of new API added to man pages. * Modified email notifications to only operate on the first job component. * Purge heterogeneous job records at the same time and not by individual components. * Modified logic for heterogeneous jobs submitted to multiple clusters ("--clusters=...") so the job will be routed to the cluster that is expected to start all components earliest. * Modified srun to create multiple job steps for heterogeneous job allocations. * Modified launch plugin to accept a pointer to job step options structure rather than work from a single/common data structure. -- Improve backfill scheduling algorithm with respect to starting jobs as soon as possible while avoiding advanced reservations. -- Add URG as an option to 'scancel --signal'. -- Check if the buffer returned from slurm_persist_msg_pack() isn't NULL. -- Modify all daemons to re-open log files on receipt of SIGUSR2 signal. This is much than using SIGHUP to re-read the configuration file and rebuild various tables. -- Add PrivateData=events configuration parameter -- Work for heterogeneous job support (complete solution in v17.11): * Add pointer to job option structure to job_step_create_allocation() function used by srun. * Parallelize task launch for heterogeneous job allocations (initial work). * Make packjobid, packjoboffset, and packjobidset fields available in squeue output. * Modify smap command to display heterogeneous job records using "#+#" format. * Add srun --pack-group and --mpi-combine options to control job step launch behaviour (not fully implemented). * Add pack job component ID to srun --label output (e.g. "P0 1:" for job component 0 and task 1). * jobcomp/elasticsearch: Add pack_job_id and pack_job_offset fields. * sview: Modified to display pack job information. * Major re-write of task state container logic to support for list of containers rather than one container per srun command. * Add some regression tests. * Add srun pack job environment variables when performing job allocation. -- Set Reason=dependency over Reason=JobArrayTaskLimit for pending jobs. -- Add slurm.conf configuration parameters SlurmctldSyslogDebug and SlurmdSyslogDebug to control which messages from the slurmctld and slurmd daemons get written to syslog. -- Add slurmdbd.conf configuration parameter DebugLevelSyslog to control which messages from the slurmdbd daemon get written to syslog. -- Fix handling of GroupUpdateForce option. -- Work for heterogeneous job support (complete solution in v17.11): * Add support to sched/backfill for concurrent allocation of all pack job components including support of --time-min option. * Defer initiation of a heterogeneous job until a components can be started at the same time, taking into consideration association and QOS limits for the job as a whole. * Perform limit check on heterogeneous job as a whole at submit time to reject jobs that will never be able to run. * Add pack_job_id and pack_job_offset to accounting database. * Modified sacct to accept pack job ID specification using "#+#" notation. * Modified sstat to accept pack job ID specification using "#+#" notation. -- Clear a job's "wait reason" value of BeginTime" after that time has passed. Previously a readon of "BeginTime" could be reported long after the job's requested begin time had passed. -- Split group_info in slurm_ctl_conf_t into group_force and group_time. -- Work for heterogeneous job support (complete solution in v17.11): * Fix I/O race condition on step termination for srun launching multiple pack job groups. * If prolog is running when attempting to signal a step, then return EAGAIN and retry rather than simply returning SLURM_ERROR and aborting. * Modify launch/slurm plugin to signal all components of a pack job rather than just the one (modify to use a list of step context records). * Add logic to support srun --mpi-combine option. * Set up debugger data structures. * Disable cancellation of individual component while the job is pending. * Modify scontrol job hold/release and update to operate with heterogeneous job id specification (e.g. "scontrol hold 123+4"). * If srun lacks application specification for some component, the next one specified will be used for earlier components. * Changes in Slurm 17.11.0pre1 ============================== -- Interpet all format options in output/error file to log prolog errors. Prior logic only supported "%j" (job ID) option. -- Add the configure option --with-shared-libslurm which will link to libslurm.so instead of libslurm.o thus reducing the footprint of all the binaries. -- In switch plugin, added plugin_id symbol to plugins and wrapped switch_jobinfo_t with dynamic_plugin_data_t in interface calls in order to pass switch information between clusters with different switch types. -- Switch naming of acct_gather_infiniband to acct_gather_interconnect -- Make it so you can "stack" the interconnect plugins. -- Add a last_sched_eval timestamp to record when a job was last evaluated by the main scheduler or backfill. -- Add scancel "--hurry" option to avoid staging out any burst buffer data. -- Simplify the sched plugin interface. -- Add new advanced reservation flags of "weekday" (repeat on each weekday; Monday through Friday) and "weekend" (repeat on each weekend day; Saturday and Sunday). -- Add new advanced reservation flag of "flex", which permits jobs requesting the reservation to begin prior to the reservation's start time and use resources inside or outside of the reservation. A typical use case is to prevent jobs not explicitly requesting the reservation from using those reserved resources rather than forcing jobs requesting the reservation to use those resources in the time frame reserved. -- Add NoDecay flag to QOS. -- Node "OS" field expanded from "sysname" to "sysname release version" (e.g. change from "Linux" to "Linux 4.8.0-28-generic #28-Ubuntu SMP Sat Feb 8 09:15:00 UTC 2017"). -- jobcomp/elasticsearch - Add "job_name" and "wc_key" fields to stored information. -- jobcomp/filetxt - Add ArrayJobId, ArrayTaskId, ReservationName, Gres, Account, QOS, WcKey, Cluster, SubmitTime, EligibleTime, DerivedExitCode and ExitCode. -- scontrol modified to report core IDs for reservation containing individual cores. -- MYSQL - Get rid of table join during rollup which speeds up the process dramatically on large job/step tables. -- Add ability to define features on clusters for directing federated jobs to different clusters. -- Add new RPC to process multiple federation RPCs in a single communication. -- Modify slurm_load_jobs() function to load job information from all clusters in a federation. -- Add squeue --local and --sibling options to modify filtering of jobs on federated clusters. -- Add SchedulerParameters option of bf_max_job_user_part to specifiy the maximum number of jobs per user for any single partition. This differs from bf_max_job_user in that a separate counter is applied to each partition rather than having a single counter per user applied to all partitions. -- Modify backfill logic so that bf_max_job_user, bf_max_job_part and bf_max_job_user_part options can all be used independently of each other. -- Add sprio -p/--partition option to filter jobs by partition name. -- Add partition name to job priority factor response message. -- Add sprio --local and --sibling options for use in federation of clusters. -- Add sprio "%c" format to print cluster name in federation mode. -- Modify sinfo logic to provided unified view of all nodes and partitions in a federation, add --local option to only report local state information even in a cluster, print cluster name with "%V" format option, and optionally sort by cluster name. -- If a task in a parallel job fails and it was launched with the --kill-on-bad-exit option then terminate the remaining tasks using the SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL. -- Include submit_time when doing the sort for job scheduling. -- Modify sacct to report all jobs in federation by default. Also add --local option. -- Modify sacct to accept "--cluster all" option (in addition to the old "--cluster -1", which is still accepted). -- Modify sreport to report all jobs in federation by default. Also add --local option. -- sched/backfill: Improve assoc_limit_stop configuration parameter support. -- KNL features: Always keep active and available features in the same order: first site-specific features, next MCDRAM modes, last NUMA modes. -- Changed default ProctrackType to cgroup. -- Add "cluster_name" field to node_info_t and partition_info_t data structure. It is filled in only when the cluster is part of a federation and SHOW_FEDERATION flag used. -- Functions slurm_load_node() slurm_load_partitions() modified to show all nodes/partitions in a federation when the SHOW_FEDERATION flag is used. -- Add federated views to sview. -- Add --federation option to sacct, scontrol, sinfo, sprio, squeue, sreport to show a federated view. Will show local view by default. -- Add FederationParameters=fed_display slurm.conf option to configure status commands to display a federated view by default if the cluster is a member of a federation. -- Log the down nodes whenever slurmctld restarts. -- Report that "CPUs" plus "Boards" in node configuration invalid only if the CPUs value is not equal to the total thread count. -- Extend the output of the seff utility to also include the job's wall-clock time. -- Add bf_max_time to SchedulerParameters. -- Add bf_max_job_assoc to SchedulerParameters. -- Add new SchedulerParameters option bf_window_linear to control the rate at which the backfill test window expands. This can be used on a system with a modest number of running jobs (hundreds of jobs) to help prevent expected start times of pending jobs to get pushed forward in time. On systems with large numbers of running jobs, performance of the backfill scheduler will suffer and fewer jobs will be evaluated. -- Improve scheduling logic with respect to license use and node reboots. -- CRAY - Alter algorithm to come up with the SLURM_ID_HASH. -- Implement federated scheduling and federated status outputs. -- The '-q' option to srun has changed from being the short form of '--quit-on-interrupt' to '--qos'. -- Change sched_min_interval default from 0 to 2 microseconds. * Changes in Slurm 17.02.12 ========================== -- Fix segfault in slurmdbd hourly rollup when having a job outside a reservation, with no end_time set, from an assoc that's in a reservation. * Changes in Slurm 17.02.11 ========================== -- Fix insecure handling of user_name and gid fields. CVE-2018-10995. * Changes in Slurm 17.02.10 ========================== -- Fix updating of requested TRES memory. -- Cray modulefile: avoid removing /usr/bin from path on module unload. -- Fix issue when resetting the partition pointers on nodes. -- Show reason field in 'sinfo -R' when nodes is marked as failed. -- Fix potential of slurmstepd segfaulting when the extern step fails to start. -- Allow nodes state to be updated between FAIL and DRAIN. -- Avoid registering a job'd credential multiple times. -- Fix sbatch --wait to stop waiting after job is gone from memory. -- Fix memory leak of MailDomain configuration string when slurmctld daemon is reconfigured. -- Fix to properly remove extern steps from the starting_steps list. -- Fix Slurm to work correctly with HDF5 1.10+. -- Add support in salloc/srun --bb option for "access_mode" in addition to "access" for consistency with DW options. -- Fix potential deadlock in _run_prog() in power save code. -- MYSQL - Add dynamic_offset in the database to force range for auto increment ids for the tres_table. -- Avoid setting node in COMPLETING state indefinitely if the job initiating the node reboot is cancelled while the reboot in in progress. -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld reconfigured. -- node_feature/knl_cray - Fix memory leak that can occur during normal operation. -- Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. -- Fix whole node allocation cpu counts when --hint=nomultihtread. -- NRT - Fix issue when running on a HFI (p775) system with multiple protocols. -- Fix uninitialized variables when unpacking slurmdb_archive_cond_t. -- Fix security issue in accounting_storage/mysql plugin by always escaping strings within the slurmdbd. CVE-2018-7033. * Changes in Slurm 17.02.9 ========================== -- When resuming powered down nodes, mark DOWN nodes right after ResumeTimeout has been reached (previous logic would wait about one minute longer). -- Fix sreport not showing full column name for TRES Count. -- Fix slurmdb_reservations_get() giving wrong usage data when job's spanned reservation that was modified. -- Fix sreport reservation utilization report showing bad data. -- Show all TRES' on a reservation in sreport reservation utilization report by default. -- Fix sacctmgr show reservation handling "end" parameter. -- Work around issue with sysmacros.h and gcc7 / glibc 2.25. -- Fix layouts code to only allow setting a boolean. -- Fix sbatch --wait to keep waiting even if a message timeout occurs. -- CRAY - If configured with NodeFeatures=knl_cray and there are non-KNL nodes which include no features the slurmctld will abort without this patch when attemping strtok_r(NULL). -- Fix regression in 17.02.7 which would run the spank_task_privileged as part of the slurmstepd instead of it's child process. -- Fix security issue in Prolog and Epilog by always prepending SPANK_ to all user-set environment variables. CVE-2017-15566. * Changes in Slurm 17.02.8 ========================== -- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd instead of local. -- mpi/mvapich - Buffer being only partially cleared. No failures observed. -- Fix for job --switch option on dragonfly network. -- In salloc with --uid option, drop supplementary groups before changing UID. -- jobcomp/elasticsearch - strip any trailing slashes from JobCompLoc. -- jobcomp/elasticsearch - fix memory leak when transferring generated buffer. -- Prevent slurmstepd ABRT when parsing gres.conf CPUs. -- Fix sbatch --signal to signal all MPI ranks in a step instead of just those on node 0. -- Check multiple partition limits when scheduling a job that were previously only checked on submit. -- Cray: Avoid running application/step Node Health Check on the external job step. -- Optimization enhancements for partition based job preemption. -- Address some build warnings from GCC 7.1, and one possible memory leak if /proc is inaccessible. -- If creating/altering a core based reservation with scontrol/sview on a remote cluster correctly determine the select type. -- Fix autoconf test for libcurl when clang is used. -- Fix default location for cgroup_allowed_devices_file.conf to use correct default path. -- Document NewName option to sacctmgr. -- Reject a second PMI2_Init call within a single step to prevent slurmstepd from hanging. -- Handle old 32bit values stored in the database for requested memory correctly in sacct. -- Fix memory leaks in the task/cgroup plugin when constraining devices. -- Make extremely verbose info messages debug2 messages in the task/cgroup plugin when constraining devices. -- Fix issue that would deny the stepd access to /dev/null where GRES has a 'type' but no file defined. -- Fix issue where the slurmstepd would fatal on job launch if you have no gres listed in your slurm.conf but some in gres.conf. -- Fix validating time spec to correctly validate various time formats. -- Make scontrol work correctly with job update timelimit [+|-]=. -- Reduce the visibily of a number of warnings in _part_access_check. -- Prevent segfault in sacctmgr if no association name is specified for an update command. -- burst_buffer/cray plugin modified to work with changes in Cray UP05 software release. -- Fix job reasons for jobs that are violating assoc MaxTRESPerNode limits. -- Fix segfault when unpacking a 16.05 slurm_cred in a 17.02 daemon. -- Fix setting TRES limits with case insensitive TRES names. -- Add alias for xstrncmp() -- slurm_xstrncmp(). -- Fix sorting of case insensitive strings when using xstrcasecmp(). -- Gracefully handle race condition when reading /proc as process exits. -- Avoid error on Cray duplicate setup of core specialization. -- Skip over undefined (hidden in Slurm) nodes in pbsnodes. -- Add empty hashes in perl api's slurm_load_node() for hidden nodes. -- CRAY - Add rpath logic to work for the alpscomm libs. -- Fixes for administrator extended TimeLimit (job reason & time limit reset). -- Fix gres selection on systems running select/linear. -- sview: Added window decorator for maximize,minimize,close buttons for all systems. -- squeue: interpret negative length format specifiers as a request to delimit values with spaces. -- Fix the torque pbsnodes wrapper script to parse a gres field with a type set correctly. * Changes in Slurm 17.02.7 ========================== -- Fix deadlock if requesting to create more than 10000 reservations. -- Fix potential memory leak when creating partition name. -- Execute the HealthCheckProgram once when the slurmd daemon starts rather than executing repeatedly until an exit code of 0 is returned. -- Set job/step start and end times to 0 when using --truncate and start > end. -- Make srun --pty option ignore EINTR allowing windows to resize. -- When resuming node only send one message to the slurmdbd. -- Modify srun --pty option to use configured SrunPortRange range. -- Fix issue with whole gres not being printed out with Slurm tools. -- Fix issue with multiple jobs from an array are prevented from starting. -- Fix for possible slurmctld abort with use of salloc/sbatch/srun --gres-flags=enforce-binding option. -- Fix race condition when using jobacct_gather/cgroup where the memory of the step wasn't always gathered correctly. -- Better debug when slurmdbd queue is filling up in the slurmctld. -- Fixed truncation on scontrol show config output. -- Serialize updates from from the dbd to the slurmctld. -- Fix memory leak in slurmctld when agent queue to the DBD has filled up. -- CRAY - Throttle step creation if trying to create too many steps at once. -- If failing after switch_g_job_init happened make sure switch_g_job_fini is called. -- Fix minor memory leak if launch fails in the slurmstepd. -- Fix issue where UnkillableStepProgram if step was in an ending state. -- Fix bug when tracking multiple simultaneous spawned ping cycles. -- jobcomp/elasticsearch plugin now saves state of pending requests on slurmctld daemon shutdown so then can be recovered on restart. -- Fix issue when an alternate munge key when communicating on a persistent connection. -- Document inconsistent behavior of GroupUpdateForce option. -- Fix bug in selection of GRES bound to specific CPUs where the GRES count is 2 or more. Previous logic could allocate CPUs not available to the job. -- Increase buffer to handle long /proc//stat output so that Slurm can read correct RSS value and take action on jobs using more memory than requested. -- Fix srun job jobs that can run immediately to run in the highest priority partion when multiple partitions are listed. scontrol show jobs can potentially show the partition list in priority order. -- Fix starting controller if StateSaveLocation path didn't exist. -- Fix inherited association 'max' TRES limits combining multiple limits in the tree. -- Sort TRES id's on limits when getting them from the database. -- Fix issue with pmi[2|x] when TreeWidth=1. -- Correct buffer size used in determining specialized cores to avoid possible truncation of core specification and not reserving the specified cores. -- Close race condition on Slurm structures when setting DebugFlags. -- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure. -- Fix incorrect lock levels when creating or updating a reservation. -- Fix overlapping reservation resize. -- Add logic to help support Dell KNL systems where syscfg is different than the normal Intel syscfg. -- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. * Changes in Slurm 17.02.6 ========================== -- Fix configurator.easy.html to output the SelectTypeParameters line. -- If a job requests a specific memory requirement then gets something else from the slurmctld make sure the step allocation is made aware of it. -- Fix missing initialization in slurmd. -- Fix potential degradation when running HTC (> 100 jobs a sec) like workflows through the slurmd. -- Fix race condition which could leave a stepd hung on shutdown. -- CRAY - Add configuration for ATP to the ansible play script. -- Fix potential to corrupt DBD message. -- burst_buffer logic modified to support sizes in both SI and EIC size units (e.g. M/MiB for powers of 1024, MB for powers of 1000). * Changes in Slurm 17.02.5 ========================== -- Prevent segfault if a job was blocked from running by a QOS that is then deleted. -- Improve selection of jobs to preempt when there are multiple partitions with jobs subject to preemption. -- Only set kmem limit when ConstrainKmemSpace=yes is set in cgroup.conf. -- Fix bug in task/affinity that could result in slurmd fatal error. -- Increase number of jobs that are tracked in the slurmd as finishing at one time. -- Note when a job finishes in the slurmd to avoid a race when launching a batch job takes longer than it takes to finish. -- Improve slurmd startup on large systems (> 10000 nodes) -- Add LaunchParameters option of cray_net_exclusive to control whether all jobs on the cluster have exclusive access to their assigned nodes. -- Make sure srun inside an allocation gets --ntasks-per-[core|socket] set correctly. -- Only make the extern step at job creation. -- Fix for job step task layout with --cpus-per-task option. -- Fix --ntasks-per-core option/environment variable parsing to set the requested value, instead of always setting one (srun). -- Correct error message when ClusterName in configuration files does not match the name in the slurmctld daemon's state save file. -- Better checking when a job is finishing to avoid underflow on job's submitted to a QOS/association. -- Handle partition QOS submit limits correctly when a job is submitted to more than 1 partition or when the partition is changed with scontrol. -- Performance boost for when Slurm is dealing with credentials. -- Fix race condition which could leave a stepd hung on shutdown. -- Add lua support for opensuse. * Changes in Slurm 17.02.4 ========================== -- Do not attempt to schedule jobs after changing the power cap if there are already many active threads. -- Job expansion example in FAQ enhanced to demonstrate operation in heterogeneous environments. -- Prevent scontrol crash when operating on array and no-array jobs at once. -- knl_cray plugin: Log incomplete capmc output for a node. -- knl_cray plugin: Change capmc parsing of mcdram_pct from string to number. -- Remove log files from test20.12. -- When rebooting a node and using the PrologFlags=alloc make sure the prolog is ran after the reboot. -- node_features/knl_generic - If a node is rebooted for a pending job, but fails to enter the desired NUMA and/or MCDRAM mode then drain the node and requeue the job. -- node_features/knl_generic disable mode change unless RebootProgram configured. -- Add new burst_buffer function bb_g_job_revoke_alloc() to be executed if there was a failure after the initial resource allocation. Does not release previously allocated resources. -- Test if the node_bitmap on a job is NULL when testing if the job's nodes are ready. This will be NULL is a job was revoked while beginning. -- Fix incorrect lock levels when testing when job will run or updating a job. -- Add missing locks to job_submit/pbs plugin when updating a jobs dependencies. -- Add support for lua5.3 -- Add min_memory_per_node|cpu to the job_submit/lua plugin to deal with lua not being able to deal with pn_min_memory being a uint64_t. Scripts are urged to change to these new variables avoid issue. If not set the variables will be 'nil'. -- Calculate priority correctly when 'nice' is given. -- Fix minor typos in the documentation. -- node_features/knl_cray: Preserve non-KNL active features if slurmctld reconfigured while node boot in progress. -- node_features/knl_generic: Do not repeatedly log errors when trying to read KNL modes if not KNL system. -- Add missing QOS read lock to backfill scheduler. -- When doing a dlopen on liblua only attempt the version compiled against. -- Fix null-dereference in sreport cluster ulitization when configured with memory-leak-debug. -- Fix Partition info in 'scontrol show node'. Previously duplicate partition names, or Partitions the node did not belong to could be displayed. -- Fix it so the backup slurmdbd will take control correctly. -- Fix unsafe use of MAX() macro, which could result in problems cleaning up accounting plugins in slurmd, or repeat job cancellation attempts in scancel. -- Fix 'scontrol update reservation duration=unlimited' to set the duration to 365-days (as is done elsewhere), rather than 49710 days. -- Check if variable given to scontrol show job is a valid jobid. -- Fix WithSubAccounts option to not include WithDeleted unless requested. -- Prevent a job tested on multiple partitions from being marked WHOLE_NODE_USER. -- Prevent a race between completing jobs on a user-exclusive node from leaving the node owned. -- When scheduling take the nodes in completing jobs out of the mix to reduce fragmentation. SchedulerParameters=reduce_completing_frag -- For jobs submited to multiple partitions, report the job's earliest start time for any partition. -- Backfill partitions that use QOS Grp limits to "float" better. -- node_features/knl_cray: don't clear configured GRES from non-KNL node. -- sacctmgr - prevent segfault in command when a request is denied due to a insufficient priviledges. -- Add warning about libcurl-devel not being installed during configure. -- Streamline job purge by handling file deletion on a separate thread. -- Always set RLIMIT_CORE to the maximum permitted for slurmd, to ensure core files are created even on non-developer builds. -- Fix --ntasks-per-core option/environment variable parsing to set the requested value, instead of always setting one. -- If trying to cancel a step that hasn't started yet for some reason return a good return code. -- Fix issue with sacctmgr show where user='' * Changes in Slurm 17.02.3 ========================== -- Increase --cpu_bind and --mem_bind field length limits. -- Fix segfault when using AdminComment field with job arrays. -- Clear Dependency field when all dependencies are satisfied. -- Add --array-unique to squeue which will display one unique pending job array element per line. -- Reset backfill timers correctly without skipping over them in certain circumstances. -- When running the "scontrol top" command, make sure that all of the user's jobs have a priority that is lower than the selected job. Previous logic would permit other jobs with equal priority (no jobs with higher priority). -- Fix perl api so we always get an allocation when calling Slurm::new(). -- Fix issue with cleaning up cpuset and devices cgroups when multiple steps end at the same time. -- Document that PriorityFlags option of DEPTH_OBLIVIOUS precludes the use of FAIR_TREE. -- Fix issue if an invalid message came in a Slurm daemon/command may abort. -- Make it impossible to use CR_CPU* along with CR_ONE_TASK_PER_CORE. The options are mutually exclusive. -- ALPS - Fix scheduling when ALPS doesn't agree with Slurm on what nodes are free. -- When removing a partition make sure it isn't part of a reservation. -- Fix seg fault if loading attempting to load non-existent burstbuffer plugin. -- Fix to backfill scheduling with respect to QOS and association limits. Jobs submitted to multiple partitions are most likley to be effected. -- sched/backfill: Improve assoc_limit_stop configuration parameter support. -- CRAY - Add ansible play and README. -- sched/backfill: Fix bug related to advanced reservations and the need to reboot nodes to change KNL mode. -- Preempt plugins - fix check for 'preempt_youngest_first' option. -- Preempt plugins - fix incorrect casts in preempt_youngest_first mode. -- Preempt/job_prio - fix incorrect casts in sort function. -- Fix to make task/affinity work with ldoms where there are more than 64 cpus on the node. -- When using node_features/knl_generic make it so the slurmd doesn't segfault when shutting down. -- Fix potential double-xfree() when using job arrays that can lead to slurmctld crashing. -- Fix priority/multifactor priorities on a slurmctld restart if not using accounting_storage/[mysql|slurmdbd]. -- Fix NULL dereference reported by CLANG. -- Update proctrack documentation to strongly encourage use of proctrack/cgroup. -- Fix potential memory leak if job fails to begin after nodes have been selected for a job. -- Handle a job that made it out of the select plugin without a job_resrcs pointer. -- Fix potential race condition when persistent connections are being closed at shutdown. -- Fix incorrect locks levels when submitting a batch job or updating a job in general. -- CRAY - Move delay waiting for job cleanup to after we check once. -- MYSQL - Fix memory leak when loading archived jobs into the database. -- Fix potential race condition when starting the priority/multifactor plugin's decay thread. -- Sanity check to make sure we have started a job in acct_policy.c before we clear it as started. -- Allow reboot program to use arguments. -- Message Aggr - Remove race condition on slurmd shutdown with respects to destroying a mutex. -- Fix updating job priority on multiple partitions to be correct. -- Don't remove admin comment when updating a job. -- Return error when bad separator is given for scontrol update job licenses. * Changes in Slurm 17.02.2 ========================== -- Update hyperlink to LBNL Node Health Check program. -- burst_buffer/cray - Add support for line continuation. -- If a job is cancelled by the user while it's allocated nodes are being reconfigured (i.e. the capmc_resume program is rebooting nodes for the job) and the node reconfiguration fails (i.e. the reboot fails), then don't requeue the job but leave it in a cancelled state. -- capmc_resume (Cray resume node script) - Do not disable changing a node's active features if SyscfgPath is configured in the knl.conf file. -- Improve the srun documentation for the --resv-ports option. -- burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job allocation of "20,22" must be expressed as "20\n22". -- Fix rare segfault when shutting down slurmctld and still sending data to the database. -- Fix gres output of a job if it is updated while pending to be displayed correctly with Slurm tools. -- Fix pam_slurm_adopt. -- Fix missing unlock when job_list doesn't exist when starting priority/ multifactor. -- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was in the middle of setting db_indexes. -- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated because the dbd is setting a db_index. -- Fix possible double insertion into database when a job is updated at the moment the dbd is assigning a db_index. -- Fix memory error when updating a job's licenses. -- Fix seff to work correctly with non-standard perl installs. -- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work. -- Fix sacct from returning way more than requested when querying against a job array task id. -- Fix double read lock of tres when updating gres or licenses on a job. -- Make sure locks are always in place when calling assoc_mgr_make_tres_str_from_array. -- Prevent slurmctld SEGV when creating reservation with duplicated name. -- Consider QOS flags Partition[Min|Max]Nodes when doing backfill. -- Fix slurmdbd_defs.c to not have half symbols go to libslurm.so and the other half go to libslurmdb.so. -- Fix 'scontrol show jobs' to remove an errant newline when 'Switches' is printed. -- Better code for handling memory required by a task on a heterogeneous system. -- Fix regression in 17.02.0 with respects to GrpTresMins on a QOS or Association. -- Cleanup to make make dist work. -- Schedule interactive jobs quicker. -- Perl API - correct value of MEM_PER_CPU constant to correctly handle memory values. -- Fix 'flags' variable to be 32 bit from the old 16 bit value in the perl api. -- Export sched_nodes for a job in the perl api. -- Improve error output when updating a reservation that has already started. -- Fix --ntasks-per-node issue with srun so DenyOnLimit would work correctly. -- node_features/knl_cray plugin - Fix memory leak. -- Fix wrong cpu_per_task count issue on heterogeneous system when dealing with steps. -- Fix double free issue when removing usage from an association with sacctmgr. -- Fix issue with SPANK plugins attempting to set null values as environment variables, which leads to the command segfaulting on newer glibc versions. -- Fix race condition on slurmctld startup when plugins have not gone through init() ahead of the rpc_manager processing incoming messages. -- job_submit/lua - expose admin_comment field. -- Allow AdminComment field to be set by the job_submit plugin. -- Allow AdminComment field to be changed by any Administrator. -- Fix key words in jobcomp select. -- MYSQL - Streamline job flush sql when doing a clean start on the slurmctld. -- Fix potential infinite loop when talking to the DBD when shutting down the slurmctld. -- Fix MCS filter. -- Make it so pmix can be included in the plugin rpm without having to specify --with-pmix. -- MYSQL - Fix initial load when not using he DBD. -- Fix scontrol top to not make jobs priority 0 (held). -- Downgrade info message about exceeding partition time limit to a debug2. * Changes in Slurm 17.02.1-2 ============================ -- Replace clock_gettime with time(NULL) for very old systems without the call. * Changes in Slurm 17.02.1 ========================== -- Modify pam module to work when configured NodeName and NodeHostname differ. -- Update to sbatch/srun man pages to explain the "filename pattern" clearer -- Add %x to sbatch/srun filename pattern to represent the job name. -- job_submit/lua - Add job "bitflags" field. -- Update slurm.spec file to note obsolete RPMs. -- Fix deadlock scenario when dumping configuration in the slurmctld. -- Remove unneeded job lock when running assoc_mgr cache. This lock could cause potential deadlock when/if TRES changed in the database and the slurmctld wasn't made aware of the change. This would be very rare. -- Fix missing locks in gres logic to avoid potential memory race. -- If gres is NULL on a job don't try to process it when returning detailed information about a job to scontrol. -- Fix print of consumed energy in sstat when no energy is being collected. -- Print formatted tres string when creating/updating a reservation. -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly. -- Prevent manipulation of the cpu frequency and governor for batch or extern steps. This addresses an issue where the batch step would inadvertently set the cpu frequency maximum to the minimum value supported on the node. -- Convert a slurmctd power management data structure from array to list in order to eliminate the possibility of zombie child suspend/resume processes. -- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation fails. Now job will be held. Update job update time when held. -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly. -- Refactor slurmctld agent logic to eliminate some pthreads. -- Added "SyscfgTimeout" parameter to knl.conf configuration file. -- Fix for CPU binding for job steps run under a batch job. * Changes in Slurm 17.02.0 ========================== -- job_submit/lua - Make "immediate" parameter available. -- Fix srun I/O race condtion to eliminate a error message that might be generated if the application exits with outstanding stdin. -- Fix regression when purging/archiving jobs/events. -- Add new job state JOB_OOM indicating Out Of Memory condition as detected by task/cgroup plugin. -- If QOS has been added to the system go refigure out Deny/AllowQOS on partitions. -- Deny job with duplicate GRES requested. -- Fix loading super old assoc_mgr usage without segfaulting. -- CRAY systems: Restore TaskPlugins order of task/cray before task/cgroup. -- Task/cray: Treat missing "mems" cgroup with "debug" messages rather than "error" messages. The file may be missing at step termination due to a change in how cgroups are released at job/step end. -- Fix for job constraint specification with counts, --ntasks-per-node value, and no node count. -- Fix ordering of step task allocation to fill in a socket before going into another one. -- Fix configure to not require C++ -- job_submit/lua - Remove access to slurmctld internal reservation fields of job_pend_cnt and job_run_cnt. -- Prevent job_time_limit enforcement from blocking other internal operations if a large number of jobs need to be cancelled. -- Add 'preempt_youngest_order' option to preempt/partition_prio plugin. -- Fix controller being able to talk to a pre-released DBD. -- Added ability to override the invoking uid for "scontrol update job" by specifying "--uid=|-u ". -- Changed file broadcast "offset" from 32 to 64 bits in order to support files over 2 GB. -- slurm.spec - do not install init scripts alongside systemd service files. * Changes in Slurm 17.02.0rc1 ============================== -- Add port info to 'sinfo' and 'scontrol show node'. -- Fix errant definition of USE_64BIT_BITSTR which can lead to core dumps. -- Move BatchScript to end of each job's information when using "scontrol -dd show job" to make it more readable. -- Add SchedulerParameters configuration parameter of "default_gbytes", which treats numeric only (no suffix) value for memory and tmp disk space as being in units of Gigabytes. Mostly for compatability with LSF. -- Fix race condtion in srun/sattach logic which would prevent srun from terminating. -- Bitstring operations are now 64bit instead of 32bit. -- Replace hweight() function in bitstring with faster version. -- scancel would treat a non-numeric argument as the name of jobs to be cancelled (a non-documented feature). Cancelling jobs by name now require the "--jobname=" command line argument. -- scancel modified to note that no jobs satisfy the filter options when the --verbose option is used along with one or more job filters (e.g. "--qos="). -- Change _pack_cred to use pack_bit_str_hex instead of pack_bit_fmt for better scalability and performance. -- Add BootTime configuration parameter to knl.conf file to optimize resource allocations with respect to required node reboots. -- Add node_features_p_boot_time() to node_features plugin to optimize scheduling with respect to node reboots. -- Avoid allocating resources to a job in the event that its run time plus boot time (if needed) extent into an advanced reservation. -- Burst_buffer/cray - Avoid stage-out operation if job never started. -- node_features/knl_cray - Add capability to detected Uncorrectable Memory Errors (UME) and if detected then log the event in all job and step stderr with a message of the form: error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 *** Similar logic added to node_features/knl_generic in version 17.02.0pre4. -- If job is allocated nodes which are powered down, then reset job start time when the nodes are ready and do not charge the job for power up time. -- Add the ability to purge transactions from the database. -- Add support for requeue'ing of federated jobs (BETA). -- Add support for interactive federated jobs (BETA). -- Add the ability to purge rolled up usage from the database. -- Properly set SLURM_JOB_GPUS environment variable for Prolog. * Changes in Slurm 17.02.0pre4 ============================== -- Add support for per-partitiion OverTimeLimit configuration. -- Add --mem_bind option of "sort" to run zonesort on KNL nodes at step start. -- Add LaunchParameters=mem_sort option to configure running of zonesort by default at step startup. -- Add "FreeSpace" information for each pool to the "scontrol show burstbuffer" output. Required changes to the burst_buffer_info_t data structure. -- Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag, which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot" command that will drain a node in order to reboot it as soon as possible, then return it to service. -- Allow unit conversion routine to convert 1024M to 1G. -- switch/cray plugin - change legacy spool directory location. -- Add new PriorityFlags option of INCR_ONLY, which prevents a job's priority from being decremented. -- Make it so we don't purge job start messages until after we purge step messages. Hopefully this will reduce the number of messages lost when filling up memory when the database/DBD is down. -- Added SchedulingParameters option of "bf_job_part_count_reserve". Jobs below the specified threshold will not have resources reserved for them. -- If GRES are configured with file IDs, then "scontrol -d show node" will not only identify the count of currently allocated GRES, but their specific index numbers (e.g. "GresUsed=gpu:alpha:2(IDX:0,2),gpu:beta:0(IDX:N/A)"). Ditto for job information with "scontrol -d show job". -- Add new mcs/account plugin. -- Add "GresEnforceBind=Yes" to "scontrol show job" output if so configured. -- Add support for SALLOC_CONSTRAINT, SBATCH_CONSTRAINT and SLURM_CONSTRAINT environment variables to set default constraints for salloc, sbatch and srun commands respectively. -- Provide limited support for the MemSpecLimit configuration parameter without the task/cgroup plugin. -- node_features/knl_generic - Add capability to detected Uncorrectable Memory Errors (UME) and if detected then log the event in all job and step stderr with a message of the form: error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 *** -- Add SLURM_JOB_GID to TaskProlog environment. -- burst_buffer/cray - Remove leading zeros from node ID lists passed to dw_wlm_cli program. -- Add "Partitions" field to "scontrol show node" output. -- Remove sched/wiki and sched/wiki2 plugins and associated code. -- Remove SchedulerRootFilter option and slurm_get_root_filter() API call. -- Add SchedulerParameters option of spec_cores_first to select specialized cores from the lowest rather than highest number cores and sockets. -- Add PrologFlags option of Serial to disable concurrent launch of Prolog and Epilog scripts. -- Fix security issue caused by insecure file path handling triggered by the failure of a Prolog script. To exploit this a user needs to anticipate or cause the Prolog to fail for their job. CVE-2016-10030. * Changes in Slurm 17.02.0pre3 ============================== -- Add srun host & PID to job step data structures. -- Avoid creating duplicate pending step records for the same srun command. -- Rewrite srun's logic for pending steps for better efficiency (fewer RPCs). -- Added new SchedulerParameters options step_retry_count and step_retry_time to control scheduling behaviour of job steps waiting for resources. -- Optimize resource allocation logic for --spread-job job option. -- Modify cpu_bind and mem_bind map and mask options to accept a repetition count to better support large task count. For example: "mask_mem:0x0f*2,0xf0*2" is equivalent to "mask_mem:0x0f,0x0f,0xf0,0xf0". -- Add support for --mem_bind=prefer option to prefer, but not restrict memory use to the identified NUMA node. -- Add mechanism to constrain kernel memory allocation using cgroups. New cgroup.conf parameters added: ConstrainKmemSpace, MaxKmemPercent, and MinKmemSpace. -- Correct invokation of man2html, which previously could cause FreeBSD builds to hang. -- MYSQL - Unconditionally remove 'ignore' clause from 'alter ignore'. -- Modify service files to not start Slurm daemons until after Munge has been started. NOTE: If you are not using Munge, but are using the "service" scripts to start Slurm daemons, then you will need to remove this check from the etc/slurm*service scripts. -- Do not process SALLOC_HINT, SBATCH_HINT or SLURM_HINT environment variables if any of the following salloc, sbatch or srun command line options are specified: -B, --cpu_bind, --hint, --ntasks-per-core, or --threads-per-core. -- burst_buffer/cray: Accept new jobs on backup slurmctld daemon without access to dw_wlm_cli command. No burst buffer actions will take place. -- Do not include SLURM_JOB_DERIVED_EC, SLURM_JOB_EXIT_CODE, or SLURM_JOB_EXIT_CODE in PrologSlurmctld environment (not available yet). -- Cray - set task plugin to fatal() if task/cgroup is not loaded after task/cray in the TaskPlugin settings. -- Remove separate slurm_blcr package. If Slurm is built with BLCR support, the files will now be part of the main Slurm packages. -- Replace sjstat, seff and sjobexit RPM packages with a single "contribs" package. -- Remove long since defunct slurmdb-direct scripts. -- Add SbcastParameters configuration option to control default file destination directory and compression algorithm. -- Add new SchedulerParameter (max_array_tasks) to limit the maximum number of tasks in a job array independently from the maximum task ID (MaxArraySize). -- Fix issue where number of nodes is not properly allocated when sbatch and salloc are requested with -n tasks < hosts from -w hostlist or from -N. -- Add infrastructure for submitting federated jobs. * Changes in Slurm 17.02.0pre2 ============================== -- Add new RPC (REQUEST_EVENT_LOG) so that slurmd and slurmstepd can log events through the slurmctld daemon. -- Remove sbatch --bb option. That option was never supported. -- Automatically clean up task/cgroup cpuset and devices cgroups after steps are completed. -- Add federation read/write locks. -- Limit job purge run time to 1 second at a time. -- The database index for jobs is now 64 bits. If you happen to be close to 4 billion jobs in your database you will want to update your slurmctld at the same time as your slurmdbd to prevent roll over of this variable as it is 32 bit previous versions of Slurm. -- Optionally lock slurmstepd in memory for performance reasons and to avoid possible SIGBUS if the daemon is paged out at the time of a Slurm upgrade (changing plugins). Controlled via new LaunchParameters options of slurmstepd_memlock and slurmstepd_memlock_all. -- Add event trigger on burst buffer errors (see strigger man page, --burst_buffer option). -- Add job AdminComment field which can only be set by a Slurm administrator. -- Add salloc, sbatch and srun option of --delay-boot=