From dennisvd@NIKHEF.NL Wed Dec 14 13:47:51 2016 Date: Wed, 14 Dec 2016 13:47:21 From: Dennis van Dok Reply-To: LHC Computer Grid - Rollout To: LCG-ROLLOUT@JISCMAIL.AC.UK Subject: Re: [LCG-ROLLOUT] pbs_server crashes Torque 4.2.10 [ The following text is in the "utf-8" character set. ] [ Your display is set for the "KOI8-R" character set. ] [ Some characters may be displayed incorrectly. ] On 14-12-16 10:30, Carles Acosta wrote: > Hello all, > > After having a test instance running Torque 4.2.10 + Maui from several > months, with any relevant issue, running hundreds of jobs, we decided to > move our production batch system from Torque 2.5.13 to Torque 4.2.10, as > we expected a better performance. Doing some ldapsearch in our bdiis > showed that several sites are in 4.2.10 version already. > > However, we have observed now, only in production, that pbs_server > crashes from time to time with this error: > > pbs_server[14661]: segfault at 0 ip 00000031bf281301 sp 00007fc6cdff47a8 > error 4 in libc-2.12.so[31bf200000+18a000] > Not sure if this is the same bug, but I did see crashes when we took 4.2.10 in production some time ago. I ran it through a debugger and came up with a patch. I think this was incorporated upstream. Best of luck, Dennis diff -ruN torque-4.2.10/src/server/req_runjob.c torque-4.2.10-patched/src/server/req_runjob.c --- torque-4.2.10/src/server/req_runjob.c 2015-03-20 04:24:58.000000000 +0100 +++ torque-4.2.10-patched/src/server/req_runjob.c 2016-05-23 23:28:15.000000000 +0200 @@ -1801,9 +1801,16 @@ { /* job has been checkpointed or files already staged in */ /* in this case, exec_host must be already set */ - + /* this is an unsafe assumption so let's be extra sure before doing strdup() */ if (prun->rq_destin && *prun->rq_destin) /* If a destination has been specified */ { + /* check that execution host is actually set */ + /* this can happen if running the job failed after stagein */ + if (pjob->ji_wattr[JOB_ATR_exec_host].at_val.at_str == NULL) + { + req_reject(PBSE_EXECTHERE, 0, preq, NULL, "exec host not set but files staged in"); + return(NULL); + } /* specified destination must match exec_host */ if ((exec_host = strdup(pjob->ji_wattr[JOB_ATR_exec_host].at_val.at_str)) == NULL) { diff -ruN torque-4.2.10/src/server/svr_jobfunc.c torque-4.2.10-patched/src/server/svr_jobfunc.c --- torque-4.2.10/src/server/svr_jobfunc.c 2015-03-20 04:24:58.000000000 +0100 +++ torque-4.2.10-patched/src/server/svr_jobfunc.c 2016-05-23 23:28:15.000000000 +0200 @@ -891,6 +891,8 @@ free(pjob.ji_wattr[JOB_ATR_exec_host].at_val.at_str); pjob.ji_wattr[JOB_ATR_exec_host].at_val.at_str = NULL; pjob.ji_wattr[JOB_ATR_exec_host].at_flags &= ~ATR_VFLAG_SET; + /* additionally clear the StagedIn flag if set */ + pjob.ji_qs.ji_svrflags &= ~JOB_SVFLG_StagedIn; } } /* END release_node_allocation() */