|Posted on Monday, October 02, 2017 - 8:45 am: |
Our computing cluster admin and us have finally figured out why all running jobs on several queues all get terminated at the same time with the error message:
Failed job: Job killed by daemon; job exceeded allotted CPU time
the error file pbs_stderr reads:
mkdir: cannot create directory ‘/scratch/webmo/webmo-5448/1678’: File exists
slurmstepd: error: *** JOB 607736 ON node762 CANCELLED AT 2017-09-30T15:36:44 DUE TO TIME LIMIT ***
rm: cannot remove ‘/scratch/webmo/webmo-5448/1678’: No such file or directory
This behaviour seems to be caused by closing and or restart of the SLURM control Daemon, which sometimes is necessary for maintenance purposes.
How can we solve this issue or what further information is needed?
Post Number: 566
|Posted on Monday, October 02, 2017 - 10:26 am: |
If you go to daemon_pbs.cgi, the "monitor_jobs" subroutine, you will see a few lines where WebMO may kill jobs that it can no longer find in the SLURM queue (such is if the daemon were restarted!). You should be able to simply comment those out without adverse consequences in nearly every case.
|Posted on Wednesday, October 04, 2017 - 4:16 am: |
Dear JR Schmidt,
we will try to apply your suggestion and will report back how we fare..
Thank you very much already for the quick reply!
|Posted on Thursday, October 26, 2017 - 4:42 am: |
So far this has solved the issue, thank you very much!
There is a small side effect, calculations that run over the alotted time remain the WebMO queue with a shown time of 0.0s, but that is a small price to pay and by far better than having all calculations terminated uncontrollably..