WebMO - Computational chemistry on the WWW
Recent news

WebMO 17.0 is now available for free download!

WebMO 17.0 Pro and Enterprise have a variety of additional features and is available for purchase.

The WebMO app is now available for iOS and Android.

December 12, 2019

WebMO Version 17.0.012e kills all job... Log Out | Topics | Search
Moderators | Edit Profile

WebMO Support Forum » Bug Reports » WebMO Version 17.X » WebMO Version 17.0.012e kills all jobs when SLURM control Daemon gets restarted « Previous Next »

Author Message
jmeyer
Unregistered guest
Posted on Monday, October 02, 2017 - 8:45 am:   

Our computing cluster admin and us have finally figured out why all running jobs on several queues all get terminated at the same time with the error message:

Failed job: Job killed by daemon; job exceeded allotted CPU time

the error file pbs_stderr reads:
mkdir: cannot create directory ‘/scratch/webmo/webmo-5448/1678’: File exists
slurmstepd: error: *** JOB 607736 ON node762 CANCELLED AT 2017-09-30T15:36:44 DUE TO TIME LIMIT ***
rm: cannot remove ‘/scratch/webmo/webmo-5448/1678’: No such file or directory

This behaviour seems to be caused by closing and or restart of the SLURM control Daemon, which sometimes is necessary for maintenance purposes.

How can we solve this issue or what further information is needed?

Best Regards,
J. Meyer
JR Schmidt
Moderator
Username: Schmidt

Post Number: 566
Registered: 11-2006
Posted on Monday, October 02, 2017 - 10:26 am:   

If you go to daemon_pbs.cgi, the "monitor_jobs" subroutine, you will see a few lines where WebMO may kill jobs that it can no longer find in the SLURM queue (such is if the daemon were restarted!). You should be able to simply comment those out without adverse consequences in nearly every case.
jmeyer
Unregistered guest
Posted on Wednesday, October 04, 2017 - 4:16 am:   

Dear JR Schmidt,

we will try to apply your suggestion and will report back how we fare..
Thank you very much already for the quick reply!

Best Regards,
J. Meyer
jmeyer
Unregistered guest
Posted on Thursday, October 26, 2017 - 4:42 am:   

So far this has solved the issue, thank you very much!
There is a small side effect, calculations that run over the alotted time remain the WebMO queue with a shown time of 0.0s, but that is a small price to pay and by far better than having all calculations terminated uncontrollably..

Add Your Message Here
Post:
Username: Posting Information:
This is a public posting area. Enter your username and password if you have an account. Otherwise, enter your full name as your username and leave the password blank. Your e-mail address is optional.
Password:
E-mail:
Options: Post as "Anonymous"
Enable HTML code in message
Automatically activate URLs in message
Action:

Topics | Last Day | Last Week | Tree View | Search | Help/Instructions | Program Credits Administration