|Posted on Wednesday, October 06, 2010 - 4:29 pm: |
We have a problem with jobs not completing and the queue getting backed up. Whenever this happens, I view the daemon file and the job number listed is randomly an old job (e.g. 2348 when the jobs being submitted are in the 7000s).
Deleting the daemon file and submitting a new job to restart the daemon clears it up for a little while (jobs go through). But this problem always comes back up.
Is this a known issue? Is there an easy fix?
Post Number: 165
|Posted on Wednesday, October 06, 2010 - 5:52 pm: |
The 'daemon' file does not list a job ID, but rather a PROCESS ID for the WebMO daemon process.
It sounds like the WebMO daemon may have crashed. Normally, when a job is submitted WebMO starts the daemon (if it is not already running). The daemon then start the job.
However, if the daemon crashed, then WebMO may THINK the daemon is already running, when in fact it is not. Normally there is a check for this, but that check may have failed. Or it is possible that the daemon is 'hung', but not crashed.
Next time you see this probem, view the contents of the daemon file to find the process ID. Then do a 'ps -aef | grep <process>' to see if the process is still running or not. This will provide useful diagnostic information.