|Posted on Thursday, February 04, 2010 - 11:01 am: |
I was doing some calculations with my students two days ago. During our lab they managed to submit about 200 hundred jobs to the queue. After 30 minutes or so the queued jobs became stuck. None of them went through. We waited and tried again, nothing. Finally we went to administer the jobs, I started killing the jobs one at the time, hoping to get the queue moving again. Nothing seemed to work, so instead of killing the 121 jobs one at the time, I checked them all in the Job Manager and hit Delete key. They disappeared from the screen but this solved nothing. Now, in the top left corner in Job Manager, under Status, I see "admin" and right beneath it "120 Jobs" queued but I can't find those 121 jobs anywhere anymore. Are they some phantom jobs that jam the system? I also deleted the failed ones but I still can't get new jobs to go through. Please advise. We have WebMO version 8.0.009p.
Post Number: 139
|Posted on Thursday, February 04, 2010 - 6:15 pm: |
This usually occurs occasionally with older version of WebMO, and it can be cleared up pretty quickly:
Go to the command line, and and go to the WebMO 'user' directory. There is a file called 'queue'. Open that file in a text editor, and delete ALL the contents. This will delete all the old jobs out of the queue.
Deleting them from the job manager should have done the same thing, but this will work for sure!
|Posted on Monday, February 08, 2010 - 12:44 pm: |
Thanks JR, for your post. It solved the queue problem, now it is clean. However, now when we submit new jobs they go to queue but that's it, they never finish, nor do they fail. They just sit in queue with the status shown on yellow and "queue 1/1" or whatever the number of submitted jobs is. It seems that our whole system is somehow still frozen and nothing goes to computing phase. Any ideas how to solve this?
Post Number: 141
|Posted on Monday, February 08, 2010 - 2:24 pm: |
Go to the same WebMO "jobs" directory, and look for the 'daemon' file. If it exists (it probably does), read the contents, which should be a single number. That number is the process ID of the WebMO daemon.
Type 'kill <number>' (as root) to kill the daemon, in case it is still running. Then delete the daemon file. Subsequent jobs should run fine.
The root problem of both issues is likely the same, that the running version of the daemon got caught in some sort of infinite loop.
|Posted on Monday, February 08, 2010 - 3:35 pm: |
Found the 'daemon' file, process ID was 5952. Where do I type "kill 5952"? I tried at the command prompt but obviously the command is not recognized by DOS. So, I just deleted the file. Now the jobs move to "running" but never finish or fail. This is what I get (for Gamess) when I open the Raw Output file next to the red X while the job is running:
Distributed Data Interface kickoff program.
Initiating 1 compute processes on 1 nodes to run the following command:
ddikick.x error: execvp failed in Kickoff_Local.
Error: execvp(c:/WinGAMESS/gamess.07.exe,args) failed (errno=unknown).
Post Number: 142
|Posted on Monday, February 08, 2010 - 4:01 pm: |
I didn't realize you were on Windows.
I believe what you are seeing now is that you have not specified the location of the WinGAMESS executable correctly. Go to the 'Interface Manager' and click on the 'Edit' icon corresponding to GAMESS. Make sure that the name of the executable is right (it is probably not gamess.07.exe, as you are probably running a newer version!)
|Posted on Tuesday, February 09, 2010 - 11:01 am: |
I checked on the Interface Manager and everything seemed ok. Yes, we do have gamess.07.exe and it was were it is supposed to be. We still can't run any Gamess calculations. They go to queue and move to running phase but that is where they stop. The calculation loop never finishes. Any other ideas?
Post Number: 143
|Posted on Tuesday, February 09, 2010 - 4:04 pm: |
This appears to be a bug specific to the old WebMO v8, in how GAMESS jobs are run on Windows. Please email me for an update file.