|Posted on Thursday, August 11, 2016 - 12:14 pm: |
We recently implemented Version 17 of WebMO Enterprise that links with a remote SGE queue, and everything was (is) working great when users work individually. However, a class tried to use the program during a lecture, and the students (~25) were submitting jobs within seconds of one another and after about 8 submissions they were frozen out and eventually received a timeout error. The jobs WERE succesfully submitted to the queue, but until the daemon shut down, the jobmgr.cgi page was not reachable.
This error is NOT repeatable on an installation that uses the webmo daemon (working version does not work with an SGE queue).
The error log is blank for WebMO, and the ssl_error_log displays 'Timeout' errors...
Any thoughts? I've been searching through the daemon_pbs.cgi and processcontrol_pbs.cgi scripts and haven't had any luck identifying the problem.
Thanks Very Much.
|Posted on Monday, March 09, 2020 - 3:29 pm: |
In case anyone else runs across this in the future, I reached out to James McNeely directly and he was kind enough to send the following:
"We were able to solve this ... our original 'solution' actually went through two steps.
First, if your WebMO installation is on an NFS mounted folder, we had to move the database files to a local folder. Apparently the flock calls in the perl scripts were misbehaving when the databases were NFS mounted.
After this, our computing services team upgraded the OS to CENTOS 7 a little while ago, and we ran into a whole lot of problems. At this point, I had to replace each flock call in all of the WebMO scripts to fcntl calls. That seemed to fix all of the issues."