Page 1 of 1
Jobs can't create run_log file
Posted: Fri Jul 10, 2020 2:18 pm
by smcgrat
Hello,
Apologies if this is the wrong place for this post.
When we try to run a job with WebMO, (Version: 20.0.009e), we get the following error:
Code: Select all
[Fri Jul 10 10:20:35 2020] jobmgr.cgi: Cannot open file /usr/local/webmo/private/webmo/graeme/3/run_log: No such file or directory at text_dump.cgi line 78.
The other contents of that directory are:
Code: Select all
$ ls -latrh /usr/local/webmo/private/webmo/graeme/3/
total 40K
drwxr-xr-x 5 webmo webmo 33 Jul 10 10:19 ..
-rw-r--r-- 1 webmo webmo 151 Jul 10 10:19 zmatrix
-rw-r--r-- 1 webmo webmo 315 Jul 10 10:19 job_options
-rw-r--r-- 1 webmo webmo 241 Jul 10 10:19 input.xyz
-rw-r--r-- 1 webmo webmo 96 Jul 10 10:19 connections
-rw-r--r-- 1 webmo webmo 43 Jul 10 10:19 charges
-rw-r--r-- 1 webmo webmo 372 Jul 10 10:19 input.com
-rw-r--r-- 1 webmo webmo 193 Jul 10 10:19 summary
-rw-r--r-- 1 webmo webmo 0 Jul 10 10:19 notes
-rw-r--r-- 1 webmo webmo 26 Jul 10 10:19 jobid
-rw-r--r-- 1 webmo webmo 1.9K Jul 10 10:19 pbs_script.sh
drwxr-xr-x 2 webmo webmo 188 Jul 10 10:19 .
-rw-r--r-- 1 webmo webmo 30 Jul 10 10:19 pbs_stderr
So I'm a bit confused as to why this is happening.
Apologies if I have missed something basic or have left any useful information out.
Regards
Sean
Re: Jobs can't create run_log file
Posted: Fri Jul 10, 2020 9:43 pm
by schmidt
Can you provide some additional context? Are are trying to get WebMO integrated with an existing queueing system? (From the directory listing, it appears the answer is "yes").
Assuming that is the case, I would recommend checking your Torque/SLURM/etc. job to see if WebMO actually submitted the job, and, if so, whether it was actually scheduled / executed by Torque/SLURM/etc. FYI, the actual script that WebMO submits to the queue is 'pbs_script.sh'. In the comment line of the script you can see the exact command line that was used to submit the script to the queue. Sometimes this can be useful, e.g. to login as user 'webmo' from the command line and manually submit the script (using those exact arguments) to debug any issues.
Re: Jobs can't create run_log file
Posted: Mon Jul 13, 2020 4:50 pm
by smcgrat
schmidt wrote: ↑Fri Jul 10, 2020 9:43 pm
Can you provide some additional context? Are are trying to get WebMO integrated with an existing queueing system? (From the directory listing, it appears the answer is "yes").
Apologies for the lack of context and thank you for your resonse.
Yes, trying to integrate with slurm running on a cluster.
Assuming that is the case, I would recommend checking your Torque/SLURM/etc. job to see if WebMO actually submitted the job, and, if so, whether it was actually scheduled / executed by Torque/SLURM/etc.
Thanks, yes it was scheduled and restarted 439 reasons for whatever reason. Here are the details of the job:
Code: Select all
JobId=21265 JobName=WebMO_2
UserId=nobody(1000) GroupId=odias(1001) MCS_label=N/A
Priority=100104440 Nice=-100000000 Account=(null) QOS=normal
JobState=COMPLETING Reason=BeginTime Dependency=(null)
Requeue=0 Restarts=439 BatchFlag=2 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2020-07-08T09:03:00 EligibleTime=2020-07-08T09:07:00
AccrueTime=2020-07-08T09:05:00
StartTime=2020-07-08T09:05:00 EndTime=2020-07-09T09:05:00 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T09:05:00
Partition=compute AllocNode:Sid=hprc-guest-114-253:28498
ReqNodeList=(null) ExcNodeList=(null)
NodeList=
BatchHost=pople-n019
NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=16,mem=63000M,node=1,billing=16
Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
MinCPUsNode=16 MinMemoryNode=63000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/usr/local/webmo/public_html/cgi-bin/webmo
StdErr=/usr/local/webmo/private/webmo/graeme/2/pbs_stderr
StdIn=/dev/null
StdOut=/usr/local/webmo/private/webmo/graeme/2/pbs_stdout
Power=
The standard output file: /usr/local/webmo/private/webmo/graeme/2/pbs_stdout doesn't exist though!
FYI, the actual script that WebMO submits to the queue is 'pbs_script.sh'. In the comment line of the script you can see the exact command line that was used to submit the script to the queue. Sometimes this can be useful, e.g. to login as user 'webmo' from the command line and manually submit the script (using those exact arguments) to debug any issues.
Thanks, that was helpful. I tested it as follows:
Code: Select all
/bin/sbatch -J WebMO_2 -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/sbatch.sh
And it ran OK. Here are the contents of the /usr/local/webmo/sbatch.sh script
Code: Select all
[webmo@hprc-guest-114-253 ~]$ cat /usr/local/webmo/sbatch.sh
#!/bin/sh
#SBATCH -N 1
#SBATCH -t 00:01:00
#SBATCH -J webmotest
#SBATCH -p compute
echo ""
echo "srun hostname"
srun hostname
echo ""
echo "hostname -f"
hostname -f
echo ""
echo "free -g"
free -g
echo ""
echo "whoami"
whoami
echo ""
echo "env | grep SLURM"
env | grep SLURM
echo ""
echo "pwd"
pwd
echo ""
echo "ls -latrh"
ls -latrh
echo ""
exit
So on the face of all that it seems to me that the webmo machine can send jobs to the slurm queue successfully. Am I right in that or have I missed something?
Thanks
Sean
Re: Jobs can't create run_log file
Posted: Mon Jul 13, 2020 7:43 pm
by smcgrat
Sorry, I didn't fully understand how the submission worked. I submitted the /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh script that had been generated by WebMO as follows:
Code: Select all
$ /bin/sbatch -J WebMO_2 -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh
The job ran without error from slurm. Looking at the standard error for the script it says the following:
Code: Select all
$ cat /usr/local/webmo/private/webmo/graeme/2/pbs_stderr
You don't exist, go away!
You don't exist, go away!
You don't exist, go away!
/var/spool/slurmd/job21395/slurm_script: line 45: /usr/local/webmo/private/webmo/status/2.faeITK: No such file or directory
The output file /usr/local/webmo/private/webmo/graeme/2/pbs_stdout is empty.
I can think of two possibly related issues.
1. The webmo user may not be working correctly, there is an ldap server that both the webmo host and the cluster use but the webmo user isn't in that ldap, it is manually created on both servers with corresponding uid's. Could this be an issue? E.g. the user account doesn't exist on the cluster nodes. Thinking about it that is probably an issue.
2. Should the /tmp directory configured in WebMo be accessible in the cluster. It was unclear to me if that was needed from the instructions. Sorry if I missed something.
Thanks
Sean
Re: Jobs can't create run_log file
Posted: Mon Jul 13, 2020 8:03 pm
by schmidt
Sean,
Let's first ensure that the job did not actually run after you manually submitted form the command line. Note that pbs_stdout probably SHOULD be empty, because outptut is typically redirected to another output file(s) (e.g. output.log & run_log); check those, having manually submitted.
In terms of your issues:
1) Is definitely a problem! SLURM is going to be very unhappy when it tries to login to the remote node and execute the job as user 'webmo', only to find that the user may not exist. (This should be easy to diagnose. Login on the webserver as user WebMO and just submit ANY old example job to SLURM to verify it works.)
2) The /tmp (or configured scratch directory) does NOT need to be NFS shared. This is local scratch, as is typical in computational chemistry.
Re: Jobs can't create run_log file
Posted: Fri Jul 17, 2020 2:37 pm
by smcgrat
Thanks schmidt. Your help is very much appreciated.
I've created a webmo account in our ldap and transferred to that, hopefully that will make things a little easier.
I am submitting jobs to the slurm queue as the webmo user from the VM as follows:
Code: Select all
sbatch -J WebMO_2 -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh
And getting this error in the output.log:
Code: Select all
/home/support/apps/apps/gaussian/intel/15.0.6/g09/g09: error while loading shared libraries: libmkl_intel_ilp64.so: cannot open shared object file: No such file or directory
I suspect that webmo needs to be told how to load gaussian properly. So I have added:
Code: Select all
source /home/support/apps/intel/15.0.6/composer_xe_2015.6.233/bin/compilervars.sh intel64
source /etc/profile.d/modules.sh
module load intel/15.0.6/composer_xe_2015.6.233 apps gaussian/intel/15.0.6/g09
To these files:
Code: Select all
/home/users/webmo/.webmo_profile
/home/users/webmo/.bashrc
As per
https://www.webmo.net/link/help/BatchQueues.html which states "The script ".webmo_profile" in <webmoUserDir>, if it exists, will be sourced as part of the script that is sent to PBS/SLURM/etc".
I'm not really sure where the webmoUserDir is though! Is it the users home folder or somewhere else? Sorry if I have missed something in the docs.
Regards
Sean
Re: Jobs can't create run_log file
Posted: Fri Jul 17, 2020 2:45 pm
by smcgrat
Actually, is the webmoUserDir where the .webmo_profile should go the same as the "User files directory" from the configuration step specified with setup.pl during installation?
I've put a .webmo_profile file there anyway to find out and see how it goes once the cluster runs the job.
Thanks
Sean
Re: Jobs can't create run_log file
Posted: Sat Jul 18, 2020 8:28 pm
by schmidt
You got it! The user for is the one specified for user file storage during setup. Likely ~webmo/webmo.
Re: Jobs can't create run_log file
Posted: Wed Jul 22, 2020 2:53 pm
by smcgrat
Hi, sorry to ressurect this yet again. I'm still getting problems unfortunately and would appreciate your help please. I've found a work around but its messy and brittle and I have to be doing something wrong here to be causing it.
The .webmo_profile environmental control doesn't seem to be working as I expect it. This is what it is set to currently:
Code: Select all
$ cat /usr/local/webmo/private/webmo/.webmo_profile
source /home/support/apps/intel/15.0.6/composer_xe_2015.6.233/bin/compilervars.sh intel64
source /etc/profile.d/modules.sh
module use /home/support/modulefiles
module load intel/15.0.6/composer_xe_2015.6.233 apps gaussian/intel/15.0.6/g09
That should set the environment up to run gaussian properly, including populating the LD_LIBRARY_PATH variable. Submitting to the queue
Code: Select all
sbatch -J WebMO_2 --reservation=webmo -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh
Leads to Gaussian jobs would failing with:
Code: Select all
[webmo@hprc-guest-114-253 gaussian_job]$ cat /usr/local/webmo/private/webmo/graeme/2/output.log
/home/support/apps/apps/gaussian/intel/15.0.6/g09/g09: error while loading shared libraries: libmkl_intel_ilp64.so: cannot open shared object file: No such file or directory
The LD_LIBRARY_PATH the run_gaussian.cgi script seemed to be picking up was:
Code: Select all
/home/support/apps/apps/gaussian/intel/15.0.6/g09/bsd:/home/support/apps/apps/gaussian/intel/15.0.6/g09/local:/home/support/apps/apps/gaussian/intel/15.0.6/g09/extras:/home/support/apps/apps/gaussian/intel/15.0.6/g09
Which did not include the necessary path to where libmkl_intel_ilp64.so is.
To get gaussian jobs to run it was necessary to explicitly set the LD_LIBRARY_PATH in /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian.cgi
Code: Select all
$ diff -u /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian-backup.cgi /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian.cgi
--- /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian-backup.cgi 2020-07-22 13:35:57.770487800 +0100
+++ /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian.cgi 2020-07-22 15:11:04.992462468 +0100
@@ -62,7 +62,9 @@
$ENV{'GAUSS_ARCHDIR'} = $GAUSS_ARCHDIR;
$ENV{'GMAIN'} = $GMAIN;
$ENV{'PATH'} = $ENV{'PATH'}.":".$ENV{'GAUSS_EXEDIR'};
-$ENV{'LD_LIBRARY_PATH'} = $LD_LIBRARY_PATH;
+#$ENV{'LD_LIBRARY_PATH'} = $LD_LIBRARY_PATH;
+$ENV{'LD_LIBRARY_PATH'} = '/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/compiler/lib/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/mpirt/lib
/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/ipp/lib/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/ipp/tools/intel64/perfsys:/home/s
upport/apps/intel/15.0.6/composer_xe_2015.6.233/mkl/lib/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/tbb/lib/intel64/gcc4.4:/home/support/apps/intel/1
5.0.6/composer_xe_2015.6.233/debugger/libipt/intel64/lib:/home/support/apps/apps/gaussian/intel/15.0.6/g09/bsd:/home/support/apps/apps/gaussian/intel/15.0.6/g09/local:/
home/support/apps/apps/gaussian/intel/15.0.6/g09/extras:/home/support/apps/apps/gaussian/intel/15.0.6/g09';
+print "LD_LIBRARY_PATH = $LD_LIBRARY_PATH\n";
# if we are using PBS, find out which host we are running on
if ($externalBatchQueue)
Which is a bit of a cludge.
What have I missed here? I have to be doing something wrong.
Thanks in advance.
Sean