Jobs can't create run_log file

Post Reply
smcgrat
Posts: 14
Joined: Mon Jul 06, 2020 4:51 pm
Full Name: Sean McGrath
Organization: Trinity College Dublin

Jobs can't create run_log file

Post by smcgrat »

Hello,

Apologies if this is the wrong place for this post.

When we try to run a job with WebMO, (Version: 20.0.009e), we get the following error:

Code: Select all

[Fri Jul 10 10:20:35 2020] jobmgr.cgi: Cannot open file /usr/local/webmo/private/webmo/graeme/3/run_log: No such file or directory at text_dump.cgi line 78.
The other contents of that directory are:

Code: Select all

$ ls -latrh /usr/local/webmo/private/webmo/graeme/3/
total 40K
drwxr-xr-x 5 webmo webmo   33 Jul 10 10:19 ..
-rw-r--r-- 1 webmo webmo  151 Jul 10 10:19 zmatrix
-rw-r--r-- 1 webmo webmo  315 Jul 10 10:19 job_options
-rw-r--r-- 1 webmo webmo  241 Jul 10 10:19 input.xyz
-rw-r--r-- 1 webmo webmo   96 Jul 10 10:19 connections
-rw-r--r-- 1 webmo webmo   43 Jul 10 10:19 charges
-rw-r--r-- 1 webmo webmo  372 Jul 10 10:19 input.com
-rw-r--r-- 1 webmo webmo  193 Jul 10 10:19 summary
-rw-r--r-- 1 webmo webmo    0 Jul 10 10:19 notes
-rw-r--r-- 1 webmo webmo   26 Jul 10 10:19 jobid
-rw-r--r-- 1 webmo webmo 1.9K Jul 10 10:19 pbs_script.sh
drwxr-xr-x 2 webmo webmo  188 Jul 10 10:19 .
-rw-r--r-- 1 webmo webmo   30 Jul 10 10:19 pbs_stderr
So I'm a bit confused as to why this is happening.

Apologies if I have missed something basic or have left any useful information out.

Regards

Sean

schmidt
Posts: 83
Joined: Sat May 30, 2020 3:00 pm
Full Name: JR Schmidt
Organization: WebMO, LLC

Re: Jobs can't create run_log file

Post by schmidt »

Can you provide some additional context? Are are trying to get WebMO integrated with an existing queueing system? (From the directory listing, it appears the answer is "yes").

Assuming that is the case, I would recommend checking your Torque/SLURM/etc. job to see if WebMO actually submitted the job, and, if so, whether it was actually scheduled / executed by Torque/SLURM/etc. FYI, the actual script that WebMO submits to the queue is 'pbs_script.sh'. In the comment line of the script you can see the exact command line that was used to submit the script to the queue. Sometimes this can be useful, e.g. to login as user 'webmo' from the command line and manually submit the script (using those exact arguments) to debug any issues.

smcgrat
Posts: 14
Joined: Mon Jul 06, 2020 4:51 pm
Full Name: Sean McGrath
Organization: Trinity College Dublin

Re: Jobs can't create run_log file

Post by smcgrat »

schmidt wrote:
Fri Jul 10, 2020 9:43 pm
Can you provide some additional context? Are are trying to get WebMO integrated with an existing queueing system? (From the directory listing, it appears the answer is "yes").
Apologies for the lack of context and thank you for your resonse.

Yes, trying to integrate with slurm running on a cluster.
Assuming that is the case, I would recommend checking your Torque/SLURM/etc. job to see if WebMO actually submitted the job, and, if so, whether it was actually scheduled / executed by Torque/SLURM/etc.
Thanks, yes it was scheduled and restarted 439 reasons for whatever reason. Here are the details of the job:

Code: Select all

JobId=21265 JobName=WebMO_2
   UserId=nobody(1000) GroupId=odias(1001) MCS_label=N/A
   Priority=100104440 Nice=-100000000 Account=(null) QOS=normal
   JobState=COMPLETING Reason=BeginTime Dependency=(null)
   Requeue=0 Restarts=439 BatchFlag=2 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2020-07-08T09:03:00 EligibleTime=2020-07-08T09:07:00
   AccrueTime=2020-07-08T09:05:00
   StartTime=2020-07-08T09:05:00 EndTime=2020-07-09T09:05:00 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-07-08T09:05:00
   Partition=compute AllocNode:Sid=hprc-guest-114-253:28498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   BatchHost=pople-n019
   NumNodes=1 NumCPUs=16 NumTasks=16 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=63000M,node=1,billing=16
   Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=63000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/usr/local/webmo/public_html/cgi-bin/webmo
   StdErr=/usr/local/webmo/private/webmo/graeme/2/pbs_stderr
   StdIn=/dev/null
   StdOut=/usr/local/webmo/private/webmo/graeme/2/pbs_stdout
   Power=
The standard output file: /usr/local/webmo/private/webmo/graeme/2/pbs_stdout doesn't exist though!
FYI, the actual script that WebMO submits to the queue is 'pbs_script.sh'. In the comment line of the script you can see the exact command line that was used to submit the script to the queue. Sometimes this can be useful, e.g. to login as user 'webmo' from the command line and manually submit the script (using those exact arguments) to debug any issues.
Thanks, that was helpful. I tested it as follows:

Code: Select all

/bin/sbatch -J WebMO_2 -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/sbatch.sh
And it ran OK. Here are the contents of the /usr/local/webmo/sbatch.sh script

Code: Select all

[webmo@hprc-guest-114-253 ~]$ cat /usr/local/webmo/sbatch.sh
#!/bin/sh
#SBATCH -N 1 
#SBATCH -t 00:01:00
#SBATCH -J webmotest           
#SBATCH -p compute   

echo ""
echo "srun hostname"
srun hostname
echo ""
echo "hostname -f"
hostname -f
echo ""
echo "free -g"
free -g
echo ""
echo "whoami"
whoami
echo ""
echo "env | grep SLURM"
env | grep SLURM
echo ""
echo "pwd"
pwd
echo ""
echo "ls -latrh"
ls -latrh
echo ""
exit 
So on the face of all that it seems to me that the webmo machine can send jobs to the slurm queue successfully. Am I right in that or have I missed something?

Thanks

Sean

smcgrat
Posts: 14
Joined: Mon Jul 06, 2020 4:51 pm
Full Name: Sean McGrath
Organization: Trinity College Dublin

Re: Jobs can't create run_log file

Post by smcgrat »

Sorry, I didn't fully understand how the submission worked. I submitted the /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh script that had been generated by WebMO as follows:

Code: Select all

$ /bin/sbatch -J WebMO_2 -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh
The job ran without error from slurm. Looking at the standard error for the script it says the following:

Code: Select all

$ cat /usr/local/webmo/private/webmo/graeme/2/pbs_stderr
You don't exist, go away!
You don't exist, go away!
You don't exist, go away!
/var/spool/slurmd/job21395/slurm_script: line 45: /usr/local/webmo/private/webmo/status/2.faeITK: No such file or directory
The output file /usr/local/webmo/private/webmo/graeme/2/pbs_stdout is empty.
I can think of two possibly related issues.

1. The webmo user may not be working correctly, there is an ldap server that both the webmo host and the cluster use but the webmo user isn't in that ldap, it is manually created on both servers with corresponding uid's. Could this be an issue? E.g. the user account doesn't exist on the cluster nodes. Thinking about it that is probably an issue.

2. Should the /tmp directory configured in WebMo be accessible in the cluster. It was unclear to me if that was needed from the instructions. Sorry if I missed something.

Thanks

Sean

schmidt
Posts: 83
Joined: Sat May 30, 2020 3:00 pm
Full Name: JR Schmidt
Organization: WebMO, LLC

Re: Jobs can't create run_log file

Post by schmidt »

Sean,

Let's first ensure that the job did not actually run after you manually submitted form the command line. Note that pbs_stdout probably SHOULD be empty, because outptut is typically redirected to another output file(s) (e.g. output.log & run_log); check those, having manually submitted.

In terms of your issues:

1) Is definitely a problem! SLURM is going to be very unhappy when it tries to login to the remote node and execute the job as user 'webmo', only to find that the user may not exist. (This should be easy to diagnose. Login on the webserver as user WebMO and just submit ANY old example job to SLURM to verify it works.)

2) The /tmp (or configured scratch directory) does NOT need to be NFS shared. This is local scratch, as is typical in computational chemistry.

smcgrat
Posts: 14
Joined: Mon Jul 06, 2020 4:51 pm
Full Name: Sean McGrath
Organization: Trinity College Dublin

Re: Jobs can't create run_log file

Post by smcgrat »

Thanks schmidt. Your help is very much appreciated.

I've created a webmo account in our ldap and transferred to that, hopefully that will make things a little easier.

I am submitting jobs to the slurm queue as the webmo user from the VM as follows:

Code: Select all

sbatch -J WebMO_2  -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh
And getting this error in the output.log:

Code: Select all

/home/support/apps/apps/gaussian/intel/15.0.6/g09/g09: error while loading shared libraries: libmkl_intel_ilp64.so: cannot open shared object file: No such file or directory
I suspect that webmo needs to be told how to load gaussian properly. So I have added:

Code: Select all

source /home/support/apps/intel/15.0.6/composer_xe_2015.6.233/bin/compilervars.sh intel64
source /etc/profile.d/modules.sh
module load intel/15.0.6/composer_xe_2015.6.233 apps gaussian/intel/15.0.6/g09
To these files:

Code: Select all

/home/users/webmo/.webmo_profile
/home/users/webmo/.bashrc
As per https://www.webmo.net/link/help/BatchQueues.html which states "The script ".webmo_profile" in <webmoUserDir>, if it exists, will be sourced as part of the script that is sent to PBS/SLURM/etc".

I'm not really sure where the webmoUserDir is though! Is it the users home folder or somewhere else? Sorry if I have missed something in the docs.

Regards

Sean

smcgrat
Posts: 14
Joined: Mon Jul 06, 2020 4:51 pm
Full Name: Sean McGrath
Organization: Trinity College Dublin

Re: Jobs can't create run_log file

Post by smcgrat »

Actually, is the webmoUserDir where the .webmo_profile should go the same as the "User files directory" from the configuration step specified with setup.pl during installation?

I've put a .webmo_profile file there anyway to find out and see how it goes once the cluster runs the job.

Thanks

Sean

schmidt
Posts: 83
Joined: Sat May 30, 2020 3:00 pm
Full Name: JR Schmidt
Organization: WebMO, LLC

Re: Jobs can't create run_log file

Post by schmidt »

You got it! The user for is the one specified for user file storage during setup. Likely ~webmo/webmo.

smcgrat
Posts: 14
Joined: Mon Jul 06, 2020 4:51 pm
Full Name: Sean McGrath
Organization: Trinity College Dublin

Re: Jobs can't create run_log file

Post by smcgrat »

Hi, sorry to ressurect this yet again. I'm still getting problems unfortunately and would appreciate your help please. I've found a work around but its messy and brittle and I have to be doing something wrong here to be causing it.

The .webmo_profile environmental control doesn't seem to be working as I expect it. This is what it is set to currently:

Code: Select all

$ cat /usr/local/webmo/private/webmo/.webmo_profile
source /home/support/apps/intel/15.0.6/composer_xe_2015.6.233/bin/compilervars.sh intel64
source /etc/profile.d/modules.sh
module use /home/support/modulefiles
module load intel/15.0.6/composer_xe_2015.6.233 apps gaussian/intel/15.0.6/g09
That should set the environment up to run gaussian properly, including populating the LD_LIBRARY_PATH variable. Submitting to the queue

Code: Select all

sbatch -J WebMO_2 --reservation=webmo -o /usr/local/webmo/private/webmo/graeme/2/pbs_stdout -e /usr/local/webmo/private/webmo/graeme/2/pbs_stderr -p 'compute' --nodes=1 --tasks-per-node=16 /usr/local/webmo/private/webmo/graeme/2/pbs_script.sh
Leads to Gaussian jobs would failing with:

Code: Select all

[webmo@hprc-guest-114-253 gaussian_job]$ cat /usr/local/webmo/private/webmo/graeme/2/output.log
/home/support/apps/apps/gaussian/intel/15.0.6/g09/g09: error while loading shared libraries: libmkl_intel_ilp64.so: cannot open shared object file: No such file or directory
The LD_LIBRARY_PATH the run_gaussian.cgi script seemed to be picking up was:

Code: Select all

/home/support/apps/apps/gaussian/intel/15.0.6/g09/bsd:/home/support/apps/apps/gaussian/intel/15.0.6/g09/local:/home/support/apps/apps/gaussian/intel/15.0.6/g09/extras:/home/support/apps/apps/gaussian/intel/15.0.6/g09
Which did not include the necessary path to where libmkl_intel_ilp64.so is.

To get gaussian jobs to run it was necessary to explicitly set the LD_LIBRARY_PATH in /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian.cgi

Code: Select all

$ diff -u /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian-backup.cgi /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian.cgi
--- /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian-backup.cgi  2020-07-22 13:35:57.770487800 +0100
+++ /usr/local/webmo/public_html/cgi-bin/webmo/run_gaussian.cgi 2020-07-22 15:11:04.992462468 +0100
@@ -62,7 +62,9 @@
 $ENV{'GAUSS_ARCHDIR'} = $GAUSS_ARCHDIR;
 $ENV{'GMAIN'} = $GMAIN;
 $ENV{'PATH'} = $ENV{'PATH'}.":".$ENV{'GAUSS_EXEDIR'};
-$ENV{'LD_LIBRARY_PATH'} = $LD_LIBRARY_PATH;
+#$ENV{'LD_LIBRARY_PATH'} = $LD_LIBRARY_PATH;
+$ENV{'LD_LIBRARY_PATH'} = '/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/compiler/lib/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/mpirt/lib
/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/ipp/lib/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/ipp/tools/intel64/perfsys:/home/s
upport/apps/intel/15.0.6/composer_xe_2015.6.233/mkl/lib/intel64:/home/support/apps/intel/15.0.6/composer_xe_2015.6.233/tbb/lib/intel64/gcc4.4:/home/support/apps/intel/1
5.0.6/composer_xe_2015.6.233/debugger/libipt/intel64/lib:/home/support/apps/apps/gaussian/intel/15.0.6/g09/bsd:/home/support/apps/apps/gaussian/intel/15.0.6/g09/local:/
home/support/apps/apps/gaussian/intel/15.0.6/g09/extras:/home/support/apps/apps/gaussian/intel/15.0.6/g09';
+print "LD_LIBRARY_PATH = $LD_LIBRARY_PATH\n";

 # if we are using PBS, find out which host we are running on
 if ($externalBatchQueue)
Which is a bit of a cludge.

What have I missed here? I have to be doing something wrong.

Thanks in advance.

Sean

Post Reply