External Batch Queues
WebMO Enterprise allows jobs to be submitted to an external batch queue and then run on compute nodes.
A batch queue is a job control systems for scheduling and running jobs on remote compute nodes. A batch queue allows for the coordination of all computationally intensive jobs submitted on a system, including both WebMO and non-WebMO jobs
Batch queues are commonly found on computer clusters, in which users submit jobs on a head node that are then run on compute nodes.
The built-in WebMO queue coordinates running of WebMO jobs by a single WebMO installation. An external batch queue coordinates computationally intensive jobs from multiple WebMO and non-WebMO sources.
Requesting Batch Queue Resources
If an external batch queue is installed and enabled, WebMO users can request a specific batch queue from the Choose Engine page and/or computational resources from the Advanced tab of the Job Options page.
Managing Batch Queues
The system administrator must install an external batch queue on the system and associated compute nodes, in order for WebMO to be able to use it.
The WebMO admin user administers groups with the Batch Queue Manager. The admin user can:
- Enable batch queueing
- Define batch queue usage
- Configure batch queue engines
- Edit batch queues
Preparing for Batch Queues
Batch queue software must be installed on the head node/web server and on compute nodes prior to its use by WebMO. Common batch queue systems include: PBS, Torque/Maui, Sun Grid Engine, LSF, SLURM.
To enable interaction with queuing systems, the following criteria must be met:
- WebMO Enterprise must be installed on a queuing system submit host, i.e., where one can run 'qsub'; typically this is the head node of the cluster
- WebMO Enterprise must be installed in a user's home directory; typically this is the 'webmo' user, i.e., in /home/webmo/public_html
- Suexec must be enabled, so that web processes run under the script owner's system ID
- Home directories must be NFS mounted across the compute nodes, so that the head and compute nodes see the same file system
Some queueing systems require additional configuration:
- Torque and PBS (CentOS 7+, Ubuntu 16+, Debian 8+, SuSe 12+): By default the new "systemd" daemon creates a "private" /tmp directory for services, including Apache, which breaks the Torque qsub/qstat commands. Also, the mem lock limit (aka `ulimit -l`) defaults to 64kB for apache processes, and systemd ignores changes in /etc/security/limits.conf. These new defaults must be resolved as follows:
- Edit /usr/lib/systemd/system/httpd.service (CentOS, Debian, Ubuntu) or /etc/systemd/system/httpd.service (SuSe) and set
[Service]
PrivateTmp=false
LimitMEMLOCK=infinity - Restart the daemons
$ sudo systemctl daemon-reload
$ sudo systemctl restart httpd
- Edit /usr/lib/systemd/system/httpd.service (CentOS, Debian, Ubuntu) or /etc/systemd/system/httpd.service (SuSe) and set
- SGE: The variables "sge_qmaster" and "sge_execd" must be defined in /etc/services to the correct port. Modern implementations of SGE use port 536, while older versions of SGE used port 6444. On some systems, this is done alternatively with an environmental variable (e.g., SGE_QMASTER_PORT), but this is insufficient for WebMO.
- LSF: A variable of the name "lsf_root" must be defined in the pbs.conf configuration file, and point to the LSF installation directory (which should have etc, lib, conf... subdirectories).
The script ".webmo_profile" in <webmoUserDir>, if it exists, will be sourced as part of the script that is sent to PBS/SLURM/etc. This script serves the same purpose as a typical .profile or .bash_profile script that is normally sourced during login. It allows sysadmin to do additional module loads, adjust the path, etc. to better configure the environment, in particular for MPI. Note that this script can utilize the WEBMO_ENGINE and WEBMO_ENGINE_VERSION environmental variables to configure the environment for a particular computational engine.
Before enabling a batch queue, verify that the webmo system user can submit jobs to the batch queuing system from the command line:
- Log into a shell as the webmo system user
- Use
qsub
to submit a computational chemistry job - Verify that the job completes successfully
Troubleshooting Batch Queues
1. Check the batch queue (PBS, SLURM, etc) logs. Was the job submitted? What UID was it submitted under? Why was the job terminated?
2. Run the WebMO batch script from the command line. The EXACT WebMO script that was submitted to the batch queue is stored in the job directory (~webmo/webmo/<username>/<job number>) as pbs_script.sh. The EXACT command used to submit the job is contained in a comment at the top of the script. Therefore, an excellent diagnostic test is to login at the command line as system user "webmo" and use that EXACT same command to submit pbs_script.sh. This command will involve sudo if system users are set up. It should work without a password if passwordless sudo is correctly setup. Since this is the same process used by WebMO, it will work (or fail) in the same way as when WebMO submits the job, but problems are much easier to diagnose from the command line.
A sample command line session follows:
$ cat /home/webmo/webmo/smith/163/pbs_script.sh
#!/bin/sh
# Submitted using: /usr/bin/sudo -u smith /usr/bin/sbatch -J WebMO_163 -o /home/webmo/webmo/smith/163298/pbs_stdout -e /home/webmo/webmo/smith/163/pbs_stderr -p 'hour' --nodes=1 --tasks-per-node=1
...
$ /usr/bin/sudo -u smith /usr/bin/sbatch -J WebMO_163 -o /home/webmo/webmo/smith/163/pbs_stdout -e /home/webmo/webmo/smith/163/pbs_stderr -p 'hour' --nodes=1 --tasks-per-node=1 /home/webmo/webmo/smith/163/pbs_script.sh
Submitted batch job 25687
$ squeue -u smith
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
25687 hour WebMO_163 smith R 0:03 1 hpc201
$ tail /home/webmo/webmo/smith/163/output.out
...