Requirements for CLC Grid Integration
- A grid submission system with a working DRMAA library must already be in place. Further details below.
- The CLC Server must be installed on a Linux based system configured as a submit host in the grid environment.
- The user running the CLC Server process must exist on all the grid nodes. This user is seen as the submitter of the grid jobs.
- CLC Server file locations holding data that will be used must be mounted with the same path on the grid nodes as on the master CLC Server and they must be accessible to the user that runs the CLC Server process.
- A CLC Network License Manager with one or more available CLC Grid Worker licenses must be reachable from the execution hosts in the grid setup.
Supported grid scheduling systems
Grid scheduling systems to be used to execute jobs submitted by a CLC Server must have:
- A working DRMAA library.
- A mechanism to limit the number of CLC jobs simultaneously running on the grid nodes to the number of CLC Grid Worker licenses (see below).
Grid integration has been verified using the following third party scheduling systems:
- SLURM 16.05.2 and 21.08.1
- UNIVA 8.4.1 and 8.6.1.7
- LSF 9.1.1 and 10.1
- PBS Pro 2020.1.4 and 2021.1.1
Notes about DRMAA for each of the grid scheduling systems are provided in the appendix, including information relating to compilation, where relevant.
Limiting CLC grid job number to the number of CLC Grid Worker licenses
The grid scheduling system must be configured to limit the number of CLC jobs simultaneously running on the grid nodes to the number of CLC Grid Worker licenses. Where more CLC jobs than this are launched, excess tasks should be held in the queue until a license becomes available.
For SLURM, the number of CLC Grid Worker licenses can be configured as described on https://slurm.schedmd.com/licenses.html. For LSF and UNIVA, a "Consumable Resource" would be configured, as described in Configuring licenses as a consumable resource. Relevant information about configuring consumable resources for PBS Pro can be found in the adminstrator's guide for that scheduling software.
TORQUE from Adaptive Computing is an example of a system that works for submitting CLC jobs, but that cannot be supported because it does not provide a means of limiting the number of CLC jobs sent simultaneously to the cluster to match the number of CLC Grid Worker licenses. So, with TORQUE, if you had three Grid Worker licenses, up to three jobs could be run simultaneously. However, if three jobs are already running and you launch a fourth job, then this fourth job will fail because there would be no license available for it.
This limitation can be overcome, allowing you to work with systems such as TORQUE, if you control the job submission in some other way so the license number is not exceeded. One possible setup for this is if you have a one-node-runs-one-job setup. You could then set up a queue where jobs are only sent to a certain number of nodes, where that number matches the number of CLC Grid Worker licenses you have.