Requirements for CLC Grid Integration
- A grid submission system with a working DRMAA library must already be in place. Further details below.
- The CLC Server must be installed on a Linux based system configured as a submit host in the grid environment.
- The user running the CLC Server process must exist on all the grid nodes. This user is seen as the submitter of the grid jobs.
- CLC Server file locations holding data that will be used must be mounted with the same path on the grid nodes as on the master CLC Server and they must be accessible to the user that runs the CLC Server process.
- Licenses for the execution nodes.
Note: When downloaded license files are being used to license the master node, a CLC Network License Manager with one or more available CLC Grid Worker licenses must be reachable from the grid nodes.
Supported grid scheduling systems
Grid scheduling systems to be used to execute jobs submitted by a CLC Server must have:
- A working DRMAA library.
- A mechanism to limit the number of CLC jobs simultaneously running on the grid nodes to the number of CLC Grid Workers there are licenses for (see below).
Grid integration has been verified using the following third party scheduling systems:
- SLURM 16.05.2 and 21.08.1
- UNIVA 8.4.1 and 8.6.1.7
- LSF 9.1.1 and 10.1
- PBS Pro 2020.1.4 and 2021.1.1
Notes about DRMAA for each of the grid scheduling systems are provided in the appendix, including information relating to compilation, where relevant.
Limiting CLC grid job number to the number of CLC Grid Worker licenses
The grid scheduling system must be configured to limit the number of CLC jobs simultaneously running on the grid nodes to the number of CLC Grid Worker licenses (or execution node licenses) that have been purchased. Where more CLC jobs than this are launched, excess tasks should be held in the queue until a license becomes available.
For SLURM, the number of CLC Grid Worker licenses can be configured as described on https://slurm.schedmd.com/licenses.html. For LSF and UNIVA, a "Consumable Resource" would be configured, as described in Configuring licenses as a consumable resource. Relevant information about configuring consumable resources for PBS Pro can be found in the adminstrator's guide for that scheduling software.
TORQUE from Adaptive Computing is an example of a system that works for submitting CLC jobs, but that cannot be supported because it does not provide a means of limiting the number of CLC jobs sent simultaneously to the cluster to match the number of CLC Grid Worker licenses. So, with TORQUE, if you had three Grid Worker licenses, up to three jobs could be run simultaneously. However, if three jobs are already running and you launch a fourth job, then this fourth job will fail because there would be no license available for it.
This limitation can be overcome, allowing you to work with systems such as TORQUE, if you control the job submission in some other way so the license number is not exceeded. One possible setup for this is if you have a one-node-runs-one-job setup. You could then set up a queue where jobs are only sent to a certain number of nodes, where that number matches the number of CLC Grid Worker licenses you have.
