Q&CE Cluster¶
Introduction¶
The Q&CE cluster is a compute cluster for running large batch jobs which need more resources then available on your own computer. It consists of a login and admin node and a number of compute nodes connected via an internal network. To manage the resources on the compute nodes, the Q&CE cluster uses a resource manager called Slurm (Simple Linux Utility for Resource Management) to handle the queuing, scheduling, and execution of jobs.
Note
If you want to submit jobs to cluster, you need special access rights. To apply for this access right, send a mail to Q&CE Administrator with:
- your Q&CE accountname
- The type of jobs you want to run (e.g. MATLAB, python etc.)
To use the cluster, you have to login to the cluster login node qce-cluster-login.ewi.tudelft.nl
with SSH. On the cluster login node you can submit your job to the compute nodes. When a user submits a job to the cluster, the resource manager queues the job and prioritizes the job according to the requested resources and recent use of the user. The job is then scheduled and the resources (e.g. CPUs, memory, time) are allocated. When the requested resources become available, the job is started on the compute nodes. During job execution, the resource manager monitors and enforces the allocated resources.
Warning
Do not run your production jobs directly on the login node, use sbatch
to submit them to the resource manager. You can use the login node for job testing and post processing.
The compute nodes have the following resources available:
Node | Cores | Threads | Memory | Local Storage | GPU | CPU Speed |
---|---|---|---|---|---|---|
qce-cn[001-004] | 28 | 56 | 192 GB | 1 TB | none | 2.4 GHz |
qce-qn[001-004] | 28 | 56 | 384 GB | 3 TB | K80 (2) | 2.0 GHz |
qce-gpu01 | 32 | 64 | 384 GB | 500 GB | RTX2080Ti (4) | 2.3 GHz |
qce-gpu02 | 32 | 64 | 768 GB | 500 GB | RTX2080Ti (4) | 2.3 GHz |
Four nodes on the cluster are equiped with NVidia Tesla K80 Accelerator cards, which has a dual GPU with 12GB memory per GPU. Two nodes are equiped with four NVidia RTX2080Ti GPUs, which have 11GB memory per GPU.
Cluster Software¶
The Q&CE cluster compute nodes are installed with a minimal set of software and libraries to keep incompatibility problems to a minimum. This means not all applications will run on the compute nodes. If you need special shared libraries to run your application, you have to provide them yourself and make sure they are found by the loader. The following application types are supported on the Q&CE cluster
- MATLAB (see How to run MATLAB in batch mode on the cluster in the FAQ)
- Cadence Spectre with apptainer (see How to run Cadence Spectre on the cluster in the FAQ)
- Xilinx Vivado (see How to run Xilinx Vivado in batch mode on the cluster in the FAQ)
- Synopsys HSpice
Warning
- The cluster does not support interactive applications.
- Applications with a heavy IO usage (i.e. Big Data applications) are discouraged to use the Q&CE cluster and will be terminated in case of problems.
Job Specification¶
Users generally specify batch jobs by writing a job script file and submitting the job to Slurm with the sbatch
command. The sbatch
command takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster. Before you specify a job for batch processing, it is important to know what the job requirements are so that it can run properly on the cluster. Each program and workflow has unique requirements so we advise that you determine what resources you need before you write your job script.
Keep in mind that while increasing the amount of compute resources you request may decrease the time it takes to run your job, it will also increase the amount of time your job spends waiting in the queue. You may request whatever resources you need but be mindful that other researchers need to be able to use those resources as well. It is strongly recommended that you include at least the following specifications:
- How long the job will run
- The cpu requirements
- The memory requirements
- The partition to run in
More details about SLURM, job submission and sbatch
parameters can be found at the official SLURM Documentation site. Also see the man pages for the SLURM commands (i.e. man sbatch
).
Warning
All specified job resources will be enforced by the resource manager. If your job uses more time or memory then specified, it will be terminated. The number of specified CPUs is the maximum your job will be able to use, it will not be terminated if it tries to use more.
Creating a job script¶
The most basic parameter given to the sbatch
command is the script to run. This obviously must be given on command line, not inside the script file. This job script must start with a line which specifies the shell under which the script is to run. I.e., the very first line of your script should generally be either #!/bin/tcsh
or #!/bin/bash
for the tcsh or bash shell, respectively. This line is typically followed by a bunch of #SBATCH
lines specifying the job requirements, and then the actual commands that you wish to have executed when the job is started on the compute nodes. The #SBATCH
lines should come BEFORE any non-blank/non-comment lines in your script file. Any #SBATCH
lines which come after non-blank/non-comment lines might get ignored by the scheduler.
Note
The sbatch
parameters mentioned in the next paragraphs are just a standard set suitable for most jobs. If you want more information about these or other parameters, please visit the SLURM Documentation site.
Specifying the amount of time your job will run¶
It is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. So you should always add a buffer to account for variability in run times; you do not want your job to be killed when it is 99.9% complete. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.
Parameter | Description |
---|---|
-t <time> |
Total run time for the job. Valid
|
--time=<time> |
If you do not specify a time parameter in your jobscript, you get the default runtime of 1 minute! Since that is not likely to be sufficient for your job to complete, specify a reasonable time. This greatly aids the scheduler in making the best utilization of resources. The maximum runtime for a job is (currently) 7 days.
Specifying the cpu requirements¶
Normally a jobs consist of a specified number of tasks. You want each task to run on its own number of CPU cores/threads for maximize performance. You can specify the number of tasks and the number of cores/threads needed for each task. If you specify more tasks or cpus per task then available in a node, the resource manager will allocate multiple nodes and your task has to manage the multinode configuration and communication itself! Most jobs just run one task and that is the default if not specified. In most cases you specify the number of CPUs your job wil use. Because each CPU on the cluster has two hardware threads, always select an even number of CPUs.
Parameter | Description |
---|---|
-n <tasks> |
Number of concurrent tasks to allocate |
--ntasks=<tasks> |
|
-c <ncpus> |
Number of cores/threads to allocate for each task
(always use an even number with a minumum of 2)
|
--cpus-per-task=<ncpus> |
Note
Slurm sees hardware threads as a cpu. The total number of hardware threads allocated for your job is <tasks>
x <ncpus>
. When you want a real core for each task make sure you always specify an even number of <ncpus>
with a minimum of 2
.
Specifying memory requirements¶
If you want to request a specific amount of memory for your job, you can choose between the total amount of memory needed for the job or the amount of memory needed for each cpu/task. Both parameters are mutually exclusive so pick the one that’s best for your job. In most cases the total amount of memory needed for the job is specified.
Parameter | Description |
---|---|
--mem=<size[units]> |
Total amount of memory needed for the job.
[units] can be [K|M|G|T] , default is M |
--mem-per-cpu=<size[units]> |
Amount of memory needed for each task/cpu.
[units] can be [K|M|G|T] , default is M |
If you don’t specify the memory requirements for your job, you get a default value of 256 Megabytes!
Specifying GPU requirements (optional)¶
If you want to run CUDA (enabled) applications on the cluster, you have to specify you want use GPU(s). You can specify the type and amount of GPUs your application needs.
Parameter | Description |
---|---|
--gres=gpu[[:type]:count] |
GPU type (if applicable) and number of GPUs needed |
Currently the cluster has two types of GPU. If you want a specific type of GPU set the type accordingly. The types are: “K80” or “RTX2080Ti”.
Specifying the partition¶
You can specify the partition in which you want your job to run. Partitions are usually a group of nodes which have the same hardware characteristics and/or resource restrictions. Currently the following partitions are available:
Partition | Description |
---|---|
general | Default partition for all users. The maximum runtime is 7 days |
long | Partition for long running jobs. The maximum runtime is 31 days
Resource constraints: max 4 cpu and 16GB memory per job. You can submit a maximum
of 20 jobs of which 10 can run simultaneously
|
Parameter | Description |
---|---|
-p <partition_name> |
Partition to allocate the job in |
--partition=<partition_name> |
If you don’t specify a partition, your job will be allocated in the general (default) partition.
Specifying the license requirements¶
Some applications need to checkout license(s) before they are able to start. Currently the following applications need to use the license option:
Application | License name | Available licenses |
---|---|---|
Cadence Spectre | spectre | 200 |
Parameter | Description |
---|---|
-L <license>:<num> |
Number of licenses from <license> to checkout |
--license=<license>:<num> |
Email events¶
The SLURM scheduler can send an email when certain events related to your job occur, such as on start and end of execution and if an error occurs. You need to specify a valid email address and the events you want to be informed on. Multiple events may be specified in a comma seperated list or you can specify multiple --mail-type
lines.
Parameter | Description |
---|---|
--mail-user=<email address> |
Email address to send the event emails to. |
--mail-type=<event,[event]> |
Event to send an email on. Common events are:
|
Job name¶
By default, slurm will use the name of the job script as the job name. If you wish to specify a different name for your job, you can specify the job name in your batch script
Parameter | Description |
---|---|
-J <job_name> |
Set the job name |
--job-name=<job_name> |
Output files¶
By default, slurm will direct both the “stdout” and “stderr” streams for your job to a file named slurm-<jobnumber>.out
in the directory where you executed the sbatch
command. You can override this behavior and specify an output file. If you want to send “stderr” output to a different file from “stdout”, you can specify an error file. The file(s) will be created in the directory where you submitted the sbatch
command unless you specify a full path in <filespec>
.
Parameter | Description |
---|---|
-o <filespec> |
Send output to file
<filespec> You can use the following replacement symbols
|
--output=<filespec> |
|
-e <filespec> |
Send “stderr” output to file
<filespec> You can use the following replacement symbols
|
--error=<filespec> |
Working directory¶
The working directory in which your job runs will be the directory from which you ran the sbatch
command, unless you specify otherwise. The easiest way to change this behavior is to add the appropriate cd
command before any other commands in your job script. You can also specify the working directory with the following parameter.
Parameter | Description |
---|---|
-D <workdir> |
Set the working directory of the batch script to
<workdir> before it is executed.The path can be specified as full path or relative path to the directory where the
sbatch command is executed. |
--chdir=<workdir> |
Note
Make sure your working directory is on a network share which is available to the compute nodes
Environment variables¶
If your job needs certain environment variables set to function properly, it’s best to put them in your job script. If the job is executed, slurm has a number of environment variables set with job information. Some of the common input and output environment variables available within the sbatch job script are listed in the chart below. For additional information, see the man page for sbatch
.
Environmental variable | Definition |
---|---|
$SLURM_JOB_ID | ID of job allocation |
$SLURM_SUBMIT_DIR | Directory job where was submitted |
$SLURM_JOB_NODELIST | File containing allocated hostnames |
Job script examples¶
Below you will find two jobscript examples with a shell script line, #SBATCH
parameters and the commands to run the job.
In this case a ‘stress’ test will be started which creates 8 processes, uses about 2GB of memory and has a runtime of 15 minutes. The number of cpus per task is set to 2 to allocate whole cpu cores for the processes. A mail will be sent when the job starts and ends or fails. The job output and errors are sent to a specified output file.
#!/bin/bash
#SBATCH -J stress-test # Job name
#SBATCH -o stress-test.%j.out # Name of output file (%j expands to jobId)
#SBATCH -p general # Use the "general" partition (default)
#SBATCH -n 1 # Number of tasks/processes
#SBATCH -c 16 # We want real cores, so 8 x 2 = 16
#SBATCH -t 16:00 # Run time (mm:ss) 16 min
#SBATCH --mem 3G # use 3GB
# cd to working directory
cd ~/stress-test/
# run stress test program with 4 CPU and 4 Memory tasks using about 2GB of memory
# and timeout after 15 minutes
./stress -v -c 4 -m 4 -t 15m
The example below shows a single process multithreaded MATLAB batch job. A single matlab instance is started which will use a maximum of 6 threads and runs for about 12 minutes. It uses no more then 512MB of memory. Job output and errors are seperated by specifying seperate files for “stdout” and “stderr”.
#!/bin/bash
#SBATCH -J matlab-multi # Job name
#SBATCH -o matlab-multi.%j.out # Name of output file (%j expands to jobId)
#SBATCH -e matlab-multi.%j.err # Name of error file (%j expands to jobId)
#SBATCH -p general # Use the "general" partition (default)
#SBATCH -n 1 # Number of tasks/processes
#SBATCH -c 6 # Number of cpus/threads per task
#SBATCH -t 15:00 # Run time (mm:ss) 15 min
#SBATCH --mem 512 # use 512MB
# enable MATLAB
module load matlab
# cd to working directory
cd ~/matlab-test/
# run a MATLAB batch job which uses at most 6 threads and 512MB of memory
matlab -batch comp_multi
Job Control & Monitoring¶
The Slurm resource manager has a lot of commands to control and monitor jobs on the cluster. These commands are available on the Q&CE cluster login node qce-cluster-login.ewi.tudelft.nl
. The table below shows a list of commonly used Slurm commands that will be discussed in following paragraphs.
Command | Description |
---|---|
sbatch | Submits job scripts to cluster |
scancel | Cancels/terminates a job |
squeue | View information about jobs located in the scheduling queue |
sinfo | View information about nodes and partitions |
sjstat | View information about jobs and cluster resources |
scontrol | View specific information about partitions, jobs, licenses etc. |
For a more in-depth description of these and other available commands and command options, see the SLURM Documentation site or use the man pages on the cluster login node: man <command>
.
Job submission¶
When you have created a job script, you can submit your job to the Q&CE cluster with the sbatch
command:
[somebody@qce-cluster-login ~]$ sbatch stress.job
Submitted batch job 24
When the job is submitted, you can view the job state with the squeue
command:
[somebody@qce-cluster-login ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24 general stress-t somebody R 0:17 1 qce-cn003
In this example, the job has been running for 17 seconds on node qce-cn003.
Terminating a job¶
If you want to terminate a running or queued job you can execute the scancel
command with the <job-id>
as a parameter:
[somebody@qce-cluster-login ~]$ scancel 24
[somebody@qce-cluster-login ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Job 24 is now terminated, there are no more jobs in the queue.
Job Monitoring¶
With the squeue
, sinfo
and sjstat
commands you can monitor the state of your job(s) and the check the available resources on the cluster.
The squeue
command shows all jobs which are currently queued in the job scheduler. With the -l
option you get some more information:
[somebody@qce-cluster-login ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
27 general stress-t somebody R 1:13 1 qce-cn003
26 general stress-t somebody R 1:53 1 qce-cn003
[somebody@qce-cluster-login ~]$ squeue -l
Thu Dec 13 16:54:32 2018
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
27 general stress-t somebody RUNNING 1:21 16:00 1 qce-cn003
26 general stress-t somebody RUNNING 2:01 16:00 1 qce-cn003
Information about the different job states and their reasons can be found in the Slurm documentation and the man page of squeue
.
The sinfo
command show the availabilty and state of the Slurm partitions and nodes. With the -l
option you get more detailed information:
[somebody@qce-cluster-login ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
general* up 7-00:00:00 1 mix qce-cn003
general* up 7-00:00:00 1 idle qce-cn004
[somebody@qce-cluster-login ~]$ sinfo -l
Thu Dec 13 16:54:41 2018
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
general* up 7-00:00:00 1-infinite no NO all 1 mixed qce-cn003
general* up 7-00:00:00 1-infinite no NO all 1 idle qce-cn004
Information about the different node and partition states can be found in the Slurm documentation and the man page of sinfo
.
The sjstat
command is a script which combines cluster resource and job state information. With the -v
you get more verbose output:
[somebody@qce-cluster-login ~]$ sjstat
Scheduling pool data:
-------------------------------------------------------------
Pool Memory Cpus Total Usable Free Other Traits
-------------------------------------------------------------
general* 193000Mb 56 2 2 1
Running job data:
----------------------------------------------------------------------
JobID User Procs Pool Status Used Master/Other
----------------------------------------------------------------------
27 somebody 16 general R 1:35 qce-cn003
26 somebody 16 general R 2:15 qce-cn003
[somebody@qce-cluster-login ~]$ sjstat -v
Scheduling pool data:
----------------------------------------------------------------------------------
Total Usable Free Node Time Other
Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
----------------------------------------------------------------------------------
general* 193000Mb 56 2 2 1 UNLIM 7-00:00:00
Running job data:
---------------------------------------------------------------------------------------------------
Time Time Time
JobID User Procs Pool Status Used Limit Started Master/Other
---------------------------------------------------------------------------------------------------
27 somebody 16 general R 1:39 16:00 12-13T16:53:11 qce-cn003
26 somebody 16 general R 2:19 16:00 12-13T16:52:31 qce-cn003
Use the sjstat
man pages for more options and information.
The scontrol show lic
command shows information about the current license usage:
[somebody@qce-cluster-login ~]$ scontrol show lic
LicenseName=spectre
Total=200 Used=0 Free=200 Remote=no
Use the scontrol
man pages for more information.