Q&CE Cluster

Introduction

The Q&CE cluster is a compute cluster for running large batch jobs which need more resources then available on your own computer. It consists of a login and admin node and a number of compute nodes connected via an internal network. To manage the resources on the compute nodes, the Q&CE cluster uses a resource manager called Slurm (Simple Linux Utility for Resource Management) to handle the queuing, scheduling, and execution of jobs.

Note

If you want to submit jobs to cluster, you need special access rights. To apply for this access right, send a mail to Q&CE Administrator with:

  • your Q&CE accountname
  • The type of jobs you want to run (e.g. MATLAB, python etc.)

To use the cluster, you have to login to the cluster login node qce-cluster-login.ewi.tudelft.nl with SSH. On the cluster login node you can submit your job to the compute nodes. When a user submits a job to the cluster, the resource manager queues the job and prioritizes the job according to the requested resources and recent use of the user. The job is then scheduled and the resources (e.g. CPUs, memory, time) are allocated. When the requested resources become available, the job is started on the compute nodes. During job execution, the resource manager monitors and enforces the allocated resources.

Warning

Do not run your production jobs directly on the login node, use sbatch to submit them to the resource manager. You can use the login node for job testing and post processing.

The compute nodes have the following resources available:

Node Cores Threads Memory Local Storage GPU CPU Speed
qce-cn[001-004] 28 56 192 GB 1 TB none 2.4 GHz
qce-qn[001-004] 28 56 384 GB 3 TB K80 (2) 2.0 GHz

Four nodes on the cluster are equiped with NVidia Tesla K80 Accelerator cards, which has a dual GPU with 12GB memory per GPU.

Cluster Software

The Q&CE cluster compute nodes are installed with a minimal set of software and libraries to keep incompatibility problems to a minimum. This means not all applications will run on the compute nodes. If you need special shared libraries to run your application, you have to provide them yourself and make sure they are found by the loader. The following application types are supported on the Q&CE cluster

User developed applications
You can run your own developed applications on the cluster if your application has statically linked libraries or you provide all the needed shared libaries yourself.
Singularity containers
You can run your own created singularity containers on the cluster (see Singularity User Guide for more information).
Python applications
The cluster compute nodes have python 2.7 and python 3.6 installed with the standard python libraries. If you need special python libraries you can setup a python virtual environment (see How to setup a python virtual environment in the FAQ) and install all the dependencies yourself.
CUDA (enabled) applications
You can run your own CUDA applications or CUDA enabled applications, like mumax3 on the cluster.
Commercial applications
The following commercial applications will run in batch mode on the cluster:

Warning

  • The cluster does not support interactive applications.
  • Applications with a heavy IO usage (i.e. Big Data applications) are discouraged to use the Q&CE cluster and will be terminated in case of problems.

Job Specification

Users generally specify batch jobs by writing a job script file and submitting the job to Slurm with the sbatch command. The sbatch command takes a number of options (some of which can be omitted or defaulted). These options define various requirements of the job, which are used by the scheduler to figure out what is needed to run your job, and to schedule it to run as soon as possible, subject to the constraints on the system, usage policies, and considering the other users of the cluster. Before you specify a job for batch processing, it is important to know what the job requirements are so that it can run properly on the cluster. Each program and workflow has unique requirements so we advise that you determine what resources you need before you write your job script.

Keep in mind that while increasing the amount of compute resources you request may decrease the time it takes to run your job, it will also increase the amount of time your job spends waiting in the queue. You may request whatever resources you need but be mindful that other researchers need to be able to use those resources as well. It is strongly recommended that you include at least the following specifications:

  • How long the job will run
  • The cpu requirements
  • The memory requirements
  • The partition to run in

More details about SLURM, job submission and sbatch parameters can be found at the official SLURM Documentation site. Also see the man pages for the SLURM commands (i.e. man sbatch).

Warning

All specified job resources will be enforced by the resource manager. If your job uses more time or memory then specified, it will be terminated. The number of specified CPUs is the maximum your job will be able to use, it will not be terminated if it tries to use more.

Creating a job script

The most basic parameter given to the sbatch command is the script to run. This obviously must be given on command line, not inside the script file. This job script must start with a line which specifies the shell under which the script is to run. I.e., the very first line of your script should generally be either #!/bin/tcsh or #!/bin/bash for the tcsh or bash shell, respectively. This line is typically followed by a bunch of #SBATCH lines specifying the job requirements, and then the actual commands that you wish to have executed when the job is started on the compute nodes. The #SBATCH lines should come BEFORE any non-blank/non-comment lines in your script file. Any #SBATCH lines which come after non-blank/non-comment lines might get ignored by the scheduler.

Note

The sbatch parameters mentioned in the next paragraphs are just a standard set suitable for most jobs. If you want more information about these or other parameters, please visit the SLURM Documentation site.

Specifying the amount of time your job will run

It is very important to specify the amount of time you expect your job to take. If you specify a time that is too short, your job will be terminated by the scheduler before it completes. So you should always add a buffer to account for variability in run times; you do not want your job to be killed when it is 99.9% complete. However, if you specify a time that is too long, you may run the risk of having your job sit in the queue for longer than it should, as the scheduler attempts to find available resources on which to run your job.

Parameter Description
-t <time>

Total run time for the job. Valid <time> formats are:

  • minutes
  • minutes:seconds
  • hours:minutes:seconds
  • days-hours
  • days-hours:minutes
  • days-hours:minutes:seconds
--time=<time>

If you do not specify a time parameter in your jobscript, you get the default runtime of 1 minute! Since that is not likely to be sufficient for your job to complete, specify a reasonable time. This greatly aids the scheduler in making the best utilization of resources. The maximum runtime for a job is (currently) 7 days.

Specifying the cpu requirements

Normally a jobs consist of a specified number of tasks. You want each task to run on its own number of CPU cores/threads for maximize performance. You can specify the number of tasks and the number of cores/threads needed for each task. If you specify more tasks or cpus per task then available in a node, the resource manager will allocate multiple nodes and your task has to manage the multinode configuration and communication itself! Most jobs just run one task and that is the default if not specified. In most cases you specify the number of CPUs your job wil use. Because each CPU on the cluster has two hardware threads, always select an even number of CPUs.

Parameter Description
-n <tasks> Number of concurrent tasks to allocate
--ntasks=<tasks>
-c <ncpus>
Number of cores/threads to allocate for each task
(always use an even number with a minumum of 2)
--cpus-per-task=<ncpus>

Note

Slurm sees hardware threads as a cpu. The total number of hardware threads allocated for your job is <tasks> x <ncpus>. When you want a real core for each task make sure you always specify an even number of <ncpus> with a minimum of 2.

Specifying memory requirements

If you want to request a specific amount of memory for your job, you can choose between the total amount of memory needed for the job or the amount of memory needed for each cpu/task. Both parameters are mutually exclusive so pick the one that’s best for your job. In most cases the total amount of memory needed for the job is specified.

Parameter Description
--mem=<size[units]>
Total amount of memory needed for the job.
[units] can be [K|M|G|T], default is M
--mem-per-cpu=<size[units]>
Amount of memory needed for each task/cpu.
[units] can be [K|M|G|T], default is M

If you don’t specify the memory requirements for your job, you get a default value of 256 Megabytes!

Specifying GPU requirements (optional)

If you want to run CUDA (enabled) applications on the cluster, you have to specify you want use GPU(s). You can specify the type and amount of GPUs your application needs.

Parameter Description
--gres=gpu[[:type]:count] GPU type (if applicable) and number of GPUs needed

Currently the cluster has only one type of GPU, so type is not neccessary.

Specifying the partition

You can specify the partition in which you want your job to run. Partitions are usually a group of nodes which have the same hardware characteristics and/or resource restrictions. Currently the following partitions are available:

Partition Description
general Default partition for all users. The maximum runtime is 7 days
long
Partition for long running jobs. The maximum runtime is 31 days
Resource constraints: max 4 cpu and 16GB memory per job. You can submit a maximum
of 20 jobs of which 10 can run simultaneously
Parameter Description
-p <partition_name> Partition to allocate the job in
--partition=<partition_name>

If you don’t specify a partition, your job will be allocated in the general (default) partition.

Specifying the license requirements

Some applications need to checkout license(s) before they are able to start. Currently the following applications need to use the license option:

Application License name Available licenses
Cadence Spectre spectre 200
Parameter Description
-L <license>:<num> Number of licenses from <license> to checkout
--license=<license>:<num>

Email events

The SLURM scheduler can send an email when certain events related to your job occur, such as on start and end of execution and if an error occurs. You need to specify a valid email address and the events you want to be informed on. Multiple events may be specified in a comma seperated list or you can specify multiple --mail-type lines.

Parameter Description
--mail-user=<email address> Email address to send the event emails to.
--mail-type=<event,[event]>

Event to send an email on. Common events are:

  • BEGIN when the job starts to execute
  • END when the job completes
  • FAIL if and when the job fails
  • REQUEUE if and when the job is requeued
  • ALL for all of he above cases

Job name

By default, slurm will use the name of the job script as the job name. If you wish to specify a different name for your job, you can specify the job name in your batch script

Parameter Description
-J <job_name> Set the job name
--job-name=<job_name>

Output files

By default, slurm will direct both the “stdout” and “stderr” streams for your job to a file named slurm-<jobnumber>.out in the directory where you executed the sbatch command. You can override this behavior and specify an output file. If you want to send “stderr” output to a different file from “stdout”, you can specify an error file. The file(s) will be created in the directory where you submitted the sbatch command unless you specify a full path in <filespec>.

Parameter Description
-o <filespec>
Send output to file <filespec>
You can use the following replacement symbols
  • %u : your username
  • %j : the job allocation number
--output=<filespec>
-e <filespec>
Send “stderr” output to file <filespec>
You can use the following replacement symbols
  • %u : your username
  • %j : the job allocation number
--error=<filespec>

Working directory

The working directory in which your job runs will be the directory from which you ran the sbatch command, unless you specify otherwise. The easiest way to change this behavior is to add the appropriate cd command before any other commands in your job script. You can also specify the working directory with the following parameter.

Parameter Description
-D <workdir>
Set the working directory of the batch script to <workdir> before it is executed.
The path can be specified as full path or relative path to the directory where the
sbatch command is executed.
--chdir=<workdir>

Note

Make sure your working directory is on a network share which is available to the compute nodes

Environment variables

If your job needs certain environment variables set to function properly, it’s best to put them in your job script. If the job is executed, slurm has a number of environment variables set with job information. Some of the common input and output environment variables available within the sbatch job script are listed in the chart below. For additional information, see the man page for sbatch.

Environmental variable Definition
$SLURM_JOB_ID ID of job allocation
$SLURM_SUBMIT_DIR Directory job where was submitted
$SLURM_JOB_NODELIST File containing allocated hostnames

Job script examples

Below you will find two jobscript examples with a shell script line, #SBATCH parameters and the commands to run the job.

In this case a ‘stress’ test will be started which creates 8 processes, uses about 2GB of memory and has a runtime of 15 minutes. The number of cpus per task is set to 2 to allocate whole cpu cores for the processes. A mail will be sent when the job starts and ends or fails. The job output and errors are sent to a specified output file.

#!/bin/bash

#SBATCH -J stress-test                 # Job name
#SBATCH -o stress-test.%j.out          # Name of output file (%j expands to jobId)
#SBATCH -p general                     # Use the "general" partition (default)
#SBATCH -n 1                           # Number of tasks/processes
#SBATCH -c 16                          # We want real cores, so 8 x 2 = 16
#SBATCH -t 16:00                       # Run time (mm:ss) 16 min
#SBATCH --mem 3G                       # use 3GB

# cd to working directory
cd ~/stress-test/
# run stress test program with 4 CPU and 4 Memory tasks using about 2GB of memory
# and timeout after 15 minutes
./stress -v -c 4 -m 4 -t 15m

The example below shows a single process multithreaded MATLAB batch job. A single matlab instance is started which will use a maximum of 6 threads and runs for about 12 minutes. It uses no more then 512MB of memory. Job output and errors are seperated by specifying seperate files for “stdout” and “stderr”.

#!/bin/bash

#SBATCH -J matlab-multi                # Job name
#SBATCH -o matlab-multi.%j.out         # Name of output file (%j expands to jobId)
#SBATCH -e matlab-multi.%j.err         # Name of error file (%j expands to jobId)
#SBATCH -p general                     # Use the "general" partition (default)
#SBATCH -n 1                           # Number of tasks/processes
#SBATCH -c 6                           # Number of cpus/threads per task
#SBATCH -t 15:00                       # Run time (mm:ss) 15 min
#SBATCH --mem 512                      # use 512MB

# enable MATLAB
module load matlab
# cd to working directory
cd ~/matlab-test/
# run a MATLAB batch job which uses at most 6 threads and 512MB of memory
matlab -batch comp_multi

Job Control & Monitoring

The Slurm resource manager has a lot of commands to control and monitor jobs on the cluster. These commands are available on the Q&CE cluster login node qce-cluster-login.ewi.tudelft.nl. The table below shows a list of commonly used Slurm commands that will be discussed in following paragraphs.

Command Description
sbatch Submits job scripts to cluster
scancel Cancels/terminates a job
squeue View information about jobs located in the scheduling queue
sinfo View information about nodes and partitions
sjstat View information about jobs and cluster resources
scontrol View specific information about partitions, jobs, licenses etc.

For a more in-depth description of these and other available commands and command options, see the SLURM Documentation site or use the man pages on the cluster login node: man <command>.

Job submission

When you have created a job script, you can submit your job to the Q&CE cluster with the sbatch command:

[somebody@qce-cluster-login ~]$ sbatch stress.job
Submitted batch job 24

When the job is submitted, you can view the job state with the squeue command:

[somebody@qce-cluster-login ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                24   general stress-t somebody  R       0:17      1 qce-cn003

In this example, the job has been running for 17 seconds on node qce-cn003.

Terminating a job

If you want to terminate a running or queued job you can execute the scancel command with the <job-id> as a parameter:

[somebody@qce-cluster-login ~]$ scancel 24
[somebody@qce-cluster-login ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Job 24 is now terminated, there are no more jobs in the queue.

Job Monitoring

With the squeue, sinfo and sjstat commands you can monitor the state of your job(s) and the check the available resources on the cluster.

The squeue command shows all jobs which are currently queued in the job scheduler. With the -l option you get some more information:

[somebody@qce-cluster-login ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                27   general stress-t somebody  R       1:13      1 qce-cn003
                26   general stress-t somebody  R       1:53      1 qce-cn003
[somebody@qce-cluster-login ~]$ squeue -l
Thu Dec 13 16:54:32 2018
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
                27   general stress-t somebody  RUNNING       1:21     16:00      1 qce-cn003
                26   general stress-t somebody  RUNNING       2:01     16:00      1 qce-cn003

Information about the different job states and their reasons can be found in the Slurm documentation and the man page of squeue.

The sinfo command show the availabilty and state of the Slurm partitions and nodes. With the -l option you get more detailed information:

[somebody@qce-cluster-login ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
general*     up 7-00:00:00      1    mix qce-cn003
general*     up 7-00:00:00      1   idle qce-cn004
[somebody@qce-cluster-login ~]$ sinfo -l
Thu Dec 13 16:54:41 2018
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
general*     up 7-00:00:00 1-infinite   no       NO        all      1       mixed qce-cn003
general*     up 7-00:00:00 1-infinite   no       NO        all      1        idle qce-cn004

Information about the different node and partition states can be found in the Slurm documentation and the man page of sinfo.

The sjstat command is a script which combines cluster resource and job state information. With the -v you get more verbose output:

[somebody@qce-cluster-login ~]$ sjstat

Scheduling pool data:
-------------------------------------------------------------
Pool        Memory  Cpus  Total Usable   Free  Other Traits
-------------------------------------------------------------
general*   193000Mb    56      2      2      1

Running job data:
----------------------------------------------------------------------
JobID    User      Procs Pool      Status        Used  Master/Other
----------------------------------------------------------------------
27       somebody     16 general   R             1:35  qce-cn003
26       somebody     16 general   R             2:15  qce-cn003

[somebody@qce-cluster-login ~]$ sjstat -v

Scheduling pool data:
----------------------------------------------------------------------------------
                           Total  Usable   Free   Node   Time      Other
Pool         Memory  Cpus  Nodes   Nodes  Nodes  Limit  Limit      traits
----------------------------------------------------------------------------------
general*    193000Mb    56      2       2      1  UNLIM 7-00:00:00

Running job data:
---------------------------------------------------------------------------------------------------
                                                 Time        Time            Time
JobID    User      Procs Pool      Status        Used       Limit         Started  Master/Other
---------------------------------------------------------------------------------------------------
27       somebody     16 general   R             1:39       16:00  12-13T16:53:11  qce-cn003
26       somebody     16 general   R             2:19       16:00  12-13T16:52:31  qce-cn003

Use the sjstat man pages for more options and information.

The scontrol show lic command shows information about the current license usage:

[somebody@qce-cluster-login ~]$ scontrol show lic
LicenseName=spectre
    Total=200 Used=0 Free=200 Remote=no

Use the scontrol man pages for more information.