Using GPUs and CUDA on the cluster

This section describes best practices for using GPUs and CUDA framework on the HPC cluster.

Requesting GPUs for a job

GPUs can be requested in a similar way as other cluster resources.

qsub -l nodes=1:ppn=1:gpus=1:shared

Additionally, it is possible to specify compute mode for a GPU:

  • shared: GPU available for multiple processes

  • exclusive_process: only one compute process is allowed to run on the GPU

To request specific GPU model additionally specify the feature parameter: -l feature=v100

GPUs available on the cluster “Rudens”:

GPU model

Arch

CUDA

FP64 power

FP16 Tensor power

Memory

NVLink

qsub feature

Tesla K40

Kepler

3.5

1.3 Tflops

N/A

12 GB

N/A

k40

Tesla V100

Volta

7.0

7.5 Tflops

112 Tflops

16 GB

N/A

v100

A100

Ampere

8.0

9.7 Tflops

312 Tflops

40 GB

Bridge 600 GB/s

a100

L40S

Ada Lovelace

8.9

N/A

362 Tflops

48 GB

N/A

l40s

For more details, please see the section “HPC hardware specifications”.

Development using CUDA Toolkit

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing, an approach known as General Purpose GPU (GPGPU) computing. Find more information on the availability of GPUs in the previous section.

Preparing the Working Environment

You need to load a version of the CUDA library (and compiler). Several versions of the CUDA library are available, list them with module avail cuda command. Different versions of CUDA are activated via the ‘module load’ command:

module load cuda/cuda-<version>
module load cuda

CUDA tools/libraries installed on the cluster:

  • C++ extensions for CUDA programming

  • CuDNN library for neural networks

  • CuBLAS

  • NCCL

Usage examples for the cluster

Job examples available here: /opt/exp_soft/user_info/cuda. The following Linux-related graphical tools can be used for programming and debugging: emacs and ddd.

module load cuda
emacs hello-world.cu
nvcc -g -G hello-world.cu –o hello-world.out
ddd –debugger cuda-gdb hello-world.out

NVIDIA Nsight, a development environment based on Eclipse and development by NVIDIA, can be used.

module load cuda
nsight

Note. To be able to use GUI tools, X11 forwarding must be performed to provide a remote graphical interface.

GPU Code Generation Options

To achieve the best possible performance whilst being portable, GPU code should be generated for the architecture(s) it will be executed upon.

That is controlled by specifying -gencode arguments to NVCC which, unlike the -arch and -code arguments, allows for ‘fatbinary’ executables that are optimised for multiple device architectures.

Each -gencode argument requires two values, the virtual architecture and real architecture, for use in NVCC’s two-stage compilation. I.e. -gencode=arch=compute_60, code=sm_60 specifies a virtual architecture of compute_60 and real architecture sm_60.

The minimum specified virtual architecture must be less than or equal to the GPU’s Compute Capability used to execute the code. To build a CUDA application which targets any GPU on HPC cluster “Rudens”, use the following -gencode arguments (for CUDA 8.0):

nvcc filename.cu \
   -gencode=arch=compute_35,code=sm_35 \