# Using GPUs and CUDA on the cluster
This section describes best practices for using GPUs and CUDA framework on the HPC cluster.
## Requesting GPUs for a job
GPUs can be requested in a similar way as other cluster resources.
```
qsub -l nodes=1:ppn=1:gpus=1:shared
```
Additionally, it is possible to specify compute mode for a GPU: 
- `shared`: GPU available for multiple processes 
- `exclusive_process`: only one compute process is allowed to run on the GPU

To request specific GPU model additionally specify the feature parameter: `-l feature=v100`

GPUs available on the cluster “Rudens”:
| GPU model | Arch | CUDA  | FP64 power |	FP16 Tensor power |	Memory | NVLink| qsub feature | 
| --------- | ---- | ----- | ---------- | ------------ | ------- | ------|------------- |
| Tesla K40 |	Kepler | 3.5 | 1.3 Tflops |	N/A |	12 GB |	N/A | k40 |
| Tesla V100 |	Volta |	7.0 |	7.5 Tflops |	112  Tflops |	16 GB | N/A |	v100 |
| A100 | Ampere | 8.0 | 9.7 Tflops | 312 Tflops | 40 GB | Bridge 600 GB/s | a100 |
| L40S | Ada Lovelace | 8.9 | N/A |  362 Tflops | 48 GB | N/A | l40s |

For more details, please see the section “HPC hardware specifications”.

## Development using CUDA Toolkit
CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing, an approach known as General Purpose GPU (GPGPU) computing. Find more information on the availability of GPUs in the previous section.
### Preparing the Working Environment
You need to load a version of the CUDA library (and compiler). Several versions of the CUDA library are available, list them with `module avail cuda` command. Different versions of CUDA are activated via the ‘module load’ command:
```
module load cuda/cuda-<version>
```

```
module load cuda
```

 CUDA tools/libraries installed on the cluster:
- C++ extensions for CUDA programming
- CuDNN library for neural networks
- CuBLAS
- NCCL

### Usage examples for the cluster
Job examples available here: `/opt/exp_soft/user_info/cuda`.
The following Linux-related graphical tools can be used for programming and debugging: 
`emacs` and `ddd`.
```
module load cuda
emacs hello-world.cu
nvcc -g -G hello-world.cu –o hello-world.out
ddd –debugger cuda-gdb hello-world.out
```
NVIDIA Nsight, a development environment based on Eclipse and development by NVIDIA, can be used.
```
module load cuda
nsight
```
Note. To be able to use GUI tools, X11 forwarding must be performed to provide a remote graphical interface. 

### GPU Code Generation Options
To achieve the best possible performance whilst being portable, GPU code should be generated for the architecture(s) it will be executed upon.

That is controlled by specifying `-gencode` arguments to NVCC which, unlike the `-arch` and `-code` arguments, allows for ‘fatbinary’ executables that are optimised for multiple device architectures.

Each `-gencode` argument requires two values, the virtual architecture and real architecture, for use in NVCC’s two-stage compilation. I.e. `-gencode=arch=compute_60, code=sm_60` specifies a virtual architecture of compute_60 and real architecture `sm_60`.

The minimum specified virtual architecture must be less than or equal to the GPU's Compute Capability used to execute the code.
To build a CUDA application which targets any GPU on HPC cluster “Rudens”, use the following -gencode arguments (for CUDA 8.0):
```
nvcc filename.cu \
   -gencode=arch=compute_35,code=sm_35 \
```