CUDA is a parallel computing platform and API model created by Nvidia that allows software developers to use a CUDA-enabled GPU for general purpose processing. The CUDA platform is designed to work with programming languages like C and C++. Documentation for using CUDA can be found on its official website.
Currently, CUDA versions 7.5, 8.0, 9.0, 9.1, 10.0, 10.1, and 11.1 are available on the cluster.
To run CUDA through the command line, first load the CUDA module, and create a CUDA file. To load the module, use the command:
module load cuda
To load a specific version of the modules, add the version number to the load command.
module load cuda/8.0
Compile the CUDA file using
nvcc <filename>.cu -o <filename>
where <filename> is the name of the CUDA file. nvcc separates the source code into host and device components. Device functions are processed by the NVIDIA compiler, and host functions are processed by the standard host compiler (like gcc). -o <filename> signals to create an output file called <filename> that can be used to run the program using ./<filename>.
This tutorial was modified from a NVIDIA CUDA tutorial
1. Create a CUDA script. This repository provides a simple script, vector_add.cu, which performs simple vector addition while utilizing blocks and threads to execute in parallel.
#include <stdio.h>
#include <iostream>
//perform vector addition utilizing blocks and threads
__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n) //avoid accessing beyond end of array
c[index] = a[index] + b[index];
}
//populate vectors with random ints
void random_ints(int* a, int N) {
for (int i=0; i < N; i++){
a[i] = rand() % 1000;
}
}
#define N (2048*2048) // overall size of the data set
#define THREADS_PER_BLOCK 512 // threads per block
int main(void) {
int *a, *b, *c;
int *d_a, *d_b, *d_c;
int size = N * sizeof(int);
//alloc space for device copies of a, b, and c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
//alloc space for host copies and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
//copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
//launch add() kernel, while avoid accessing beyond the end of the array
add<<<(N + THREADS_PER_BLOCK-1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_a, d_b, d_c, N);
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
//clean up
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
2. Compile the CUDA script using nvcc vector_add.cu -o vector_add
which creates an executable called vector_add
3. Prepare the submission script, which is the script that is submitted to the Slurm scheduler as a job in order to run the CUDA script. This repository provides the script job.sh as an example.
#!/bin/bash
#SBATCH --job-name=cuda_test
#SBATCH -o cuda_out%j.out
#SBATCH -e cuda_err%j.err
#SBATCH --gres=gpu:1
echo -e '\nsubmitted cuda job'
echo 'hostname'
hostname
#loads the cuda module
module load cuda
#recompiles the vector_add.cu file
nvcc vector_add.cu -o vector_add
#runs the vector_add program
./vector_add
4. Submit the job using
sbatch job.sh
5. Examine the results.