DGX is a NVIDIA infastracture that can be used to deploy applications quickly and allows scalability between multple nodes. Documentation for DGX can be found on NVIDIA's official website.

Activating DGX

To run an application on Hyperion with DGX, load anaconda and CUDA. Then activate the DGX configuration.

module load python3/anaconda/2020.02
module load cuda/11.1
source activate /work/examples/.conda/dgx
    
To exit the virtual environment, use the command

source deactivate

Running DGX through a job script

1. Ensure that you have a virtual environment created, following the steps described above.

2. Create a job script. This repository provides a simple script, job.sh, which demonstrates DGX and using NVIDIA's GPU commands.

Example DGX script job.sh


#!/bin/sh
#SBATCH --job-name=dgx_test
#SBATCH -N 1 # number of notes
#SBATCH -n 24    ## number of CU cores
#SBATCH --gres=gpu:1   ## number of GPUs
#SBATCH --output dgx_%j.out #Output file
#SBATCH --error dgx_%j.err #Error output file
#SBATCH -p dgx_aic #DGX group

#Load desired modules
module load python3/anaconda/2020.02
module load cuda/11.1
source activate /work/examples/.conda/dgx
#The following is sample script

echo " The host name is"
hostname

echo " The current directory is:"
pwd

echo -e " \nConda environment list\n"
conda list

echo -e "\nCUDA Visible Devices  "
echo  $CUDA_VISIBLE_DEVICES

echo -e "\nNvidia GPU info\n"
nvidia-smi

#Add your script. Example:
# python ./your_script.py
#See our tutorials for how to run other applications on our clusters.
        

4. Submit the job using: sbatch job.sh

5. Examine the results.