Jobs on GPU nodes

GPU nodes on GRICAD clusters

For simulations that require GPU cards to run, multiple nodes are available:

The Bigfoot cluster, dedicated to GPU computing :
- Nodes with 4 NVIDIA Tesla V100 GPUs, 32 Go RAM for each GPU.
- Nodes with 2 NVIDIA A100, 32 Go RAM for each GPUs
- Nodes with 4 NVIDIA MI210/XGMI, 64 Go RAM for each GPUs
- Nodes with T4 GPUs, called Virgo, only available at night!
- Grace-Hopper node with 1 GH200 chip (ARM + GPU / HGBM3e)

Using the Bigfoot cluster

The “Bigfoot” cluster is dedicated to calculations requiring the use of nodes delivering computing power through co-processors, currently GPGPUs. The access is done in a classical way, from the SSH bastions of Gricad, to the bigfoot front-end:

login@trinity:~$ ssh bigfoot
Linux bigfoot 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64
                       Welcome to Bigfoot cluster! 

                             :            :
                            .'            :
                        _.-"              :
                    _.-"                  '.
    ..__...____...-"                       :
   : \_\                                    :
   :    .--"                                 :
   `.__/  .-" _                               :
      /  /  ," ,-                            .'
     (_)(`,(_,'L_,_____       ____....__   _.'
      "' "             """""""          """   

GPU, GPU, GPU, ... ;-)

            Type 'chandler' to get cluster status  
          Type 'recap.py' to get cluster properties

Sample OAR submissions: 
  # Get a A100 GPU and all associated cpu and memory resources:
  oarsub -l /nodes=1/gpu=1 --project test -p "gpumodel='A100'" "nvidia-smi -L"
  # Get a MIG partition of an A100 on a devel node, to make some tests
  oarsub -l /nodes=1/gpu=1/migdevice=1 --project test -t devel "nvidia-smi -L"

Last login: Mon Jan 10 17:37:43 2022 from 129.88.178.43
login@bigfoot:~$

The chandler command allows to have an overview of the available resources and their instantaneous state.

Several GPU models are available. The recap.py command gives up-to-date information about the hardware configuration of the different nodes, in particular the model and the number of GPUs available inside the nodes:

login@bigfoot:~$ recap.py 
 ================================================================================
|   node  | cpumodel       | gpumodel  | gpus | cpus | cores| mem | mem/gpu |MIG|
 ================================================================================
|bigfoot1 | intel Gold 6130| V100      |   4  |   2  |   32 | 192 |   96  |  NO |
|    [ + 1  more node(s) ]                                                      |
|bigfoot3 | intel Gold 6130| V100      |   4  |   2  |   32 | 192 |   96  |  NO |
|bigfoot4 | intelGold 5218R| V100      |   4  |   2  |   40 | 192 |   96  |  NO |
|    [ + 1  more node(s) ]                                                      |
|bigfoot6 | intelGold 5218R| V100      |   4  |   2  |   40 | 192 |   96  |  NO |
|bigfoot7 | amd   EPYC 7452| A100      |   2  |   2  |   64 | 192 |   96  | YES |
|bigfoot8 | intelGold 5218R| V100      |   4  |   2  |   40 | 192 |   48  |  NO |
|bigfoot9 | amd   EPYC 7452| A100      |   2  |   2  |   64 | 192 |   96  |  NO |
|    [ + 2  more node(s) ]                                                      |
|bigfoot12| amd   EPYC 7452| A100      |   2  |   2  |   64 | 192 |   96  |  NO |
|virgo1   | intel      vcpu| T4        |   1  |   1  |    2 |   4 |    4  |  NO |
|    [ + 33 more node(s) ]                                                      |
|virgo35  | intel      vcpu| T4        |   1  |   1  |    2 |   4 |    4  |  NO |
 ================================================================================
 # of GPUS: 10 A100, 28 V100, 35 T4
                                       
login@bigfoot:~$

The nodes, appart from the Virgo T4 nodes, are interconnected via the same low-latency Omnipath network as the Dahu cluster.

The usual storage spaces are available from the front-end and all nodes:

/home: a space dedicated to the Bigfoot cluster, to clearly distinguish environments /bettik: a high performance capacitive workspace, shared with the Dahu and Luke clusters
/silenus : temporary ultra high performance scratch (SSD/NVMe), shared with the Dahu cluster (not yet avalable on Virgo nodes, coming soon)
Mantis: managed cloud storage under iRods

Software environment

The classical NIX and GUIX environments are available and shared with the Dahu and Luke clusters, as well as the specific application directory /applis.

To install the libraries commonly used in GPU computing, you can use the predefined conda environments.

To use NVIDIA GPUs, you will also need to source the appropriate CUDA toolkit. You can use the following script by passing the name of the cluster and the desired version of the toolkit. Example for the toolkit version 11.7:

user@bigfoot:~$ source /applis/environments/cuda_env.sh 11.7

You can also list all available CUDA toolkits on the cluster using cuda_env.sh script :

user@bigfoot:~$ source /applis/environments/cuda_env.sh -l

Submitting a job

To launch a job, we use the OAR resource manager (whose essential commands are described in this page.), and more particularly the oarsub command.

The particularity on the Bigfoot cluster, is that the resource unit to request is usually a gpu. The other resources of the compute nodes (cpu-core and memory) have been distributed and associated to the gpu according to the hardware configuration of the nodes which is heterogeneous.

It is recommended (but not mandatory) to specify the GPU model you want to obtain on your compute nodes, using the OAR gpumodel property.

The following example gives the minimum options to submit a job requiring a single gpu on a node that has Nvidia A100 GPUs:

oarsub -l /nodes=1/gpu=1 -p "gpumodel='A100'" ...

OAR will also allocate, on a pro-rata basis, a certain number of general-purpose computing cores (cpu-cores) and a certain amount of memory.

This other example job will get 2 Nvidia A100 GPUs on the same node, and the associated cpu and memory resources:

oarsub -l /nodes=1/gpu=2 -p "gpumodel='A100'" ...

OAR allocates GPU units according to the gpudevice property. To find out which units are allocated, use the oarprint command once on the node (interactively):

user@bigfoot2:~$ oarprint gpudevice
1
0

This same command also allows you to know the cpu-cores associated with gpudevice:

user@bigfoot8:~$ oarprint -P gpudevice,cpuset core
0 3
1 12
1 18
0 9
0 1
1 17
1 14
0 5
1 10
1 16
0 8
0 4
0 6
1 15
0 7
1 11
0 2
1 13
0 0
1 19

Here, we have obtained 20 compute cores and 2 GPUs, each GPU being associated to 10 cores whose rank is listed within the compute node.

To know the amount of core memory allocated, we can query the cgroup of the job in the following way:

user@bigfoot8:~$ cat /dev/oar_cgroups_links/memory/`cat /proc/self/cpuset`/memory.limit_in_bytes
100693262336

This amount of memory varies depending on the number of GPUs obtained on the node and the characteristics of the node. The mem_per_gpu property (in GB) allows us to know the configuration of the nodes and to put constraints on job submission on the amount of core memory needed for the job. For example:

oarsub -l /nodes=1/gpu=2 -p "gpumodel='A100' and mem_per_gpu > 64"...

For Nvidia GPUs, the command nvidia-smi allows to get information about the accessible GPUs. It also shows that OAR will only allow access to GPUs that have been allocated:

user@bigfoot:~$ oarsub -l /nodes=1/gpu=2 -p "gpumodel='V100'" --project test -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=293
Interactive mode: waiting...
Starting...
Connect to OAR job 293 via the node bigfoot8
user@bigfoot8:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-263f55be-a11c-81be-8af6-e948471cb954)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-9e4e2b6c-19ea-73ef-7026-00619f988787)
bzizou@bigfoot8:~$ logout
Connection to bigfoot8 closed.
Disconnected from OAR job 293.
user@bigfoot:~$ oarsub -l /nodes=1/gpu=1 -p "gpumodel='A100'" --project test -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=294
Interactive mode: waiting...
Starting...
Connect to OAR job 294 via the node bigfoot12
user@bigfoot12:~$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-45d882aa-be45-3db7-bd3a-06da9fcaf3b1)
user@bigfoot12:~$ logout

Limitations

The maximum walltime is 48 hours. This is necessary to allow a reasonable turnover for a fair sharing of resources. It is also necessary to facilitate maintenance operations and machine updates. If your jobs need a longer execution time, you must set up or use checkpoint features within your applications (i.e. your applications must be able to save their state in files in order to restart later on this state by loading these files).

Submit a job using a submission script

The submission of a CPU job using a script is explained on this page.

Submitting a GPU job is done in the same way, except for the content of the dump.sh, which is as follows:

#!/bin/bash

#OAR -n Hello_World
#OAR -l /nodes=1/gpu=1,walltime=00:01:30
#OAR -p gpumodel='A100'
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test

cd /bettik/bouttier/
/bin/hostname >> dumb.txt

In this example, we request a job on GPU A100.

#OAR -p gpumodel='A100'

Unlike interactive submission, in a submission script, -p gpumodel='A100' is not written in quotes.

AMD GPU nodes

AMD GPUs are operated via the “amdgpu” and “rocm” layers.

To join these specific nodes, you need to add the job type amd, for example:

user@bigfoot:~$ oarsub -t amd -l /nodes=1/gpu=2 --project test -I

or as directives in a script:

#!/bin/bash                                                                                                                  

#OAR -n Hello_World
#OAR -l /nodes=1/gpu=1,walltime=00:01:30
#OAR -p gpumodel='MI210'
#OAR -t amd                                                                                            
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test

For the environment, we recommend the following NIX shell, which provides the rocm, opencl and openmpi utilities, including drivers for exploiting XGMI links. You can copy/paste this entire section into an interactive job on your node:

source /applis/site/nix.sh
NIX_PATH="nixpkgs=channel:nixos-23.11" nix-shell -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocmPackages.rocm-smi -p clinfo -p rocm-opencl-runtime -p rocm-opencl-icd

Or run your program passively by prefixing it with the nix-shell as follows in a script:

source /applis/site/nix.sh
export NIX_PATH="nixpkgs=channel:nixos-23.11" 
nix-shell --command <./votre_programme> -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocmPackages.rocm-smi -p clinfo -p rocm-opencl-runtime -p rocm-opencl-icd

Here’s a more complete example, allowing you to configure OpenCL locally, interactively:

$ source /applis/site/nix.sh
$ NIX_PATH="nixpkgs=channel:nixos-23.11" nix-shell -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocmPackages.rocm-smi -p clinfo -p rocm-opencl-runtime -p rocm-opencl-icd
[nix-shell:~]$ mkdir -p ~/.local/etc/OpenCL/vendors
[nix-shell:~]$ echo `nix eval --raw nixpkgs.rocm-opencl-runtime.outPath`/lib/libamdocl64.so > ~/.local/etc/OpenCL/vendors/amdocl64.icd
[nix-shell:~]$ export OCL_ICD_VENDORS=~/.local/etc/OpenCL/vendors/amdocl64.icd
[nix-shell:~]$ clinfo                                                                                                                                                                                                                                      
Number of platforms                               1                                                                          
  Platform Name                                   AMD Accelerated Parallel Processing

  [...]

Test of IDEFIX application developped with KOKKOS (https://kokkos.org/) The Kokkos C++ EcoSystem is a solution for writing modern C++ applications, a programming model for performance and portability.

Compilation af IDEFIX code in a nix shell :

user@bigfoot:~$ cd idefix-bench/benchmark
user@bigfoot:~/idefix-bench/benchmark$ NIX_PATH="nixpkgs=channel:nixpkgs-unstable"
user@bigfoot:~/idefix-bench/benchmark$ . ./sourceMeFirst.sh
user@bigfoot:~/idefix-bench/benchmark$ nix-shell -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocm-smi -p hip -p cmake

The cmake options to compile for AMD HIP: “-DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_VEGA90A=ON” (for AMD Mi200/Mi250X) The HIP C°° compiler command is “hipcc”.

user@bigfoot:~/idefix-bench/benchmark$ cmake $IDEFIX_DIR -DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_VEGA90A=ON -DCMAKE_CXX_COMPILER=hipcc
user@bigfoot:~/idefix-bench/benchmark$ make

user@bigfoot:~/idefix-bench/benchmark$ oarsub -I -lnodes=1 --project admin -t amd 
user@bigfoot14:~/idefix-bench/benchmark$ . /applis/site/nix.sh
user@bigfoot14:~/idefix-bench/benchmark$ mpirun  -np 2 --mca btl '^openib' -x UCX_TLS=sm,self,rocm_copy,rocm_ipc --mca pml ucx -x UCX_RNDV_THRESH=128 --mca osc ucx ./idefix

Sandboxing: the `devel` mode

The special -t devel submission mode is available to facilitate job tuning. It is limited to jobs of 2 hours maximum, and allocates resources on nodes dedicated to this mode. This allows to have sandbox resources much more available than production resources.

But be careful, this sandbox works on smaller GPUs, which are in fact Nvidia A100 GPU partitions. Each Nvidia A100 GPU in devel mode is in fact a sub-gpu, which we will call mig by abuse of language (MIG=Multi Instance GPU). Also the selection of resources is a bit different, and you have to specify the number of migdevice you want. In general we will only ask for one because it doesn’t make sense to work on several:

oarsub -l /nodes=1/gpu=1/migdevice=1,walltime=00:10:00 -t devel ...

“Virgo” virtual nodes: T4 GPU for opportunistic computing

The bigfoot cluster also hosts “virtual” nodes with small Nvidia T4 GPUs. These GPUs are used during the day by students during their training. They are made available by the Fall project.

At night, from midnight, the virtual machines are switched on by allocating a physical GPU. The virgo nodes become active in the Bigfoot cluster, offering Nvidia T4 resources to the jobs waiting for this gpumodel. The virtual machines are shut down at 6am. Jobs must be sufficiently short (walltime less than 6am) to be allowed on these resources.

oarsub -l /nodes=1/gpu=1,walltime=04:00:00 -p "gpumodel='T4'" ...

Special node Grace-Hopper GH200

An experimental Supermicro/CG1 node equipped with an Nvidia GH200 “Grace Hopper Superchip 480G/96G” motherboard is available under the name bigfoot-gh1. This node features a special architecture combining a 72-core, 64-bit ARM processor with a Nvidia H100 gas pedal boosted by 96GB of HBM3e memory. The ARM part has access to 480GB of LPDDR5X memory. The special feature of this architecture is unified access to both types of memory, via direct access on the NVlink bus. This node is deployed separately from the other nodes and runs under the Ubuntu OS with an aarch64 kernel. So be careful if you have binaries for x86 architecture, they won’t work without a proper recompilation for aarch64 architecture. If you’re using a Nix profile, you’ll need to create another one in order to get compatible binaries.

Cuda and NVhpc toolkits are installed via Ubuntu packages.

To submit a job on this node, specify job type gh:

oarsub -t gh -l /nodes=1 [...]

Due to the lack of drivers available for this architecture, Bettik and Silenus storage are not available on this node. Access to Mantis also requires compiling iRods commands for arm64 (information to come). This node does, however, have a local NVMe scratch space mounted in /var/tmp, which we recommend you use for your I/O files.

Jobs on GPU nodes

GPU nodes on GRICAD clusters

Using the Bigfoot cluster

Software environment

Submitting a job

Limitations

Submit a job using a submission script

AMD GPU nodes

Sandboxing: the devel mode

“Virgo” virtual nodes: T4 GPU for opportunistic computing

Special node Grace-Hopper GH200

Sandboxing: the `devel` mode