For simulations that require GPU cards to run, multiple nodes are available:
The Bigfoot cluster, dedicated to GPU computing :
Nodes with 4 NVIDIA Tesla V100 GPUs, 32 Go RAM for each GPU.
Nodes with 2 NVIDIA A100, 32 Go RAM for each GPUs
Nodes with 4 NVIDIA MI210/XGMI, 64 Go RAM for each GPUs
Nodes with T4 GPUs, called Virgo, only available at night!
The “Bigfoot” cluster is dedicated to calculations requiring the use of nodes delivering computing power through co-processors, currently GPGPUs. The access is done in a classical way, from the SSH bastions of Gricad, to the bigfoot front-end:
login@trinity:~$ ssh bigfoot
Linux bigfoot 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64
Welcome to Bigfoot cluster!
: :
.' :
_.-" :
_.-" '.
..__...____...-" :
: \_\ :
: .--" :
`.__/ .-" _ :
/ / ," ,- .'
(_)(`,(_,'L_,_____ ____....__ _.'
"' " """"""" """
GPU, GPU, GPU, ... ;-)
Type 'chandler' to get cluster status
Type 'recap.py' to get cluster properties
Sample OAR submissions:
# Get a A100 GPU and all associated cpu and memory resources:
oarsub -l /nodes=1/gpu=1 --project test -p "gpumodel='A100'" "nvidia-smi -L"
# Get a MIG partition of an A100 on a devel node, to make some tests
oarsub -l /nodes=1/gpu=1/migdevice=1 --project test -t devel "nvidia-smi -L"
Last login: Mon Jan 10 17:37:43 2022 from 129.88.178.43
login@bigfoot:~$
The chandler
command allows to have an overview of the available resources and their instantaneous state.
Several GPU models are available. The recap.py
command gives up-to-date information about the hardware configuration of the different nodes, in particular the model and the number of GPUs available inside the nodes:
login@bigfoot:~$ recap.py
================================================================================
| node | cpumodel | gpumodel | gpus | cpus | cores| mem | mem/gpu |MIG|
================================================================================
|bigfoot1 | intel Gold 6130| V100 | 4 | 2 | 32 | 192 | 96 | NO |
| [ + 1 more node(s) ] |
|bigfoot3 | intel Gold 6130| V100 | 4 | 2 | 32 | 192 | 96 | NO |
|bigfoot4 | intelGold 5218R| V100 | 4 | 2 | 40 | 192 | 96 | NO |
| [ + 1 more node(s) ] |
|bigfoot6 | intelGold 5218R| V100 | 4 | 2 | 40 | 192 | 96 | NO |
|bigfoot7 | amd EPYC 7452| A100 | 2 | 2 | 64 | 192 | 96 | YES |
|bigfoot8 | intelGold 5218R| V100 | 4 | 2 | 40 | 192 | 48 | NO |
|bigfoot9 | amd EPYC 7452| A100 | 2 | 2 | 64 | 192 | 96 | NO |
| [ + 2 more node(s) ] |
|bigfoot12| amd EPYC 7452| A100 | 2 | 2 | 64 | 192 | 96 | NO |
|virgo1 | intel vcpu| T4 | 1 | 1 | 2 | 4 | 4 | NO |
| [ + 33 more node(s) ] |
|virgo35 | intel vcpu| T4 | 1 | 1 | 2 | 4 | 4 | NO |
================================================================================
# of GPUS: 10 A100, 28 V100, 35 T4
login@bigfoot:~$
The nodes, appart from the Virgo T4 nodes, are interconnected via the same low-latency Omnipath network as the Dahu cluster.
The usual storage spaces are available from the front-end and all nodes:
The classical NIX and GUIX environments are available and shared with the Dahu and Luke clusters, as well as the specific application directory /applis
.
To install the libraries commonly used in GPU computing, you can use the predefined conda environments.
To use NVIDIA GPUs, you will also need to source the appropriate CUDA toolkit. You can use the following script by passing the name of the cluster and the desired version of the toolkit. Example for the toolkit version 11.7:
user@bigfoot:~$ source /applis/environments/cuda_env.sh 11.7
You can also list all available CUDA toolkits on the cluster using cuda_env.sh
script :
user@bigfoot:~$ source /applis/environments/cuda_env.sh -l
To launch a job, we use the OAR resource manager (whose essential commands are described in this page.), and more particularly the oarsub
command.
The particularity on the Bigfoot cluster, is that the resource unit to request is usually a gpu. The other resources of the compute nodes (cpu-core and memory) have been distributed and associated to the gpu according to the hardware configuration of the nodes which is heterogeneous.
It is recommended (but not mandatory) to specify the GPU model you want to obtain on your compute nodes, using the OAR gpumodel property.
The following example gives the minimum options to submit a job requiring a single gpu on a node that has Nvidia A100 GPUs:
oarsub -l /nodes=1/gpu=1 -p "gpumodel='A100'" ...
OAR will also allocate, on a pro-rata basis, a certain number of general-purpose computing cores (cpu-cores) and a certain amount of memory.
This other example job will get 2 Nvidia A100 GPUs on the same node, and the associated cpu and memory resources:
oarsub -l /nodes=1/gpu=2 -p "gpumodel='A100'" ...
OAR allocates GPU units according to the gpudevice
property. To find out which units are allocated, use the oarprint
command once on the node (interactively):
user@bigfoot2:~$ oarprint gpudevice
1
0
This same command also allows you to know the cpu-cores associated with gpudevice:
user@bigfoot8:~$ oarprint -P gpudevice,cpuset core
0 3
1 12
1 18
0 9
0 1
1 17
1 14
0 5
1 10
1 16
0 8
0 4
0 6
1 15
0 7
1 11
0 2
1 13
0 0
1 19
Here, we have obtained 20 compute cores and 2 GPUs, each GPU being associated to 10 cores whose rank is listed within the compute node.
To know the amount of core memory allocated, we can query the cgroup
of the job in the following way:
user@bigfoot8:~$ cat /dev/oar_cgroups_links/memory/`cat /proc/self/cpuset`/memory.limit_in_bytes
100693262336
This amount of memory varies depending on the number of GPUs obtained on the node and the characteristics of the node. The mem_per_gpu
property (in GB) allows us to know the configuration of the nodes and to put constraints on job submission on the amount of core memory needed for the job. For example:
oarsub -l /nodes=1/gpu=2 -p "gpumodel='A100' and mem_per_gpu > 64"...
For Nvidia GPUs, the command nvidia-smi
allows to get information about the accessible GPUs. It also shows that OAR will only allow access to GPUs that have been allocated:
user@bigfoot:~$ oarsub -l /nodes=1/gpu=2 -p "gpumodel='V100'" --project test -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=293
Interactive mode: waiting...
Starting...
Connect to OAR job 293 via the node bigfoot8
user@bigfoot8:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-263f55be-a11c-81be-8af6-e948471cb954)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-9e4e2b6c-19ea-73ef-7026-00619f988787)
bzizou@bigfoot8:~$ logout
Connection to bigfoot8 closed.
Disconnected from OAR job 293.
user@bigfoot:~$ oarsub -l /nodes=1/gpu=1 -p "gpumodel='A100'" --project test -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=294
Interactive mode: waiting...
Starting...
Connect to OAR job 294 via the node bigfoot12
user@bigfoot12:~$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-45d882aa-be45-3db7-bd3a-06da9fcaf3b1)
user@bigfoot12:~$ logout
The maximum walltime is 48 hours. This is necessary to allow a reasonable turnover for a fair sharing of resources. It is also necessary to facilitate maintenance operations and machine updates. If your jobs need a longer execution time, you must set up or use checkpoint features within your applications (i.e. your applications must be able to save their state in files in order to restart later on this state by loading these files).
The submission of a CPU job using a script is explained on this page.
Submitting a GPU job is done in the same way, except for the content of the dump.sh, which is as follows:
#!/bin/bash
#OAR -n Hello_World
#OAR -l /nodes=1/gpu=1,walltime=00:01:30
#OAR -p gpumodel='A100'
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test
cd /bettik/bouttier/
/bin/hostname >> dumb.txt
In this example, we request a job on GPU A100.
#OAR -p gpumodel='A100'
Unlike interactive submission, in a submission script, -p gpumodel='A100'
is not written in quotes.
AMD GPUs are operated via the “amdgpu” and “rocm” layers.
To join these specific nodes, you need to add the job type amd
, for example:
user@bigfoot:~$ oarsub -t amd -l /nodes=1/gpu=2 --project test -I
or as directives in a script:
#!/bin/bash
#OAR -n Hello_World
#OAR -l /nodes=1/gpu=1,walltime=00:01:30
#OAR -p gpumodel='MI210'
#OAR -t amd
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test
For the environment, we recommend the following NIX shell, which provides the rocm, opencl and openmpi utilities, including drivers for exploiting XGMI links. You can copy/paste this entire section into an interactive job on your node:
source /applis/site/nix.sh
NIX_PATH="nixpkgs=channel:nixos-23.11" nix-shell -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocmPackages.rocm-smi -p clinfo -p rocm-opencl-runtime -p rocm-opencl-icd
Or run your program passively by prefixing it with the nix-shell as follows in a script:
source /applis/site/nix.sh
export NIX_PATH="nixpkgs=channel:nixos-23.11"
nix-shell --command <./votre_programme> -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocmPackages.rocm-smi -p clinfo -p rocm-opencl-runtime -p rocm-opencl-icd
Here’s a more complete example, allowing you to configure OpenCL locally, interactively:
$ source /applis/site/nix.sh
$ NIX_PATH="nixpkgs=channel:nixos-23.11" nix-shell -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocmPackages.rocm-smi -p clinfo -p rocm-opencl-runtime -p rocm-opencl-icd
[nix-shell:~]$ mkdir -p ~/.local/etc/OpenCL/vendors
[nix-shell:~]$ echo `nix eval --raw nixpkgs.rocm-opencl-runtime.outPath`/lib/libamdocl64.so > ~/.local/etc/OpenCL/vendors/amdocl64.icd
[nix-shell:~]$ export OCL_ICD_VENDORS=~/.local/etc/OpenCL/vendors/amdocl64.icd
[nix-shell:~]$ clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
[...]
Test of IDEFIX application developped with KOKKOS (https://kokkos.org/) The Kokkos C++ EcoSystem is a solution for writing modern C++ applications, a programming model for performance and portability.
Compilation af IDEFIX code in a nix shell :
user@bigfoot:~$ cd idefix-bench/benchmark
user@bigfoot:~/idefix-bench/benchmark$ NIX_PATH="nixpkgs=channel:nixpkgs-unstable"
user@bigfoot:~/idefix-bench/benchmark$ . ./sourceMeFirst.sh
user@bigfoot:~/idefix-bench/benchmark$ nix-shell -p nur.repos.gricad.openmpi4 -p nur.repos.gricad.ucx -p rocm-smi -p hip -p cmake
The cmake options to compile for AMD HIP: “-DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_VEGA90A=ON” (for AMD Mi200/Mi250X) The HIP C°° compiler command is “hipcc”.
user@bigfoot:~/idefix-bench/benchmark$ cmake $IDEFIX_DIR -DKokkos_ENABLE_HIP=ON -DKokkos_ARCH_VEGA90A=ON -DCMAKE_CXX_COMPILER=hipcc
user@bigfoot:~/idefix-bench/benchmark$ make
user@bigfoot:~/idefix-bench/benchmark$ oarsub -I -lnodes=1 --project admin -t amd
user@bigfoot14:~/idefix-bench/benchmark$ . /applis/site/nix.sh
user@bigfoot14:~/idefix-bench/benchmark$ mpirun -np 2 --mca btl '^openib' -x UCX_TLS=sm,self,rocm_copy,rocm_ipc --mca pml ucx -x UCX_RNDV_THRESH=128 --mca osc ucx ./idefix
devel
modeThe special -t devel
submission mode is available to facilitate job tuning. It is limited to jobs of 2 hours maximum, and allocates resources on nodes dedicated to this mode. This allows to have sandbox resources much more available than production resources.
But be careful, this sandbox works on smaller GPUs, which are in fact Nvidia A100 GPU partitions. Each Nvidia A100 GPU in devel
mode is in fact a sub-gpu, which we will call mig
by abuse of language (MIG=Multi Instance GPU). Also the selection of resources is a bit different, and you have to specify the number of migdevice
you want. In general we will only ask for one because it doesn’t make sense to work on several:
oarsub -l /nodes=1/gpu=1/migdevice=1,walltime=00:10:00 -t devel ...
The bigfoot cluster also hosts “virtual” nodes with small Nvidia T4
GPUs. These GPUs are used during the day by students during their training. They are made available by the Fall project.
At night, from midnight, the virtual machines are switched on by allocating a physical GPU. The virgo nodes become active in the Bigfoot cluster, offering Nvidia T4 resources to the jobs waiting for this gpumodel. The virtual machines are shut down at 6am. Jobs must be sufficiently short (walltime less than 6am) to be allowed on these resources.
oarsub -l /nodes=1/gpu=1,walltime=04:00:00 -p "gpumodel='T4'" ...