Jobs on GPU nodes

GPU nodes on GRICAD clusters

For simulations that require GPU cards to run, multiple nodes are available:

  • The Froggy cluster has 9 nodes with two Kepler K20m GPUS. These nodes are called frogkepler.

  • The Bigfoot cluster, dedicated to GPU computing :

    • Nodes with 4 NVIDIA Tesla V100 GPUs, 32 Go RAM for each GPU.

    • Nodes with 2 NVIDIA A100, 32 Go RAM for each GPUs

    • Nodes with T4 GPUs, called Virgo, only available at night!

Using the Bigfoot cluster

The “Bigfoot” cluster is dedicated to calculations requiring the use of nodes delivering computing power through co-processors, currently GPGPUs. The access is done in a classical way, from the SSH bastions of Gricad, to the bigfoot front-end:

login@trinity:~$ ssh bigfoot
Linux bigfoot 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64
                       Welcome to Bigfoot cluster! 

                             :            :
                            .'            :
                        _.-"              :
                    _.-"                  '.
    ..__...____...-"                       :
   : \_\                                    :
   :    .--"                                 :
   `.__/  .-" _                               :
      /  /  ," ,-                            .'
     (_)(`,(_,'L_,_____       ____....__   _.'
      "' "             """""""          """   

GPU, GPU, GPU, ... ;-)

            Type 'chandler' to get cluster status  
          Type 'recap.py' to get cluster properties

Sample OAR submissions: 
  # Get a A100 GPU and all associated cpu and memory resources:
  oarsub -l /nodes=1/gpu=1 --project test -p "gpumodel='A100'" "nvidia-smi -L"
  # Get a MIG partition of an A100 on a devel node, to make some tests
  oarsub -l /nodes=1/gpu=1/migdevice=1 --project test -t devel "nvidia-smi -L"

Last login: Mon Jan 10 17:37:43 2022 from 129.88.178.43
login@bigfoot:~$ 

The chandler command allows to have an overview of the available resources and their instantaneous state.

Several GPU models are available. The recap.py command gives up-to-date information about the hardware configuration of the different nodes, in particular the model and the number of GPUs available inside the nodes:

login@bigfoot:~$ recap.py 
 ================================================================================
|   node  | cpumodel       | gpumodel  | gpus | cpus | cores| mem | mem/gpu |MIG|
 ================================================================================
|bigfoot1 | intel Gold 6130| V100      |   4  |   2  |   32 | 192 |   96  |  NO |
|    [ + 1  more node(s) ]                                                      |
|bigfoot3 | intel Gold 6130| V100      |   4  |   2  |   32 | 192 |   96  |  NO |
|bigfoot4 | intelGold 5218R| V100      |   4  |   2  |   40 | 192 |   96  |  NO |
|    [ + 1  more node(s) ]                                                      |
|bigfoot6 | intelGold 5218R| V100      |   4  |   2  |   40 | 192 |   96  |  NO |
|bigfoot7 | amd   EPYC 7452| A100      |   2  |   2  |   64 | 192 |   96  | YES |
|bigfoot8 | intelGold 5218R| V100      |   4  |   2  |   40 | 192 |   48  |  NO |
|bigfoot9 | amd   EPYC 7452| A100      |   2  |   2  |   64 | 192 |   96  |  NO |
|    [ + 2  more node(s) ]                                                      |
|bigfoot12| amd   EPYC 7452| A100      |   2  |   2  |   64 | 192 |   96  |  NO |
|virgo1   | intel      vcpu| T4        |   1  |   1  |    2 |   4 |    4  |  NO |
|    [ + 33 more node(s) ]                                                      |
|virgo35  | intel      vcpu| T4        |   1  |   1  |    2 |   4 |    4  |  NO |
 ================================================================================
 # of GPUS: 10 A100, 28 V100, 35 T4
                                       
login@bigfoot:~$ 

The nodes, appart from the Virgo T4 nodes, are interconnected via the same low-latency Omnipath network as the Dahu cluster.

The usual storage spaces are available from the front-end and all nodes:

  • /home: a space dedicated to the Bigfoot cluster, to clearly distinguish environments /bettik: a high performance capacitive workspace, shared with the Dahu and Luke clusters
  • /silenus : temporary ultra high performance scratch (SSD/NVMe), shared with the Dahu cluster (not yet avalable on Virgo nodes, coming soon)
  • Mantis: managed cloud storage under iRods

Software environment

The classical NIX and GUIX environments are available and shared with the Dahu and Luke clusters, as well as the specific application directory /applis.

To install the libraries commonly used in GPU computing, you can use the predefined conda environments.

To use NVIDIA GPUs, you will also need to source the appropriate CUDA toolkit. You can use the following script by passing the name of the cluster and the desired version of the toolkit. Example for the toolkit version 10.1:

user@bigfoot2:~$ source /applis/environments/cuda_env.sh bigfoot 11.2

Submitting a job

To launch a job, we use the OAR resource manager (whose essential commands are described in this page.), and more particularly the oarsub command.

The particularity on the Bigfoot cluster, is that the resource unit to request is usually a gpu. The other resources of the compute nodes (cpu-core and memory) have been distributed and associated to the gpu according to the hardware configuration of the nodes which is heterogeneous.

It is recommended (but not mandatory) to specify the GPU model you want to obtain on your compute nodes, using the OAR gpumodel property.

The following example gives the minimum options to submit a job requiring a single gpu on a node that has Nvidia A100 GPUs:

oarsub -l /nodes=1/gpu=1 -p "gpumodel='A100'" ...

OAR will also allocate, on a pro-rata basis, a certain number of general-purpose computing cores (cpu-cores) and a certain amount of memory.

This other example job will get 2 Nvidia A100 GPUs on the same node, and the associated cpu and memory resources:

oarsub -l /nodes=1/gpu=2 -p "gpumodel='A100'" ...

OAR allocates GPU units according to the gpudevice property. To find out which units are allocated, use the oarprint command once on the node (interactively):

user@bigfoot2:~$ oarprint gpudevice
1
0

This same command also allows you to know the cpu-cores associated with gpudevice:

user@bigfoot8:~$ oarprint -P gpudevice,cpuset core
0 3
1 12
1 18
0 9
0 1
1 17
1 14
0 5
1 10
1 16
0 8
0 4
0 6
1 15
0 7
1 11
0 2
1 13
0 0
1 19

Here, we have obtained 20 compute cores and 2 GPUs, each GPU being associated to 10 cores whose rank is listed within the compute node.

To know the amount of core memory allocated, we can query the cgroup of the job in the following way:

user@bigfoot8:~$ cat /dev/oar_cgroups_links/memory/`cat /proc/self/cpuset`/memory.limit_in_bytes
100693262336

This amount of memory varies depending on the number of GPUs obtained on the node and the characteristics of the node. The mem_per_gpu property (in GB) allows us to know the configuration of the nodes and to put constraints on job submission on the amount of core memory needed for the job. For example:

oarsub -l /nodes=1/gpu=2 -p "gpumodel='A100' and mem_per_gpu > 64"...

For Nvidia GPUs, the command nvidia-smi allows to get information about the accessible GPUs. It also shows that OAR will only allow access to GPUs that have been allocated:

user@bigfoot:~$ oarsub -l /nodes=1/gpu=2 -p "gpumodel='V100'" --project test -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=293
Interactive mode: waiting...
Starting...
Connect to OAR job 293 via the node bigfoot8
user@bigfoot8:~$ nvidia-smi -L
GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-263f55be-a11c-81be-8af6-e948471cb954)
GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-9e4e2b6c-19ea-73ef-7026-00619f988787)
bzizou@bigfoot8:~$ logout
Connection to bigfoot8 closed.
Disconnected from OAR job 293.
user@bigfoot:~$ oarsub -l /nodes=1/gpu=1 -p "gpumodel='A100'" --project test -I
[ADMISSION RULE] Set default walltime to 7200.
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=294
Interactive mode: waiting...
Starting...
Connect to OAR job 294 via the node bigfoot12
user@bigfoot12:~$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-45d882aa-be45-3db7-bd3a-06da9fcaf3b1)
user@bigfoot12:~$ logout

Limitations

The maximum walltime is 48 hours. This is necessary to allow a reasonable turnover for a fair sharing of resources. It is also necessary to facilitate maintenance operations and machine updates. If your jobs need a longer execution time, you must set up or use checkpoint features within your applications (i.e. your applications must be able to save their state in files in order to restart later on this state by loading these files).

Sandboxing: the devel mode

The special -t devel submission mode is available to facilitate job tuning. It is limited to jobs of 30 minutes maximum, and allocates resources on nodes dedicated to this mode. This allows to have sandbox resources much more available than production resources.

But be careful, this sandbox works on smaller GPUs, which are in fact Nvidia A100 GPU partitions. Each Nvidia A100 GPU in devel mode is in fact a sub-gpu, which we will call mig by abuse of language (MIG=Multi Instance GPU). Also the selection of resources is a bit different, and you have to specify the number of migdevice you want. In general we will only ask for one because it doesn’t make sense to work on several:

oarsub -l /nodes=1/gpu=1/migdevice=1,walltime=00:10:00 -t devel ...

“Virgo” virtual nodes: T4 GPU for opportunistic computing

The bigfoot cluster also hosts “virtual” nodes with small Nvidia T4 GPUs. These GPUs are used during the day by students during their training. They are made available by the Fall project.

At night, from midnight, the virtual machines are switched on by allocating a physical GPU. The virgo nodes become active in the Bigfoot cluster, offering Nvidia T4 resources to the jobs waiting for this gpumodel. The virtual machines are shut down at 6am. Jobs must be sufficiently short (walltime less than 6am) to be allowed on these resources.

oarsub -l /nodes=1/gpu=1,walltime=04:00:00 -p "gpumodel='T4'" ...

Using Dahu GPU nodes

The software environment

To install the libraries commonly used in GPU computing, you can use the predefined conda environments. The procedure is described here

To use GPUs, you will also need to source the appropriate CUDA toolkit.

To do this, you can use the following script passing the name of the cluster and the desired toolkit version.

In the following example the script will load cuda toolkit 10.1 on dahu cluster.

user@bigfoot2:~$ source /applis/environments/cuda_env.sh dahu 10.1

Starting a job on GPU nodes

To launch a job, we always use the OAR resource manager (whose essential commands are described on this page), and more specifically the oarsub command.

In practice, to launch a job on GPU nodes, we must specify when calling the oarsub command that the type of our job will be gpu (-t gpu).

To request a GPU and 8 associated cpus cores on Dahu :

oarsub -t gpu -l /nodes=1/gpudevice=1 ...

To request 2 GPUs on the same node (and thus 16 CPU cores) :

oarsub -t gpu -l /nodes=1/gpudevice=2 ...

OAR allocates GPU units according to the gpudevice property. To find out which units are allocated, you must use, once on the node (interactively) the oarprint command :

user@bigfoot2:~$ oarprint gpudevice
1
0

Using Froggy GPU nodes

The software environment

Due to the age of the machine, the software environment of the classical libraries for the use of these nodes is set up with the module command.

Running a job on GPU nodes

To launch a job, we always use the OAR resource manager (whose essential commands are described on this page), and more specifically the oarsub command.

In practice, to launch a job on GPU nodes, we have to specify when calling the oarsub command that the type of our job will be gpu (-t gpu) and it is possible to specify the number of GPUs we want for our job using the gpu property (e.g. -l/nodes=1/gpu=2)

There is a difference between Dahu and Froggy on the property to reserve a number of GPUs: on Dahu, you have to use the gpudevice property and on Froggy, gpu.

Here is an example of an interactive session where you reserve two GPUs, load the CUDA environment and compile a CUDA utility :

[bouttier@froggy1 ~] oarsub -I --project test -l /nodes=1/gpu=2 -t gpu
[ADMISSION RULE] Set default walltime to 1800.
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=NO
[GPUNODE] Adding gpu node restriction
OAR_JOB_ID=349170
Interactive mode : waiting...
Starting...
Connect to OAR job 349170 via the node frogkepler3
[bouttier@frogkepler3 ~]$ source /applis/site/env.bash
[bouttier@frogkepler3 ~]$ module load cuda/6.5
[bouttier@frogkepler3 ~]$ rsync -a $cuda_SAMPLES_DIR .
[bouttier@frogkepler3 ~]$ cd cuda-samples/NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQuery
[bouttier@frogkepler3 deviceQuery]$ make
g++ -m64  -I/opt/cuda/5.0//include -I. -I.. -I../../common/inc -o deviceQuery.o -c deviceQuery.cpp
g++ -m64 -o deviceQuery deviceQuery.o -L/opt/cuda/5.0//lib64 -lcuda -lcudart
mkdir -p ../../bin/linux/release
cp deviceQuery ../../bin/linux/release
[bouttier@frogkepler3 deviceQuery]$ ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K20m"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    3.5
 Total amount of global memory:                 4800 MBytes (5032706048 bytes)
 (13) Multiprocessors x (192) CUDA Cores/MP:    2496 CUDA Cores
 GPU Clock rate:                                706 MHz (0.71 GHz)
 Memory Clock rate:                             2600 Mhz
 Memory Bus Width:                              320-bit
 L2 Cache Size:                                 1310720 bytes
 Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
 Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       49152 bytes
 Total number of registers available per block: 65536
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  2048
 Maximum number of threads per block:           1024
 Maximum sizes of each dimension of a block:    1024 x 1024 x 64
 Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             512 bytes
 Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Enabled
 Device supports Unified Addressing (UVA):      Yes
 Device PCI Bus ID / PCI location ID:           2 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K20m"
 CUDA Driver Version / Runtime Version          5.0 / 5.0
 CUDA Capability Major/Minor version number:    3.5
 Total amount of global memory:                 4800 MBytes (5032706048 bytes)
 (13) Multiprocessors x (192) CUDA Cores/MP:    2496 CUDA Cores
 GPU Clock rate:                                706 MHz (0.71 GHz)
 Memory Clock rate:                             2600 Mhz
 Memory Bus Width:                              320-bit
 L2 Cache Size:                                 1310720 bytes
 Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
 Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
 Total amount of constant memory:               65536 bytes
 Total amount of shared memory per block:       49152 bytes
 Total number of registers available per block: 65536
 Warp size:                                     32
 Maximum number of threads per multiprocessor:  2048
 Maximum number of threads per block:           1024
 Maximum sizes of each dimension of a block:    1024 x 1024 x 64
 Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
 Maximum memory pitch:                          2147483647 bytes
 Texture alignment:                             512 bytes
 Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
 Run time limit on kernels:                     No
 Integrated GPU sharing Host Memory:            No
 Support host page-locked memory mapping:       Yes
 Alignment requirement for Surfaces:            Yes
 Device has ECC support:                        Enabled
 Device supports Unified Addressing (UVA):      Yes
 Device PCI Bus ID / PCI location ID:           132 / 0
 Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla K20m, Device1 = Tesla K20m