The Kraken Supercomputer

Kraken

The Kraken platform

In short, as of 2025 July, Kraken is composed of:

11328 CPU cores AMD Genoa, 28 GPU Nvidia H100, 200Gbp/s Interconnect, 300TB full NVMe scratch, 768 and 1536 GB of RAM per node

In details (first step of the setup):

44 CPU computing nodes with 192 cores each:
- 2 AMD EPYC 9654 96C 2.4GHz Processor
- 768 GB RAM (4 GB /core)
- 1.92 TB local NVMe SSD scratch
- Infiniband NDR 200 Gb/s HBA
8 CPU computing fat nodes with 192 cores each:
- 2 AMD EPYC 9654 96C 2.4GHz Processor
- 1536 GB RAM (8 GB / core)
- 1.92 TB local NVMe SSD scratch
- Infiniband NDR 200 Gb/s HBA
7 GPU computing nodes:
- 2 AMD EPYC 9654 96C 2.4GHz Processor
- 1536 GB RAM
- 4 GPU Nvidia H100 94GB HBM2e
- 3.84 TB local NVMe SSD scratch
- 2 x Infiniband NDR 200 Gb/s HBA
Infiniband NDR 200Gb/s non-blocking Interconnect (2x200 Gb/s for GPU nodes)
Hoyt scratch: 300 TB BeeGFS distributed scratch filesystem full NVMe SSD
Access to Bettik, Mantis and Summer
DWC (Direct Water Cooling) on all the computing nodes
OAR v3 batch scheduler and Alumet system profiling tool

The cluster configuration

Kraken has been splitted in 2 specialized clusters, accessible from 2 different frontends:

kraken-cpu is the general purpose CPU nodes cluster
kraken-gpu is the GPU nodes cluster

Each cluster has it’s own distinct /home volume for the personnal directories. Both cluster share the same high peformance distributed scratch directory /hoyt and have also access to /bettik and Mantis (iRods).

Local scratches are mounted on /var/tmp (on every nodes) and the OAR property scratch1 gives the free amount of local scratch available in MB.

The nodes of the kraken-cpu cluster are named kraken-c[n] (for 768 GB RAM nodes) and kraken-f[n] (for 1536 GB RAM fat nodes).

The nodes of the kraken-gpu cluster are named kraken-g[n].

Each kraken-cpu node has one connected 200Gb/s Infininand interface named ibp65s0f0.

Each kraken-gpu node has two connected 200Gb/s Infiniband interfaces named ibp163s0 and ibp195s0

Warning: more IB interfaces are available on the Kraken’s nodes, but they are not connected.

Listing resources

With OAR3, there’s the new -S option of the oarnodes command to see a condensed snapshot of the resources, their state and the jobs running on it:

oarnodes -S

oarnodes

The chandler command is still there, but it may need a large terminal to format correctly the output. This is for now still the better way to see if some nodes are in the “drained” mode (softly disabled for a near future maintenance).

AMD EPYC 9654 architecture and OAR configuration

The following AMD documentation has been used to configure the cluster efficiently, and may be used as a reference to understand the topology of the computing nodes of Kraken:

https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/tuning-guides/58002_amd-epyc-9004-tg-hpc.pdf

Here is a schematic representation of the Kraken’s CPU nodes topology:

Kraken CPU topology

Each computing unit, a single “core” of the cluster represents a unique OAR “resource”. The resources are structured to follow the AMD 9654 topology:

8 “cores” per “die”
3 “dies” per “numa node”
4 “numa nodes” per “cpu”
2 “cpus” per “node”

The hierarchy of computing resources created into the OAR configuration of the cluster is, from the larger to the smaller element: /nodes/cpu/numa/die/core

In other words, the resources have all a unique id for each of the following properties: core, die, numa, cpu, network_address (with nodes as an alias for network_address)

For example, here are the properties of kraken-c12:

  Id     Network address   State   Available upto   cpu   core   die   numa  
 ─────────────────────────────────────────────────────────────────────────── 
  2113   kraken-c12        Alive   2147483647       22    2112   264   88    
  2114   kraken-c12        Alive   2147483647       22    2113   264   88    
  2115   kraken-c12        Alive   2147483647       22    2114   264   88    
  2116   kraken-c12        Alive   2147483647       22    2115   264   88    
  2117   kraken-c12        Alive   2147483647       22    2116   264   88    
  2118   kraken-c12        Alive   2147483647       22    2117   264   88    
  2119   kraken-c12        Alive   2147483647       22    2118   264   88    
  2120   kraken-c12        Alive   2147483647       22    2119   264   88    
  2121   kraken-c12        Alive   2147483647       22    2120   265   88    
  2122   kraken-c12        Alive   2147483647       22    2121   265   88
[...]
  2136   kraken-c12        Alive   2147483647       22    2135   266   88    
  2137   kraken-c12        Alive   2147483647       22    2136   267   89    
  2138   kraken-c12        Alive   2147483647       22    2137   267   89    
  2139   kraken-c12        Alive   2147483647       22    2138   267   89    
  2140   kraken-c12        Alive   2147483647       22    2139   267   89    
[...]
  2296   kraken-c12        Alive   2147483647       23    2295   286   95
  2297   kraken-c12        Alive   2147483647       23    2296   287   95
  2298   kraken-c12        Alive   2147483647       23    2297   287   95
  2299   kraken-c12        Alive   2147483647       23    2298   287   95
  2300   kraken-c12        Alive   2147483647       23    2299   287   95
  2301   kraken-c12        Alive   2147483647       23    2300   287   95
  2302   kraken-c12        Alive   2147483647       23    2301   287   95
  2303   kraken-c12        Alive   2147483647       23    2302   287   95
  2304   kraken-c12        Alive   2147483647       23    2303   287   95

Generally, those ids are only used internally by OAR and you never have to specify them. Instead, you will ask for a given number of node, cpu, numa, die or core and OAR will allocate some of them to your jobs.

For example: the following OAR job will run on 48 cores of the same CPU:

oarsub -l /nodes=1/cpu=1/numa=2

An admission rule called [ANTIFRAG] has been written to try to avoid cpu fragmentation in case you only ask for a given number of cores. This rule will try to attribute full dies and will suggest you to fill numa or cpus. For example:

# The following request:
oarsub -l /core=42
# will be automatically converted into 
oarsub -l /numa=2
# and will result in a job having 48 cores.

# The following request:
oarsub -l /core=380
# will be automatically converted into 
oarsub -l /nodes=2
# and will result in a job having 384 cores.

# The following request:
oarsub -l /core=100
# will be automatically converted into 
oarsub -l /die=13
# and will result in a job having 104 cores, and a warning will suggest you to use full numa nodes:
  [ANTIFRAG] Warning: the number of dies asked is not a multiple of 3: consider using /core=120 to use full numa nodes

The [ANTIFRAG] admission rule goal is to limit the situations where several different jobs share the same die, numa or cpu, because that may lead to performance issues.

The [ANTIFRAG] admission rule will never be as efficient as you may be. We strongly recommend that you understand the cluster topolgy and that you try to match your job requests to the topology as best as you can. The [ANTIFRAG] rule is automatically disable as soon as you specify yourself a /node, /cpu, /numa or /die property into your resources request, for example -l /nodes=2/cpu=1, assuming that you know what you are doing.

Fat nodes

Fat nodes are considered as “normal” nodes having twice more memory. So they are just nodes of the kraken-cpu cluster tagged with the property fat=YES. All jobs can go on the fat nodes, but jobs explicitly asking for fat nodes have a higher priority on the fat nodes.

To explicitely to go on the fat nodes, the job must be of the type fat, using the -t fat option of oarsub, for example:

oarsub -l /cpu=1 -t fat

If you want to avoid the fat nodes, you have to exclude the fat nodes from your request using the fat property like that:

oarsub -I -l /numa=1 -p "fat = 'NO'"

Memory

There’s an amount of 4 GB per cpu-core on each node. This value is 8 GB per cpu-core on the fat nodes.

So, if your job is memory bound and for example, you need 200 GB of RAM, then, you need at least 200/4 = 50 cores, even if your job is sequential.

Asking for 50 cores to have 200 GB is not enough if your job requires this amount of memory available on the same host! A 50 cores job may be spread on several nodes, one die on a host or another… So, if you need 200 GB of addressable memory from one unique process, you should then ask for /nodes=1/die=7 (remember, better to stay inside a multiple of dies, so actualy 56 cores, then 224 GB of RAM)

GPU nodes

GPU nodes are available as a separate cluster managed by the frontend kraken-gpu.

The OAR configuration is simpler: the resources hierarchy is as follow:

48 cores per GPU
4 GPUS per node

The cores are not severable from the attached GPU. The only way to ask for resources on kraken-gpu is to ask for a number of GPUS or to ask for a number of nodes, for example:

# Asking for 2 GPUS on the same node:
oarsub -l /nodes=1/gpu=2
# Asking for 2 full nodes, will result in 8 GPUS:
oarsub -l /nodes=2
# You may want to have a GPU per node on 2 different nodes (2 GPUS total):
oarsub -l /nodes=2/gpu=1

Memory on GPU nodes

There’s an amount of 8 GB of RAM memory per core, so actualy 384 GB per GPU. As a consequence, if you ask for 2 GPUS on the same node for example, you have 768 GB of RAM.

Each H100 GPU has 94 GB of HBM2e memory.

GPU isolation

Your job can only see the exact amount of GPU devices it has requested. The other devices of the node are automatically masked. By this way, the first device of you job on a given node is always given the id 0. You can see the real device path of your GPUS by using the oarprint gpudevice command from the inside of a job, for example:

bzizou@kraken-gpu:~$ oarsub -I -l /gpu=2 --project test
[ADMISSION RULE] Set default walltime to 7200.
# INFO:  Moldable instance:  1  Estimated nb resources:  96  Walltime:  7200
# Warning: no nodes request... A /nodes=<N> prefix is recommended when asking for more than 1 GPU,
# for example /nodes=1/gpu=2
OAR_JOB_ID=62512
Interactive mode: waiting...
Starting...
Connect to OAR job 62512 via the node kraken-g1
bzizou@kraken-g1:~$ nvidia-smi 
Tue Jul 22 15:56:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100                    Off |   00000000:A6:00.0 Off |                    0 |
| N/A   42C    P0             67W /  700W |       0MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100                    Off |   00000000:C6:00.0 Off |                    0 |
| N/A   42C    P0             70W /  700W |       0MiB /  95830MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
bzizou@kraken-g1:~$ oarprint gpudevice
/dev/nvidia2
/dev/nvidia3

In this example, you can see that the two GPUS that your programm can see have ids ‘0’ and ‘1’. The nvidia-smi command can only see those 2 devices and the 2 others of the same node are masked. The oarprint gpudevice shows that you actually obtained the nvidia gpus having hardware ids ‘2’ and ‘3’.

MPI

Several Openmpi environment are available:

Directly installed into the OS (which is Debian/Trixie), currently openmpi 5.0.7 with gcc 14.2
Environment modules:

bzizou@kraken-g1:~$ module avail openmpi 
----------------------------------------------------- /softs/Modules -----------------------------
openmpi/4.1.8/cuda-12.9-gcc-12.4.0  openmpi/4.1.8/gcc-14.2.0            openmpi/5.0.8/gcc-14-2.0  
openmpi/4.1.8/cuda-12.9-gcc-14.2.0  openmpi/5.0.7/cuda-12.9-gcc-12.4.0  
openmpi/4.1.8/gcc-12.4.0            openmpi/5.0.7/gcc-11.4.1

Knonw working so far are openmpi/5.0.8/gcc-14-2.0, openmpi/4.1.8/gcc-14.2.0 and openmpi/4.1.8/cuda-12.9-gcc-14.2.0

(cuda keyword in the module name -> use on kraken-gpu)

Nix / Guix (check GRICAD doc)

Mpirun recommended options

Recommended:

mpirun --hostfile $OAR_NODEFILE --prefix $OPENMPI_PATH -x LD_LIBRARY_PATH --mca plm_rsh_agent "oarsh" --mca pml ucx --mca btl ^tcp,openib,uct -x UCX_TLS=shm,self,cuda_copy,dc,rc,ud,gdr_copy,tcp

Normally not needed as UCX is already configured accordingly, but you may have to specify the IB devices:

# kraken-cpu
  -x UCX_NET_DEVICES=ibp65s0f0:1
# kraken-gpu
  -x UCX_NET_DEVICES=ibp163s0:1,ibp195s0:1

If you want to test multi-rail on the GPU nodes (because they have 2 connected IB interfaces), you can add:

  -x UCX_MAX_EAGER_RAILS=2 -x UCX_MAX_RNDV_RAILS=2 -x UCX_NET_DEVICES=ibp163s0:1,ibp195s0:1

If you have a working combination of mpi environment and options, we may be interested by your experience! Feel free to report it à sos-gricad@univ-grenoble-alpes.fr !

Dashboards

Dashboards are available here (need VPN access): https://gricad-dashboards.univ-grenoble-alpes.fr/dashboards/f/feotgbsryapz4a

Alumet dashboards show CPU / GPU resources usage by job
Hoyt dashboards show the load on the scratch filesystem
Other dashboards to monitor temperatures, total load, …

Sandboxing (devel jobs)

If you are in the phase of developping your job scripts, you may need to run test jobs, just to check if they are launching correctly. For that, there’s a devel queue which has a high priority for such tests jobs to start before the others. You have to specify the devel type to the oar submission: simply add -t devel to your oarsub command options. The rules are the following:

devel jobs have a maximum walltime of 30 minutes
you can start only 1 devel job at a time
a devel job is automatically placed into the devel queue (do NOT specify -q devel, use -t devel instead)
the devel queue is not allowed for non-devel jobs