In short, as of 2025 July, Kraken is composed of:
11328 CPU cores AMD Genoa, 28 GPU Nvidia H100, 200Gbp/s Interconnect, 300TB full NVMe scratch, 768 and 1536 GB of RAM per node
In details (first step of the setup):
Kraken has been splitted in 2 specialized clusters, accessible from 2 different frontends:
kraken-cpu
is the general purpose CPU nodes clusterkraken-gpu
is the GPU nodes clusterEach cluster has it’s own distinct /home
volume for the personnal directories. Both cluster share the same high peformance distributed scratch directory /hoyt
and have also access to /bettik
and Mantis
(iRods).
Local scratches are mounted on /var/tmp
(on every nodes) and the OAR property scratch1
gives the free amount of local scratch available in MB.
The nodes of the kraken-cpu
cluster are named kraken-c[n]
(for 768 GB RAM nodes) and kraken-f[n]
(for 1536 GB RAM fat nodes).
The nodes of the kraken-gpu
cluster are named kraken-g[n]
.
Each kraken-cpu node has one connected 200Gb/s Infininand interface named ibp65s0f0
.
Each kraken-gpu node has two connected 200Gb/s Infiniband interfaces named ibp163s0
and ibp195s0
Warning: more IB interfaces are available on the Kraken’s nodes, but they are not connected.
With OAR3, there’s the new -S
option of the oarnodes
command to see a condensed snapshot of the resources, their state and the jobs running on it:
oarnodes -S
The chandler
command is still there, but it may need a large terminal to format correctly the output. This is for now still the better way to see if some nodes are in the “drained” mode (softly disabled for a near future maintenance).
The following AMD documentation has been used to configure the cluster efficiently, and may be used as a reference to understand the topology of the computing nodes of Kraken:
Here is a schematic representation of the Kraken’s CPU nodes topology:
Each computing unit, a single “core” of the cluster represents a unique OAR “resource”. The resources are structured to follow the AMD 9654 topology:
The hierarchy of computing resources created into the OAR configuration of the cluster is, from the larger to the smaller element: /nodes/cpu/numa/die/core
In other words, the resources have all a unique id for each of the following properties: core, die, numa, cpu, network_address (with nodes as an alias for network_address)
For example, here are the properties of kraken-c12:
Id Network address State Available upto cpu core die numa
───────────────────────────────────────────────────────────────────────────
2113 kraken-c12 Alive 2147483647 22 2112 264 88
2114 kraken-c12 Alive 2147483647 22 2113 264 88
2115 kraken-c12 Alive 2147483647 22 2114 264 88
2116 kraken-c12 Alive 2147483647 22 2115 264 88
2117 kraken-c12 Alive 2147483647 22 2116 264 88
2118 kraken-c12 Alive 2147483647 22 2117 264 88
2119 kraken-c12 Alive 2147483647 22 2118 264 88
2120 kraken-c12 Alive 2147483647 22 2119 264 88
2121 kraken-c12 Alive 2147483647 22 2120 265 88
2122 kraken-c12 Alive 2147483647 22 2121 265 88
[...]
2136 kraken-c12 Alive 2147483647 22 2135 266 88
2137 kraken-c12 Alive 2147483647 22 2136 267 89
2138 kraken-c12 Alive 2147483647 22 2137 267 89
2139 kraken-c12 Alive 2147483647 22 2138 267 89
2140 kraken-c12 Alive 2147483647 22 2139 267 89
[...]
2296 kraken-c12 Alive 2147483647 23 2295 286 95
2297 kraken-c12 Alive 2147483647 23 2296 287 95
2298 kraken-c12 Alive 2147483647 23 2297 287 95
2299 kraken-c12 Alive 2147483647 23 2298 287 95
2300 kraken-c12 Alive 2147483647 23 2299 287 95
2301 kraken-c12 Alive 2147483647 23 2300 287 95
2302 kraken-c12 Alive 2147483647 23 2301 287 95
2303 kraken-c12 Alive 2147483647 23 2302 287 95
2304 kraken-c12 Alive 2147483647 23 2303 287 95
Generally, those ids are only used internally by OAR and you never have to specify them. Instead, you will ask for a given number of node, cpu, numa, die or core and OAR will allocate some of them to your jobs.
For example: the following OAR job will run on 48 cores of the same CPU:
oarsub -l /nodes=1/cpu=1/numa=2
An admission rule called [ANTIFRAG] has been written to try to avoid cpu fragmentation in case you only ask for a given number of cores. This rule will try to attribute full dies and will suggest you to fill numa or cpus. For example:
# The following request:
oarsub -l /core=42
# will be automatically converted into
oarsub -l /numa=2
# and will result in a job having 48 cores.
# The following request:
oarsub -l /core=380
# will be automatically converted into
oarsub -l /nodes=2
# and will result in a job having 384 cores.
# The following request:
oarsub -l /core=100
# will be automatically converted into
oarsub -l /die=13
# and will result in a job having 104 cores, and a warning will suggest you to use full numa nodes:
[ANTIFRAG] Warning: the number of dies asked is not a multiple of 3: consider using /core=120 to use full numa nodes
The [ANTIFRAG] admission rule goal is to limit the situations where several different jobs share the same die, numa or cpu, because that may lead to performance issues.
The [ANTIFRAG] admission rule will never be as efficient as you may be. We strongly recommend that you understand the cluster topolgy and that you try to match your job requests to the topology as best as you can. The [ANTIFRAG] rule is automatically disable as soon as you specify yourself a /node, /cpu, /numa or /die property into your resources request, for example -l /nodes=2/cpu=1
, assuming that you know what you are doing.
Fat nodes are considered as “normal” nodes having twice more memory. So they are just nodes of the kraken-cpu
cluster tagged with the property fat=YES
. All jobs can go on the fat nodes, but jobs explicitly asking for fat nodes have a higher priority on the fat nodes.
To explicitely to go on the fat nodes, the job must be of the type fat, using the -t fat
option of oarsub, for example:
oarsub -l /cpu=1 -t fat
If you want to avoid the fat nodes, you have to exclude the fat nodes from your request using the fat property like that:
oarsub -I -l /numa=1 -p "fat = 'NO'"
There’s an amount of 4 GB per cpu-core on each node. This value is 8 GB per cpu-core on the fat nodes.
So, if your job is memory bound and for example, you need 200 GB of RAM, then, you need at least 200/4 = 50 cores, even if your job is sequential.
Asking for 50 cores to have 200 GB is not enough if your job requires this amount of memory available on the same host! A 50 cores job may be spread on several nodes, one die on a host or another… So, if you need 200 GB of addressable memory from one unique process, you should then ask for /nodes=1/die=7 (remember, better to stay inside a multiple of dies, so actualy 56 cores, then 224 GB of RAM)
GPU nodes are available as a separate cluster managed by the frontend kraken-gpu
.
The OAR configuration is simpler: the resources hierarchy is as follow:
The cores are not severable from the attached GPU. The only way to ask for resources on kraken-gpu is to ask for a number of GPUS or to ask for a number of nodes, for example:
# Asking for 2 GPUS on the same node:
oarsub -l /nodes=1/gpu=2
# Asking for 2 full nodes, will result in 8 GPUS:
oarsub -l /nodes=2
# You may want to have a GPU per node on 2 different nodes (2 GPUS total):
oarsub -l /nodes=2/gpu=1
There’s an amount of 8 GB of RAM memory per core, so actualy 384 GB per GPU. As a consequence, if you ask for 2 GPUS on the same node for example, you have 768 GB of RAM.
Each H100 GPU has 94 GB of HBM2e memory.
Your job can only see the exact amount of GPU devices it has requested. The other devices of the node are automatically masked. By this way, the first device of you job on a given node is always given the id 0.
You can see the real device path of your GPUS by using the oarprint gpudevice
command from the inside of a job, for example:
bzizou@kraken-gpu:~$ oarsub -I -l /gpu=2 --project test
[ADMISSION RULE] Set default walltime to 7200.
# INFO: Moldable instance: 1 Estimated nb resources: 96 Walltime: 7200
# Warning: no nodes request... A /nodes=<N> prefix is recommended when asking for more than 1 GPU,
# for example /nodes=1/gpu=2
OAR_JOB_ID=62512
Interactive mode: waiting...
Starting...
Connect to OAR job 62512 via the node kraken-g1
bzizou@kraken-g1:~$ nvidia-smi
Tue Jul 22 15:56:08 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 Off | 00000000:A6:00.0 Off | 0 |
| N/A 42C P0 67W / 700W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 Off | 00000000:C6:00.0 Off | 0 |
| N/A 42C P0 70W / 700W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
bzizou@kraken-g1:~$ oarprint gpudevice
/dev/nvidia2
/dev/nvidia3
In this example, you can see that the two GPUS that your programm can see have ids ‘0’ and ‘1’. The nvidia-smi
command can only see those 2 devices and the 2 others of the same node are masked. The oarprint gpudevice
shows that you actually obtained the nvidia gpus having hardware ids ‘2’ and ‘3’.
Several Openmpi environment are available:
bzizou@kraken-g1:~$ module avail openmpi
----------------------------------------------------- /softs/Modules -----------------------------
openmpi/4.1.8/cuda-12.9-gcc-12.4.0 openmpi/4.1.8/gcc-14.2.0 openmpi/5.0.8/gcc-14-2.0
openmpi/4.1.8/cuda-12.9-gcc-14.2.0 openmpi/5.0.7/cuda-12.9-gcc-12.4.0
openmpi/4.1.8/gcc-12.4.0 openmpi/5.0.7/gcc-11.4.1
Knonw working so far are openmpi/5.0.8/gcc-14-2.0
, openmpi/4.1.8/gcc-14.2.0
and openmpi/4.1.8/cuda-12.9-gcc-14.2.0
(cuda keyword in the module name -> use on kraken-gpu)
Recommended:
mpirun --hostfile $OAR_NODEFILE --prefix $OPENMPI_PATH -x LD_LIBRARY_PATH --mca plm_rsh_agent "oarsh" --mca pml ucx --mca btl ^tcp,openib,uct -x UCX_TLS=shm,self,cuda_copy,dc,rc,ud,gdr_copy,tcp
Normally not needed as UCX is already configured accordingly, but you may have to specify the IB devices:
# kraken-cpu
-x UCX_NET_DEVICES=ibp65s0f0:1
# kraken-gpu
-x UCX_NET_DEVICES=ibp163s0:1,ibp195s0:1
If you want to test multi-rail on the GPU nodes (because they have 2 connected IB interfaces), you can add:
-x UCX_MAX_EAGER_RAILS=2 -x UCX_MAX_RNDV_RAILS=2 -x UCX_NET_DEVICES=ibp163s0:1,ibp195s0:1
If you have a working combination of mpi environment and options, we may be interested by your experience! Feel free to report it à sos-gricad@univ-grenoble-alpes.fr !
Dashboards are available here (need VPN access): https://gricad-dashboards.univ-grenoble-alpes.fr/dashboards/f/feotgbsryapz4a
If you are in the phase of developping your job scripts, you may need to run test jobs, just to check if they are launching correctly. For that, there’s a devel queue which has a high priority for such tests jobs to start before the others. You have to specify the devel type to the oar submission: simply add -t devel
to your oarsub command options.
The rules are the following:
-q devel
, use -t devel
instead)