List of hardware / configuration changes
Here is a changes journal of the HPC facilities. The lastest changes are on the top of the list.
2025-07-23
- The new “Kraken” supercomputer is now open to all users in beta-production mode. Check the Kraken doc here: https://gricad-doc.univ-grenoble-alpes.fr/hpc/kraken/kraken/
“Beta” production means that the platform is not in it’s final configuration, especially for the network connexion with common services like Bettik, Mantis,… It also means
that stability issues may occur, nodes may be rebooted and jobs stopped without notice. Limited support and use at your own risks.
2025-07-09
- Frontend systems upgraded
- Grafana update on gricad-dashboards
- Dahu-oar3 : oar3 upgraded
2025-06-30
2025-04-18
- Bigfoot: applied a quota of 176 ressources max for all running jobs of a given user to avoid monopolistics situations on GPUS
- Bigfoot: Tania, the process killer could not send e-mails (since weeks…). Fixed
2025-04-16
2025-04-11
- Upgraded the Nix CLI to 2.24.14 and upgraded the documentation
- Fixed a performance issue at login on bastions trinity and rotule
2025-04-07
- OAR3 packages update on dahu-oar3 server
2025-04-03
- /applis/site/nix.sh now activates experimental flakes command by default
2025-03-19
- Updated default nix environment (/applis/site/nix.sh -> 24.11)
2025-02-28
2025-02-19
- Bigfoot: nodes and frontend auth system plugged to a new dedicated ldap slave
2025-02-13
- Datacenter IMAG ventitest
- Bettik: replaced one meta-data node by a new one (grant by IPAG)
- Upgraded OS of luke/dahu/bigfoot frontends
- Silenus: Optimized kernel config and rebooted
- Started to deploy new cybersecurity policy
2025-01-30
- Bigfoot: removed /etc/singularity as it caused an error message when using Apptainer (which is the Singularity replacement)
2025-01-28
2025-01-03
- Mantis: from now, data older that 2 years are automatically migrated to the backup resources.
2025-01-22
- Bigfoot frontend: auth is now connected to a local LDAP replicate
- Upgraded OS and iRods on the cargo host
- Added security tools on Eli and some service nodes
- CiGri: fixed a bug with OAR_AUTO_RESUBMIT jobs
2025-01-15
@All
clusters:- Added Mamba 2.0.5 (Miniforge) environments manager.
- Use
source /applis/environments/mamba.sh
to activate Mamba commands.
2025-01-14
2025-01-08
- CiGri: fixed a bug with the “Nikita” module, which is responsible of killing running jobs when a campaign is canceled
2024-12-06
- Mantis: we started to move automatically files older than 2 years to the “BACKUP” resources, to fre up disk space on the “IMAG” resources
- Mantis: disabled “deduplication” on the Backup resources as it caused performance issues
2024-12-04
- Bigfoot:
- Added latest Nvidia Toolkit (12.6.3) with CudNN (9.6.0) libraries.
- Check available toolkits versions with :
source /applis/environments/cuda_env.sh -l
2024-11-26
- A long SPRING (UGA Networks fabrics) preventive maintenance made us shut down the computing services. During this shutdown, we could also make some preventive actions and upgrades:
- Upgraded kernels on almost every servers: Trinity, Rotule, Mantis servers, Eli servers, cluster Frontends + reboot
- Upgraded Mantis to iRods 4.3.3
- Upgrade Postgresql of the OAR database of Dahu
- Cleaned OAR database of Dahu (removed jobs older than 2 years, and vaccuum)
- RAM of dahu-workflow1 and dahu-workflow2 upgraded to 192 GB
- BIGFOOT7: Fan replaced
- Bettik meta-dat servers replaced
2024-11-18
- Fixed a bug into colmet: jobs that do not have any process were preventing other data to be collected, resulting, among other things, in a bad power usage estimation.
2024-11-15
- A new workflow frontend has been installed : dahu-workflow2
- Fixed a Cigri bug with errors like ‘prepared statement “stmt_1730797794_6939266” does not exist’
2024-10-29
- Tania (process sniper) now kills greedy processes using too much MEMORY on dahu/bigfoot frontends (it used to kill processes using too much CPU, but now also monitors the MEMORY)
2024-10-22
- Eli and all instances where restarted due to a necessary firmware upgrade of all the servers’ disks, recommended by Dell. The platform was stopped between 9:00 and 12:30.
2024-10-18
- Upgraded iRods clients (icommands) to 4.3.3 (Warning: configuration change is necessary and it is made automatically by the /applis/site/nix.sh script. The change occurs in the ~/.irods/irods_environment.json file where the string “PAM” should be replaced by “pam_password”)
2024-10-09
- Upgrade dahu-visu operating system to latest Debian version (Bookworm)
2024-10-08
- Added mounts ansible sync at boot time on Dahu and Bigfoot nodes
- CiGri: Bug Fix: uncleaned SQL connections
2024-10-01
- Replaced GPU #1 of bigfoot11 (Some user’s jobs failed on this GPU and DCGMI reported bad performances. Dell accorded a replacement after days of investigation)
2024-09-30
- Updated admission rules on Dahu: automatic redirection of “ljc” jobs to the long queue
- Updated admission rules on Bigfoot: prevent submissions on the devel nodes with full (/gpu=1) gpu
2024-09-09
- Upgraded OAR3 on dahu-oar3 with experimental packages 3.0.0dev14 (fixes scheduling issues, api compatability with oar2/cigri3.2 and performance issues)
2024-09-05
- Upgraded OS of Killeen, the grid server
- Upgraded CiGri to version 3.2: you have now support of OAR3 clusters (dahu-oar3 for instance) and a new efficient “temporal_grouping” option, and many bug fixed. More informations here: https://github.com/oar-team/cigri/blob/3.2/CHANGELOG.md and Here (for OAR3, you have to set up a JWT token; check the
gridtoken
command) - Security updates of OS of the Luke frontend
- Security updates of OS of the Dahu frontend
2024-08-20
- Fixed Apptainer version to 1.3.2 and Singularity link to this version of Apptainer on all Dahu/Bigfoot nodes and frontends
2024-07-23
- Changed CMOS battery of a Mantis node during vetntitest service interruption
2024-06-13
- Dahu-oar3: fixed Energy-saving and scheduling
2024-05-30
- Updated and restarted Eli and all Elastic/Open-search instances
2024-04-22
- Bigfoot: Continued systemd OAR job manager adaptations : fixed Suspecting bigfoot-gh1 with users having a “-” in their login and confined GPU devices into the systemd slice
2024-04-15
2024-04-12
- Bigfoot:
- Doc update for AMD GPUS (rocm nix packages upgrade)
2024-04-10
- Bigfoot:
- Upgraded bigfoot-gh1 kernel to 6.5.0-1014 and nvidia drivers to 550 —> fixed cpuset bug, so bigfoot-gh1 is back online into OAR
- Upgraded AMD GPU firmwares on bigfoot13 as an attempt to fix some issues with inter-gpus communications
- Nix: added support for aarch64 (ARM64) architecture into /applis/site/nix.sh
2024-03-29
2024-03-25
- Dahu and Bigfoot nodes:
- Added packages (deps of apptainer): squashfuse, fuse2fs, gocryptfs
singularity
is now a symbolic link to apptainer
2024-03-21
- Added a new queue
long
for jobs walltime between 48h and 160h on the Dahu cluster with a high priority on 2 nodes (dahu106 and dahu107, tagged with the long=YES OAR property)
2024-03-20
- Installed a new
bigfoot-gh1
node: it’s an experimental node containing an Nvidia Grace-Hopper GH200 motherboard (72 cores ARM64 Grace + Hopper GPU). As the node is still unstable, it’s often in the “drain” mode. When not drained, you can submit jobs with the -t gh
type from the bigfoot frontend.
2024-03-05
- MAIN MAINTENANCE
- Mantis:
- OS upgrade of the nodes
- iRODS servers upgrade 4.2.12 -> 4.3.1
- clients upgrade postponed as there’s a configuration change to deploy into users home directories
- Dahu, Luke and Bigfoot nodes:
- Security updates
- Firmware upgrades
- Upgrade BeeGFS clients to 7.4.2
- Nvidia drivers re-deployment on Bigfoot nodes
- Bettik :
- upgrade servers to BeeGFS 7.4.2
- data migration and decomission of bettik-data1
- Silenus :
- upgrade servers to BeeGFS 7.4.2
- RAM upgrade of the meta-data server 32 GB -> 254 GB
- Nix: upgrade deamon and clients to 2.18
- SSH gateways OS Upgrade
- Vacuum of OAR databases (Bigfoot/Luke/Dahu)