Liste of hardware / configuration changes
Here is a changes journal of the HPC facilities. The lastest changes are on the top of the list.
2024-12-06
- Mantis: we started to move automatically files older than 2 years to the “BACKUP” resources, to fre up disk space on the “IMAG” resources
- Mantis: disabled “deduplication” on the Backup resources as it caused performance issues
2024-12-04
- Bigfoot:
- Added latest Nvidia Toolkit (12.6.3) with CudNN (9.6.0) libraries.
- Check available toolkits versions with :
source /applis/environments/cuda_env.sh -l
2024-11-26
- A long SPRING (UGA Networks fabrics) preventive maintenance made us shut down the computing services. During this shutdown, we could also make some preventive actions and upgrades:
- Upgraded kernels on almost every servers: Trinity, Rotule, Mantis servers, Eli servers, cluster Frontends + reboot
- Upgraded Mantis to iRods 4.3.3
- Upgrade Postgresql of the OAR database of Dahu
- Cleaned OAR database of Dahu (removed jobs older than 2 years, and vaccuum)
- RAM of dahu-workflow1 and dahu-workflow2 upgraded to 192 GB
- BIGFOOT7: Fan replaced
- Bettik meta-dat servers replaced
2024-11-18
- Fixed a bug into colmet: jobs that do not have any process were preventing other data to be collected, resulting, among other things, in a bad power usage estimation.
2024-11-15
- A new workflow frontend has been installed : dahu-workflow2
- Fixed a Cigri bug with errors like ‘prepared statement “stmt_1730797794_6939266” does not exist’
2024-10-29
- Tania (process sniper) now kills greedy processes using too much MEMORY on dahu/bigfoot frontends (it used to kill processes using too much CPU, but now also monitors the MEMORY)
2024-10-22
- Eli and all instances where restarted due to a necessary firmware upgrade of all the servers' disks, recommended by Dell. The platform was stopped between 9:00 and 12:30.
2024-10-18
- Upgraded iRods clients (icommands) to 4.3.3 (Warning: configuration change is necessary and it is made automatically by the /applis/site/nix.sh script. The change occurs in the ~/.irods/irods_environment.json file where the string “PAM” should be replaced by “pam_password”)
2024-10-09
- Upgrade dahu-visu operating system to latest Debian version (Bookworm)
2024-10-08
- Added mounts ansible sync at boot time on Dahu and Bigfoot nodes
- CiGri: Bug Fix: uncleaned SQL connections
2024-10-01
- Replaced GPU #1 of bigfoot11 (Some user’s jobs failed on this GPU and DCGMI reported bad performances. Dell accorded a replacement after days of investigation)
2024-09-30
- Updated admission rules on Dahu: automatic redirection of “ljc” jobs to the long queue
- Updated admission rules on Bigfoot: prevent submissions on the devel nodes with full (/gpu=1) gpu
2024-09-09
- Upgraded OAR3 on dahu-oar3 with experimental packages 3.0.0dev14 (fixes scheduling issues, api compatability with oar2/cigri3.2 and performance issues)
2024-09-05
- Upgraded OS of Killeen, the grid server
- Upgraded CiGri to version 3.2: you have now support of OAR3 clusters (dahu-oar3 for instance) and a new efficient “temporal_grouping” option, and many bug fixed. More informations here: https://github.com/oar-team/cigri/blob/3.2/CHANGELOG.md and Here (for OAR3, you have to set up a JWT token; check the
gridtoken
command) - Security updates of OS of the Luke frontend
- Security updates of OS of the Dahu frontend
2024-08-20
- Fixed Apptainer version to 1.3.2 and Singularity link to this version of Apptainer on all Dahu/Bigfoot nodes and frontends
2024-07-23
- Changed CMOS battery of a Mantis node during vetntitest service interruption
2024-06-13
- Dahu-oar3: fixed Energy-saving and scheduling
2024-05-30
- Updated and restarted Eli and all Elastic/Open-search instances
2024-04-22
- Bigfoot: Continued systemd OAR job manager adaptations : fixed Suspecting bigfoot-gh1 with users having a “-” in their login and confined GPU devices into the systemd slice
2024-04-15
2024-04-12
- Bigfoot:
- Doc update for AMD GPUS (rocm nix packages upgrade)
2024-04-10
- Bigfoot:
- Upgraded bigfoot-gh1 kernel to 6.5.0-1014 and nvidia drivers to 550 —> fixed cpuset bug, so bigfoot-gh1 is back online into OAR
- Upgraded AMD GPU firmwares on bigfoot13 as an attempt to fix some issues with inter-gpus communications
- Nix: added support for aarch64 (ARM64) architecture into /applis/site/nix.sh
2024-03-29
2024-03-25
- Dahu and Bigfoot nodes:
- Added packages (deps of apptainer): squashfuse, fuse2fs, gocryptfs
singularity
is now a symbolic link to apptainer
2024-03-21
- Added a new queue
long
for jobs walltime between 48h and 160h on the Dahu cluster with a high priority on 2 nodes (dahu106 and dahu107, tagged with the long=YES OAR property)
2024-03-20
- Installed a new
bigfoot-gh1
node: it’s an experimental node containing an Nvidia Grace-Hopper GH200 motherboard (72 cores ARM64 Grace + Hopper GPU). As the node is still unstable, it’s often in the “drain” mode. When not drained, you can submit jobs with the -t gh
type from the bigfoot frontend.
2024-03-05
- MAIN MAINTENANCE
- Mantis:
- OS upgrade of the nodes
- iRODS servers upgrade 4.2.12 -> 4.3.1
- clients upgrade postponed as there’s a configuration change to deploy into users home directories
- Dahu, Luke and Bigfoot nodes:
- Security updates
- Firmware upgrades
- Upgrade BeeGFS clients to 7.4.2
- Nvidia drivers re-deployment on Bigfoot nodes
- Bettik :
- upgrade servers to BeeGFS 7.4.2
- data migration and decomission of bettik-data1
- Silenus :
- upgrade servers to BeeGFS 7.4.2
- RAM upgrade of the meta-data server 32 GB -> 254 GB
- Nix: upgrade deamon and clients to 2.18
- SSH gateways OS Upgrade
- Vacuum of OAR databases (Bigfoot/Luke/Dahu)