Liste of hardware / configuration changes

Here is a changes journal of the HPC facilities. The lastest changes are on the top of the list.

2024-09-09

  • Upgraded OAR3 on dahu-oar3 with experimental packages 3.0.0dev14 (fixes scheduling issues, api compatability with oar2/cigri3.2 and performance issues)

2024-09-05

  • Upgraded OS of Killeen, the grid server
  • Upgraded CiGri to version 3.2: you have now support of OAR3 clusters (dahu-oar3 for instance) and a new efficient “temporal_grouping” option, and many bug fixed. More informations here: https://github.com/oar-team/cigri/blob/3.2/CHANGELOG.md and Here (for OAR3, you have to set up a JWT token; check the gridtoken command)
  • Security updates of OS of the Luke frontend
  • Security updates of OS of the Dahu frontend

2024-08-20

  • Fixed Apptainer version to 1.3.2 and Singularity link to this version of Apptainer on all Dahu/Bigfoot nodes and frontends

2024-07-23

  • Changed CMOS battery of a Mantis node during vetntitest service interruption

2024-06-13

  • Dahu-oar3: fixed Energy-saving and scheduling

2024-05-30

  • Updated and restarted Eli and all Elastic/Open-search instances

2024-04-22

  • Bigfoot: Continued systemd OAR job manager adaptations : fixed Suspecting bigfoot-gh1 with users having a “-” in their login and confined GPU devices into the systemd slice

2024-04-15

2024-04-12

  • Bigfoot:
    • Doc update for AMD GPUS (rocm nix packages upgrade)

2024-04-10

  • Bigfoot:
    • Upgraded bigfoot-gh1 kernel to 6.5.0-1014 and nvidia drivers to 550 —> fixed cpuset bug, so bigfoot-gh1 is back online into OAR
    • Upgraded AMD GPU firmwares on bigfoot13 as an attempt to fix some issues with inter-gpus communications
  • Nix: added support for aarch64 (ARM64) architecture into /applis/site/nix.sh

2024-03-29

2024-03-25

  • Dahu and Bigfoot nodes:
    • Added packages (deps of apptainer): squashfuse, fuse2fs, gocryptfs
    • singularity is now a symbolic link to apptainer

2024-03-21

  • Added a new queue long for jobs walltime between 48h and 160h on the Dahu cluster with a high priority on 2 nodes (dahu106 and dahu107, tagged with the long=YES OAR property)

2024-03-20

  • Installed a new bigfoot-gh1 node: it’s an experimental node containing an Nvidia Grace-Hopper GH200 motherboard (72 cores ARM64 Grace + Hopper GPU). As the node is still unstable, it’s often in the “drain” mode. When not drained, you can submit jobs with the -t gh type from the bigfoot frontend.

2024-03-05

  • MAIN MAINTENANCE
    • Mantis:
      • OS upgrade of the nodes
      • iRODS servers upgrade 4.2.12 -> 4.3.1
      • clients upgrade postponed as there’s a configuration change to deploy into users home directories
    • Dahu, Luke and Bigfoot nodes:
      • Security updates
      • Firmware upgrades
      • Upgrade BeeGFS clients to 7.4.2
      • Nvidia drivers re-deployment on Bigfoot nodes
    • Bettik :
      • upgrade servers to BeeGFS 7.4.2
      • data migration and decomission of bettik-data1
    • Silenus :
      • upgrade servers to BeeGFS 7.4.2
      • RAM upgrade of the meta-data server 32 GB -> 254 GB
    • Nix: upgrade deamon and clients to 2.18
    • SSH gateways OS Upgrade
    • Vacuum of OAR databases (Bigfoot/Luke/Dahu)