Avanced job management

On our clusters, thanks to OAR, users can perform very fine tuning of their job submissions. Before going further, you should make sure you have a good understanding of OAR commands behaviour and usage as described here.

Checkpointing

This feature is still experimental. We advise you to test it before automating your submissions or workflows.

It can occur that, in some cases, jobs need to run a longer time than the maximum walltime allowed and the program does not have a way to save its status to be able to restart from where it stopped. On Dahu compute nodes, CRIU coupled with OAR via a daemon, allows jobs to checkpoint themselves. CRIU can dump an entire process tree (sequential jobs only, no inter node communication handling) and its memory on disk for stopping the job and further restart from the dump.

A simple mechanism has been set up to link CRIU to the checkpoint and idempotent OAR features so that a long job is automatically checkpointed a short time before the walltime and automatically restarted afterwards.

To set up such a job, you can adapt the following script to your needs:

#!/bin/bash
#OAR -n test_script
#OAR -t idempotent
#OAR -l /nodes=1/core=4,walltime=00:5:00
#OAR -O test.%jobid%.stdout
#OAR -E test.%jobid%.stderr
#OAR --project test
#OAR --checkpoint 240
#OAR --notify mail:Bruno.Bzeznik@univ-grenoble-alpes.fr

# Timeout to adapt: 600 is a good value for bigger jobs
RESUME_TIMEOUT=90

# Handler for checkpointing signal sent by OAR
handler() { echo "Caught checkpoint signal at: `date`"
            echo "Checkpointing..."
            echo -e "$PROG_PID\n$(pwd)" > /var/lib/checkpoints/$OAR_JOB_ID.checkpoint
          }
trap handler SIGUSR2

# Load environment
source /applis/site/nix.sh

# A checkpoint exists, resuming it
if [ -e checkpoint_ok ]
then
  rm -f checkpoint/pidfile
  sleep 30
  echo -e "$(pwd)" > /var/lib/checkpoints/$OAR_JOB_ID.resume
  # Wait for the restore (for pidfile to be created)
  declare -i c=1
  while [ \! -e checkpoint/pidfile -a $c -le $RESUME_TIMEOUT ]
  do
    sleep 1
    let c++
  done
  if [ $c -eq $RESUME_TIMEOUT ]
  then
     echo "ERROR: Timeout on resume!" >&2
     exit 3
  fi
  sleep 5
  PROG_PID=$(cat checkpoint/pidfile)

# No checkpoint, starting the program
else
  nohup stress --cpu 4 --io 4 --vm 2 -v  &
  PROG_PID=$!
fi

# Wait for $PROG_PID (`wait` does not work in all cases, and 
# bash kills the script when a trap occurs within a wait)
while [ -e /proc/$PROG_PID ]
do
  sleep 1
done

# Now that the process has exited, we have to wait for the checkpoint
# to be finished. The checkpoint_ok file is removed only before doing
# a new checkpoint.
while [ \! -e checkpoint_ok ]
do
  sleep 1
done

# Idempotent job exits with 99 code to be automatically re-submitted
exit 99

The job working directory MUST be on a shared filesystem, such as /bettik, as the dump is created inside a checkpoint sub-directory that needs to be available from every nodes for the resume. All the files opened by your program must of course also be on a shared filesystem. Be careful about temporary directories!

Bold parts in the previous script are to be replaced. Your program launch command should be placed between nohup and &. The final & is very important as your program should run in the background, for the script to catch the PID to be checkpointed.

In some cases, the resume might fail. One of the reasons is that the PID used by the program could not be restored because it was already in use by another process on the node. In that case, the automatic re-submission system is stopped and you will have to manually submit again your job, with the hope that the new job will start on another node with the available requested PID.

The output of your program will always go into the output of the first job of the idempotency tree (test.%jobid%.stdout in our example, with the jobid of the initial job)

Checkpointing testing

It is important to test if the checkpoint works. To do so, you can force the job to checkpoint itslef with the following command:

oardel --checkpoint $JOB_ID

A checkpoint directory should be created in the working directory, and the job should complete (be killed). As it is a manual checkpoint, you will have to manually restart your job, so, submit it again with oarsub. Then, check if the job is correctly resumed. It’s recommended to repeat this operation at least one time again to be sure that a resumed job can also be checkpointed.