On our clusters, thanks to OAR, users can perform very fine tuning of their job submissions. Before going further, you should make sure you have a good understanding of OAR commands behaviour and usage as described here.
This feature is still experimental. We advise you to test it before automating your submissions or workflows.
It can occur that, in some cases, jobs need to run a longer time than the maximum walltime allowed and the program does not have a way to save its status to be able to restart from where it stopped. On Dahu compute nodes, CRIU coupled with OAR via a daemon, allows jobs to checkpoint themselves. CRIU can dump an entire process tree (sequential jobs only, no inter node communication handling) and its memory on disk for stopping the job and further restart from the dump.
A simple mechanism has been set up to link CRIU to the checkpoint and idempotent OAR features so that a long job is automatically checkpointed a short time before the walltime and automatically restarted afterwards.
To set up such a job, you can adapt the following script to your needs:
#!/bin/bash
#OAR -n test_script
#OAR -t idempotent
#OAR -l /nodes=1/core=4,walltime=00:5:00
#OAR -O test.%jobid%.stdout
#OAR -E test.%jobid%.stderr
#OAR --project test
#OAR --checkpoint 240
#OAR --notify mail:Bruno.Bzeznik@univ-grenoble-alpes.fr
# Timeout to adapt: 600 is a good value for bigger jobs
RESUME_TIMEOUT=90
# Handler for checkpointing signal sent by OAR
handler() { echo "Caught checkpoint signal at: `date`"
echo "Checkpointing..."
echo -e "$PROG_PID\n$(pwd)" > /var/lib/checkpoints/$OAR_JOB_ID.checkpoint
}
trap handler SIGUSR2
# Load environment
source /applis/site/nix.sh
# A checkpoint exists, resuming it
if [ -e checkpoint_ok ]
then
rm -f checkpoint/pidfile
sleep 30
echo -e "$(pwd)" > /var/lib/checkpoints/$OAR_JOB_ID.resume
# Wait for the restore (for pidfile to be created)
declare -i c=1
while [ \! -e checkpoint/pidfile -a $c -le $RESUME_TIMEOUT ]
do
sleep 1
let c++
done
if [ $c -eq $RESUME_TIMEOUT ]
then
echo "ERROR: Timeout on resume!" >&2
exit 3
fi
sleep 5
PROG_PID=$(cat checkpoint/pidfile)
# No checkpoint, starting the program
else
nohup stress --cpu 4 --io 4 --vm 2 -v &
PROG_PID=$!
fi
# Wait for $PROG_PID (`wait` does not work in all cases, and
# bash kills the script when a trap occurs within a wait)
while [ -e /proc/$PROG_PID ]
do
sleep 1
done
# Now that the process has exited, we have to wait for the checkpoint
# to be finished. The checkpoint_ok file is removed only before doing
# a new checkpoint.
while [ \! -e checkpoint_ok ]
do
sleep 1
done
# Idempotent job exits with 99 code to be automatically re-submitted
exit 99
The job working directory MUST be on a shared filesystem, such as /bettik
, as the dump is created inside a checkpoint sub-directory that needs to be available from every nodes for the resume. All the files opened by your program must of course also be on a shared filesystem. Be careful about temporary directories!
Bold parts in the previous script are to be replaced. Your program launch command should be placed between nohup
and &
. The final &
is very important as your program should run in the background, for the script to catch the PID
to be checkpointed.
In some cases, the resume might fail. One of the reasons is that the PID
used by the program could not be restored because it was already in use by another process on the node. In that case, the automatic re-submission system is stopped and you will have to manually submit again your job, with the hope that the new job will start on another node with the available requested PID.
The output of your program will always go into the output of the first job of the idempotency tree (test.%jobid%.stdout
in our example, with the jobid
of the initial job)
It is important to test if the checkpoint works. To do so, you can force the job to checkpoint itslef with the following command:
oardel --checkpoint $JOB_ID
A checkpoint directory should be created in the working directory, and the job should complete (be killed). As it is a manual checkpoint, you will have to manually restart your job, so, submit it again with oarsub
. Then, check if the job is correctly resumed. It’s recommended to repeat this operation at least one time again to be sure that a resumed job can also be checkpointed.