Job management

Job manager introduction

Cluster notion

A cluster is a set of machines that consists of one or more frontend and compute nodes. A frontend is the machine on which the user connects in ssh, while the compute nodes are the machines on which the user runs his programs.

All programs that the user runs are done from a frontend using a resource manager that finds and allocates the required resources among the nodes.

The resource manager used is OAR.

Job notion

A job is what the user submits to the resource manager to execute a program. It consists of a description of the resources needed to run the program and the commands to execute the program, usually provided in a script.

The resources required by the resource manager to submit a job include the number of nodes, the number of cores per node, and the maximum computing time. If the user omits this information, the manager applies default values. The defaults are 1 node, 1 core and 2h maximum.

A job has a life cycle, it is:

  1. Submitted to the manager
  2. Waiting for free resources (Waiting)
  3. Running
  4. Terminated

OAR job management commands

Submitting a job

The command to submit a job is oarsub. This command has many options that can be found using the --help option.

[cgirpi@froggy2 ~]$ oarsub --help
Usage: /usr/lib/oar/oarsub [options] [-I|-C|<script>]
Submit a job the OAR batch scheduler
Options are:
 -I, --interactive             Request an interactive job. Open a login shell
                               on the first node of the reservation instead of
                               running a script.
 -C, --connect=<job id>        Connect to a running job
 -l, --resource=<list>         Set the requested resources for the job.
                               The different parameters are resource properties
                               registered in OAR database, and `walltime' which
                               specifies the duration before the job must be
                               automatically terminated if still running.
                               Walltime format is [hour:mn:sec|hour:mn|hour].
                               Ex: host=4/cpu=1,walltime=2:00:00
     --array <number>          Specify an array job with 'number' subjobs
     --array-param-file <file> Specify an array job on which each subjob will
                               receive one line of the file as parameter
 -S, --scanscript              Batch mode only: asks oarsub to scan the given
                               script for OAR directives (#OAR -l ...)
 -q, --queue=<queue>           Set the queue to submit the job to
 -p, --property="<list>"       Add constraints to properties for the job.
                               (format is a WHERE clause from the SQL syntax)
 -r, --reservation=<date>      Request a job start time reservation,
                               instead of a submission. The date format is
                               "YYYY-MM-DD HH:MM:SS".
     --checkpoint=<delay>      Enable the checkpointing for the job. A signal
                               is sent DELAY seconds before the walltime on
                               the first processus of the job
     --signal=<#sig>           Specify the signal to use when checkpointing
                               Use signal numbers, default is 12 (SIGUSR2)
 -t, --type=<type>             Specify a specific type (deploy, besteffort,
                               cosystem, checkpoint, timesharing)
 -d, --directory=<dir>         Specify the directory where to launch the
                               command (default is current directory)
     --project=<txt>           Specify a name of a project the job belongs to
 -n, --name=<txt>              Specify an arbitrary name for the job
 -a, --anterior=<job id>       Anterior job that must be terminated to start
                               this new one
     --notify=<txt>            Specify a notification method
                               (mail or command to execute). Ex:
                                   --notify "mail:name@domain.com"
                                   --notify "exec:/path/to/script args"
     --resubmit=<job id>       Resubmit the given job as a new one
 -k, --use-job-key             Activate the job-key mechanism.
 -i, --import-job-key-from-file=<file>
                               Import the job-key to use from a files instead
                               of generating a new one.
     --import-job-key-inline=<txt>
                               Import the job-key to use inline instead of
                               generating a new one.
 -e  --export-job-key-to-file=<file>
                               Export the job key to a file. Warning: the
                               file will be overwritten if it already exists.
                               (the %jobid% pattern is automatically replaced)
 -O  --stdout=<file>           Specify the file that will store the standart
                               output stream of the job.
                               (the %jobid% pattern is automatically replaced)
 -E  --stderr=<file>           Specify the file that will store the standart
                               error stream of the job.
                               (the %jobid% pattern is automatically replaced)
     --hold                    Set the job state into Hold instead of Waiting,
                               so that it is not scheduled (you must run
                               "oarresume" to turn it into the Waiting state)
 -s, --stagein=<dir|tgz>       Set the stagein directory or archive
     --stagein-md5sum=<md5sum> Set the stagein file md5sum
 -D, --dumper                  Print result in DUMPER format
 -X, --xml                     Print result in XML format
 -Y, --yaml                    Print result in YAML format
 -J, --json                    Print result in JSON format
 -h, --help                    Print this help message
 -V, --version                 Print OAR version number

Basic submission

The most basic form of use is the one where you only specify the project and the command to run: oarsub --project <your-project> <command>.

For example, below is an example of submitting the command /bin/hostname:

[cgirpi@froggy1 ~]$ oarsub --project cecipex /bin/hostname
[ADMISSION RULE] Set default walltime to 1800.
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 0 cores
[ADMISSION RULE] No antifrag for small jobs
OAR_JOB_ID=15576103

Each job is uniquely identified by its job identifier (OAR_JOB_ID). In the example, this identifier has the value 15576103.

Be aware that the command that you pass as argument of oarsub, and that you want to execute on the node, must have execution rights and be accessible from the nodes. Especially if the path is not in the PATH environment variable, you must then pass as argument the command with its full path.

Submission with explicit resources request

With the previous submission command, we did not request any specific resources. Thus, in a return message, the OAR resource manager allocated us the default walltime (1800 seconds) and told us that no compute core was requested.

Walltime and end of job If your job ends normally (or crashes!) before the time indicated by the walltime, everything is fine, the resources are released immediately. On the other hand, if it is still running at the end of the walltime, it will be killed. So be sure to specify a walltime that is sufficient to finish within the time limit, but not too big so that it does not wait unnecessarily in the resource manager queue.

To explicitly reserve resources, use the -l option of the oarsub command. For example, the -l option of the oarsub command:

bouttier@froggy1 $ oarsub --project test -l /nodes=4/core=1,walltime=1:00:00 /bin/hostname
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 4 cores
[ADMISSION RULE] No antifrag for small jobs
OAR_JOB_ID=15676746

With this command, we asked for 1 core on 4 nodes, for a total of 4 cores, for a maximum duration (walltime) of one hour.

As we have seen before, each job is identified with a unique number for each cluster contained in the $OAR_JOB_ID environment variable, here 15676746. By default, since the job runs on resources that are not directly accessible, we do not have access to the standard and error outputs in which our commands can return information. In reality, these outputs are copied to two job-specific files, named OAR.$OAR_JOB_ID.stdout and OAR.$OAR_JOB_ID.stderr respectively. These files are created by OAR where you submitted your job.

The most commonly used keywords are described in this table :

KeywordsMeaning
nodes No.of nodes requested
coreNo. of cores requested
cpuNo. of cpu requested
walltimeMaximum requested execution time

Here are several examples of resource reservations:

oarsub -l /nodes=10,walltime=00:05:00 --project test /bin/hostname

Here we reserve 10 full nodes for 5 minutes of maximum execution time.

oarsub -l /core=54,walltime=00:00:30 --project test /bin/hostname

In this case, 54 cores are reserved, which the manager will choose according to the availability of the machine (on an undefined number of nodes at the time of submission), all for 30 seconds of execution time.

oarsub -l /nodes=3/cpu=1, walltime=2:00:00 --project test /bin/hostname

Finally, here, we ask for 3 nodes with 1 cpu on each (the other cpu of these nodes will be dedicated to other jobs), for a maximum of 2 hours of execution time.

Memory The RAM allocated to you is the one corresponding to the cores you have reserved. Thus, if you reserve 1 whole node, you have access to the totality of the RAM of the node. On the other hand, if you ask for n cores, you will have access to RAM_of_the_node*(n/nb_total_core_on_the_node)

Submitting an interactive job

A common practice to get a good understanding of cluster computing is to submit an interactive job. In short, you ask for resources, and the resource manager will connect you to one of them (a node on which it has allocated you a core) from which you will be able to run your commands/scripts/programs interactively. At the end of the walltime, your session is automatically killed. If you leave it before, the manager will free your resources.

The interactive submission is done using the -I option of the oarsub command. For example, you can use the -I option of the oarsub command:

bouttier@froggy1 $ oarsub -I -l /nodes=1,walltime=1:00:00 --project admin
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 16 cores
[ADMISSION RULE] Disabling antifrag optimization as you are already using /node property
OAR_JOB_ID=15676747
Interactive mode : waiting...
Starting...

Connect to OAR job 15676747 via the node frog80

bouttier@frog80 $

We can make several remarks here:

  • We launched oarsub on the frontend (froggy1 in the first prompt)
  • The -I option has been included
  • We did not specify a command to execute (no /bin/hostname)
  • The job manager allocated us 16 cores since we asked for 1 integer node (which has 16 cores on the froggy cluster)
  • Once the interactive job is launched, we are connected to our job via the frog80 node.

We see that here, with the second prompt, we are connected to the frog80 node from which we will be able to manipulate files, run commands and run scripts interactively. To quit the job, we just have to leave the node via the exit command.

Submitting a job via a submission script

The interactive submission is useful to get to grips with the cluster, to do tests. On the other hand, to automatically launch a set of commands (file preparation, preparation of the software environment, execution of the main program, recovery of output files and cleanup), it is not recommended.

In this use case, we will submit a job via a submission script. This submission script can be written in different interpreted languages, more often bash and python.

The submission script will contain the set of instructions you need to perform your experiment as well as OAR guidelines that will tell the resource managers everything you need. These directives are indicated by the #OAR string at the beginning of a line. Please note that #OAR is not a comment in your submission script.

The submission script will then be passed as a parameter to the oarsub command using the -S option. It must first be made executable using the chmod +x command.

As a thousand words are worth a picture, let’s go to a concrete example. Here is a submission script that is completely useless but will illustrate all the mechanisms we need to present here. So we will logically call it dumb.sh, whose contents are as follows:

#!/bin/bash

#OAR -n Hello_World
#OAR -l /nodes=2/core=1,walltime=00:01:30
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test

cd /bettik/buttier/
/bin/hostname >> dumb.txt

We can distinguish two main parts in this script:

  • A part dedicated to the directives that will be read by OAR, indicated by #OAR
  • The directives executed on the resources reserved on the cluster.

For the latter, this script, which will only work on the luke and dahu clusters:

  • change directory to the /bettik/bouttier/ folder.
  • write the output of the /bin/hostname command to the dumb.txt file. For it to work on froggy, you would have to work in the /scratch folder instead of the /bettik folder.

For OAR directives, we note here some known options of oarsub. Let’s look line by line at what we ask the resource manager to do:

#OAR -n Hello_World

Here we call our job hello_world.

#OAR -l /core=1,walltime=00:01:30

We ask for 1 core for 1 minute 30 seconds maximum.

#OAR --stdout hello_world.out

The standard output file will be called hello_world.out.

#OAR --stderr hello_world.err

The standard error file will be called hello_world.err.

#OAR --project test

We indicate that we belong to the test project.

In reality, these guidelines include the options we normally pass to the oarsub command. Please note that all the options described here can be used in the command line with oarsub

Here, standard and error output files are named independently of the job ID. This can be dangerous, because if you submit the script multiple times, each job will write to this file and we will potentially lose the information for each job.

Before submitting the script, you must first make it executable :

chmod +x dumb.sh

Now we can submit the :

oarsub -S ./dumb.sh

Once the requested resources are available, the series of commands described in the script will be executed.

Following a job

To find out the status and characteristics of a submitted job, use the oarstat command. Executed alone, this command will return the status of jobs submitted by all users on the cluster. For readability reasons, we will restrict it to a single user using the -u option followed by the login:

bouttier@f-dahu:~$ oarstat -u bouttier
Job id    S User     Duration   System message
--------- - -------- ---------- ------------------------------------------------
4936641   W bouttier    0:00:00 R=1,W=0:1:30,J=B,N=Hello_World,P=admin,T=heterogeneous

The result gives all the jobs submitted by this user. We see that he has only one. We see its identifier, its status (here, W for waiting, it is waiting to be launched), how much has elapsed since the beginning of its execution (here 0 since it is still waiting) and its characteristics (requested resources, walltime, name, project, type).

To know the detailed information of a job, once we have its identifier, we can execute the following command:

bouttier@f-dahu:~$ oarstat -f -j 4936641
Job_Id: 4936641
    job_array_id = 4936641
    job_array_index = 1
    name = Hello_World
    project = admin
    owner = bouttier
    state = Terminated
    wanted_resources = -l "{type = 'default'}/core=1,walltime=0:1:30"
    types = heterogeneous
    dependencies =
    assigned_resources = 1067
    assigned_hostnames = dahu66
    queue = default
    command = dumb.sh
    exit_code = 32512 (127,0,0)
    launchingDirectory = /home/bouttier
    stdout_file = hello_world.out
    stderr_file = hello_world.err
    jobType = PASSIVE
    properties = ((((hasgpu='NO') AND compute = 'YES') AND sequential='YES') AND desktop_computing = 'NO') AND drain='NO'
    reservation = None
    walltime = 0:1:30
    submissionTime = 2019-11-01 16:17:58
    startTime = 2019-11-01 16:18:09
    stopTime = 2019-11-01 16:18:11
    cpuset_name = bouttier_4936641
    initial_request = oarsub -S dumb.sh; #OAR -n Hello_Worl; #OAR -l /core=1,walltime=00:01:3; #OAR --stdout hello_world.ou; #OAR --stderr hello_world.er; #OAR --project admi
    message = R=1,W=0:1:30,J=B,N=Hello_World,P=admin,T=heterogeneous (Karma=0.000,quota_ok)
    scheduledStart = no prediction
    resubmit_job_id = 0
    events =
2019-11-01 16:18:12> SWITCH_INTO_TERMINATE_STATE:[bipbip 4936641] Ask to change the job state

Here, the line state = Terminated tells us that it is now finished.

Deleting a job

If you have started a job and found that there is no need for it to continue (useless calculation, error in the submission script, etc.) you can delete your submission using the oardel command:

bouttier@f-dahu:~$ oarsub -S dumb.sh
[PARALLEL] Small jobs (< 32 cores) restricted to tagged nodes
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=4936645
bouttier@f-dahu:~$ oardel 4936645
Deleting the job = 4936645 ...REGISTERED.
The job(s) [ 4936645 ] will be deleted in the near future.

Karma

On a OAR cluster, you can see a karma value when you submit an interactive job, or when you request the status of a job (oarstat -j $JOB_ID) or when you check your accounting data. This is a value used to ensure the fair sharing of resources that we use on each GRICAD cluster. Fair sharing means that the system tries to allow resources to be used with equity among users. The lower your karma, the more likely it is that your job will start before that of a user with a higher karma value. Karma is a function of how much computing time you have requested and actually consumed in the past (during a sliding window, usually a week or two depending on the platform). But note that karma (and the fairsharing algorithm) is only used when the system is full of jobs. Most of the time, the programming is FIFO with backfilling.