Job management

Job manager introduction

Cluster notion

A cluster is a set of machines that consists of one or more frontend and compute nodes. A frontend is the machine on which the user connects in ssh, while the compute nodes are the machines on which the user runs his programs.

All programs that the user runs are done from a frontend using a resource manager that finds and allocates the required resources among the nodes.

The resource manager used is OAR.

Job notion

A job is what the user submits to the resource manager to execute a program. It consists of a description of the resources needed to run the program and the commands to execute the program, usually provided in a script.

The resources required by the resource manager to submit a job include the number of nodes, the number of cores per node, and the maximum computing time. If the user omits this information, the manager applies default values. The defaults are 1 node, 1 core and 2h maximum.

A job has a life cycle, it is:

  1. Submitted to the manager
  2. Waiting for free resources (Waiting)
  3. Running
  4. Terminated

OAR job management commands

Submitting a job

The command to submit a job is oarsub. This command has many options that can be found using the help option: oarsub --help.

Basic submission

The most basic form of use is the one where you only specify the project and the command to run: oarsub --project <your-project> <command>.

For example, below is an example of submitting the command /bin/hostname:

[cgirpi@froggy1 ~]$ oarsub --project cecipex /bin/hostname
[ADMISSION RULE] Set default walltime to 1800.
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 0 cores
[ADMISSION RULE] No antifrag for small jobs
OAR_JOB_ID=15576103

Each job is uniquely identified by its job identifier (OAR_JOB_ID). In the example, this identifier has the value 15576103.

Be aware that the command that you pass as argument of oarsub, and that you want to execute on the node, must have execution rights and be accessible from the nodes. Especially if the path is not in the PATH environment variable, you must then pass as argument the command with its full path.

Sandboxing: the devel mode

The special -t devel submission mode is available to facilitate job tuning. It is limited to jobs of 2 hours maximum, and allocates resources on nodes dedicated to this mode. This allows to have sandbox resources much more available than production resources.

As of February 2024, the sandbox nodes are accessible from a dedicated front-end, which is also used for pre-production testing of the new version of OAR (v3). Usage remains the same, but you simply need to connect to the dahu-oar3 front-end before you can launch a development job:

# Connect to the dahu-oar3 frontend
ssh dahu-oar3
# Launch a devel job with the '-t devel' option and a walltime less than 30 minutes
oarsub -l /nodes=1/core=1,walltime=00:10:00 -t devel ...

To make it easier to switch from one front-end to another without having to type in a password, we advise you to install a local ssh key, if you haven’t already done so, from the Dahu front-end:

user@f-dahu:~$ ssh-keygen -t rsa
user@f-dahu:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Submission with explicit resources request

With the previous submission command, we did not request any specific resources. Thus, in a return message, the OAR resource manager allocated us the default walltime (1800 seconds) and told us that no compute core was requested.

Walltime and end of job If your job ends normally (or crashes!) before the time indicated by the walltime, everything is fine, the resources are released immediately. On the other hand, if it is still running at the end of the walltime, it will be killed. So be sure to specify a walltime that is sufficient to finish within the time limit, but not too big so that it does not wait unnecessarily in the resource manager queue.

To explicitly reserve resources, use the -l option of the oarsub command. For example, the -l option of the oarsub command:

bouttier@froggy1 $ oarsub --project test -l /nodes=4/core=1,walltime=1:00:00 /bin/hostname
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 4 cores
[ADMISSION RULE] No antifrag for small jobs
OAR_JOB_ID=15676746

With this command, we asked for 1 core on 4 nodes, for a total of 4 cores, for a maximum duration (walltime) of one hour.

As we have seen before, each job is identified with a unique number for each cluster contained in the $OAR_JOB_ID environment variable, here 15676746. By default, since the job runs on resources that are not directly accessible, we do not have access to the standard and error outputs in which our commands can return information. In reality, these outputs are copied to two job-specific files, named OAR.$OAR_JOB_ID.stdout and OAR.$OAR_JOB_ID.stderr respectively. These files are created by OAR where you submitted your job.

The most commonly used keywords are described in this table :

KeywordsMeaning
nodes No.of nodes requested
coreNo. of cores requested
cpuNo. of cpu requested
walltimeMaximum requested execution time

Here are several examples of resource reservations:

oarsub -l /nodes=10,walltime=00:05:00 --project test /bin/hostname

Here we reserve 10 full nodes for 5 minutes of maximum execution time.

oarsub -l /core=54,walltime=00:00:30 --project test /bin/hostname

In this case, 54 cores are reserved, which the manager will choose according to the availability of the machine (on an undefined number of nodes at the time of submission), all for 30 seconds of execution time.

oarsub -l /nodes=3/cpu=1, walltime=2:00:00 --project test /bin/hostname

Finally, here, we ask for 3 nodes with 1 cpu on each (the other cpu of these nodes will be dedicated to other jobs), for a maximum of 2 hours of execution time.

Memory The RAM allocated to you is the one corresponding to the cores you have reserved. Thus, if you reserve 1 whole node, you have access to the totality of the RAM of the node. On the other hand, if you ask for n cores, you will have access to RAM_of_the_node*(n/nb_total_core_on_the_node)


The oarsub -l, -t and -p parameters

Requesting specific resources is done using the -l, -t and -p parameters of oarsub. The -l parameter is used to quantify the requested resources, -t to target a specific type and -p to constrain certain properties of the resources.

It is up to you to request resources that are consistent with the resources available on the clusters. You can get information about these resources via the recap.py, chandler and oarnodes commands in addition to the information in this documentation.

The -l parameter

With the -l parameter you will control the topology, from the point of view of the clusters resources, and the walltime of your request.

For an HPC cluster (Dahu, Luke), this topology is:

graph LR; B[switch] B -->|/| C[node] C -->|/| D[cpu] D -->|/| E[core]

For a GPU cluster like Bigfoot it becomes:

graph LR; B[switch] B -->|/| C[node] C -->|/| D[gpu] D -->|/| E[migdevice]

Specifying switch=1 ensures that all the following elements of the request are connected to a single switch. Be careful, if you don’t define your request more precisely, it amounts to asking for all the computing cores attached to the same switch!

switch=2/node=1 would request one node on two separate switches for a total of two complete nodes. A more consistent request would be switch=1/node=2 which would ensure that both requested nodes are connected to the same switch which could be important for distributed computing.

node=1/cpu=1/core=16 would request 16 execution cores on a single processor on a single node.

node=1/gpu=1 would require a full GPU on one node.

An example of a full -l parameter could therefore be : -l "/node=1/cpu=1/Core=16,walltime=02:00:00".

A fairly common counter example is -l "/node=1/cpu=3" which cannot work since no Dahu node has three CPUs. Upon submission you will get an error: There are not enough resources for your request.

Any property can be used as a value for the l parameter. It just needs to be considered from the point of view of resource quantification and make sense in the cluster topology. For example, specifying cpumodel=1 is equivalent to ensuring that all compute cores are taken from the same processor model (without imposing a specific model unlike -p) regardless of the node they are taken from.

The -t parameter

The -t parameter allows you to select the type of job submitted, which will select certain types of resources with potentially specific admission rules. If it is not specified, the default type is automatically assigned.

On Dahu the possibilities are :

  • devel - development job, see above.
  • heterogeneous - allows jobs to run on all nodes in the cluster without being limited to the initial homogeneous partition.
  • fast - job targeting only high frequency CPU resources, faster but with fewer cores per CPU.
  • fat - job exclusively targeting very high memory capacity resources.
  • besteffort - useful via CiGri.
  • visu - Visualization.

On Bigfoot they are :

  • devel - development job, see above.
  • heterogeneous - allows jobs to run on all nodes in the cluster without being limited to the initial homogeneous partition.
  • besteffort - useful via CiGri.

On Luke they are :

Heterogeneous type and Dahu.
The heterogeneous type is particularly important on Dahu. If this type is not specified, jobs will only run on the initial, homogeneous partition of the cluster.
The homogeneous partition consists of nodes with identical characteristics to the first nodes added to the cluster. On a cluster like Dahu that has been built by successive additions over time, this partition is relatively small.
If you want your job not to be limited to this one partition for execution, it is imperative to specify the -t heterogeneous parameter.

The -p parameter

All other properties are selectable via the -p parameter, which can be specified multiple times.

Some useful properties are, for example, for Dahu :

  • cpumarch
  • cpumodel
  • cpufreq
  • memcore
  • n_cores
  • total_mem
  • scratchX
  • scratchX_type
  • scratchX_loc

and for Bigfoot :

  • gpumodel
  • mem_per_gpu
  • scratchX
  • scratchX_type
  • scratchX_loc

The other properties are either not useful for resource selection or are exploited via other mechanisms (-l and -t). Using them will be counterproductive without a very thorough understanding of their operation combined with a very specific use case.


Submitting an interactive job

A common practice to get a good understanding of cluster computing is to submit an interactive job. In short, you ask for resources, and the resource manager will connect you to one of them (a node on which it has allocated you a core) from which you will be able to run your commands/scripts/programs interactively. At the end of the walltime, your session is automatically killed. If you leave it before, the manager will free your resources.

The interactive submission is done using the -I option of the oarsub command. For example, you can use the -I option of the oarsub command:

bouttier@froggy1 $ oarsub -I -l /nodes=1,walltime=1:00:00 --project admin
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 16 cores
[ADMISSION RULE] Disabling antifrag optimization as you are already using /node property
OAR_JOB_ID=15676747
Interactive mode : waiting...
Starting...

Connect to OAR job 15676747 via the node frog80

bouttier@frog80 $

We can make several remarks here:

  • We launched oarsub on the frontend (froggy1 in the first prompt)
  • The -I option has been included
  • We did not specify a command to execute (no /bin/hostname)
  • The job manager allocated us 16 cores since we asked for 1 integer node (which has 16 cores on the froggy cluster)
  • Once the interactive job is launched, we are connected to our job via the frog80 node.

We see that here, with the second prompt, we are connected to the frog80 node from which we will be able to manipulate files, run commands and run scripts interactively. To quit the job, we just have to leave the node via the exit command.

Submitting a job via a submission script

The interactive submission is useful to get to grips with the cluster, to do tests. On the other hand, to automatically launch a set of commands (file preparation, preparation of the software environment, execution of the main program, recovery of output files and cleanup), it is not recommended.

In this use case, we will submit a job via a submission script. This submission script can be written in different interpreted languages, more often bash and python.

The submission script will contain the set of instructions you need to perform your experiment as well as OAR guidelines that will tell the resource managers everything you need. These directives are indicated by the #OAR string at the beginning of a line. Please note that #OAR is not a comment in your submission script.

The submission script will then be passed as a parameter to the oarsub command using the -S option. It must first be made executable using the chmod +x command.

As a thousand words are worth a picture, let’s go to a concrete example. Here is a submission script that is completely useless but will illustrate all the mechanisms we need to present here. So we will logically call it dumb.sh, whose contents are as follows:

#!/bin/bash

#OAR -n Hello_World
#OAR -l /nodes=2/core=1,walltime=00:01:30
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test

cd /bettik/buttier/
/bin/hostname >> dumb.txt

We can distinguish two main parts in this script:

  • A part dedicated to the directives that will be read by OAR, indicated by #OAR
  • The directives executed on the resources reserved on the cluster.

For the latter, this script, which will only work on the luke and dahu clusters:

  • change directory to the /bettik/bouttier/ folder.
  • write the output of the /bin/hostname command to the dumb.txt file. For it to work on froggy, you would have to work in the /scratch folder instead of the /bettik folder.

For OAR directives, we note here some known options of oarsub. Let’s look line by line at what we ask the resource manager to do:

#OAR -n Hello_World

Here we call our job hello_world.

#OAR -l /core=1,walltime=00:01:30

We ask for 1 core for 1 minute 30 seconds maximum.

#OAR --stdout hello_world.out

The standard output file will be called hello_world.out.

#OAR --stderr hello_world.err

The standard error file will be called hello_world.err.

#OAR --project test

We indicate that we belong to the test project.

In reality, these guidelines include the options we normally pass to the oarsub command. Please note that all the options described here can be used in the command line with oarsub

Here, standard and error output files are named independently of the job ID. This can be dangerous, because if you submit the script multiple times, each job will write to this file and we will potentially lose the information for each job.

Before submitting the script, you must first make it executable :

chmod +x dumb.sh

Now we can submit the :

oarsub -S ./dumb.sh

Once the requested resources are available, the series of commands described in the script will be executed.

Moldable jobs

A moldable job is obtained by chaining several -l directives in the same oarsub command. For example:

oarsub -l "/nodes=1/core=32,walltime=08:00:00" -l "/nodes=1/core=16,walltime=14:00:00" --project test runme.sh 

With this request, you give OAR the possibility to choose one or the other of your resource requests according to the availability of the cluster. It will automatically choose the one that will finish first.

Following a job

To find out the status and characteristics of a submitted job, use the oarstat command. Executed alone, this command will return the status of jobs submitted by all users on the cluster. For readability reasons, we will restrict it to a single user using the -u option followed by the login:

bouttier@f-dahu:~$ oarstat -u bouttier
Job id    S User     Duration   System message
--------- - -------- ---------- ------------------------------------------------
4936641   W bouttier    0:00:00 R=1,W=0:1:30,J=B,N=Hello_World,P=admin,T=heterogeneous

The result gives all the jobs submitted by this user. We see that he has only one. We see its identifier, its status (here, W for waiting, it is waiting to be launched), how much has elapsed since the beginning of its execution (here 0 since it is still waiting) and its characteristics (requested resources, walltime, name, project, type).

To know the detailed information of a job, once we have its identifier, we can execute the following command:

bouttier@f-dahu:~$ oarstat -f -j 4936641
Job_Id: 4936641
    job_array_id = 4936641
    job_array_index = 1
    name = Hello_World
    project = admin
    owner = bouttier
    state = Terminated
    wanted_resources = -l "{type = 'default'}/core=1,walltime=0:1:30"
    types = heterogeneous
    dependencies =
    assigned_resources = 1067
    assigned_hostnames = dahu66
    queue = default
    command = dumb.sh
    exit_code = 32512 (127,0,0)
    launchingDirectory = /home/bouttier
    stdout_file = hello_world.out
    stderr_file = hello_world.err
    jobType = PASSIVE
    properties = ((((hasgpu='NO') AND compute = 'YES') AND sequential='YES') AND desktop_computing = 'NO') AND drain='NO'
    reservation = None
    walltime = 0:1:30
    submissionTime = 2019-11-01 16:17:58
    startTime = 2019-11-01 16:18:09
    stopTime = 2019-11-01 16:18:11
    cpuset_name = bouttier_4936641
    initial_request = oarsub -S dumb.sh; #OAR -n Hello_Worl; #OAR -l /core=1,walltime=00:01:3; #OAR --stdout hello_world.ou; #OAR --stderr hello_world.er; #OAR --project admi
    message = R=1,W=0:1:30,J=B,N=Hello_World,P=admin,T=heterogeneous (Karma=0.000,quota_ok)
    scheduledStart = no prediction
    resubmit_job_id = 0
    events =
2019-11-01 16:18:12> SWITCH_INTO_TERMINATE_STATE:[bipbip 4936641] Ask to change the job state

Here, the line state = Terminated tells us that it is now finished.

Deleting a job

If you have started a job and found that there is no need for it to continue (useless calculation, error in the submission script, etc.) you can delete your submission using the oardel command:

bouttier@f-dahu:~$ oarsub -S dumb.sh
[PARALLEL] Small jobs (< 32 cores) restricted to tagged nodes
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=4936645
bouttier@f-dahu:~$ oardel 4936645
Deleting the job = 4936645 ...REGISTERED.
The job(s) [ 4936645 ] will be deleted in the near future.

Karma

On a OAR cluster, you can see a karma value when you submit an interactive job, or when you request the status of a job (oarstat -j $JOB_ID) or when you check your accounting data. This is a value used to ensure the fair sharing of resources that we use on each GRICAD cluster. Fair sharing means that the system tries to allow resources to be used with equity among users. The lower your karma, the more likely it is that your job will start before that of a user with a higher karma value. Karma is a function of how much computing time you have requested and actually consumed in the past (during a sliding window, usually a week or two depending on the platform). But note that karma (and the fairsharing algorithm) is only used when the system is full of jobs. Most of the time, the programming is FIFO with backfilling.