A cluster is a set of machines that consists of one or more frontend and compute nodes. A frontend is the machine on which the user connects in ssh, while the compute nodes are the machines on which the user runs his programs.
All programs that the user runs are done from a frontend using a resource manager that finds and allocates the required resources among the nodes.
The resource manager used is OAR.
A job is what the user submits to the resource manager to execute a program. It consists of a description of the resources needed to run the program and the commands to execute the program, usually provided in a script.
The resources required by the resource manager to submit a job include the number of nodes, the number of cores per node, and the maximum computing time. If the user omits this information, the manager applies default values. The defaults are 1 node, 1 core and 2h maximum.
A job has a life cycle, it is:
The command to submit a job is oarsub. This command has many options that can be found using the help option: oarsub --help
.
The most basic form of use is the one where you only specify the project and the command to run: oarsub --project <your-project> <command>
.
For example, below is an example of submitting the command /bin/hostname
:
[cgirpi@froggy1 ~]$ oarsub --project cecipex /bin/hostname
[ADMISSION RULE] Set default walltime to 1800.
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 0 cores
[ADMISSION RULE] No antifrag for small jobs
OAR_JOB_ID=15576103
Each job is uniquely identified by its job identifier (OAR_JOB_ID
). In the example, this identifier has the value 15576103.
Be aware that the command that you pass as argument of oarsub, and that you want to execute on the node, must have execution rights and be accessible from the nodes. Especially if the path is not in the PATH environment variable, you must then pass as argument the command with its full path.
devel
modeThe special -t devel
submission mode is available to facilitate job tuning. It is limited to jobs of 2 hours maximum, and allocates resources on nodes dedicated to this mode. This allows to have sandbox resources much more available than production resources.
As of February 2024, the sandbox nodes are accessible from a dedicated front-end, which is also used for pre-production testing of the new version of OAR (v3). Usage remains the same, but you simply need to connect to the dahu-oar3
front-end before you can launch a development job:
# Connect to the dahu-oar3 frontend
ssh dahu-oar3
# Launch a devel job with the '-t devel' option and a walltime less than 30 minutes
oarsub -l /nodes=1/core=1,walltime=00:10:00 -t devel ...
To make it easier to switch from one front-end to another without having to type in a password, we advise you to install a local ssh key, if you haven’t already done so, from the Dahu front-end:
user@f-dahu:~$ ssh-keygen -t rsa
user@f-dahu:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
With the previous submission command, we did not request any specific resources. Thus, in a return message, the OAR resource manager allocated us the default walltime (1800 seconds) and told us that no compute core was requested.
Walltime and end of job If your job ends normally (or crashes!) before the time indicated by the walltime, everything is fine, the resources are released immediately. On the other hand, if it is still running at the end of the walltime, it will be killed. So be sure to specify a walltime that is sufficient to finish within the time limit, but not too big so that it does not wait unnecessarily in the resource manager queue.
To explicitly reserve resources, use the -l
option of the oarsub command. For example, the -l option of the oarsub command:
bouttier@froggy1 $ oarsub --project test -l /nodes=4/core=1,walltime=1:00:00 /bin/hostname
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 4 cores
[ADMISSION RULE] No antifrag for small jobs
OAR_JOB_ID=15676746
With this command, we asked for 1 core on 4 nodes, for a total of 4 cores, for a maximum duration (walltime) of one hour.
As we have seen before, each job is identified with a unique number for each cluster contained in the $OAR_JOB_ID
environment variable, here 15676746. By default, since the job runs on resources that are not directly accessible, we do not have access to the standard and error outputs in which our commands can return information. In reality, these outputs are copied to two job-specific files, named OAR.$OAR_JOB_ID.stdout
and OAR.$OAR_JOB_ID.stderr
respectively. These files are created by OAR
where you submitted your job.
The most commonly used keywords are described in this table :
Keywords | Meaning |
---|---|
nodes No. | of nodes requested |
core | No. of cores requested |
cpu | No. of cpu requested |
walltime | Maximum requested execution time |
Here are several examples of resource reservations:
oarsub -l /nodes=10,walltime=00:05:00 --project test /bin/hostname
Here we reserve 10 full nodes for 5 minutes of maximum execution time.
oarsub -l /core=54,walltime=00:00:30 --project test /bin/hostname
In this case, 54 cores are reserved, which the manager will choose according to the availability of the machine (on an undefined number of nodes at the time of submission), all for 30 seconds of execution time.
oarsub -l /nodes=3/cpu=1, walltime=2:00:00 --project test /bin/hostname
Finally, here, we ask for 3 nodes with 1 cpu on each (the other cpu of these nodes will be dedicated to other jobs), for a maximum of 2 hours of execution time.
Memory The RAM allocated to you is the one corresponding to the cores you have reserved. Thus, if you reserve 1 whole node, you have access to the totality of the RAM of the node. On the other hand, if you ask for n
cores, you will have access to RAM_of_the_node*(n/nb_total_core_on_the_node)
Requesting specific resources is done using the -l
, -t
and -p
parameters of oarsub. The -l
parameter is used to quantify the requested resources, -t
to target a specific type and -p
to constrain certain properties of the resources.
It is up to you to request resources that are consistent with the resources available on the clusters. You can get information about these resources via the recap.py
, chandler
and oarnodes
commands in addition to the information in this documentation.
With the -l parameter you will control the topology, from the point of view of the clusters resources, and the walltime of your request.
For an HPC cluster (Dahu, Luke), this topology is:
For a GPU cluster like Bigfoot it becomes:
Specifying switch=1
ensures that all the following elements of the request are connected to a single switch. Be careful, if you don’t define your request more precisely, it amounts to asking for all the computing cores attached to the same switch!
switch=2/node=1
would request one node on two separate switches for a total of two complete nodes. A more consistent request would be switch=1/node=2
which would ensure that both requested nodes are connected to the same switch which could be important for distributed computing.
node=1/cpu=1/core=16
would request 16 execution cores on a single processor on a single node.
node=1/gpu=1
would require a full GPU on one node.
An example of a full -l parameter could therefore be : -l "/node=1/cpu=1/Core=16,walltime=02:00:00"
.
A fairly common counter example is -l "/node=1/cpu=3"
which cannot work since no Dahu node has three CPUs. Upon submission you will get an error: There are not enough resources for your request
.
Any property can be used as a value for the l
parameter. It just needs to be considered from the point of view of resource quantification and make sense in the cluster topology. For example, specifying cpumodel=1
is equivalent to ensuring that all compute cores are taken from the same processor model (without imposing a specific model unlike -p
) regardless of the node they are taken from.
The -t parameter allows you to select the type of job submitted, which will select certain types of resources with potentially specific admission rules. If it is not specified, the default type is automatically assigned.
On Dahu the possibilities are :
On Bigfoot they are :
On Luke they are :
Heterogeneous type and Dahu.
The heterogeneous type is particularly important on Dahu. If this type is not specified, jobs will only run on the initial, homogeneous partition of the cluster.
The homogeneous partition consists of nodes with identical characteristics to the first nodes added to the cluster. On a cluster like Dahu that has been built by successive additions over time, this partition is relatively small.
If you want your job not to be limited to this one partition for execution, it is imperative to specify the -t heterogeneous
parameter.
All other properties are selectable via the -p parameter, which can be specified multiple times.
Some useful properties are, for example, for Dahu :
and for Bigfoot :
The other properties are either not useful for resource selection or are exploited via other mechanisms (-l
and -t
). Using them will be counterproductive without a very thorough understanding of their operation combined with a very specific use case.
A common practice to get a good understanding of cluster computing is to submit an interactive job. In short, you ask for resources, and the resource manager will connect you to one of them (a node on which it has allocated you a core) from which you will be able to run your commands/scripts/programs interactively. At the end of the walltime, your session is automatically killed. If you leave it before, the manager will free your resources.
The interactive submission is done using the -I option of the oarsub command. For example, you can use the -I
option of the oarsub
command:
bouttier@froggy1 $ oarsub -I -l /nodes=1,walltime=1:00:00 --project admin
[ADMISSION RULE] Modify resource description with type constraints
[COMPUTE TYPE] Setting compute=YES
[ADMISSION RULE] Antifragmentation activated
[ADMISSION RULE] You requested 16 cores
[ADMISSION RULE] Disabling antifrag optimization as you are already using /node property
OAR_JOB_ID=15676747
Interactive mode : waiting...
Starting...
Connect to OAR job 15676747 via the node frog80
bouttier@frog80 $
We can make several remarks here:
oarsub
on the frontend (froggy1
in the first prompt)-I
option has been included/bin/hostname
)frog80
node.We see that here, with the second prompt, we are connected to the frog80
node from which we will be able to manipulate files, run commands and run scripts interactively. To quit the job, we just have to leave the node via the exit command.
The interactive submission is useful to get to grips with the cluster, to do tests. On the other hand, to automatically launch a set of commands (file preparation, preparation of the software environment, execution of the main program, recovery of output files and cleanup), it is not recommended.
In this use case, we will submit a job via a submission script. This submission script can be written in different interpreted languages, more often bash
and python
.
The submission script will contain the set of instructions you need to perform your experiment as well as OAR guidelines that will tell the resource managers everything you need. These directives are indicated by the #OAR
string at the beginning of a line. Please note that #OAR
is not a comment in your submission script.
The submission script will then be passed as a parameter to the oarsub
command using the -S
option. It must first be made executable using the chmod +x
command.
As a thousand words are worth a picture, let’s go to a concrete example. Here is a submission script that is completely useless but will illustrate all the mechanisms we need to present here. So we will logically call it dumb.sh
, whose contents are as follows:
#!/bin/bash
#OAR -n Hello_World
#OAR -l /nodes=2/core=1,walltime=00:01:30
#OAR --stdout hello_world.out
#OAR --stderr hello_world.err
#OAR --project test
cd /bettik/buttier/
/bin/hostname >> dumb.txt
We can distinguish two main parts in this script:
#OAR
For the latter, this script, which will only work on the luke
and dahu
clusters:
/bin/hostname
command to the dumb.txt
file. For it to work on froggy
, you would have to work in the /scratch
folder instead of the /bettik
folder.For OAR
directives, we note here some known options of oarsub
. Let’s look line by line at what we ask the resource manager to do:
#OAR -n Hello_World
Here we call our job hello_world
.
#OAR -l /core=1,walltime=00:01:30
We ask for 1 core for 1 minute 30 seconds maximum.
#OAR --stdout hello_world.out
The standard output file will be called hello_world.out
.
#OAR --stderr hello_world.err
The standard error file will be called hello_world.err
.
#OAR --project test
We indicate that we belong to the test project.
In reality, these guidelines include the options we normally pass to the oarsub command. Please note that all the options described here can be used in the command line with oarsub
Here, standard and error output files are named independently of the job ID. This can be dangerous, because if you submit the script multiple times, each job will write to this file and we will potentially lose the information for each job.
Before submitting the script, you must first make it executable :
chmod +x dumb.sh
Now we can submit the :
oarsub -S ./dumb.sh
Once the requested resources are available, the series of commands described in the script will be executed.
A moldable job is obtained by chaining several -l
directives in the same oarsub command. For example:
oarsub -l "/nodes=1/core=32,walltime=08:00:00" -l "/nodes=1/core=16,walltime=14:00:00" --project test runme.sh
With this request, you give OAR the possibility to choose one or the other of your resource requests according to the availability of the cluster. It will automatically choose the one that will finish first.
To find out the status and characteristics of a submitted job, use the oarstat
command. Executed alone, this command will return the status of jobs submitted by all users on the cluster. For readability reasons, we will restrict it to a single user using the -u
option followed by the login:
bouttier@f-dahu:~$ oarstat -u bouttier
Job id S User Duration System message
--------- - -------- ---------- ------------------------------------------------
4936641 W bouttier 0:00:00 R=1,W=0:1:30,J=B,N=Hello_World,P=admin,T=heterogeneous
The result gives all the jobs submitted by this user. We see that he has only one. We see its identifier, its status (here, W for waiting, it is waiting to be launched), how much has elapsed since the beginning of its execution (here 0 since it is still waiting) and its characteristics (requested resources, walltime, name, project, type).
To know the detailed information of a job, once we have its identifier, we can execute the following command:
bouttier@f-dahu:~$ oarstat -f -j 4936641
Job_Id: 4936641
job_array_id = 4936641
job_array_index = 1
name = Hello_World
project = admin
owner = bouttier
state = Terminated
wanted_resources = -l "{type = 'default'}/core=1,walltime=0:1:30"
types = heterogeneous
dependencies =
assigned_resources = 1067
assigned_hostnames = dahu66
queue = default
command = dumb.sh
exit_code = 32512 (127,0,0)
launchingDirectory = /home/bouttier
stdout_file = hello_world.out
stderr_file = hello_world.err
jobType = PASSIVE
properties = ((((hasgpu='NO') AND compute = 'YES') AND sequential='YES') AND desktop_computing = 'NO') AND drain='NO'
reservation = None
walltime = 0:1:30
submissionTime = 2019-11-01 16:17:58
startTime = 2019-11-01 16:18:09
stopTime = 2019-11-01 16:18:11
cpuset_name = bouttier_4936641
initial_request = oarsub -S dumb.sh; #OAR -n Hello_Worl; #OAR -l /core=1,walltime=00:01:3; #OAR --stdout hello_world.ou; #OAR --stderr hello_world.er; #OAR --project admi
message = R=1,W=0:1:30,J=B,N=Hello_World,P=admin,T=heterogeneous (Karma=0.000,quota_ok)
scheduledStart = no prediction
resubmit_job_id = 0
events =
2019-11-01 16:18:12> SWITCH_INTO_TERMINATE_STATE:[bipbip 4936641] Ask to change the job state
Here, the line state = Terminated
tells us that it is now finished.
If you have started a job and found that there is no need for it to continue (useless calculation, error in the submission script, etc.) you can delete your submission using the oardel
command:
bouttier@f-dahu:~$ oarsub -S dumb.sh
[PARALLEL] Small jobs (< 32 cores) restricted to tagged nodes
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=4936645
bouttier@f-dahu:~$ oardel 4936645
Deleting the job = 4936645 ...REGISTERED.
The job(s) [ 4936645 ] will be deleted in the near future.
On a OAR cluster, you can see a karma value when you submit an interactive job, or when you request the status of a job (oarstat -j $JOB_ID
) or when you check your accounting data. This is a value used to ensure the fair sharing of resources that we use on each GRICAD cluster. Fair sharing means that the system tries to allow resources to be used with equity among users. The lower your karma, the more likely it is that your job will start before that of a user with a higher karma value. Karma is a function of how much computing time you have requested and actually consumed in the past (during a sliding window, usually a week or two depending on the platform). But note that karma (and the fairsharing algorithm) is only used when the system is full of jobs. Most of the time, the programming is FIFO with backfilling.