Managing a workflow with Snakemake

Snakemake is a workflow engine, inspired by Make, written in Python, designed for the reproducible and traceable automation of pipelines. It lets you define rules describing how to generate output files from input files using scripts or commands. The workflow is expressed in a Snakefile, in Python syntax, and Snakemake automatically builds the dependency graph to execute the steps in the right order, in parallel if possible.

This page is a guide to getting to grips with Snakemake and using it for your projects. However, Snakemake is a project with a very active community, and new versions with new features are released very regularly. More information, including the latest updates, is available at https://snakemake.readthedocs.io/en/stable/#.

Reference paper: Mölder, F., Jablonski, K. P., et al. (2025) Sustainable data analysis with Snakemake. F1000Research, 10, 33. https://doi.org/10.12688/f1000research.29032.1

Installing Snakemake

Snakemake can easily be installed on your local machine or in a virtual environment (Conda, Venv, Poetry, etc.) using the command:

pip install snakemake

If the pip command is not recognised, check that Python is correctly installed on your machine or in your virtual environment and, in the latter case, that your virtual environment is correctly activated.

Creating rules for workflow management

What is a rule?

Snakemake workflow management is controlled by rules (“rules”).

A rule is defined (as a minimum) by :

a name (e.g. my_rule) ;
one or more input files
one or more output files
an shell command to generate output from input.

A rule is correct when executing the shell command generates the specified output files from the specified input files.

How do I create a rule?

To create a rule, you need to start by creating a file called Snakefile, which has a .smk extension specific to Snakemake, in your working directory (for example, you can create a my_snakefile.smk file at the root of this directory). This file will record all the rules that will govern your workflow. The Snakefile is based on the Python language, so you can write additional scripts in Python.

The syntax of a Snakemake rule is as follows:

rule my_rule: 
    input: 
        "path/to/my/input/file1", 
        "path/to/my/input/file2",
        ...
    output: 
        "path/to/my/output/file1", 
        "path/to/my/output/file2",
        ...
    shell: 
        "myshellcommand"

In this my_rule example, the myshellcommand is triggered by the existence of all files specified in the input field and generates the files specified in the output field.

The absence of one or more files specified in the input field will prevent the rule from executing.

The absence of one or more files specified in the output field after execution of the command given in the shell field will cause an error which will stop the workflow from executing? In addition, any other output files generated will be automatically deleted, even if they are valid.

How to organise a workflow?

A workflow is defined by one or more Snakemake rules, all recorded in the Snakefile, which are chained together until the final outputs are created.

There are four key principles to remember when designing your workflow based on rules:

A rule is only executed if all the input files specified in its input field exist.
Files can be used as input (and therefore trigger) several rules at the same time.
The output files of one rule can be used as input files for another rule, even if they do not yet exist when the workflow is launched.
The workflow stops when there are no more rules to execute or when the final outputs are reached (defined in the all rule).

Therefore, by cascading rule inputs and outputs, it is possible to design a network from which Snakemake will deduce the order in which the rules are executed.

The order in which the rules are written in the Snakefile is irrelevant: the actual order will be automatically deduced from the inputs and outputs of each rule. However, you must remember to create an all rule containing only an input field, in which you specify the final files generated by the workflow. This all rule must be placed at the beginning of the Snakefile.

Basic example of workflow

Let’s take an example made up of four rules.

We have a working directory in which Snakemake is installed. In this directory, we find a file1 file and a Snakefile named ABCD.smk whose contents are shown below:

# All rule: the final file generated by this workflow is file5
rule all: 
    input: 
        "file5"

# Copy file1 to file2
rule A: 
    input: 
        "file1"
    output: 
        "file2"
    shell: 
        "cp file1 file2"

# Copy file2 to file3
rule B: 
    input: 
        "file2"
    output: 
        "file3"
    shell: 
        "cp file2 file3"

# Copy file2 to file4
rule C: 
    input: 
        "file2"
    output: 
        "file4"
    shell: 
        "cp file2 file4"

# Creation of file5 thanks to the existence of file3 and file4
rule D: 
    input: 
        "file3", 
        "file4"
    output: 
        "file5"
    shell: 
        "touch file5"

Let’s analyse the previous Snakefile in parallel with a workflow simulation:

In our directory, the presence of the file file1 will trigger rule A, since this file constitutes all the input required for this rule. No other rule in the Snakefile requires file1 as input. The file2 file is created by the command given in the shell field of rule A. No error is returned, since the output field specifies that only file2 will be created by rule A.
The appearance of the file2 file will trigger rules B and C. If several jobs are made available to Snakemake, these rules can be run in parallel (see the section on job management). Otherwise, they will be executed sequentially. When they are executed, the file3 and file4 files appear.
Since file3 and file4 are the inputs to rule D, rule D is finally executed and file5 is created.
The file5 file is not an entry for any of the rules specified in the Snakefile. There are no rules left to execute. The general execution of the workflow stops without error.

From this Snakefile, Snakemake is therefore able to deduce the order of the steps to be executed, which can be represented in the form of a graph:

flowchart TD
    S{Start} -->|file1 exists| A
    A -->|cp file1 file2| B
    B -->|cp file2 file3| D
    A -->|cp file1 file2| C
    C -->|cp file2 file4| D
    D -->|touch file5| stop{Stop}

What happens if there is an error?

As we saw below, the execution of the entire workflow depends on the correct generation of the outputs of each stage, which are used as inputs for the following stages. But what do you do if an error occurs at an intermediate stage of the workflow?

This is one of Snakemake’s strengths: if the workflow is interrupted during execution, Snakemake is able to analyse the files in the working directory and restart the workflow from the interrupted step. What’s more, if some of the outputs from steps that have already been executed are modified, Snakemake restarts the workflow from the step furthest upstream from any modification.

Basic example of workflow with interruption

Let’s go back to the previous workflow, and imagine that the workflow is interrupted because of an error during the execution of rule C. As rule C fails, file file4 is not created and rule D cannot be started, as not all its input files are present. The workflow stops.

We find the source of the error in the execution of rule C, and restart the workflow.

Snakemake’s “reasoning” will be as follows:

Both file1 and file2 exist, and file2 is newer than file1. Rule A is therefore not rerun.
The file2 and file3 files exist, and the file3 file is newer than the file2 file. Rule B is not rerun.
The file2 file exists, but the file4 file does not: step C must be rerun.
The file4 file does not exist: rule D cannot be run.

From this analysis, Snakemake deduces that the workflow must be restarted from rule C, and rule D will be launched by creating the file4 file. Steps A and B will not be restarted.

Note that if file file1 has been modified after the interrupted execution, it becomes more recent than file file2. Rule A and the entire workflow will then be restarted the next time the workflow is run.

Additional features

Interpolating

Snakemake can be used to interpolate rules to make them more generic. The reference to a field within a rule must be placed {within braces}.

# Copy file1 to file2
rule A: 
    input: 
        "file1"
    output: 
        "file2"
    shell: 
        "cp {input} {output}"

Interpolation only works within a rule. It is not possible to refer to the inputs or outputs of another rule.

Wildcards

Snakemake also allows abstraction in file names, so that the same rule can be applied several times to different files.

Let’s take an example. Below is a Snakefile :


# Number of files
nb_files = 4

# Definition of the final output
rule all:
    input: 
        "final_output"

# Creation of 'nb_files' input_file files
rule A:
    input: 
        "initial_file"
    output: 
        [expand("input_file{file_id}", file_id=range(1, nb_files+1))]
    shell: 
        "touch {output}"

# Input_files are copied to the same number of output_files
rule B: 
    input: 
        "input_file{file_id}"
    output: 
        "output_file{file_id}"
    shell: 
        "cp input_file{wildcards.file_id} output_file{wildcards.file_id}"

# If all output_files are present, generate the final_output file
rule C: 
    input: 
        [expand("output_file{file_id}", file_id=range(1, nb_files+1))],
    output: 
        "final_output"
    shell:
        "touch final_output"

In this example, wildcards are used in rule B to copy input_files to output_files indexed in the same way (input_file1 is copied to output_file1, input_file2 is copied to output_file2, etc.). The file_id index is named in the input and output fields, and referred to in the shell field using wildcards.file_id.

There can be several wildards in the same rule. Simply name them differently and refer to them in the shell field by their respective names.

The expand command is used to aggregate files and set parameters for them, as in the example above.

Functions

As mentioned above, the Snakefile is a file based on the Python language. It is therefore possible to write Python functions in the Snakefile and call them in the rule fields (except in the shell field).

Below is an example taken from the official Snakemake documentation (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html):

def myfunc(wildcards):
    return [... a list of input files depending on given wildcards ...]

rule:
    input:
        myfunc
    output:
        "someoutput.{somewildcard}.txt"
    shell:
        "..."

More information on using the functions in the Snakefile is available in the official Snakemake documentation.

Run a workflow with Snakemake

Once the Snakefile has been written, you need to be able to run the workflow managed by Snakemake.

This can be done from a terminal using the snakemake command, with the options we will describe here.

Execution and resource allocation

To launch a workflow execution with Snakemake, you need to specify a Snakefile and the resources to be used (jobs, cores, etc.).

If you allocate several jobs to the execution of your workflow, Snakemake will take care of efficiently parallelizing the steps that can be parallelized.

Launching the workflow with 1 core :

snakemake -s my_snakefile.smk --cores 1

Launch workflow with 4 jobs:

snakemake -s my_snakefile.smk --jobs 4

Graph generation

Snakemake also offers a command to generate a graph from the various rules you have specified in the Snakefile, without running the workflow. This graph is returned in Graphviz Dot language.

This graph can be generated simply with the command :

snakemake -s my_snakefile.smk --dag

Snakemake and computing clusters

Some use cases may require you to run a workflow totally or partially on a computing cluster (for example, on Bigfoot if your workflow includes AI inference steps).

The aim of this section is to explain how to set up communication between your local machine and a computing cluster as part of a workflow managed by Snakemake.

Prerequisites

Before you start, make sure you have an accessible personal space on your computing cluster. You must have the necessary authorisations to be able to access the computing infrastructure. Check your SSH authorisations and configurations before proceeding. You will also need to create a Perseus account and possibly request access to a project in order to continue. More information is available at https://gricad-doc.univ-grenoble-alpes.fr/hpc/connexion/.

Running part of the workflow on a computing cluster

Let’s say you have a workflow described by the following Snakefile:

# Number of files
nb_files = 4

# Definition of the final output (local)
rule all:
    input: 
        "final_output"

# Creation of 'nb_files' input_file files (local)
rule preprocessing:
    input: 
        "initial_file"
    output: 
        [expand("input_file{file_id}", file_id=range(1, nb_files+1))]
    shell: 
        "touch {output}"

# Generation of output_files from input_files via cluster calculation
rule cluster_inference: 
    input: 
        "input_file{file_id}"
    output: 
        "output_file{file_id}"
    shell: 
        "calcul_cluster.sh" # Cluster inference with oarsub

# If all output_files are present, generate the final_output file (local)
rule postprocessing: 
    input: 
        [expand("output_file{file_id}", file_id=range(1, nb_files+1))],
    output: 
        "final_output"
    shell:
        "touch final_output"

The workflow therefore looks like the figure below:

flowchart TD
    S{Start} --> PreProcessing
    PreProcessing --> ClusterInference
    ClusterInference -->PostProcessing
    PostProcessing -->stop{Stop}

In this workflow, we eventually want the preprocessing and postprocessing stages to be run locally, and the cluster_inference stage to be run on the cluster (using oarsub).

As things stand, all the steps are carried out locally. So we’re going to adapt our workflow and our Snakefile so that we can run the cluster_inference part on the computing cluster.

Installating the Snakemake Cluster Generic plugin

The Snakemake community is very active and several complementary packages have been developed. In particular, the snakemake-executor-plugin-cluster-generic package facilitates the use of calculation clusters in a workflow managed by Snakemake.

Start by installing the package in your work environment:

pip install snakemake-executor-plugin-cluster-generic

Distribute rules between local execution and execution on the cluster

At the beginning of your Snakefile, you need to add a line that will tell Snakemake which rules should be run locally. By default, the other rules will be executed on the cluster.

Adapting to the example above, the line to add to the start of your Snakefile is as follows:

workflow._localrules = set(["all", "preprocessing", "postprocessing"])

Thanks to this line, Snakemake will understand that the all, preprocessing and postprocessing rules should be run locally, and that the other rules (cluster_inference, here) should be run on the cluster.

Transferring the necessary data and files

We are going to frame our cluster_inference step with a step for transferring data from your local machine to the calculation cluster (transfer) and a step for bringing back the data inferred on the cluster to your local machine (transfer_back).

Our workflow therefore takes the following form:

flowchart TD
    S{Start} --> PreProcessing
    PreProcessing --> TransferData
    TransferData --> ClusterInference
    ClusterInference --> TransferBackData
    TransferBackData --> PostProcessing
    PostProcessing -->stop{Stop}

These data transfer steps can be carried out with rsync via the GRICAD tool for transferring data to cargo clusters.

The aim of the TransferData step is to transfer the data and files required for inference to the computing infrastructure. The resulting rule can therefore be written as follows:

rule transfer: 
    input: 
        [expand("input_file{file_id}", file_id=range(1, nb_files+1))]
    output: 
        "transfer_OK"
    shell:
        "transfer_to_cluster.sh && touch transfer_OK"

In the rule above, data and file transfers are carried out using the transfer_to_cluster.sh script. This script might look like :

rsync -avxH "$PWD/data/" "$USER@cargo.univ-grenoble-alpes.fr:/path/to/your/cluster/directory/data/"

Once this transfer has been completed, we apply the touch transfer_OK command to indicate locally that the transfer to the cluster has been completed, and trigger the next step (for which the transfer_OK file will be one of the inputs).

The data transferred must include the jobs to be run with oarsub as well as any qsub and qstat wrappers, which make it easier to pass options at runtime.

The aim of the TransferBackData step is to repatriate locally the output generated after inference on the computing infrastructure. The resulting rule can therefore be written in the following form:

rule transfer_back: 
    input: 
        "inference_cluster_OK"
    output: 
        [expand("output_file{file_id}", file_id=range(1, nb_files+1))]
    shell:
        "transfer_back_from_cluster.sh && touch transfer_back_OK"

In the rule above, the output is retrieved using the transfer_back_from_cluster.sh script. This script might look like :

rsync -avxH "$USER@cargo.univ-grenoble-alpes.fr:/path/to/your/cluster/directory/outputs/" "$PWD/outputs/"

Once this transfer has been completed, we apply the touch transfer_back_OK command to indicate locally that the output has been repatriated locally, and trigger the next stage (for which the transfer_back_OK file will be one of the inputs).

The transfer and transfer_back rules must be run locally. You should therefore remember to add them to the list of rules to be executed locally at the beginning of your Snakefile.

Remote execution with oarsub

To tell Snakemake that part of the workflow will be run on a compute cluster, you need to specify this in the workflow launch command.

snakemake -s my_snakefile.smk --jobs 10 --executor cluster-generic 
    --cluster-generic-submit-cmd "local_qsub.sh" 
    --cluster-generic-status-cmd "local_qstat.sh"

The --executor cluster-generic flag is used to specify to Snakemake that the workflow will be partially executed on a compute cluster.

The --cluster-generic-submit-cmd "local_qsub.sh" flag tells Snakemake which script to run to submit jobs on the cluster. local_qsub.sh is a local wrapper which will itself launch the qsub.sh wrapper previously transferred to the cluster, which will submit jobs using oarsub.

The --cluster-generic-status-cmd "local_qstat.sh" flag tells Snakemake which script to run to check the state of jobs on the cluster. local_qstat.sh is a local wrapper which will itself launch the qstat.sh wrapper previously transferred to the cluster, which will check the state of jobs with oarstat.

In the Snakefile, the cluster_inference rule will take the form :

rule cluster_inference:
  input:
    "transfer_OK", # Output file from the data transfer stage on the cluster
  output:
    "job_running",
  shell:
    "job.sh" # Script to run with oarsub on the cluster

This rule will be launched on the cluster by qsub.sh, itself launched by Snakemake with local_qsub.sh.

The contents of the local_qsub.sh file will be as follows (to be completed as required):

ssh cluster_name "/path/to/run/directory/qsub.sh $*"

This file therefore connects to the cluster via SSH and launches the qsub.sh file previously transferred, then creates a job_running file to indicate locally that the job has been launched on the cluster.

The contents of the local_qstat.sh file will be as follows (to be completed as required):

tmp=$(ssh cluster_name "/path/to/run/directory/qstat.sh $*)"

oar_state=$(echo "$tmp" | cut -d' ' -f 2)
case ${oar_state} in
        Terminated) touch "job_running"; echo "success";;
        Running|Finishing|Waiting|toLaunch|Launching|Hold|toAckReservation) echo "running";;
        Error|toError) echo "failed";;
        Suspended|Resuming) echo "suspended";;
        *) echo "unknown";;
esac

The local_qstat.sh will be iteratively launched by Snakemake until the job is completed. If successful, the job_running file will be created and will be used to launch the rest of the workflow (data repatriation with the transfer_back rule). Otherwise, the workflow will stop with an “output file not created” error.

Template for a workflow running on a GRICAD cluster

Carrying out the above steps can be quite complex and time-consuming. A template to make this tool easier to use is currently being designed.