Snakemake is a workflow engine, inspired by Make, written in Python, designed for the reproducible and traceable automation of pipelines. It lets you define rules describing how to generate output files from input files using scripts or commands. The workflow is expressed in a Snakefile, in Python syntax, and Snakemake automatically builds the dependency graph to execute the steps in the right order, in parallel if possible.
This page is a guide to getting to grips with Snakemake and using it for your projects. However, Snakemake is a project with a very active community, and new versions with new features are released very regularly. More information, including the latest updates, is available at https://snakemake.readthedocs.io/en/stable/#.
Reference paper: Mölder, F., Jablonski, K. P., et al. (2025) Sustainable data analysis with Snakemake. F1000Research, 10, 33. https://doi.org/10.12688/f1000research.29032.1
Snakemake can easily be installed on your local machine or in a virtual environment (Conda, Venv, Poetry, etc.) using the command:
pip install snakemake
If the pip command is not recognised, check that Python is correctly installed on your machine or in your virtual environment and, in the latter case, that your virtual environment is correctly activated.
Snakemake workflow management is controlled by rules (“rules”).
A rule is defined (as a minimum) by :
shell command to generate output from input.A rule is correct when executing the shell command generates the specified output files from the specified input files.
To create a rule, you need to start by creating a file called Snakefile, which has a .smk extension specific to Snakemake, in your working directory (for example, you can create a my_snakefile.smk file at the root of this directory). This file will record all the rules that will govern your workflow. The Snakefile is based on the Python language, so you can write additional scripts in Python.
The syntax of a Snakemake rule is as follows:
rule my_rule:
input:
"path/to/my/input/file1",
"path/to/my/input/file2",
...
output:
"path/to/my/output/file1",
"path/to/my/output/file2",
...
shell:
"myshellcommand"
In this my_rule example, the myshellcommand is triggered by the existence of all files specified in the input field and generates the files specified in the output field.
The absence of one or more files specified in the input field will prevent the rule from executing.
The absence of one or more files specified in the output field after execution of the command given in the shell field will cause an error which will stop the workflow from executing? In addition, any other output files generated will be automatically deleted, even if they are valid.
A workflow is defined by one or more Snakemake rules, all recorded in the Snakefile, which are chained together until the final outputs are created.
There are four key principles to remember when designing your workflow based on rules:
input field exist.all rule).Therefore, by cascading rule inputs and outputs, it is possible to design a network from which Snakemake will deduce the order in which the rules are executed.
The order in which the rules are written in the Snakefile is irrelevant: the actual order will be automatically deduced from the inputs and outputs of each rule. However, you must remember to create an all rule containing only an input field, in which you specify the final files generated by the workflow. This all rule must be placed at the beginning of the Snakefile.
Let’s take an example made up of four rules.
We have a working directory in which Snakemake is installed. In this directory, we find a file1 file and a Snakefile named ABCD.smk whose contents are shown below:
# All rule: the final file generated by this workflow is file5
rule all:
input:
"file5"
# Copy file1 to file2
rule A:
input:
"file1"
output:
"file2"
shell:
"cp file1 file2"
# Copy file2 to file3
rule B:
input:
"file2"
output:
"file3"
shell:
"cp file2 file3"
# Copy file2 to file4
rule C:
input:
"file2"
output:
"file4"
shell:
"cp file2 file4"
# Creation of file5 thanks to the existence of file3 and file4
rule D:
input:
"file3",
"file4"
output:
"file5"
shell:
"touch file5"
Let’s analyse the previous Snakefile in parallel with a workflow simulation:
file1 will trigger rule A, since this file constitutes all the input required for this rule. No other rule in the Snakefile requires file1 as input. The file2 file is created by the command given in the shell field of rule A. No error is returned, since the output field specifies that only file2 will be created by rule A.file2 file will trigger rules B and C. If several jobs are made available to Snakemake, these rules can be run in parallel (see the section on job management). Otherwise, they will be executed sequentially. When they are executed, the file3 and file4 files appear.file3 and file4 are the inputs to rule D, rule D is finally executed and file5 is created.file5 file is not an entry for any of the rules specified in the Snakefile. There are no rules left to execute. The general execution of the workflow stops without error.From this Snakefile, Snakemake is therefore able to deduce the order of the steps to be executed, which can be represented in the form of a graph:
flowchart TD
S{Start} -->|file1 exists| A
A -->|cp file1 file2| B
B -->|cp file2 file3| D
A -->|cp file1 file2| C
C -->|cp file2 file4| D
D -->|touch file5| stop{Stop}
As we saw below, the execution of the entire workflow depends on the correct generation of the outputs of each stage, which are used as inputs for the following stages. But what do you do if an error occurs at an intermediate stage of the workflow?
This is one of Snakemake’s strengths: if the workflow is interrupted during execution, Snakemake is able to analyse the files in the working directory and restart the workflow from the interrupted step. What’s more, if some of the outputs from steps that have already been executed are modified, Snakemake restarts the workflow from the step furthest upstream from any modification.
Let’s go back to the previous workflow, and imagine that the workflow is interrupted because of an error during the execution of rule C. As rule C fails, file file4 is not created and rule D cannot be started, as not all its input files are present. The workflow stops.
We find the source of the error in the execution of rule C, and restart the workflow.
Snakemake’s “reasoning” will be as follows:
file1 and file2 exist, and file2 is newer than file1. Rule A is therefore not rerun.file2 and file3 files exist, and the file3 file is newer than the file2 file. Rule B is not rerun.file2 file exists, but the file4 file does not: step C must be rerun.file4 file does not exist: rule D cannot be run.From this analysis, Snakemake deduces that the workflow must be restarted from rule C, and rule D will be launched by creating the file4 file. Steps A and B will not be restarted.
Note that if file file1 has been modified after the interrupted execution, it becomes more recent than file file2. Rule A and the entire workflow will then be restarted the next time the workflow is run.
Snakemake can be used to interpolate rules to make them more generic. The reference to a field within a rule must be placed {within braces}.
# Copy file1 to file2
rule A:
input:
"file1"
output:
"file2"
shell:
"cp {input} {output}"
Interpolation only works within a rule. It is not possible to refer to the inputs or outputs of another rule.
Snakemake also allows abstraction in file names, so that the same rule can be applied several times to different files.
Let’s take an example. Below is a Snakefile :
# Number of files
nb_files = 4
# Definition of the final output
rule all:
input:
"final_output"
# Creation of 'nb_files' input_file files
rule A:
input:
"initial_file"
output:
[expand("input_file{file_id}", file_id=range(1, nb_files+1))]
shell:
"touch {output}"
# Input_files are copied to the same number of output_files
rule B:
input:
"input_file{file_id}"
output:
"output_file{file_id}"
shell:
"cp input_file{wildcards.file_id} output_file{wildcards.file_id}"
# If all output_files are present, generate the final_output file
rule C:
input:
[expand("output_file{file_id}", file_id=range(1, nb_files+1))],
output:
"final_output"
shell:
"touch final_output"
In this example, wildcards are used in rule B to copy input_files to output_files indexed in the same way (input_file1 is copied to output_file1, input_file2 is copied to output_file2, etc.). The file_id index is named in the input and output fields, and referred to in the shell field using wildcards.file_id.
There can be several wildards in the same rule. Simply name them differently and refer to them in the shell field by their respective names.
The expand command is used to aggregate files and set parameters for them, as in the example above.
As mentioned above, the Snakefile is a file based on the Python language. It is therefore possible to write Python functions in the Snakefile and call them in the rule fields (except in the shell field).
Below is an example taken from the official Snakemake documentation (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html):
def myfunc(wildcards):
return [... a list of input files depending on given wildcards ...]
rule:
input:
myfunc
output:
"someoutput.{somewildcard}.txt"
shell:
"..."
More information on using the functions in the Snakefile is available in the official Snakemake documentation.
Once the Snakefile has been written, you need to be able to run the workflow managed by Snakemake.
This can be done from a terminal using the snakemake command, with the options we will describe here.
To launch a workflow execution with Snakemake, you need to specify a Snakefile and the resources to be used (jobs, cores, etc.).
If you allocate several jobs to the execution of your workflow, Snakemake will take care of efficiently parallelizing the steps that can be parallelized.
Launching the workflow with 1 core :
snakemake -s my_snakefile.smk --cores 1
Launch workflow with 4 jobs:
snakemake -s my_snakefile.smk --jobs 4
Snakemake also offers a command to generate a graph from the various rules you have specified in the Snakefile, without running the workflow. This graph is returned in Graphviz Dot language.
This graph can be generated simply with the command :
snakemake -s my_snakefile.smk --dag
Some use cases may require you to run a workflow totally or partially on a computing cluster (for example, on Bigfoot if your workflow includes AI inference steps).
The aim of this section is to explain how to set up communication between your local machine and a computing cluster as part of a workflow managed by Snakemake.
Before you start, make sure you have an accessible personal space on your computing cluster. You must have the necessary authorisations to be able to access the computing infrastructure. Check your SSH authorisations and configurations before proceeding. You will also need to create a Perseus account and possibly request access to a project in order to continue. More information is available at https://gricad-doc.univ-grenoble-alpes.fr/hpc/connexion/.
Let’s say you have a workflow described by the following Snakefile:
# Number of files
nb_files = 4
# Definition of the final output (local)
rule all:
input:
"final_output"
# Creation of 'nb_files' input_file files (local)
rule preprocessing:
input:
"initial_file"
output:
[expand("input_file{file_id}", file_id=range(1, nb_files+1))]
shell:
"touch {output}"
# Generation of output_files from input_files via cluster calculation
rule cluster_inference:
input:
"input_file{file_id}"
output:
"output_file{file_id}"
shell:
"calcul_cluster.sh" # Cluster inference with oarsub
# If all output_files are present, generate the final_output file (local)
rule postprocessing:
input:
[expand("output_file{file_id}", file_id=range(1, nb_files+1))],
output:
"final_output"
shell:
"touch final_output"
The workflow therefore looks like the figure below:
flowchart TD
S{Start} --> PreProcessing
PreProcessing --> ClusterInference
ClusterInference -->PostProcessing
PostProcessing -->stop{Stop}
In this workflow, we eventually want the preprocessing and postprocessing stages to be run locally, and the cluster_inference stage to be run on the cluster (using oarsub).
As things stand, all the steps are carried out locally. So we’re going to adapt our workflow and our Snakefile so that we can run the cluster_inference part on the computing cluster.
The Snakemake community is very active and several complementary packages have been developed. In particular, the snakemake-executor-plugin-cluster-generic package facilitates the use of calculation clusters in a workflow managed by Snakemake.
Start by installing the package in your work environment:
pip install snakemake-executor-plugin-cluster-generic
At the beginning of your Snakefile, you need to add a line that will tell Snakemake which rules should be run locally. By default, the other rules will be executed on the cluster.
Adapting to the example above, the line to add to the start of your Snakefile is as follows:
workflow._localrules = set(["all", "preprocessing", "postprocessing"])
Thanks to this line, Snakemake will understand that the all, preprocessing and postprocessing rules should be run locally, and that the other rules (cluster_inference, here) should be run on the cluster.
We are going to frame our cluster_inference step with a step for transferring data from your local machine to the calculation cluster (transfer) and a step for bringing back the data inferred on the cluster to your local machine (transfer_back).
Our workflow therefore takes the following form:
flowchart TD
S{Start} --> PreProcessing
PreProcessing --> TransferData
TransferData --> ClusterInference
ClusterInference --> TransferBackData
TransferBackData --> PostProcessing
PostProcessing -->stop{Stop}
These data transfer steps can be carried out with rsync via the GRICAD tool for transferring data to cargo clusters.
The aim of the TransferData step is to transfer the data and files required for inference to the computing infrastructure. The resulting rule can therefore be written as follows:
rule transfer:
input:
[expand("input_file{file_id}", file_id=range(1, nb_files+1))]
output:
"transfer_OK"
shell:
"transfer_to_cluster.sh && touch transfer_OK"
In the rule above, data and file transfers are carried out using the transfer_to_cluster.sh script. This script might look like :
rsync -avxH "$PWD/data/" "$USER@cargo.univ-grenoble-alpes.fr:/path/to/your/cluster/directory/data/"
Once this transfer has been completed, we apply the touch transfer_OK command to indicate locally that the transfer to the cluster has been completed, and trigger the next step (for which the transfer_OK file will be one of the inputs).
The data transferred must include the jobs to be run with oarsub as well as any qsub and qstat wrappers, which make it easier to pass options at runtime.
The aim of the TransferBackData step is to repatriate locally the output generated after inference on the computing infrastructure. The resulting rule can therefore be written in the following form:
rule transfer_back:
input:
"inference_cluster_OK"
output:
[expand("output_file{file_id}", file_id=range(1, nb_files+1))]
shell:
"transfer_back_from_cluster.sh && touch transfer_back_OK"
In the rule above, the output is retrieved using the transfer_back_from_cluster.sh script. This script might look like :
rsync -avxH "$USER@cargo.univ-grenoble-alpes.fr:/path/to/your/cluster/directory/outputs/" "$PWD/outputs/"
Once this transfer has been completed, we apply the touch transfer_back_OK command to indicate locally that the output has been repatriated locally, and trigger the next stage (for which the transfer_back_OK file will be one of the inputs).
The transfer and transfer_back rules must be run locally. You should therefore remember to add them to the list of rules to be executed locally at the beginning of your Snakefile.
To tell Snakemake that part of the workflow will be run on a compute cluster, you need to specify this in the workflow launch command.
snakemake -s my_snakefile.smk --jobs 10 --executor cluster-generic
--cluster-generic-submit-cmd "local_qsub.sh"
--cluster-generic-status-cmd "local_qstat.sh"
The --executor cluster-generic flag is used to specify to Snakemake that the workflow will be partially executed on a compute cluster.
The --cluster-generic-submit-cmd "local_qsub.sh" flag tells Snakemake which script to run to submit jobs on the cluster. local_qsub.sh is a local wrapper which will itself launch the qsub.sh wrapper previously transferred to the cluster, which will submit jobs using oarsub.
The --cluster-generic-status-cmd "local_qstat.sh" flag tells Snakemake which script to run to check the state of jobs on the cluster. local_qstat.sh is a local wrapper which will itself launch the qstat.sh wrapper previously transferred to the cluster, which will check the state of jobs with oarstat.
In the Snakefile, the cluster_inference rule will take the form :
rule cluster_inference:
input:
"transfer_OK", # Output file from the data transfer stage on the cluster
output:
"job_running",
shell:
"job.sh" # Script to run with oarsub on the cluster
This rule will be launched on the cluster by qsub.sh, itself launched by Snakemake with local_qsub.sh.
The contents of the local_qsub.sh file will be as follows (to be completed as required):
ssh cluster_name "/path/to/run/directory/qsub.sh $*"
This file therefore connects to the cluster via SSH and launches the qsub.sh file previously transferred, then creates a job_running file to indicate locally that the job has been launched on the cluster.
The contents of the local_qstat.sh file will be as follows (to be completed as required):
tmp=$(ssh cluster_name "/path/to/run/directory/qstat.sh $*)"
oar_state=$(echo "$tmp" | cut -d' ' -f 2)
case ${oar_state} in
Terminated) touch "job_running"; echo "success";;
Running|Finishing|Waiting|toLaunch|Launching|Hold|toAckReservation) echo "running";;
Error|toError) echo "failed";;
Suspended|Resuming) echo "suspended";;
*) echo "unknown";;
esac
The local_qstat.sh will be iteratively launched by Snakemake until the job is completed. If successful, the job_running file will be created and will be used to launch the rest of the workflow (data repatriation with the transfer_back rule). Otherwise, the workflow will stop with an “output file not created” error.
Carrying out the above steps can be quite complex and time-consuming. A template to make this tool easier to use is currently being designed.