Introduction

A workflow management system (WMS) is a piece of software that sets up, performs and monitors a defined sequence of computational tasks (i.e. "a workflow"). Snakemake is a WMS that was developed in the bioinformatics community, and as such it has a number of features that make it particularly well-suited for creating reproducible and scalable data analyses.

First of all the language you use to formulate your workflows is based on Python, which is a language with strong standing in academia. However, users are not required to know how to code in Python to work efficiently with Snakemake. Workflows can easily be scaled from your desktop to server, cluster, grid or cloud environments. This makes it possible to develop a workflow on your laptop, maybe using only a small subset of your data, and then run the real analysis on a cluster. Snakemake also has several features for defining the environment with which each task is carried out. This is important in bioinformatics, where workflows often involve running a large number of small third-party tools.

Snakemake is primarily intended to work on files (rather than for example streams, reading/writing from databases or passing variables in memory). This fits well with many fields of bioinformatics, notably next-generation sequencing, that often involve computationally expensive operations on large files. It's also a good fit for a scientific research setting, where the exact specifications of the final workflow aren't always known at the beginning of a project.

Lastly, a WMS is a very important tool for making your analyses reproducible. By keeping track of when each file was generated, and by which operation, it is possible to ensure that there is a consistent "paper trail" from raw data to final results. Snakemake also has features that allow you to package and distribute the workflow, and any files it involves, once it's done.

If you want to read more, you can find several resources here:

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven't done so already, then open up a terminal and go to workshop-reproducible-research/snakemake and activate your snakemake-env Conda environment.

The basics

In this part of the tutorial we will create a very simple workflow from scratch, in order to show the fundamentals of how Snakemake works. The workflow will take two files as inputs, a.txt and b.txt, and the purpose is to convert the text in the files to upper case and then to concatenate them.

Run the following shell commands. The first one will make an empty file named Snakefile, which will later contain the workflow. The second and third commands generate two files containing some arbitrary text.

touch Snakefile
echo "This is a.txt" > a.txt
echo "This is b.txt" > b.txt

Then open Snakefile in your favorite text editor (Vim is available if you're running this in the course Docker container). A Snakemake workflow is based on rules which take some file(s) as input, performs some type of operation on them, and generate some file(s) as outputs. Here is a very simple rule that takes a.txt as an input and produces a.upper.txt as an output. Copy this rule to your Snakefile and save it.

rule convert_to_upper_case:
    input:
        "a.txt"
    output:
        "a.upper.txt"
    shell:
        """
        tr [a-z] [A-Z] < {input} > {output}
        """

Attention

Indentation is important in Snakefiles, so make sure that you have the correct number of spaces before input/output/shell and their respective subsections. The number of spaces per level doesn't matter as long as you're consistent. Here we use four, but you could just as well use two for a more compact look. Don't use tabs (unless your editor automatically converts them to spaces).

A rule has a name, here it's convert_to_upper_case. Make an effort to name your rules in a way that makes it easy to understand the purpose of the rule, as rule names are one of the main ways to interact with the workflow. The shell section contains the shell commands that will convert the text in the input file to upper case and send it to the output file. In the shell command string, we can refer to elements of the rule via curly brackets. Here, we refer to the output file by specifying {output} and to the input file by specifying {input}. If you're not very familiar with Bash, this particular command can be read like "send the contents of a.txt to the program tr, which will convert all characters in the set [a-z] to the corresponding character in the set [A-Z], and then send the output to a.upper.txt".

Now let's run our first Snakemake workflow. When a workflow is executed Snakemake tries to generate a set of target files. Target files can be specified via the command line (or, as you will see later, in several other ways). Here we ask Snakemake to make the file a.upper.txt. It's good practice to first run with the flag -n(or --dry-run), which will show what Snakemake plans to do without actually running anything. You can also use the flag -p, for showing the shell commands that it will execute, and the flag -r for showing the reason for running a specific rule. snakemake --help will show you all available flags.

$ snakemake -n -r -p a.upper.txt

Building DAG of jobs...
Job counts:
    count   jobs
    1   convert_to_upper_case
    1

[Thu Nov 19 13:31:30 2020]
rule convert_to_upper_case:
    input: a.txt
    output: a.upper.txt
    jobid: 0
    reason: Missing output files: a.upper.txt


        tr [a-z] [A-Z] < a.txt > a.upper.txt

Job counts:
    count   jobs
    1   convert_to_upper_case
    1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

You can see that Snakemake plans to run 1 job: the rule convert_to_upper_case with a.txt as input and a.upper.txt as output. The reason for doing this is that it's missing the file a.upper.txt. Now execute the workflow without the -n flag and check that the contents of a.upper.txt is as expected. Then try running the same command again. What do you see? It turns out that Snakemake only reruns jobs if one of the input files is newer than one of the output files, or if one of the input files will be updated by another job. This is how Snakemake ensures that everything in the workflow is up to date. We will get back to this shortly.

What if we ask Snakemake to generate the file b.upper.txt?

$ snakemake -n -r -p b.upper.txt

Building DAG of jobs...
MissingRuleException:
No rule to produce b.upper.txt (if you use input functions make sure that they do not raise unexpected exceptions)

That didn't work well. We could copy the rule to make a similar one for b.txt, but that would be a bit cumbersome. Here is where named wildcards come in; one of the most powerful features of Snakemake. Simply change the input from input: "a.txt" to input: "{some_name}.txt" and the output to output: "{some_name}.upper.txt". Now try asking for b.upper.txt again.

Tada! What happens here is that Snakemake looks at all the rules it has available (actually only one in this case) and tries to assign values to all wildcards so that the targeted files can be generated. In this case it was quite simple, you can see that it says that wildcards: some_name=b, but for large workflows and multiple wildcards it can get much more complex. Named wildcards is what enables a workflow (or single rules) to be efficiently generalized and reused between projects or shared between people.

It seems we have the first part of our workflow working, now it's time to make the second rule for concatenating the outputs from convert_to_upper_case. The rule structure will be similar; the only difference is that here we have two inputs instead of one. This can be expressed in two ways, either with named inputs like this:

input:
    firstFile="...",
    secondFile="..."
shell:
    """
    some_function {input.firstFile} {input.secondFile}
    """

Or with indexes like this:

input:
    "...",
    "..."
shell:
    """
    some_function {input[0]} {input[1]}
    """

Attention

If you have multiple inputs or outputs they need to be delimited with a comma (as seen above). This is a very common mistake when writing Snakemake workflows. The parser will complain, but sometimes the error message can be difficult to interpret.

Now try to construct this rule yourself and name it concatenate_a_and_b. The syntax for concatenating two files in Bash is cat first_file.txt second_file.txt > output_file.txt. Call the output c.txt. Run the workflow in Snakemake and validate that the output looks as expected.

Wouldn't it be nice if our workflow could be used for any files, not just a.txt and b.txt? We can achieve this by using named wildcards (or in other ways as we will discuss later). As we've mentioned, Snakemake looks at all the rules it has available and tries to assign values to all wildcards so that the targeted files can be generated. We therefore have to name the output file in a way so that it also contains information about which input files it should be based on. Try to figure out how to do this yourself. If you're stuck you can look at the spoiler below, but spend some time on it before you look. Also rename the rule to concatenate_files to reflect its new more general use.

Click to see how to implement concatenate_files
rule concatenate_files:
    input:
        "{first}.upper.txt",
        "{second}.upper.txt"
    output:
        "{first}_{second}.txt"
    shell:
        """
        cat {input[0]} {input[1]} > {output}
        """

We can now control which input files to use by the name of the file we ask Snakemake to generate.

$ snakemake a_b.txt

Building DAG of jobs...
Using shell: /usr/local/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       concatenate_files
        2       convert_to_upper_case
        3

rule convert_to_upper_case:
    input: b.txt
    output: b.upper.txt
    jobid: 1
    wildcards: some_name=b

Finished job 1.

1 of 3 steps (33%) done

rule convert_to_upper_case:
    input: a.txt
    output: a.upper.txt
    jobid: 2
    wildcards: some_name=a

Finished job 2.
2 of 3 steps (67%) done

rule concatenate_files:
    input: a.upper.txt, b.upper.txt
    output: a_b.txt
    jobid: 0
    wildcards: second=b, first=a

Finished job 0.
3 of 3 steps (100%) done

Neat!

Quick recap

In this section we've learned:

  • How a simple Snakemake rule looks.
  • How to define target files when executing a workflow.
  • How to use named wildcards for writing generic and flexible rules.

!!! tip You can name a file whatever you want in a Snakemake workflow, but you will find that everything falls into place much nicer if the filename reflects the file's path through the workflow, e.g. sample_a.trimmed.deduplicated.sorted.bam.

Visualizing workflows

All that we've done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let's now take a look at some of the features where a WMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate two types of graphs, one that show how the rules are connected and one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected. First we look at the rule graph. The following command will generate a rule graph in the dot language and pipe it to the program dot, which in turn will save a visualization of the graph as a PNG file (if you're having troubles displaying PNG files you could use SVG or JPG instead).

snakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png

That looks simple enough, the output from the rule convert_to_upper_case will be used as input to the rule concatenate_files. For a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.

While saying that it's easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WMS.

The second type of graph is based on the jobs, and looks like this for our little workflow (use --dag instead of --rulegraph).

snakemake --dag a_b.txt | dot -Tpng > jobgraph.png

The main difference here is that now each node is a job instead of a rule. You can see that the wildcards used in each job are also displayed. Another difference is the dotted lines around the nodes. A dotted line is Snakemake's way of indicating that this rule doesn't need to be rerun in order to generate a_b.txt. Validate this by running snakemake -n -r a_b.txt and it should say that there is nothing to be done.

We've discussed before that one of the main purposes of using a WMS is that it automatically makes sure that everything is up to date. This is done by recursively checking that outputs are always newer than inputs for all the rules involved in the generation of your target files. Now try to change the contents of a.txt to some other text and save it. What do you think will happen if you run snakemake -n -r a_b.txt again?

Click to see output
$ snakemake -n -r a_b.txt

Building DAG of jobs...
rule convert_to_upper_case:
    input: a.txt
    output: a.upper.txt
    jobid: 2
    reason: Updated input files: a.txt
    wildcards: some_name=a

rule concatenate_files:
    input: a.upper.txt, b.upper.txt
    output: a_b.txt
    jobid: 0
    reason: Input files updated by another job: a.upper.txt
    wildcards: first=a, second=b

Job counts:
        count   jobs
        1       concatenate_files
        1       convert_to_upper_case
        2

Were you correct? Also generate the job graph and compare to the one generated above. What's the difference? Now rerun without -n and validate that a_b.txt contains the new text. Note that Snakemake doesn't look at the contents of files when trying to determine what has changed, only at the timestamp for when they were last modified.

We've seen that Snakemake keeps track of if files in the workflow have changed, and automatically makes sure that any results depending on such files are regenerated. What about if the rules themselves are changed? It turns out that there are multiple ways to do this, but the most straightforward is to manually specify that you want to rerun a rule (and thereby also all the steps between that rule and your target). Let's say that we want to modify the rule concatenate_files to also include which files were concatenated.

rule concatenate_files:
    input:
        "{first}.upper.txt",
        "{second}.upper.txt"
    output:
        "{first}_{second}.txt"
    shell:
        """
        echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}
        """

Note

It's not really important for the tutorial, but the shell command used here first outputs "Concatenating " followed by a space delimited list of the files in input. This string is then sent to the program cat where it's concatenated with input[0] and input[1] (the parameter - means that it should read from standard input). Lastly, the output from cat is sent to {output}.

If you now run the workflow as before you should get "Nothing to be done", because no files involved in the workflow have been changed. Instead we have to force Snakemake to rerun the rule by using the -Rflag. Let's try a dry-run.

snakemake a_b.txt -r -n -R concatenate_files

Note that the reason for the job is now "Forced execution". You can target files as well as rules, so you would get the same result with -R a_b.txt. Whenever you've made changes to a rule that will affect the output it's good practice to force re-execution like this. Still, there can be situations where you don't know if any rules have been changed. Maybe several people collaborate on the same workflow but are using it on different files, for example. Snakemake keeps track of how all files were generated (when, by which rule, which version of the rule, and by which commands). You can export this information to a tab-delimited file like this:

snakemake a_b.txt -D > summary.tsv

The contents of summary.tsv is shown in the table below (scroll horizontally to see the full table).

output_file date rule version log-file(s) input-file(s) shellcmd status plan
a_b.txt Thu Nov 16 12:03:11 2017 concatenate_files - a.upper.txt,b.upper.txt cat a.upper.txt b.upper.txt > a_b.txt rule implementation changed no update
a.upper.txt Thu Nov 16 12:03:11 2017 convert_to_upper_case - a.txt tr [a-z] [A-Z] < a.txt > a.upper.txt ok no update
b.upper.txt Thu Nov 16 12:03:11 2017 convert_to_upper_case - b.txt tr [a-z] [A-Z] < b.txt > b.upper.txt ok no update

You can see in the second last column that the rule implementation for a_b.txt has changed. The last column shows if Snakemake plans to regenerate the files when it's next executed. None of the files will be regenerated because Snakemake doesn't regenerate files by default if the rule implementation changes. From a reproducibility perspective maybe it would be better if this was done automatically, but it would be very computationally expensive and cumbersome if you had to rerun your whole workflow every time you fix a spelling mistake in a comment somewhere. So, it's up to us to look at the summary table and rerun things as needed. You can get a list of the files for which the rule implementation has changed, and then force Snakemake to regenerate these files with the -R flag.

snakemake a_b.txt -R $(snakemake a_b.txt --list-code-changes)

Here the $(snakemake a_b.txt --list-code-changes) part outputs the files that are then used as targets for the snakemake a_b.txt -R part.

Clever, right? There are a bunch of these --list-xxx-changes flags that can help you keep track of your workflow. You can list all options with snakemake --help. Run with the -D flag again to make sure that the summary table now looks like expected.

You might wonder where Snakemake keeps track of all these things? It stores all information in a hidden subdirectory called .snakemake. This is convenient since it's easy to delete if you don't need it anymore and everything is contained in the project directory. Just be sure to add it to .gitignore so that you don't end up tracking it.

By now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There's a lot we haven't covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve on it, and at the same time showcase some Snakemake features we haven't discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.

Quick recap

In this section we've learned:

  • How to use --dag and --rulegraph for visualizing the job and rule graphs, respectively.
  • How to force Snakemake to rerun relevant parts of the workflow after there have been changes.
  • How logging in Snakemake works.

The MRSA workflow

As you might remember from the intro, we are attempting to understand how lytic bacteriophages can be used as a future therapy for the multiresistant bacteria MRSA (methicillin-resistant Staphylococcus aureus). In order to do this we have performed RNA-seq of three strains, one test and two controls. We have already set up a draft Snakemake workflow for the RNA-seq analysis and it seems to be running nicely. It's now up to you to modify this workflow to make it more flexible and reproducible!

Tip

This section will leave a little more up to you compared to the previous one. If you get stuck at some point the final workflow after all the modifications is available as git/Snakefile.

This will require some more packages, so add the following lines to environment.yml.

# For aggregating output from FastQC
- multiqc=1.7
# For mapping reads to a genome
- bowtie2=2.3.5.1
# For sorting the output from Bowtie2
- samtools=1.10
# For generating a count table for further analysis
- htseq=0.11.2

You are probably already in your snakemake_exercise environment, otherwise activate it (use conda info --envs if you are unsure). You can update the current environment to contain the new packages like this:

conda env update -f environment.yml

Tip

Here we have one Conda environment for executing the whole Snakemake workflow. Snakemake also supports using explicit conda environments on a per-rule basis, by specifying something like conda: rule-specific-env.yml in the rule definition and running Snakemake with the --use-conda flag. The given rule will then be run in the conda environment specified in rule-specific-env.yml that will be created and activated on the fly by Snakemake.

Done! Let's start by generating the rule graph so that we get an overview of the workflow.

snakemake -s snakefile_mrsa.smk --rulegraph | dot -T png > rulegraph_mrsa.png

There are two differences in this command compared to the one we've used before. The first is that we're using the -s flag to specify which Snakemake workflow to run. We didn't need to do that before since Snakefile is the default name. The second is that we don't define a target. In the toy example we used a_b.txt as a target, and the wildcards were resolved based on that. How come that we don't need to do that here? It turns out that by default Snakemake targets the first rule in a workflow. By convention we call this rule all and let it serve as a rule for aggregating the main outputs of the workflow.

Now take some time and look through the workflow file and try to understand how the rules fit together. Use the rule graph as aid. The rules represent a quite standard, although somewhat simplified, workflow for RNA-seq analysis. If you are unfamiliar with the purpose of the different operations (index genome, FastQC and so on), then take a look at the intro.

Also generate the job graph in the same manner. Here you can see that three samples will be downloaded from SRA (Sequence Read Archive); SRR935090, SRR935091, and SRR935092. Those will then be quality controlled with FastQC and aligned to a genome. The QC output will be aggregated with MultiQC and the alignments will be used to generate a count table, i.e. a table that shows how many reads map to each gene for each sample. This count table is then what the downstream analysis will be based on (in the R Markdown tutorial and in the Docker tutorial).

Now try to run the whole workflow. Hopefully you see something like this.

Building DAG of jobs...
Using shell: /usr/local/bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    3   align_to_genome
    1   all
    3   fastqc
    1   generate_count_table
    1   generate_rulegraph
    3   get_SRA_by_accession
    1   get_genome_fasta
    1   get_genome_gff3
    1   index_genome
    1   multiqc
    3   sort_bam
    19

rule get_genome_fasta:
    output: data/raw_external/NCTC8325.fa.gz
    jobid: 18

--2017-11-16 22:13:28--  ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
           => ‘data/raw_external/NCTC8325.fa.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
.
.
[lots of stuff]
.
.
localrule all:
    input: results/tables/counts.tsv, results/multiqc.html, results/rulegraph.png
    jobid: 0


Finished job 0.
19 of 19 steps (100%) done

After everything is done, the workflow will have resulted in a bunch of files in the directories data, intermediate and results. Take some time to look through the structure, in particular the quality control reports in results and the count table in results/tables.

Parameters

In a typical bioinformatics project, considerable efforts are spent on tweaking parameters for the various programs involved. It would be inconvenient if you had to change in the shell scripts themselves every time you wanted to run with a new setting. Luckily, there is a better option for this: the params keyword.

rule some_rule:
    input:
        "..."
    output:
        "..."
    params:
        cutoff=2.5
    shell:
        """
        some_program --cutoff {params.cutoff} {input} {output}
        """

We run most of the programs with default settings in our workflow. However, there is one parameter in the rule get_SRA_by_accession that we use for determining how many reads we want to retrieve from SRA for each sample (-X 25000). Change in this rule to use the parameter max_reads instead, set the value to 20000, and run through the workflow. Remember that Snakemake doesn't automatically rerun rules after parameter changes, so you have to trigger the execution of get_SRA_by_accession with -R.

rule get_SRA_by_accession:
    """
    Retrieve a single-read FASTQ file from SRA (Sequence Read Archive) by run accession number.
    """
    output:
        "data/raw_internal/{sra_id}.fastq.gz"
    shell:
        """
        fastq-dump {wildcards.sra_id} -X 25000 --readids \
            --dumpbase --skip-technical --gzip -Z > {output}
        """

The parameter values we set in the params section don't have to be static, they can be any Python expression. In particular, Snakemake provides a global dictionary of configuration parameters called config. Let's modify get_SRA_by_accession to look something like this in order to access the elements of this dictionary:

rule get_SRA_by_accession:
    """
    Retrieve a single-read FASTQ file from SRA (Sequence Read Archive) by run accession number.
    """
    output:
        "data/raw_internal/{sra_id}.fastq.gz"
    params:
        max_reads = config["max_reads"]
    shell:
        """
        fastq-dump {wildcards.sra_id} -X {params.max_reads} --readids \
            --dumpbase --skip-technical --gzip -Z > {output}
        """

The config variable is just a normal Python dictionary, but it has the special feature that we can change the parameter values from the command line by using the snakemake --config KEY=VALUE syntax. Try this out for yourself.

From a reproducibility perspective, it's not optimal to set parameters from the command line, since it's difficult to keep track of which parameter values that were used. A much better alternative is to use the --configfile FILE option. Here we can collect all the project-specific settings, sample ids, and so on in one file. This also enables us to write the Snakefile in a more general manner so that it can be better reused between projects. Like several other files used in these tutorials, this file should be in yaml format. Create the file below and save it as config.yml.

max_reads: 25000

If we now run Snakemake with --configfile config.yml, it will parse this file to form the config dictionary. If you want to overwrite a parameter value, e.g. for testing, you can still use the --config flag.

Tip

Rather than supplying the config file from the command line you could also add the line configfile: "config.yml" to the top of your Snakefile.

Logs

As you probably noticed it was difficult to follow how the workflow progressed since some rules printed a lot of output to the terminal. In some cases this also contained important information, such as statistics on the sequence alignments or genome indexing. This could be valuable for example if you later in the project get weird results and want to debug. It's also important from a reproducibility perspective that the "paper trail" describing how the outputs were generated is saved. Luckily, Snakemake has a feature that can help with this. Just as we define input and output in a rule we can also define log.

rule some_rule:
    input:
        "..."
    output:
        "..."
    log:
        "..."
    shell:
        """
        echo 'Converting {input} to {output}' > {log}
        """

A log file is not different from any other output file, but it's dealt with a little differently by Snakemake. For example, it's shown in the file summary when using -D. It's also a good way to clarify the purpose of the file. We probably don't need to save logs for all the rules, only the ones with interesting output.

  • get_genome_fasta and get_genome_gff3 would be good to log since they are dependent on downloading files from an external server.
  • multiqc aggregates quality control data for all the samples into one html report, and the log contains information about which samples were aggregated.
  • index_genome outputs some statistics about the genome indexing.
  • align_to_genome outputs important statistics about the alignments. This is probably the most important log to save.

Now add a log file to some or all of the rules above. A good place to save them to would be results/logs/rule_name/. Be sure to include any wildcards used in the rule in the job name as well, e.g. {some_wildcard}.log, so that you don't end up with identical names for different samples.

You also have to specify in the shell section of each rule what you want the log to contain. Some of the programs we use send their log information to standard out, some to standard error and some let us specify a log file via a flag. To save some time you can use the info below.

# Wget has a -o flag for specifying the log file
wget remote_file -O output_file -o {log}

# MultiQC writes to standard error so we redirect with "2>"
multiqc -n output_file input_files 2> {log}

# Bowtie2-build redirects to standard out so we use ">"
bowtie2-build input_file index_dir > {log}

# Bowtie2 writes main output to standard out and the log info to standard error
bowtie2 -x index_dir -U input_file > output_file 2> {log}

Now rerun the whole workflow by using the -F flag. Do the logs contain what they should? Note how much easier it is to follow the progression of the workflow when the rules write to logs instead of to the terminal. If you run with -D (or -S for a simpler version) you will see that the summary table now also contains the log file for each of the files in the workflow.

Temporary files

It's not uncommon that workflows contain temporary files that should be kept for some time and then deleted once they are no longer needed. A typical case could be that some operation generates a file, which is then compressed to save space or indexed to make searching faster. There is then no need to save the original output file. Take a look at the job graph for our workflow again. The output from align_to_genome is a bam file, which contains information about all the reads for a sample and where they map in the genome. For downstream processing we need this file to be sorted by genome coordinates. This is what the rule sort_bam is for. We therefore end up with both intermediate/{sra_id}.bam and intermediate/{sra_id}.sorted.bam.

In Snakemake we can mark an output file as temporary like this:

output: temp("...")

The file will then be deleted as soon as all jobs where it's an input have finished. Now do this for the output of align_to_genome. We have to rerun the rule for it to trigger, so use -R align_to_genome. It should look something like this:

.
.
rule sort_bam:
    input: intermediate/SRR935090.bam
    output: intermediate/SRR935090.sorted.bam
    jobid: 2
    wildcards: sra_id=SRR935090

Removing temporary output file intermediate/SRR935090.bam.
Finished job 2.
.
.

Tip

Sometimes you may want to trigger removal of temporary files without actually rerunning the jobs. You can then use the --delete-temp-output flag. In some cases you may instead want to run only parts of a workflow and therefore want to prevent files marked as temporary from being deleted (because the files are needed for other parts of the workflow). In such cases you can use the --notemp flag.

Snakemake has a number of options for marking files:

  • temp("..."): The output file should be deleted once it's no longer needed by any rules.
  • protected("..."): The output file should be write-protected. Typically used to protect files that require a huge amount of computational resources from being accidentally deleted.
  • ancient("..."): The timestamp of the input file is ignored and it's always assumed to be older than any of the output files.
  • touch("..."): The output file should be "touched", i.e. created or updated, when the rule has finished. Typically used as "flag files" to enforce some rule execution order without real file dependencies.
  • directory("..."): The output is a directory rather than a file.
  • dynamic("{some_wildcard}..."): This one is a bit tricky. It's used when the number of output files from a rule cannot be determined beforehand. A typical use case could be if you run some clustering analysis and end up with one file per cluster.

Rule targets

So far we have only defined the inputs/outputs of a rule as strings, or in some case a list of strings, but Snakemake allows us to be much more flexible than that. Actually, we can use any Python expression or even functions, as long as they return a string or list of strings. Consider the rule align_to_genome below.

rule align_to_genome:
    """
    Align a fastq file to a genome index using Bowtie 2.
    """
    input:
        "data/raw_internal/{sra_id}.fastq.gz",
        "intermediate/NCTC8325.1.bt2",
        "intermediate/NCTC8325.2.bt2",
        "intermediate/NCTC8325.3.bt2",
        "intermediate/NCTC8325.4.bt2",
        "intermediate/NCTC8325.rev.1.bt2",
        "intermediate/NCTC8325.rev.2.bt2"
    output:
        "intermediate/{sra_id,\w+}.bam"
    shell:
        """
        bowtie2 -x intermediate/NCTC8325 -U {input[0]} > {output}
        """

Here we have seven inputs; the fastq file with the reads and six files with similar file names from the Bowtie 2 genome indexing. We can try to tidy this up by using a Python expression to generate a list of these files instead. If you're familiar with Python you could do this with list comprehensions like this:

input:
    "data/raw_internal/{sra_id}.fastq.gz",
    ["intermediate/NCTC8325.{my_substr}.bt2".format(my_substr=substr) for
        substr in ["1", "2", "3", "4", "rev.1", "rev.2"]]

This will take the elements of the list of substrings one by one, and insert that element in the place of {my_substring}. Since this type of aggregating rules are quite common, Snakemake also has a more compact way of achieving the same thing.

input:
    "data/raw_internal/{sra_id}.fastq.gz",
    expand("intermediate/NCTC8325.{my_substr}.bt2",
           my_substr = ["1", "2", "3", "4", "rev.1", "rev.2"])

Now change in the rules index_genome and align_to_genome to use the expand() expression.

In the workflow we decide which samples to run by including the SRR ids in the names of the inputs to the rules multiqc and generate_count_table. This is a potential source of errors since it's easy to change in one place and forget to change in the other. As we've mentioned before, but not really used so far, Snakemake allows us to use Python code "everywhere". Let's therefore define a list of sample ids and put at the very top of the Snakefile, just before the rule all.

SAMPLES = ["SRR935090", "SRR935091", "SRR935092"]

Now use expand() in multiqc and generate_count_table to use SAMPLES for the sample ids. Much better!

Shadow rules

Take a look at the rule generate_count_table below. Since input.annotation is compressed, it is first unzipped to a temporary file. htseq-count then generates a temporary count table, which is finally prepended with a header containing the sample names. Lastly, the two temporary files are deleted.

rule generate_count_table:
    """
    Generate a count table using htseq-count.
    """
    input:
        bams=["intermediate/SRR935090.sorted.bam", "intermediate/SRR935091.sorted.bam", "intermediate/SRR935092.sorted.bam"],
        annotation="data/raw_external/NCTC8325.gff3.gz"
    output:
        "results/tables/counts.tsv"
    shell:
        """
        # htseq-count cannot use .gz, so unzip to a temporary file first
        gunzip -c {input.annotation} > tempfile

        # Save the count table as a temporary file and then prepend a header line
        # with the sample names
        htseq-count --format bam --type gene --idattr gene_id {input.bams} tempfile > tempfile2
        echo '{input.bams}' | tr ' ' '\t' | cat - tempfile2 > {output}

        # Remove the temporary files
        rm tempfile tempfile2
        """

There are a number of drawbacks with having files that aren't explicitly part of the workflow as input/output files to rules (as the two temporary files here).

  • Snakemake cannot clean up these files if the job fails, as it would do for normal output files.
  • If several jobs are run in parallel there is a risk that they write to tempfile at the same time. This can lead to very scary results.
  • Sometimes we don't know the names of all the files that a program can generate. It is, for example, not unusual that programs leave some kind of error log behind if something goes wrong.

All of these issues can be dealt with by using the shadow option for a rule. The shadow option results in that each execution of the rule is run in an isolated temporary directory (located in .snakemake/shadow/ by default). There are a few options for shadow. The most simple is shadow: "minimal", which means that the rule is executed in an empty directory that the input files to the rule have been symlinked into. For the rule below, that means that the only file available would be input.txt. The shell commands would generate the files some_other_junk_file and output.txt. Lastly, Snakemake will move the output file (output.txt) to its "real" location and remove the whole shadow directory. We therefore never have to think about manually removing some_other_junk_file.

rule some_rule:
    input:
        "input.txt"
    output:
        "output.txt"
    shadow: "minimal"
    shell:
        """
        touch some_other_junk_file
        cp {input} {output}
        """

Try this out for the rules where we have to "manually" deal with files that aren't tracked by Snakemake (multiqc, index_genome, generate_count_table). Also remove the shell commands that remove temporary files from those rules, as they are no longer needed. Now rerun the workflow and validate that the temporary files don't show up in your working directory.

Tip

Some people use the shadow option for almost every rule and some never use it at all. One thing to keep in mind is that it leads to some extra file operations when the outputs are moved to their final location. This is no issue when the shadow directory is on the same disk as the output directory, but if you're running on a distributed file system and generate very many or very large files it might be worth considering other options (see e.g. the --shadow-prefix flag).

Generalizing the workflow

It's generally a good idea to separate project-specific parameters from the actual implementation of the workflow. If we want to move all project-specific information to config.yml, and let the Snakefile be a more general RNA-seq analysis workflow, we need the config file to:

  • Specify which samples to run.
  • Specify which genome to align to and where to download its sequence and annotation files.
  • (Any other parameters we might need to make it into a general workflow, e.g. to support both paired-end and single-read sequencing)

Note

Putting all configuration in config.yml will break the generate_rulegraph rule. You can fix it either by replacing --config max_reads=0 with --configfile=config.yml in the shell command of that rule in the Snakefile, or by adding configfile: "config.yml" to the top of the Snakefile (as mentioned in a tip above).

The first point is straightforward; rather than using SAMPLES = ["..."] in the Snakefile we define it as a parameter in config.yml. Do a dry-run afterwards to make sure that everything works as expected. You can either add it as a list similar to the way it was expressed before by adding SAMPLES: ["..."] to config.yml, or you can use this yaml notation:

sample_ids:
- SRR935090
- SRR935091
- SRR935092

The second point is trickier. Writing workflows in Snakemake is quite straightforward when the logic of the workflow is reflected in the file names, i.e. my_sample.trimmed.deduplicated.sorted.fastq, but that isn't always the case. In our case we have the FTP paths to the genome sequence and annotation where the naming doesn't quite fit with the rest of the workflow. The easiest solution is probably to make three parameters to hold these values, say genome_id, genome_fasta_path and genome_gff_path, but we will go for a somewhat more complex but very useful alternative. We want to construct a dictionary where something that will be a wildcard in the workflow is the key and the troublesome name is the value. An example might make this clearer (this is also in config.yml). This is a nested dictionary where "genomes" is a key with another dictionary as value, which in turn has genome ids as keys and so on. The idea is that we have a wildcard in the workflow that takes the id of a genome as value (either "NCTC8325" or "ST398" in this case). The fasta and gff3 paths can then be retrieved based on the value of the wildcard.

genomes:
  NCTC8325:
    fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
    gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz
  ST398:
    fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection//staphylococcus_aureus_subsp_aureus_st398/dna/Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.dna.toplevel.fa.gz
    gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_st398//Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.37.gff3.gz

Let's now look at how to do the mapping from genome id to fasta path in the rule get_genome_fasta. This is how the rule currently looks (if you have added the log section as previously described).

rule get_genome_fasta:
    """
    Retrieve the sequence in fasta format for a genome.
    """
    output:
        "data/raw_external/NCTC8325.fa.gz"
    log:
        "results/logs/get_genome_fasta/NCTC8325.log"
    shell:
        """
        wget ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output} -o {log}
        """

We don't want the hardcoded genome id, so replace that with a wildcard, say {genome_id}. In general, we want the rules as far downstream as possible in the workflow to be the ones that determine what the wildcards should resolve to. In our case this is align_to_genome. You can think of it like the rule that really "needs" the file asks for it, and then it's up to Snakemake to determine how it can use all the available rules to generate it. Here we say "I need this genome index to align my sample to" and it's up to Snakemake to determine how to download and build the index.

We're almost done, but there is one more tricky thing left. Take a look below at the params section. Here we have defined a function to generate the parameter fasta_path. This is what's called an anonymous function, or lambda expression, but any normal function would work as well. What happens is that the function takes the wildcards object as its only argument. This object allows access to the wildcards values via attributes (here wildcards.genome_id). The function will then look in the nested config dictionary and return the value of the fasta path for the key wildcards.genome_id. Neat!

rule get_genome_fasta:
    """
    Retrieve the sequence in fasta format for a genome.
    """
    output:
        "data/raw_external/{genome_id}.fa.gz"
    params:
        fasta_path = lambda wildcards: config["genomes"][wildcards.genome_id]["fasta"]
    log:
        "results/logs/get_genome_fasta/{genome_id}.log"
    shell:
        """
        wget {params.fasta_path} -O {output} -o {log}
        """

Now change in get_genome_gff3 in the same way.

Also change in index_genome to use a wildcard rather than a hardcoded genome id. Here you will run into a complication if you have followed the previous instructions and use the expand() expression. We want the list to expand to ["intermediate/{genome_id}.1.bt2", "intermediate/{genome_id}.2.bt2", ...], i.e. only expanding the wildcard referring to the bowtie2 index. To keep the genome_id wildcard from being expanded we have to "mask" it with double curly brackets: {{genome_id}}.

Lastly, we need to define somewhere which genome id we actually want to use. This needs to be done both in align_to_genome and generate_count_table. Do this by introducing a parameter in config.yml called "genome_id". See below for an example for align_to_genome. Here the substr wildcard gets expanded from a list while genome_id gets expanded from the config file.

output:
    expand("intermediate/{genome_id}.{substr}.bt2",
           genome_id = config["genome_id"],
           substr = ["1", "2", "3", "4", "rev.1", "rev.2"])

Summary

Well done!

  • You have a general RNA-seq pipeline which can easily be reused between projects, thanks to clear separation between code and settings.
  • You have great traceability due to logs and summary tables.
  • You have clearly defined the environment for the workflow using Conda.
  • You have kept the workflow neat and free from temporary files by using temp() and shadow.
  • You have a logical directory structure which makes it easy to separate raw data, intermediate files, and results.
  • You have set up you project in a way that makes it very easy to distribute and reproduce either via Git, Snakemake's --archive option or a Docker image.

Extra material

Running jobs in Singularity or Docker containers

Snakemake also supports defining a Singularity or Docker container for each rule (you will have time to work on the Docker tutorial and the Singularity tutorial later during the course). Analogous to using a rule-specific Conda environment, specify container: "docker://some-account/rule-specific-image" in the rule definition. Instead of a link to a container image, it is also possible to provide the path to a *.sif file (= a Singularity file). When executing Snakemake, add the --use-singularity flag to the command line. For the given rule, a Singularity container will then be created from the image or Singularity file that is provided in the rule definition on the fly by Snakemake and the rule will be run in this container.

You can find pre-made Singularity or Docker images for many tools on https://biocontainers.pro/ (bioinformatics-specific) or on https://hub.docker.com/.

Here is an example for a rule and its execution:

rule align_to_genome:
    """
    Align a fastq file to a genome index using Bowtie 2.
    """
    input:
        "data/raw_internal/{sra_id}.fastq.gz",
        "intermediate/NCTC8325.1.bt2",
        "intermediate/NCTC8325.2.bt2",
        "intermediate/NCTC8325.3.bt2",
        "intermediate/NCTC8325.4.bt2",
        "intermediate/NCTC8325.rev.1.bt2",
        "intermediate/NCTC8325.rev.2.bt2"
    output:
        "intermediate/{sra_id,\w+}.bam"
    container: "docker://quay.io/biocontainers/bowtie2:2.3.4.1--py35h2d50403_1"
    shell:
        """
        bowtie2 -x intermediate/NCTC8325 -U {input[0]} > {output}
        """

Start your Snakemake workflow with the following command:

snakemake --use-singularity

Feel free to modify the MRSA workflow according to this example. As Singularity is a container software that was developed for HPC clusters, and for example the Mac version is still a beta version, it might not work to run your updated Snakemake workflow with Singularity locally on your computer. In the next section we explain how you can run Snakemake workflows on UPPMAX where Singularity is pre-installed.

Running Snakemake workflows on UPPMAX

There are several options to execute Snakemake workflows on UPPMAX (a HPC cluster with the SLURM workload manager). In any case, we highly recommend to use a session manager like tmux or screen so that you can run your workflow in a session in the background while doing other things on the cluster or even logging out of the cluster.

Run your workflow in an interactive job

For short workflows with only a few rules that need the same compute resources in terms of CPU (cores), you can start an interactive job (in your tmux or screen session) and run your Snakemake workflow as you would do that on your local machine. Make sure to give your interactive job enough time to finish running all rules of your Snakemake workflow.

Cluster configuration

For workflows with long run times and/or where each rule requires different compute resources, Snakemake can be configured to automatically send each rule as a job to the SLURM queue and to track the status of each job.

The relevant parameters for such a cluster configuration are --cluster and --cluster-config, in combination with a cluster.yaml file that specifies default and rule-specific compute resources and your compute account details.

Here is an example for a cluster.yaml file:

# cluster.yaml - cluster configuration file
__default__:
  account: # fill in your project compute account ID
  partition: core
  time: 01:00:00
  ntasks: 1
  cpus-per-task: 1
### rule-specific resources
trimming:
  time: 01-00:00:00
mapping:
  time: 01-00:00:00
  cpus-per-task: 16

Start your Snakemake workflow in a tmux or screen session with the following command:

snakemake
    -j 10 \
    --cluster-config cluster.yaml \
    --cluster "sbatch \
               -A {cluster.account} \
               -p {cluster.partition} \
               -t {cluster.time} \
               --ntasks {cluster.ntasks} \
               --cpus-per-task {cluster.cpus-per-task}"

The additional parameter -j specifies the number of jobs that Snakemake is allowed to send to SLURM at the same time.

SLURM Profile

In future Snakemake versions, the cluster configuration will be replaced by so-called Profiles. The SLURM Profile needs to be set up with the software cookiecutter (available via Conda). During the SLURM Profile setup, you will be asked for several values for your Profile, e.g. for a Profile name or your compute project account ID.

Rule-specific resources can be defined in each rule via the resources: directive, for example:

rule align_to_genome:
    input:
        "{genome_id}.bt2",
        "{sample}.fastq.gz"
    output:
        "{sample}.bam"
    resources:
        runtime = 360
    threads: 10
    shell:
        """
        aligner -t {threads} -i {input[1]} -x {input[0]} > {output}
        """

Any rule for which runtime is specified in the resources directive will be submitted as one job to the SLURM queue with runtime as the allocated time. Similarly, the number specified in the threads directive will be used as the number of allocated cores.

You can even specify if rules should be resubmitted to the SLURM queue, asking for more resources on subsequent attempts. Do this by modifying the resources directive with e.g.:

    resources:
        runtime = lambda wildcards, attempt: attempt*360

In this example, the rule will ask for 360 minutes on the first attempt, for 2*360 minutes on the second attempt, etc.

Finally, instead of specifying compute resources in the resources and threads directives, it is possible to set up the SLURM Profile by providing a cluster.yaml file. When you create the profile with cookiecutter and you are prompted for cluster_config []: enter the path to a cluster.yaml file, e.g.: cluster_config []: config/cluster.yaml. Now you can control resource allocations for rules either using the resources: directive in each rule, or by adding that information to the config/cluster.yaml file. If resources are found in both locations, the allocations in the cluster.yaml file will take precedence.

With this setup you can start the workflow with your SLURM Profile as follows from within a tmux or screen session:

snakemake -j 10 --profile your_profile_name