Snakemake

Snakemake is a workflow management system to create reproducible and scalable data analyses in a human readable, Python based language. You define individual analysis steps as rules with a defined input and output, where dependencies are defined implicitly when the input of one rule matches the output of another rule.

If you want to get to know its core features, there is a great snakemake tutorial that is easy to set up and has excersises to work through at your own pace. And once you have a general understanding of how snakemake works, you could for example try to use one of the standardized snakemake workflows.

Snakemake workflows: reusable analysis workflows

The snakemake module system allows for flexibly combining different existing workflows for more complex analyses. It is also used for the deployment of standardized workflows that you can find in the snakemake workflow catalog. Just click on the Usage button of any of the listed standardized workflows to get a walkthrough of how to set up this workflow for your own data analysis (for example the dna-seq-varlociraptor workflow Usage).

Snakemake wrappers: reusable snakemake rules

Snakemake wrappers are reusable snakemake rules. They wrap one command of a command line tool to provide a specific output from a provided input file. You can simply copy-paste such a wrapper into your own workflow use it. Just check out the compendium of wrappers and meta-wrappers (copy-pastable collections of wrappers to perform analysis multiple steps). And if the tool or command you are looking for is not yet on there, consider contributing a new snakemake wrapper, once you have got the tool working in your own workflow.

Snakemake debugging tricks

Saving time on DAG building and resolution

To quickly debug a particular rule, specify the output of that rule as the desired output of your snakemake run. To avoid clashes with command line argument specifications, it is best to provide the desired output file as the 1st argument right after snakemake, e.g.:

snakemake path/to/wanted_output.txt --jobs 5 --use-conda

language-specific debugging

R

logging

Standard code we use for redirecting any stderr output to the log: file specified in the rule calling the R script:

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

Also, if you are looking to have proper backtraces even for unexpected errors (errors not properly handled in your code or in a package you load), you can use:

rlang::global_entrace()

You will need to have the package rlang installed, but this for example comes with the tidyverse. For infos on the function, see the documentation: https://rlang.r-lib.org/reference/global_entrace.html Also, this is not epxected to incur a performance reduction.

interactive debugging

You can save the entire current state of a workspace in R, so you can insert this right before some code triggers an error:

save.image(file = "my_dump.RData")

In an interactive R session, first load all the library()s that you need for the script. Then you can load the full workspace and interactively explore / debug what’s going on:

load("my_dump.RData")

python

logging

Standard code we use for redirecting any stderr output to the log: file specified in the rule calling the python script:

import sys
sys.stderr = open(snakemake.log[0], "w", buffering=1)

Here, the buffering=1 ensures that line buffering is used, so that stderr lines are written to the log file whenever a full line is available. This avoids information not getting printed before throwing an error due to some longer buffering.

interactive debugging

Inserting the following into a python script executed in a snakemake rule will throw you into an interactive python debugging shell where you have access to the current state, including locals() and globals() variables:

import pdb; pdb.set_trace()

Please note that you must not redirect the standard output to any log file, which you might have done before for other debugging purpose. If you do so, the python debugger will still interrupt the pipeline execution but you will not see the debugging shell, and thus be unable to interact with the program state or resume the execution.

A collection of resources for data scientists (not only) in bioinformatics