Introduction

In this mdbook, we will collect resources and tips for data scientists in bioinformatics. It is meant as a quick reference for anybody, from someone getting started without any prior programming knowledge to experienced data scientists. We initiated its development for a community of bioinformaticians at the DKTK site Essen Düsseldorf, but want to make this open for anybody from the wider local and global data science communities.

Contributions

Contributions are very welcome, and can be anything from asking even the smallest question by opening an issue, up to directly suggesting changes or additions via pull requests. The only thing to keep in mind is to keep this collection a welcoming place by following our code of conduct.

Bioinformatics

For a tool to convert between lots of bioinformatics file formats, see the bioconvert description in the data formats section.

bioconda

Bioconda lets you install thousands of software packages related to biomedical research using the conda package manager.

It has great usage documentation, so check out how to get started using bioconda on its own website.

The first step (also described and linked to on the website) is to install conda, the package manager you need to use the bioconda channel. Here, we recommend using the mambaforge installation.

conda basics

There is a separate page with some basic conda commands` to get you started.

contributing to bioconda

While you will nowadays find most bioinformatics packages on bioconda (and checking out a package on bioconda is a great way of checking how often it is downloaded), you will sometimes encounter packages that are not available there. If you feel comfortable with installing such a tool manually, please consider contributing your solution to the channel in the form of a recipe. There is a whole documentation section dedicated to helping you contribute package recipes to bioconda.

bioconductor

The bioconductor website says it best:

The mission of the Bioconductor project is to develop, support, and disseminate free open source software that facilitates rigorous and reproducible analysis of data from current and emerging biological assays. We are dedicated to building a diverse, collaborative, and welcoming community of developers and data scientists.

Bioconductor uses the R statistical programming language, and is open source and open development.

Thus, bioconductor provides standardized and reviewed bioinformatics packages, usually with good and extensive documentation. All of its packages can for example be installed using conda via the bioconda channel.

Coding tools

useful coding conventions

naming things: the case for snake case

"Things" can be anything from a file or folder name to the name of a single variable name. In coding, and in work in general, you have to come up with names for stuff constantly. The first guide should usually be the style guide of the language you are working in (for example, see the python style guide).

For anything where your language's style guide leaves you alone (or that is not in a language with a style guide), we recommend using snake_case or lower_case_with_underscores as a general rule. Names are very easy to read, because words are clearly separated; and you'll never have to think twice which letter was capitalized.

Also, don't abbreviate unnecessarily. Typing out a longer variable name is usually something that your integrated development environment (IDE) will do for you (if you're not very quick at it anyway)---what it can't do for you, is remember what the heck that particular variable s stands for...

integrated development environments (IDEs)

If all you want to do is done in R, your best choice of IDE is probably Rstudio. For all other purposes, you will have to look around and try what works for you.

Microsoft Visual Studio Code

VS Code offers a lot of plugins for all common languages, plus lots of useful plugins for other stuff (and all for free). It has extensive documentation, but the best way to get started is probably to simply download VS code and jump right in. Several people in our group currently use it, with the possibility to work on a remote project on a server via ssh a very useful and heavily used feature.

VS Code has a very lively Extension Marketplace that provides lots of useful plugins. Here are some that have proven very useful to us:

  • The remote ssh extension enables you to log onto a server via an ssh connection. You can then open a project folder and browse and edit files, while you can also open a Terminal / shell to run commands.
  • The snakemake extension gives you syntax highlighting for Snakefiles.
  • The Rainbow CSV extension gives you systematic per-column colors in comma-separated and tab-separated value files.
  • The Edit CSV extension allows you to edit .csv and .tsv files in a tabular view, so that you can create and delete columns and rows, and copy-paste things.

In addition, VS Code will often offer you to install plugins for filetypes you open, or for languages that you use. Explore.

Jet Brains

Jet Brains develops a suite of IDEs for different languages (just use the language filter on the right) and offers a free version for academic purposes.

Licensing

In general, you will have to check with your institution / employer if there are any requirements or limitations for releasing software and code that you develop. This especially applies to licensing.

Open Source licensing

To maximize possibilities of code re-use, which should be the default in research, Open Source licenses are the best choice. For quickly deciding on an Open Source license for your project, use the interactive tool at choosealicense.org. For a comprehensive list, check out the Open Source Initiative's community-approved Open Source licenses.

Markdown

Markdown is an plain-text syntax for writing structured texts that are both human-readable as plain-text and machine-readable for automatic parsing into different formats (HTML, PDF, ...).

For a standardized syntax definition, check out the CommonMark project. They have a quick Markdown syntax overview that also links to a 10 minute tutorial.

Markdown is used extensively in many projects, e.g. for writing this mdbook or for writing README.md files with GitHub flavoured Markdown on GitHub. As it literally takes no more than 10 minutes to learn the basics and a couple of minutes to look up anything that is a bit more complicated (for example tables in GitHub flavoured Markdown), it is really worthwhile checking it out.

regular expressions

As the xkcd comic below suggests, regular expressions are a really powerful tool to systematically search (and replace) strings in very large text files. If you have ever used search-and-replace in a word processor or a spreadsheet software, you have probably seen cases where just searching for a particular word doesn't do the trick. This is where regular expressions come in.

To get started, it makes sense to familiarize yourself with the general principal of regexes. For example, check out the regular-expressions.info Quick Start. As this introduction also mentions, you will then have to venture on to learn how exactly regular expressions work in the language or tool you want to use.

regular expression flavours

Here comes a tables with resources on the exact regular expression syntax in different tools and languages. These are often referred to as regular expression flavours.

why regexes (xkcd reasoning)

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.

user experience / bug reports

friction logging

Friction logging is a technique to narratively describe you user experience with some process. This can be the installation and setup of a software, the deployment of a standardized workflow, following along a tutorial, or any other process you go through on your computer. It can be a very useful document for developers who want to improve their "product", whatever that is (software, web interface, installation process). And it is very easy to get started with this, a quick motivation and description for this technique is in this friction logging blog post by Aja Hammerly.

version control

git: the version control tool

For quite some years, git has become the de-facto standard for distributed version control. It can be used to granularly track changes to your code base while collaborating in within and across teams. And its wide adoption means that it integrates well into the graphical user inteface of any modern integrated development environment (IDE), while you can also use it directly on the command line.

learning git

  • happy git with R: A very good and comprehensive, self-paced course for learning how to use git, applying this to coding in R (and integrating it with using the RStudio IDE).
  • git book: The Getting Started and Git Basics chapters should get you going, the following chapters introduce you to more advanced applications.
  • printout git cheat sheet: A git cheat sheet with the most important git subcommands at one glance. Having this printed out and ready at hand will make looking them up a lot quicker. It is provided by GitHub in a great number of languages.
  • interactive git cheat sheet: For interactively exploring different git subcommands and what they do, with a visualization of different places where code changes can be tracked, this is a very cool tool. It's for playing around or looking stuff up once you feel a bit more comfortable with git.
  • git command reference: In addition to a full reference of all git subcommands (that you can execute on the command line), this page also provides links to the above mentioned cheat sheets.

GitHub: de facto standard online platform to work with git repositories

GitHub has become the de facto standard for distributed version control using git repositories. It has extensive free features, especially for publicly hosted repositories. This includes continuous integration resources, even for private repositories. And GitHub provides extensive documentation. However, repositories will be hosted on company servers that will not necessarily conform to your country's data regulations. Thus, for certain projects you might have to use an in-house platform like GitLab (see below).

GitLab: Open Source online platform to work with git repositories

GitLab is an Open Source alternative to GitHub, with most of the same features. As the platform itself is Open Source code, it can be installed and hosted in-house by an institution. Thus, it is the go to solution for educational and research instutions that want to avoid having (private) repositories hosted on outside servers.

collaborating using git: git, GitHub and GitLab flow

For collaborative work on a code base, you need to agree on some basic rules of how to use git and GitHub or GitLab. Usually, you can pick up on a lot of these conventions by simply using one of the online platforms and by looking at existing projects. Most of them implicitly use GitHub flow, so reading through the explanation linked here will get you a long way.

GitHub flow is a reduction of the more complicated git flow from 2010 to setup where code is released often and quality is assured by continuous integration. Should you ever encounter a project where GitHub flow does not provide the level of control you need, GitLab flow should get you started.

Command line

Unix shell / linux command line / bash

bash is the most common Linux shell, a command-line environment where you can enter and thus run commands interactively. You can find extensive resources about it online, and oftentimes searching for your particular problem online will provide suggestions for solutions.

Here are two quick-and-easy introductions that focus on the most important aspects and should be doable in a few hours:

Here is a very well curated and more thorough introduction to the command line with a motivating example running through the course:

And some additional material, for those who want to dive in deeper or skip ahead:

And here are some more detailed resources:

bash scripts

You can write full scripts in plain text files, with one command per row and including flow controls (like if-else clauses) in bash, so that you can repeat any number of steps easily. This script can just contain commands, and if it is saved as path/to/some_script.bash, you could execute it with:

bash path/to/some_script.bash

Instead, you can also specify that it is a bash script within the file, by including the following first line in the file path/to/another_script.bash (if you are interested in what this line does, check out the explanation on stackoverflow):

#!/usr/bin/env bash

If you then make that file executable for your user with chmod u+x another_script.bash, you can execute it by simply calling the name:

path/to/another_script.bash

Your shell will know to use the bash from your main environment to execute this script.

conda / mamba (software installation)

Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies.

This is how the conda documentation puts it. One more major selling-point: you don't need administrator rights to install conda itself or the to install packages with it. And for our (bioinformatic) purposes, there are two more very useful aspects:

  1. There is bioconda, a conda channel with pretty much any tool in bioinformatics pre-packaged for you.
  2. conda seemlessly integrates with snakemake for automatic software installation in automated analysis workflows.

installing conda / mamba

We recommend using the mambaforge installation. mamba is a drop-in replacement command for most of conda (it can do exactly the same things), but it is much faster for solving package dependencies. This can mean a major speedup for creating and changing environments. Whenever a command does not actually work with mamba, you can simply use conda instead (it is also installed in the background and is used for any task that's not too slow).

creating and using environments

conda makes it easy to create and manage new (named) software environments. We recommend to create a new (and minimal) software environment whenever possible (for example, in a snakemake workflow, each rule will have its own software environment). You will then only activate that software environment whenever you need the software installed into it. By keeping environments minimal, you avoid clashes in the dependencies that software might have (for example, if software X needs python version 3, while software Y still requires python version 2, these cannot be installed into the same environment---but X and Y will happily install in separate environments, which will pull in the required python version).

But, long story short, here's the command to create an environment:

mamba create -n snakemake_env snakemake snakedeploy snakefmt

You can then activate that environment with:

mamba activate snakemake_env

And you will be able to call the installed tools, for example snakemake:

snakemake --version
snakemake --help

If you are done with your current session of using the software, the following gets you back to your original environment (the one you were in before activating snakemake_env):

mamba deactivate

If you no longer need the entire environment (and the software contained in it) and want to delete it, you can:

mamba env remove -n snakemake_env

And then, there are many other things you can do with environments---just look up the details whenever you need them.

installing specific versions of software

Sometimes, you want to install a specific version of a software, for example because something was proven to work with that version in the past. You can achieve this with:

mamba create -n snakemake_env_7_19_1 snakemake=7.19.1

And in case you ever need to dive into the details, there is of course documentation on how conda handles version number specifications.

linux tools

Any linux distribution comes with a set of linux tools for the command line. In bioinformatics (and data science in general), especially the line by line tools for plain text files are very useful. Thus, we will introduce a number of these tools here and provide links for useful tutorials and documentation for the more complex ones. However, you can usually get detailed documentation for any of these tools on the command-line. Simply write man followed by the tool's name to get its full manual, for example man head. You can scroll up and down and leave the manual with q to drop back to the command-line (controls are the same as the command-line text file viewer less).

common command-line arguments

Most command-line tools have certain standard command-line arguments, that usually come in a short (one dash - and just one letter) and a long version (two dashes -- and a full word):

  • -h/--help: display a help message
  • --version: display the version of the tool that you have installed
  • --verbose: provide more verbose output, oftentimes useful for debugging problems

head and tail

head -n 5 myfile.txt will display the first five lines of the plain text file myfile.txt. tail -n 5 myfile.txt will display the last five lines of the plain text file myfile.txt.

less: quick look at text files

less is a tool for quickly looking at text files on the command line. Conjure it up with less myfile.txt, scroll up and down with the arrow and page up/down keys, search for something like thisword with /thisword (followed by enter) and quit by hitting q.

grep: line by line plain text file searching

grep allows you to search for strings and regular expressions in text files. It will return any line containing a search string. For example, grep "gene" myfile.txt would return any line containing the string gene in myfile.txt.

sed: line by line plain text file editing

sed is a great tool to work on plain-text files line by line. For example, you can easily search and replace patterns using regular expressions.

awk: line by line editing of tabular plain text files

awk is a great tool to work on plain-text tabular files (for example tab-separated or comma-separated files, .tsv and .csv files, respectively). It provides you with direct accessors to individual columns (for example $1 for the first column) and with lots of functionality to work with them.

further resources

Data analyses

differential expression analysis

A classic case of bioinformatic data analysis, is to check whether expression levels of individual transcripts differ between two (or more) groups of samples. These can be a treatment group compared to a control group, different time points, or groups differing according to any other annotation variable from their metadata. Up to a certain point, these analyses differ depending on the raw data you start from (for example expression microarrays, full transcriptome RNA-seq data, or specialized RNA sequencing methods). Thus, we have one section per data type here, and include linkouts to standardized (snakemake) workflows wherever available, which will make it much quicker for you to get started with your analysis, and much easier to update and/or redo it later. However, after the initial data processing, differential expression analysis is usually performed by the same (statistical) methods. Thus, we have another section dedicated to differential expression analysis tools. Finally, transcripts (or genes) that have been found to be differentially expressed, can be analyzed further---for example via gene set or enrichment or pathway perturbation analysis. We list some tools for these tasks in the section analysis of differentially expressed transcripts

data generation / experimental design

A good recap of available research and recommendations is given in the section Designing better RNA-seq experiments of this review paper:

Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat Rev Genet 20, 631–656 (2019). https://doi.org/10.1038/s41576-019-0150-2

number of biological replicates

You should always include biological replicates in any RNAseq analysis. Differential expression analysis tools will either not work at all without biological or the result will not be very useful. As Michael Love, the developer of DESeq2 stated, without biological replicates "you have no idea [of] the degree of within-group variability, so you can't perform statistical inference (determining if the difference you see is more than you would typically see among replicates within one condition)".

How many biological replicates to include, depends on how sensitive your analysis needs to be and how heterogeneous you expect your replicates to be within sample groups. Anything below 3 samples is generally discouraged, as a dedicated study on biological replicate numbers that had low within-group variability/heterogeneity could only detect about 85% of expected differentially expressed genes with a fold change of 2 or higher. Thus, any more sublte fold changes will most likely be missed and the same study required at least 6 biological replicates to detect about 80% of expected differentially expressed genes with a fold change of 1.2 or above.

The general rules here are:

  1. If you want to detect differentially expressed genes with smaller effect sizes (lower fold changes), you will need more biological replicates.
  2. The more heterogeneity/variability you expect within a group of biological replicates, the more biological replicates you need. For example, in vitro experiments with biological replicates from the same cell line should allow for fewer biological replicates, while studies where samples within a group come from different patients will usually require more biological replicates per group.

sequencing depth

Generally, sequencing depth is not as important a factor as the number of biological replicates. The above-cited review article concludes that for eukaryotic genomes, around 10 - 30 million reads per sample give good results.

sequencing length

According to the above-cited review article, if you only want to do general differential expression analysis, single-end reads with a length of at least 50 base pairs should suffice. If you are also interested in the expression of different transcript isoforms, the detection of fusion gene transcripts or de novo transcript assembly, longer paired-end reads will improve the sensitivity of your analyses.

data types

RNA-seq data

TODO: provide good self-contained tutorial for working with RNA-seq data

There are several standardized snakemake workflows, the following ones are maintained by the Kösterlab and associated people:

microarray data

While fewer and fewer new microarray data sets are being generated, a lot of existing microarray data sets can be re-analyzed for new research questions. If you want to get started with such analyses, there is a great hands-on tutorial by Greg Tucker-Kellogg: Introduction to gene expression microarray analysis in R and Bioconductor. It includes an introduction to Bioconductor, works on publicly available data and is completely self-contained. You can work through it at your own pace and all it requires is some basic R knowledge---for the latter, see our section on learning R.

Once you are familiar with the data type and the analysis, there's a standardized snakemake workflow for microarray analysis by James Ashmore. While we haven't used this workflow ourselves (but might in the future, we'll try to update the info here if we do), this workflow is:

This manuscript also gives good and detailed recommendations for microarray analysis in general, so you can also consult it if you are just looking for help on some certain analysis steps.

differential expression analysis tools

sleuth

TODO: add some info sleuth walkthroughs for different analyses types

DeSEQ2

TODO: add some info extensive HTML documentation page for DESeq2 main bioconductor package page for DESeq2

edgeR

TODO: add some info extensive PDF user guide for edgeR main bioconductor package page for edgeR

analysis of differentially expressed transcripts

gene set enrichment

pathway enrichment

pathway perturbation

In this section, tools for downloading data that may include raw sequencing data, genes, genomes, annotations, etc. are gathered.

tools for downloading from SRA, ENA, NCBI and Ensembl

fastq-dl

fastq-dl lets you download FASTQ data from the European Nucleotide Archive or the Sequence Read Archive repositories. It allows you to use any ENA/SRA accession number, for a Study, a Sample, an Experiment, or a Run.

ncbi-datasets: gathering data across NCBI databases

ncbi-datasets is a very-well documented open source tool from NCBI (National Center for Biotechnology Information) for gathering various kinds of data including genes, genomes, annotations etc. belonging to the desired species. As a side note, it has specific options for retrieving SARS-CoV-2 data.

Installation

It is offered as a conda package and can be installed via:

conda create -n ncbi-download -c conda-forge ncbi-datasets-cli

Example usage

Retrieval of human reference genome:

datasets download genome taxon human --reference --filename human-reference.zip

Retrieval of genomic assemblies for a specific SARS-CoV-2 lineage with the host being homo sapiens:

datasets download virus genome taxon SARS-CoV-2 --complete-only --filename B.1.525.zip --lineage B.1.525 --host "homo sapiens"

A very nice illustration that shows the overall structure of data download in the ncbi-datasets repositories documentation .

TCGA and TARGET datasets

GDC data portal

The Genomic Data Commons (GDC) data portal is a good place to start exploring the data available via The Cancer Genome Atlas Program (TCGA) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) programs. However, this interface does not really offer any easy ways of programatically accessing and reanalysing this data. Thus, we provide some tools and resources below, that should make this easier.

TCGAbiolinks is a well-documented bioconductor package for working with GDC data. In addition to the TCGA and TARGET projects, TCGAbiolinks also provides access to gene expression from healthy individuals via the public Genotype-Tissue Expression Project (GTEx) data. Check out the main bioconductor landing page for TCGAbiolinks for the original papers introducing the functionality and for links to the extensive documentation with walkthroughs. In addition, the TCGAWorkflow vignette shows how to use it for reanalysis of TCGA data.

Data formats and tools

This section gathers common bioinformatics data formats, where you can find specifications of them and which tools you can use to handle them sanely.

bioconvert: converting between bioinformatics file formats

There's a plethora of bioinformatics file formats, and you often need to convert from on to another. bioconvert is a tool that integrates as many of these conversions as possible. It's a community effort, so feel free to contribute a converter if the one you need doesn't exist, yet, but you found (or implemented) a tool that can do it.

sequencing data (fastq)

These are tools for maniupalting fasta and fastq files. For tools that can programmatically download fasta and fastq` files from repositories, see the section on data download.

general purpose (swiss army knife) tools

seqkit has a lot of subcommands for most of the operations you might ever want to do on fastq or fasta files. It is really quick and well-documented.

Splitting FASTQ data

fastqsplitter

Fastqsplitter is a tool that allows you to split FASTQ files into a desired number of output files evenly.

Installation of fastqsplitter via conda:

mamba create -n fastqsplitter -c bioconda fastqsplitter

Usage of fastqsplitter:

fastqsplitter -i input.fastq.gz -o output_1.fastq.gz -o output_2.fastq.gz -o output_3.fastq.gz

In the example above, fastqsplitter divides the input fastq into 3 evenly distributed fastq files through inferring this number from the outputs that are specified by the user.

comma-separated or tab-separated value files (CSV/TSV)

tools

These can also often be handled with standard linux command line tools. However, these usually do not respect header lines and require complicated extra code to deal with those. The following tools all respect header lines. However, always be aware of the trade-off between having lots of functionality (or specific functionality like a join command) and a cost in processing speed or memory usage that this might incur.

miller

Miller is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, JSON, and JSON Lines.

Or, to rephrase this: miller is the swiss army knife of CSV/TSV files, you should be able to do almost anything with it. It uses the linux pipe philosophy, streaming through input and output wherever possible, to also reduce memory usage. And it employs multithreading to a certain degree, especially for (de-)compression and for multiple commands in its then-chains.

xsv

xsv is not as versatile as miller, but blazingly fast. It also follows the linux pipe philosophy and streams wherever possible. If it can do what you are looking for, it is probably the fastest option out there.

csvkit

csvkit can for example join multiple csv files by a common column at once, using csvjoin, whereas other tools will require pipe chaining of multiple calls to their join command for this. However, it is a lot slower than xsv.

YAML syntax

YAML is a human friendly language for the communication of data between people and computers. The language is rich enough to represent almost any conceivable data structure or object inside a running computer program, as a readable plain text file. This type of language is often called a “serialization language” and YAML is definitely one of those. But unlike most serialization languages, the YAML format tries hard to make simple things really simple.

This is how the YAML data project puts it, and it's spot on. Thus, for anything where we need configuration files that should be readable by both humans and machines, we use the YAML language.

learning YAML

debugging YAML

For a very strict yaml linter, check out yamllint. It is available via the conda-forge channel, so you can install it into its own yamllint conda/mamba evnironment with:

mamba create -n yamllint yamllint

YAML specification

ALL the gory details in the official specification.

Data visualization

If you are interested in a conceptual framework for data visualizations, read up on the grammar of graphics. Otherwise, jump right into one of the topics covered here.

color palettes

Color palettes are a central tool in data visualization. Thus, it's important to learn which color scale to use when. And for any visualization using color, please make sure to use colorblind safe palettes:

color blind friendly palettes

If you just want a color blind color scale, here are some great resources:

If you're looking for a thorough introduction to the topic, datawrapper has you covered with a three-part blog series:

  1. How your colorblind and colorweak readers see your colors
  2. What to consider when visualizing data for colorblind readers
  3. What’s it like to be colorblind

grammar of graphics

Basically, a grammar of graphics is a framework which follows a layered approach to describe and construct visualizations or graphics in a structured manner.

This is how Dipanjan Sarkar puts it in his introduction to the grammar of graphics on towardsdatascience. While the most popular implementation of this concept is in the ggplot2 package (part of the tidyverse in R), implementations in other languages have followed. For example, the above linked blog post uses the python implementation in the package plotnine.

streamlit.io

For building apps and demos to showcase your project and/or project results, this Python library is very easy to implement an app in a straighforward way: https://streamlit.io/

Using Streamlit:

  • Project reports can be presented with sidebars, widgets..etc. and plots can be inserted (support for Vega-lite, Graphviz and many others)
  • Algorithm development could be showcased to nonbioinformaticians with some mockup information
  • A tutorial to create your first app: https://docs.streamlit.io/library/get-started/create-an-app

Languages

In this section, we'll gather useful resources for different programming languages. Each subsection starts out with learning resources to get you started.

python

Python is a programming language that lets you work quickly and integrate systems more effectively.

That is how python.org puts it. We can maybe add that python comes with a great Open Source community around it, including a large package ecosystem with packages for everything from data science, to bioinformatics or machine learning. We suggest some below, but whenever you need to do something in python (like parsing a particular file format or running some machine learning method), it usually makes sense to go first check whether there is a package that will provide you with all the functionality for this.

learning programming with python

The software carpentries provide a great Programming with Python course that you can also work through self-paced. It includes linkouts for how to set up for the course, so you should be able to get yourself up and learning in no time.

python coding style

python is known for its rigorous style guide, some of which (like indentation) is also rigorously enforced. There's a lot to learn from the style guide, so it's well worth a read!

pandas: the standard for tabular data

pandas is the current de facto standard package for working with tabular data (a.k.a. 95% of all of data science / bioinformatics). Its documentation provides a number of resources for Getting Started, just see which material fits you best (for example, have you used R or Excel in the past...?).

Generally, there are often multiple ways of achieving the same thing with pandas. To be consistent and make your life a bit easier, there's a suggestion for Minimally Sufficient Pandas, a subset of all pandas functionality. It makes sense to familiarize with that subset early on while learning to use pandas.

polars: the future for tabular data

There's also a newer python library for tabular data, called polars. It has a more efficient implementation and a cleaner usage syntax. However, it is not as widely adopted and will not work on some older computers, so the current go-to recommendation is still pandas. And while polars is probably the future for tabular data in python, a transition from pandas to polars won't be too difficult.

R

learning programming with R

If you are new to programming and want to start learning using R, the free online book Hands-On Programming with R (HOPR) is a great place to start. It really starts from scratch, including information on how to install R and Rstudio and getting you acquainted with basic principles of programming through hands-on excercises.

quick introduction to R

If you have programmed or written scripts before, this quick Intro to R by Greg Tucker-Kellogg should get you up to speed in a few hours. It is a self-contained course, that you can easily do at your own pace. And it introduces all the necessary basics for starting to work with R. For installation, I nevertheless recommend the information on how to install R and Rstudio from HOPR.

Stat545: Data wrangling, exploration, and analysis with R

If you are looking for a very detailed and granular introduction to data analysis with R, STAT545: Data wrangling, exploration, and analysis with R by Jenny Brian et al. provides a thorough and accessible course. You can work through it self-paced.

interactive learning via swirl

Interactive learning within R command line. From beginner to advanced. User friendly. swirl

R4DS: R for Data Science

If you have programmed before and want to learn how to use R for data analysis, the free online book R for Data Science is all you need. It introduces both concepts and tools in a very accessible way, including example code and coding excercises. It is based on the tidyverse packages.

tidyverse

Use the tidyverse for modern, vectorized, and efficient code, with excellent documentation (and cool package badges;). If you want a thorough introduction, the R4DS book is there for you. And if you want to jump right in, here's a table of the most important packages and their online documentation, in the order that you will probably most often use them in a project:

important packages

nameonline docs (cheatsheets!)purpose
readrhttps://readr.tidyverse.org/read in (rectangular) data (csv, tsv, xslx, ...)
tidyrhttps://tidyr.tidyverse.org/get data into tidy format
stringrhttps://stringr.tidyverse.org/string manipulation
dplyrhttps://dplyr.tidyverse.org/transform data (in tidy format) with sql-like syntax
ggplot2https://ggplot2.tidyverse.org/easy and versatile plotting with the grammar of graphics

jump right in

If you do want to jump right in, depending on you previous exposure to the tidyverse, I recommend to at least:

  1. Get all the cheatsheets and have those ready on your machine or even printed out. Either from the online documentation linked above, or from the Rstudio compilation of all tidyverse cheatsheets.
  2. Read through the introductions of the online vignettes above, and the conceptual vignettes that they link to. They do a very good job at explaining basic concepts that will make writing analyses with the tidyverse a lot easier and more intuitive.

The most important concepts are probably:

  1. Always get your data into a tidy format first. How tidy data should look and why, is described in this tidy data vignette.
  2. When working in the tidyverse, you will always use pipes. There is a good introductory chapter on pipes in the R4DS online book.
  3. There’s also a vignette on how to work with two tables. For example, if you want to filter one table to only include entries also present in a second table. Or if you want to use a second table, to add annotation fields to a first table.

style guide

https://style.tidyverse.org

A detailed and well-written style guide for using R with the tidyverse. It covers topics like good names, syntax recommendations and best practices for error messages. Many of the things covered here, can also be applied in other programming languages.

styler: automatic formatting

https://styler.r-lib.org

Automatic formatting according to tidyverse style guide. This is also available for VS Code via the R (or REditorSupport) extensions.

Rstudio: an integrated development environment tailor-made for R

Rstudio is an IDE dedicated to R, and probably the most comprehensive IDE for this language and very easy to get going with a productive workflow. You can get the free desktop version for your operating system on posit's Rstudio download page.

In Rstudio, you have a console with an interactive R session on the bottom left (like running R on the command line). Here, you can always conjure up the help information of any function by putting a question mark in front of it (for example ?mutate()) and hitting enter. This will open the respective vignette in the bottom right window. There, you will first find a description of the function and each of its arguments, and further down some quick examples on tibbles (data frames) that are immediately useable in the interactive session. So just copy and paste them, and start modifying them to your needs.

If you are using Ubuntu as your operating system and want to install R (and R packages) through conda, here's a useful trick:

alternative IDE: R (REditorSupport) extension for VS Code

https://github.com/REditorSupport/vscode-R

Provides full language support for R in VS Code, including code completion, formatting, syntax highlighting, plot and data views, etc. This should make VS Code configurable in the same way that RStudio works.

debugging / error handling

For debugging, you want to have error messages that are as informative as possible. There are two types of situations where you can improve the quality of the error messages you (and others) get:

  1. Someone else's code is throwing an error and you want a clear backtrace of the error. Here, you can use the **tidyverse's **rlang::global_entrace() function function to turn any uncaught error into an error with a proper backtrace. You can do this either in a particular script (probably at the top), or in your global .Rprofile file.
  2. You are writing your own code and can create more informative error messages (so users might not even have to read through the backtrace to identify their problem!). Here, the recommendation in the tidyverse style guide points to the cli package that provides nice functionality for useful error messages via cli_abort.

Rust

Rust is a good language to write (bioinformatics) software in, with a focus on efficiency and reliability. You will be able to analyze large (bioinformatics) data sets very quickly, with inherently safe memory management and easy parallelization. It has extensive documentation and a very active user community.

learning Rust

The language website already has a lot of resources, for example:

And then, there is a public 4-day course for learning Rust, called "Comprehensive Rust". While this does require that you already have some programming experience and is originally meant as a course to be taught, you can nevertheless work through it at your own pace.

Working with reference data

This section collects resources that help you working with reference data, be it reference genomes, transcriptomes or different types of annotations.

Liftover between reference genomes

Tasks liks read mapping and variant calling are executed with respect to a reference genome build or version. For Homo sapiens there are currently two commonly used builds: GRCh37 (hg19) and GRCh38 (hg38). Sometimes you need to convert (liftover) coordinates from one reference build to another, for example to be able to use coordinate-specific annotations or to correctly compare coordinate-specific results file. Or you might want to do something funkier, like lifting over coordinates between different species. Here, we collect tools and resources for different coordinate-specific formats and how to use them.

chain files

A chain file describes a pairwise alignment between two reference assemblies. (From the CrossMap docs) All the tools mentioned below use a chain file for the liftover process, so will need one for your pair of reference genomes.

downloading chain files

The CrossMap documentation on chain files has a good link collection with chain files available from Ensembl and UCSC. The bcftools +liftover paper suggests, that the UCSC chain files are more comprehensive than the Ensembl chain files:

For the liftover from GRCh37 to GRCh38, we did not use the Ensembl chain file (GRCh37_to_GRCh38.chain.gz) generated from Ensembl assembly mappings (http://github.com/Ensembl/ensembl/) as this resulted in a much higher rate of variants dropped compared with using the UCSC chain file.

creating chain files

These tools are untested, but we document them here to try out in the future. Please report back to this knowledge base on their usage if you come to try them out:

tools

VCF liftover

picard LiftoverVcf

For VCF files, Picard offers the command LiftoverVcf.

Tested usage for a referene fasta and a callset vcf:

mamba create -n picard picard
mamba activate picard
picard CreateSequenceDictionary  -R {input.reference} -O {input.reference}.dict
picard LiftoverVcf  -I {input.callset}  -O {output} --CHAIN {input.liftover_chain} --REJECT {output}_rejected_variants.vcf -R {input.reference}

bcftools +liftover plugin

This looks like a very thorough tool to lift over indels (in addition to SNVs, which all tools are doing OK), judging from the great paper introducing bcftools +liftover plugin. While installation is not easily automated, yet, it is definitely worth trying out once it is integrated as a plugin into bcftools itself. Then, it should be pretty straightforward to conduct a liftover.

other format liftover

CrossMap

CrossMap allows for a wider range of formats to be lifted over, including SAM/BAM/CRAM, VCF, BED, GFF/GTF, MAF, BigWig and Wiggle format. It is available for installation via bioconda and has very clear usage instructions, with detailed information for the different formats it supports. However, analyses by the authors of bcftools +liftover and transanno suggest, that the VCF/BCF-specific tools perform better on this data type.

Statistics

For a general understanding of Probability, Mathematical Statistics, Stochastic Processes, I often turn to Random on randomservices.org, a project by Kyle Siegrist. It has great explanations and little apps to play around with certain processes and probability distributions. Especially the explanations of probability distributions are very useful and often much better than what you can find on Wikipedia.

bayesian statistics

You want a basic understanding or intuition of bayesian statistics (as opposed to frequentist statistics)? The presentation Bayesian Statistics (a very brief introduction) by Ken Rice could be a good starting point.

learning bayesian statistics with R

There is Bayes Rules! An Introduction to Applied Bayesian Modeling by Alicia A. Johnson, Miles Q. Ott, and Mine Dogucu. They provide a thorough introduction to bayesian statistics, all with excercises in R (even including their own dedicated R package bayesrule to accompany the book).

A slightly shorter introduction to bayesian statistics in R can be found in the Bayesian statistics chapter of the book Learning Statistics with R by Danielle Navarro. It is aimed at a psychology audience, but the examples should all be generic enough to follow.

Or you could check out this Bayesian models in R r-bloggers post by Francisco Lima. It also has R code to follow along.

learning bayesian statistics with python

You would rather learn bayesian statistics using python? Look no further, here's Think Bayes 2 by Allen B. Downey. It comes with a Jupyter Notebook per chapter, so you can directly run (and play with) the code in the book.

Causal inference and Bayesian statistics

Richard McElreath has a nice introduction to causal inference and applying it to Bayesian methods called Statistical Rethinking. It is also available as a set of lectures on youtube, updated annually. There is also an R package rethinking to go with the book.

frequentist statistics

learning statistical modeling

Susan Holmes and Wolfgang Huber have written the very comprehensive book Modern Statistics for Modern Biology, and you can read it all online for free. Most importantly though, it has example R code and excercises to follow along. Thus, if you want to learn all about frequentist statistics---with application in modern biology---this is the perfect resource.

all the hypothesis tests

The Handbook of Biological Statistics by John H. McDonald has great explanations of all the statistical tests you could ever need in biomedical research. And it has this amazing tabular overview of all of the statistical tests, so you can easily choose the one you need for your data set at hand.

If you want to understand a bit more about how the theory behind a number of these tests is very similar (and thus not all too hard to actually understand), you can check out Common statistical tests are linear models by Jonas Kristoffer Lindeløv. The article has a great summary table and provides example code in R. So it's also a great resource if you just want quick example code for running a particular statistical test in R.

Workflow management

Snakemake

Snakemake is a workflow management system to create reproducible and scalable data analyses in a human readable, Python based language. You define individual analysis steps as rules with a defined input and output, where dependencies are defined implicitly when the input of one rule matches the output of another rule.

If you want to get to know its core features, there is a great snakemake tutorial that is easy to set up and has excersises to work through at your own pace. And once you have a general understanding of how snakemake works, you could for example try to use one of the standardized snakemake workflows.

Snakemake workflows: reusable analysis workflows

The snakemake module system allows for flexibly combining different existing workflows for more complex analyses. It is also used for the deployment of standardized workflows that you can find in the snakemake workflow catalog. Just click on the Usage button of any of the listed standardized workflows to get a walkthrough of how to set up this workflow for your own data analysis (for example the dna-seq-varlociraptor workflow Usage).

Snakemake wrappers: reusable snakemake rules

Snakemake wrappers are reusable snakemake rules. They wrap one command of a command line tool to provide a specific output from a provided input file. You can simply copy-paste such a wrapper into your own workflow use it. Just check out the compendium of wrappers and meta-wrappers (copy-pastable collections of wrappers to perform analysis multiple steps). And if the tool or command you are looking for is not yet on there, consider contributing a new snakemake wrapper, once you have got the tool working in your own workflow.

Snakemake debugging tricks

Saving time on DAG building and resolution

To quickly debug a particular rule, specify the output of that rule as the desired output of your snakemake run. To avoid clashes with command line argument specifications, it is best to provide the desired output file as the 1st argument right after snakemake, e.g.:

snakemake path/to/wanted_output.txt --jobs 5 --use-conda

language-specific debugging

R

logging

Standard code we use for redirecting any stderr output to the log: file specified in the rule calling the R script:

log <- file(snakemake@log[[1]], open="wt")
sink(log)
sink(log, type="message")

Also, if you are looking to have proper backtraces even for unexpected errors (errors not properly handled in your code or in a package you load), you can use:

rlang::global_entrace()

You will need to have the package rlang installed, but this for example comes with the tidyverse. For infos on the function, see the documentation: https://rlang.r-lib.org/reference/global_entrace.html Also, this is not epxected to incur a performance reduction.

interactive debugging

You can save the entire current state of a workspace in R, so you can insert this right before some code triggers an error:

save.image(file = "my_dump.RData")

In an interactive R session, first load all the library()s that you need for the script. Then you can load the full workspace and interactively explore / debug what’s going on:

load("my_dump.RData")

python

logging

Standard code we use for redirecting any stderr output to the log: file specified in the rule calling the python script:

import sys
sys.stderr = open(snakemake.log[0], "w")
interactive debugging

Inserting the following into a python script executed in a snakemake rule will throw you into an interactive python debugging shell where you have access to the current state, including locals() and globals() variables:

import pdb; pdb.set_trace()

Working on servers

This section collects resources that are helpful for working on servers. For example how to log into servers via ssh, how to create persistent shell sessions so that you can leave things running when you have to log out, and how to get data onto and off of servers.

persistent screen sessions

A number of tools allow you to set up persistent bash sessions on a server (or another computer) that can continue to run when you log off / disconnect from them and that can have multiple windows (shell sessions / terminals) open at once. Thus, commands (and terminal session histories) will be kept alive even when logged out, and you can easily re-attach to a session that is still running. Here we introduce some that are commonly used.

screen

screen is one such tool for persisten shell sessions that you can detach from, and reattach to later, keeping it running in the background.

Essential instructions

The bare essentials are described in this guide:

And the most important command line arguments and keystrokes can be found in this reference:

And if you are looking for a more detailed intro, check out the Getting Started section of screen's online documentation.

tmux

tmux stands for 'terminal multiplexer'. It is also meant for running commands and workflows in persistent terminal sessions that you can log out of.

Essential instructions

  1. Create a new session with a name:
tmux new -s test
  1. Detach the session that is created with Ctrl+B and release Ctrl and B, then quickly D, so it's basically Ctrl+B and D. The release of Ctrl and B is important to not kill the session instead of detaching (yes, it happens). The message should indicate:
[detached (from session test)]
  1. List the session that is created:
tmux ls
  1. Then, to be able to reach the newly created session back again:
tmux attach -t test

Other shortcuts that can be useful

CTRL - B - [ //scroll mode
q //exit scroll mode

ssh

ssh allows for secure remote access to servers or other computers over insecure networks (including the internet).

getting started with ssh

While there is extensive documentation on ssh commands and related commands such as scp (for copying files using the ssh protocol), the following Wiki page will probably get you started pretty quickly:

ssh keys

ssh key files are a great way to set up safe ssh connections, and if properly set up with your operating system keychain, you can save on a lot of password typing. There's a great guide for generating and setting up ssh keys that you can quickly work through step by step. No need to remember any of this by heart, just look it up when you need it.

data transfer onto and from linux servers

FTP with Filezilla

Filezilla gives you a nice graphical user interface for data transfer, if you have an (S)FTP server interface available. You connect very easily and have a sane way of transferring data with a tree- and folder-view for both source and destination. And you can monitor the transfer while it is happening, and flexibly configure an overwrite policy if necessary. One import advante is that you can install and use Filezilla on any operating system.

In general, any transfer should usually work with the following steps:

  1. Install Filezilla: https://filezilla-project.org/
  2. Connect to the server via Quickconnect in the top bar:
  • Host: (s)ftp://<server_address>
  • Username: <user_name> (optional, only if authentication is necessary)
  • Password: <password> (optional, only if authentication is necessary)
  • Port: <port> (optional, only if a non-standard port is specified in the server's documentation)
  1. On both panels, navigate to the folders (locally where the data is, remotely where it should go).
  2. Highlight files you want to transfer.
  3. Right-click and select Upload.

rsync (via ssh)

If you have a Unix shell / linux command line available on both the machines that you are trying to transfer data to and from, this is usually the easiest and most robust way of transferring data. You can quickly transfer full folders recursively, with rsync optimizing the transfer speed and keeping things like modification timestamps intact. It is always recommended to run transfer commands in a persistent shell session, because your transfer will always keep on running, even when your connection is interrupted or your terminal is closed unexpectly.

And here are standard rsync commands you can use:

One to take the folder at /abs/src/path/folder and transfer it to /abs/dest/path/folder is (assuming, you are on the src machine):

rsync -avP /abs/src/path/folder <user>@<dest_server_address>:/abs/dest/path/folder

The command line options used here are:

  • -a: short for --archive mode, which sets -rlptgoD (no -H,-A,-X):
    • -r: --recursively copy sub-directories
    • -l: copy sym--links as symlinks
    • -p: short for --perms, preserve the permissions
    • -t: preserve the modification --times
    • -g: preserve the --group
    • -o: preserve the --owner (if you are the super-user)
    • -D: summary flag for:
      • --devices: preserve device files (super-user only)
      • --specials: preserve special files
  • -v: produces more --verbose output, useful for debugging when errors occur
  • -P: summary flag for:
    • --partial: keep partially transferred files
    • --progress: show progress during transfer

Also, if you want to rename the folder during the transfer, you just add a trailing / to the source path:

rsync -avP /abs/src/path/folder/ <user>@<dest_server_address>:/abs/dest/path/new_folder

Then everything within the source folder gets copied to the destination folder.

The same commands as given above, can also be run to download things from a server to a local machine. In that case, you simply make the server address and path your source, and you local path your destination:

rsync -avP <user>@<server_address>:/abs/remote/path/folder /abs/local/path/folder

And either of the local paths can also be a relative path. So if you are for example in a folder /home/user42/ that holds a subfolder folder/, you can do:

rsync -avP folder <user>@<dest_server_address>:/abs/dest/path/folder