linux tools

Any linux distribution comes with a set of linux tools for the command line. In bioinformatics (and data science in general), especially the line by line tools for plain text files are very useful. Thus, we will introduce a number of these tools here and provide links for useful tutorials and documentation for the more complex ones. However, you can usually get detailed documentation for any of these tools on the command-line. Simply write man followed by the tool's name to get its full manual, for example man head. You can scroll up and down and leave the manual with q to drop back to the command-line (controls are the same as the command-line text file viewer less).

common command-line arguments

Most command-line tools have certain standard command-line arguments, that usually come in a short (one dash - and just one letter) and a long version (two dashes -- and a full word):

-h/--help: display a help message
--version: display the version of the tool that you have installed
--verbose: provide more verbose output, oftentimes useful for debugging problems

`awk`: line by line editing of tabular plain text files

awk is a great tool to work on plain-text tabular files (for example tab-separated or comma-separated files, .tsv and .csv files, respectively). It provides you with direct accessors to individual columns (for example $1 for the first column) and with lots of functionality to work with them.

`grep`: line by line plain text file searching

grep allows you to search for strings and regular expressions in text files. It will return any line containing a search string. For example, grep "gene" myfile.txt would return any line containing the string gene in myfile.txt.

`head` or `tail`: first or last lines of a file

head -n 5 myfile.txt will display the first five lines of the plain text file myfile.txt. tail -n 5 myfile.txt will display the last five lines of the plain text file myfile.txt.

`htop`: interactive processes viewer

htop is an interactive process viewer that allows you to look at the resource usage of individual processes and across an entire computer in the terminal. This can be useful both locally, or when working remotely server.

`htop` on machines with lots of CPUs

Sometimes, servers have so many CPUs that the default display of htop that shows a CPU meter for every CPU will take up the entire screen, so that you cannot see the processes table at all. If your htop --version is >=3.2.2, you can simply hit the # hotkey to toggle all of the meters at the top on and off. You can easily install the latest htop with conda if you have the conda-forge channel enabled, for example with: conda create -n htop htop.

If you do want to keep the meters, but want to collapse all the CPU meters into one average CPU meter, you can consider editing your ~/.config/htop/htoprc configuration. Just open it with your editor of choice and make sure that you remove any reference to LeftCPUs and RightCPUs and replace it with CPU. The four lines regarding the meters could for example look like this:

left_meters=CPU Memory Swap
left_meter_modes=1 1 1
right_meters=Tasks LoadAverage Uptime
right_meter_modes=2 2 2

`less`: quick look at text files

less is a tool for quickly looking at text files on the command line. Conjure it up with less myfile.txt, scroll up and down with the arrow and page up/down keys, search for something like thisword with /thisword (followed by enter) and quit by hitting q.

`sed`: line by line plain text file editing

sed is a great tool to work on plain-text files line by line. For example, you can easily search and replace patterns using regular expressions.

sed tutorial for getting started

further resources

useful bioinformatics one-liners: these mostly only use basic linux-tools to achieve common bioinformatics tasks

A collection of resources for data scientists (not only) in bioinformatics