Workflow

Single cell data analysis was performed as outlined by Lun et al. (2016), using Scran and Scater. After cell and gene filtration (see Filtration), raw counts were normalized with Scater using a linear scaling approach. In the following, the covariates G1, G2M were considered to cause technical variation. Per-gene biological variance was estimated by fitting a mean-variance trend model using above covariates as design matrix (see hvg.tsv). Subsequently, the covariates were used to remove batch effects from normalized expressions. Finally, highly variable genes/transcripts (HVGs) were extracted from the trend model and their pairwise correlation was analyzed (see Results for details).

Detailed software versions can be found under Rules

Results

Dimension Reduction

hvg-pca.test-condition.svg

PCA plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

First, second, and third components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. Cells were colored by test-condition. See Lun et al. (2016).

hvg-tsne.test-condition.perp=10.seed=23213.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=10 and seed=23213. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=10.seed=789789.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=10 and seed=789789. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=10.seed=897354.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=10 and seed=897354. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=20.seed=23213.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=20 and seed=23213. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=20.seed=789789.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=20 and seed=789789. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=20.seed=897354.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=20 and seed=897354. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=5.seed=23213.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=5 and seed=23213. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=5.seed=789789.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=5 and seed=789789. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

hvg-tsne.test-condition.perp=5.seed=897354.svg

t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.

Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=5 and seed=897354. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.

Filtration

50-highest-genes.svg

Percentage of total counts assigned to the top 50 most highly-abundant features. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labeled as a control feature.

This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment. See Lun et al. (2016).

avg-counts.svg

Histogram of mean counts for features (genes/transcripts) across all cells. The blue dashed line depicts the threshold for classification as low expression feature. Low expression features are subsequently removed from the analysis.

cell-filtering.tsv

Cells were filtered as suggested by Lun et al. (2016) by removing by removing outliers regarding

  • library size,
  • number of expressed features,
  • percentage of reads mapping to mitochondrial genes,
  • percentage of reads mapping to spike-ins.

Outliers are defined as occuring beyond 3 median absolute deviations (MADs) of the median.

Highly Variable Genes

hvg-clusters.svg

Larger sets of correlated features were assembled by treating features as nodes in a graph and each pair of features with significantly large (FDR ≥ 0.05) correlations as an edge. Clusters in this graph represent a set of correlated features. See Lun et al. (2016).

hvg-corr-heatmap.svg

Heatmap of mean-centred normalized log-expression values for significantly (FDR ≥ 0.05) correlated HVGs. Dendrograms were formed by hierarchical clustering on the Euclidean distances between features (rows) or cells (columns). See Lun et al. (2016).

hvg-correlations.tsv

Table of pairs of correlated HVGs.

Correlations between genes are quantified by computing Spearman's rho, which accommodates non-linear relationships in the expression values. See Lun et al. (2016).

hvg-expr-dists.svg

Violin plots of normalized log-expression values for the top 20 HVGs. Each point represents the log-expression value in one cell.

hvg.tsv

Genes/transcripts identified as highly variable (HVGs).

We identify HVGs to focus on the genes that are driving heterogeneity across the population of cells. This requires estimation of the variance in expression for each gene, followed by decomposition of the variance into biological and technical components. HVGs are then identified as those genes with the largest biological components. This avoids prioritizing genes that are highly variable due to technical factors such as sampling noise during RNA capture and library preparation. See Lun et al. (2016).

Variance was estimated by fitting a mean-variance trend to endogenous features, assuming that most of them do not vary across cells.Then, we considered all genes with a minimum average difference in true (biological) log2 expression of 0.5 between any two cells and an associated FDR of 0.05.

mean-vs-variance.svg

Variance of normalized log-expression values for each feature, plotted against the mean log-expression. The blue line represents the mean-dependent trend fitted to the variances of the endogenous genes. Variance estimates for spike-in transcripts are highlighted in red. See Lun et al. (2016).

Normalization

size-factors-vs-libsize.svg

Size factors for normalization were calculated by pooling counts from many cells and then deconvolving the obtained pooled size factors for individual cells (Lun et al (2016)). In this plot, each point is a cell, showing the estimated size factor vs. the library size. For spike-ins, size factors were calculated separately, as these should not be affected by library size. Then, size factors have been used to normalize counts into log-transformed expressions where cell-specific biases are removed. Thereby log-transformation provides a variance stabilization: high abundance genes no longer dominate downstream analysis. See Lun et al. (2016).

Quality Control

cycle-scores.test-condition.svg

Cells are classified into cycle phases using the method of Scialdone et al. (2015). This plot shows the scores for G1 and G2/M, each point represents a cell. Points are color coded by .

explained-variance.svg

Density plot of the percentage of variance explained by various covariates.

For each feature, the percentage of the variance of the normalized log-expression values across cells that is explained by each covariate is calculated. Each curve corresponds to one factor and represents the distribution of percentages across all genes. See Lun et al. (2016).

expressed-genes.svg

Histogram of number of expressed genes in each cell.

library-size.svg

Histogram of observed library sizes.

mito-proportion.svg

Histogram of the proportion of reads mapped to mitochondrial genes.

spike-proportion.svg

Histogram of the proportion of reads mapped to spike-in transcripts.

Statistics

If the workflow has been executed in cluster/cloud, runtimes include the waiting time in the queue.

Rules

Rule Jobs Output Singularity Conda environment
hvg_tsne 9
  • plots/hvg-tsne.test-condition.perp=5.seed=897354.svg
  • plots/hvg-tsne.test-condition.perp=20.seed=897354.svg
  • plots/hvg-tsne.test-condition.perp=5.seed=23213.svg
  • plots/hvg-tsne.test-condition.perp=10.seed=789789.svg
  • plots/hvg-tsne.test-condition.perp=20.seed=23213.svg
  • plots/hvg-tsne.test-condition.perp=10.seed=897354.svg
  • plots/hvg-tsne.test-condition.perp=5.seed=789789.svg
  • plots/hvg-tsne.test-condition.perp=10.seed=23213.svg
  • plots/hvg-tsne.test-condition.perp=20.seed=789789.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
  • r-rtsne =0.13
hvg_pca 1
  • plots/hvg-pca.test-condition.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
hvg 1
  • analysis/variance.rds
  • tables/hvg.tsv
  • plots/hvg-expr-dists.svg
  • plots/mean-vs-variance.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
qc 1
  • plots/library-size.svg
  • plots/expressed-genes.svg
  • plots/mito-proportion.svg
  • plots/spike-proportion.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
correlation 1
  • tables/hvg-correlations.tsv
  • plots/hvg-clusters.svg
  • plots/hvg-corr-heatmap.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
cell_cycle_scores 1
  • plots/cycle-scores.test-condition.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
explained_variance 1
  • plots/explained-variance.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
batch_effect_removal 1
  • analysis/design-matrix.rds
  • analysis/normalized.batch-removed.rds
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
normalize 1
  • analysis/normalized.rds
  • plots/size-factors-vs-libsize.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
load_counts 1
  • analysis/all.rds
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
cell_cycle 1
  • analysis/cell-cycle-assignments.rds
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
filter_genes 1
  • analysis/filtered.rds
  • plots/avg-counts.svg
  • plots/50-highest-genes.svg
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34
filter_cells 1
  • analysis/filtered-cells.rds
  • tables/cell-filtering.tsv
  • bioconductor-scater =1.6
  • bioconductor-scran =1.6
  • bioconductor-singlecellexperiment =1.0
  • r-rcolorbrewer =1.1
  • bioconductor-rbgl =1.54
  • bioconductor-rgraphviz =2.22
  • r-gplots =3.0
  • bioconductor-limma =3.34