Workflow
Single cell data analysis was performed as outlined by Lun et al. (2016), using Scran and Scater. After cell and gene filtration (see Filtration), raw counts were normalized with Scater using a linear scaling approach. In the following, the covariates G1, G2M were considered to cause technical variation. Per-gene biological variance was estimated by fitting a mean-variance trend model using above covariates as design matrix (see hvg.tsv). Subsequently, the covariates were used to remove batch effects from normalized expressions. Finally, highly variable genes/transcripts (HVGs) were extracted from the trend model and their pairwise correlation was analyzed (see Results for details).
Detailed software versions can be found under Rules
Results
Dimension Reduction
hvg-pca.test-condition.svg
PCA plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
First, second, and third components are shown, along with the percentage of variance explained. Bars represent the coordinates of the cells on each axis. Cells were colored by test-condition. See Lun et al. (2016).
hvg-tsne.test-condition.perp=10.seed=23213.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=10 and seed=23213. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=10.seed=789789.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=10 and seed=789789. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=10.seed=897354.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=10 and seed=897354. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=20.seed=23213.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=20 and seed=23213. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=20.seed=789789.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=20 and seed=789789. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=20.seed=897354.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=20 and seed=897354. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=5.seed=23213.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=5 and seed=23213. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=5.seed=789789.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=5 and seed=789789. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
hvg-tsne.test-condition.perp=5.seed=897354.svg
t-SNE plot constructed from normalized log-expression values of correlated HVGs, where each point represents a cell.
Cells were colored by test-condition. T-SNE visualizations can be misleading, due to parameter choices and the fact that it is non-deterministic. We therefore run t-SNE for different perplexities and random seeds. This plot shows results for perplexity=5 and seed=897354. T-SNE results will only be usable for you, if clustering is similar for different seeds. See here for additional details.
Filtration
50-highest-genes.svg
Percentage of total counts assigned to the top 50 most highly-abundant features. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labeled as a control feature.
This should generally be dominated by constitutively expressed transcripts, such as those for ribosomal or mitochondrial proteins. The presence of other classes of features may be cause for concern if they are not consistent with expected biology. For example, a top set containing many spike-in transcripts suggests that too much spike-in RNA was added during library preparation, while the absence of ribosomal proteins and/or the presence of their pseudogenes are indicative of suboptimal alignment. See Lun et al. (2016).
avg-counts.svg
Histogram of mean counts for features (genes/transcripts) across all cells. The blue dashed line depicts the threshold for classification as low expression feature. Low expression features are subsequently removed from the analysis.
cell-filtering.tsv
Cells were filtered as suggested by Lun et al. (2016) by removing by removing outliers regarding
- library size,
- number of expressed features,
- percentage of reads mapping to mitochondrial genes,
- percentage of reads mapping to spike-ins.
Outliers are defined as occuring beyond 3 median absolute deviations (MADs) of the median.
Highly Variable Genes
hvg-clusters.svg
Larger sets of correlated features were assembled by treating features as nodes in a graph and each pair of features with significantly large (FDR ≥ 0.05) correlations as an edge. Clusters in this graph represent a set of correlated features. See Lun et al. (2016).
hvg-corr-heatmap.svg
Heatmap of mean-centred normalized log-expression values for significantly (FDR ≥ 0.05) correlated HVGs. Dendrograms were formed by hierarchical clustering on the Euclidean distances between features (rows) or cells (columns). See Lun et al. (2016).
hvg-correlations.tsv
Table of pairs of correlated HVGs.
Correlations between genes are quantified by computing Spearman's rho, which accommodates non-linear relationships in the expression values. See Lun et al. (2016).