Single-Cell Differential Expression Analysis

In this vignette, we show you how to perform single cell differential expression analysis using single cell RNA-seq data with the ROSeq package.

The ROSeq package models the rank distribution of read counts for each gene. These models can then be used to identify robustly differentially expressed genes between groups of cells. For more information, please refer to the original manuscript by Lalit et al.

Preparing data

The analysis starts with a matrix of read counts. Here the rows correspond to the genes while the columns correspond to individual, single cells (in the case of single cell data). The ROSeq package includes the dataset published by Tung et al. The dataset contains scdata: read count data corresponding to 288 single cells each, made available from three human induced pluripotent stem cell lines (NA19098, NA19101 and NA19239). Here we load the cells corresponding to individuals NA19101 and NA19239. In addition, bcdata: bulk or population RNA-Seq data corresponding to these three individuals was also made available, in the form of 3 replicates each.

# Load the example dataset
data("Tung2017")
classOne<-"NA19101"
classTwo<-"NA19239"
keep <- anno[,1] == classOne | anno[,1] == classTwo
scdata <- scdata[,keep]
keep <- colnames(bcdata) == classOne | colnames(bcdata) == classTwo
bcdata <- bcdata[,keep]

# Factor determining cell types
bcgroups <- factor(gsub(paste0("(", classOne, '|', classTwo, ").*"), "\\1", colnames(bcdata)), levels = c(classOne, classTwo))
scgroups <- factor(gsub(paste0("(", classOne, '|', classTwo, ").*"), "\\1", colnames(scdata)), levels = c(classOne, classTwo))

# The group factor should be named accordingly
names(bcgroups) <- colnames(bcdata)  
table(bcgroups)

## bcgroups
## NA19101 NA19239 
##       3       3

names(scgroups) <- colnames(scdata)  
table(scgroups)

## scgroups
## NA19101 NA19239 
##     288     288

As a first step, good-quality genes are identified as those which have more than five non-zero values. This filtering approach follows from Hemberg Lab’s notes on scRNA-Seq analysis.

# Preprocessing Stage - Clean up the dataset : Filtering the single cell data by identifying genes which have more than 5 non zero values
gkeep <- rowSums(scdata > 0) > 5
scdatafiltered <- scdata[gkeep,]
scngenes <- length(scdatafiltered[,1])
common <- intersect(rownames(bcdata), rownames(scdatafiltered))  
bcdatafiltered<-bcdata[common, ] # give you common rows in bulk cell data frame  
scdatafiltered<-scdatafiltered[common, ] # give you common rows in single cell data frame
scnclassOne<-length(which(scgroups==classOne))
scnclassTwo<-length(which(scgroups==classTwo))

# Preprocessing Stage - Visually inspecting the coverage
sccoverage <- colSums(scdatafiltered)/scngenes
scord <- order(scgroups)
bar.positions <- barplot(sccoverage[scord],col=scgroups[scord], xaxt='n', ylab="Average Counts per gene for each cell")
axis(side=1,at=c(bar.positions[scnclassOne/2], bar.positions[scnclassOne+scnclassTwo/2]), labels=c(classOne,classTwo), tick=FALSE)

Fitting and Testing for Differential Expression

As a next step we normalize the datasets using median normalization and run ROSeq. Specifying multiple cores at this stage would significantly speed up the process of fitting the parametric model and evaluating for differential expression.

# Solving Stage - Applying Median normalization to scfiltereddata
library(parallel)
library(MASS)
lib_size = colSums(scdatafiltered)
scdatafilteredmediannorm <- t(t(scdatafiltered)/lib_size*median(lib_size)) 
numCores <- detectCores()
results<-callROSeq(scdata=scdatafilteredmediannorm, scgroups, numCores)
pvalues_ROSeq<-c(rep(NA, nrow(scdatafilteredmediannorm)))
for (gene in 1:nrow(scdatafilteredmediannorm)){
  pvalues_ROSeq[gene]<-results[[gene]][12]
}
pvaluesadjusted_ROSeq <-p.adjust(pvalues_ROSeq, method = "fdr")
names(pvaluesadjusted_ROSeq)<-rownames(scdatafilteredmediannorm)
# Saving Results as .RData
save(pvaluesadjusted_ROSeq, file = paste0(classOne,"_", classTwo, "_", "ROSeq.RData"))

Identifying the top-n most differentially expressed genes

Visualizing in a heat-map form, the top twenty differentially expressed genes as predicted by ROSeq, appears as follows. This serves as a qualititative, sanity check that ROSeq is calling out the correct genes.

library(gplots)
scdataordered<-scdatafilteredmediannorm[order(pvaluesadjusted_ROSeq), ]
resultsordered<-results[order(pvaluesadjusted_ROSeq)]
# Post- processing Stage: Grab the top 20 differentially expressed genes
scfirst20<-head(as.matrix(scdataordered), n = 20L)
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)

col_breaks = c(seq(-1,-0.51,length=100),           # for green
               seq(-0.5,0.8,length=100),           # for yellow
               seq(0.81,1,length=100))             # for red

heatmap.2(scfirst20,
          main = paste0(classOne, " vs ", classTwo), # heat map title
          notecol="black",      # change font color of cell labels to black
          density.info="none",  # turns off density plot inside color legend
          trace="none",         # turns off trace lines inside the heat map
          margins =c(5,10),     # widens margins around plot
          col=my_palette,       # use on color palette defined earlier
          breaks=col_breaks,    # enable color transition at specified limits
          dendrogram="none",     # only draw a row dendrogram
          Colv="NA", Rowv="NA", labCol = FALSE)

We also glance at the model fit for some of the genes. For example, looking at the most and the least differentially expressed genes, gives the following results:

Establishing Ground Truth criteria for differential expression using bulk cell data

In addition to scRNA-Seq read counts, bulk or population RNA-Seq data corresponding to these three individuals was also made available, in the form of 3 replicates each. Differential expression was performed on this bulk cell RNA-Seq read count data in a pair-wise fashion using a standard, bulk cell Differential Expression technique called DESeq (see Anders et al.) in order to establish the ground truth : genes with an adjusted p-value less than 0.05 and an absolute log2(foldchange) greater than two were considered to be differentially expressed (DE). Similarly, genes with an adjusted p-value greater than 0.1 were NOT considered to be differentially expressed (notDE).

##       Length Class  Mode     
## DE      182  -none- character
## notDE 14112  -none- character

Evaluating AUC by including the Ground Truth and comparing performance with other methods like Wilcoxon Rank-Sum, MAST, SCDE and ROTS

Median Normalization was implemented next for use by methods such as Wilcoxon Rank-Sum and ROSeq, while TMM Normalization was implemented for use by ROTS (other packages such as MAST and SCDE have an inbuilt normalization step in their implementation), in order to equalize coverage across the two sub-populations or classes of cells under investigation.

Finally, all the analyses above are combined and overlaid in one single plot:

# Solving Stage - Evaluating ROSeq performance 
AUC_Wilcoxon<-round(DE_Quality_AUC_plot(pvaluesadjusted_Wilcoxon, "red"), 3)
AUC_MAST<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_MAST, "blue"), 3)
AUC_ROTS<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_ROTS, "magenta"), 3)
AUC_SCDE<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_SCDE, "green"), 3)
AUC_ROSeq<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_ROSeq, "black"), 3)

# Add Legend
addLegend(AUC_MAST, AUC_ROSeq, AUC_ROTS, AUC_SCDE, AUC_Wilcoxon)

Getting Started with `ROSeq`

December 4, 2018

Single-Cell Differential Expression Analysis

Preparing data

Fitting and Testing for Differential Expression

Identifying the top-n most differentially expressed genes

Establishing Ground Truth criteria for differential expression using bulk cell data

Evaluating AUC by including the Ground Truth and comparing performance with other methods like Wilcoxon Rank-Sum, MAST, SCDE and ROTS

Getting Started with ROSeq

December 4, 2018

Single-Cell Differential Expression Analysis

Preparing data

Fitting and Testing for Differential Expression

Identifying the top-n most differentially expressed genes

Establishing Ground Truth criteria for differential expression using bulk cell data

Evaluating AUC by including the Ground Truth and comparing performance with other methods like Wilcoxon Rank-Sum, MAST, SCDE and ROTS

Getting Started with `ROSeq`