ROSeqIn this vignette, we show you how to perform single cell differential expression analysis using single cell RNA-seq data with the ROSeq package.
The ROSeq package models the rank distribution of read counts for each gene. These models can then be used to identify robustly differentially expressed genes between groups of cells. For more information, please refer to the original manuscript by Lalit et al.
The analysis starts with a matrix of read counts. Here the rows correspond to the genes while the columns correspond to individual, single cells (in the case of single cell data). The ROSeq package includes the dataset published by Tung et al. The dataset contains scdata: read count data corresponding to 288 single cells each, made available from three human induced pluripotent stem cell lines (NA19098, NA19101 and NA19239). Here we load the cells corresponding to individuals NA19101 and NA19239. In addition, bcdata: bulk or population RNA-Seq data corresponding to these three individuals was also made available, in the form of 3 replicates each.
# Load the example dataset
data("Tung2017")
classOne<-"NA19101"
classTwo<-"NA19239"
keep <- anno[,1] == classOne | anno[,1] == classTwo
scdata <- scdata[,keep]
keep <- colnames(bcdata) == classOne | colnames(bcdata) == classTwo
bcdata <- bcdata[,keep]
# Factor determining cell types
bcgroups <- factor(gsub(paste0("(", classOne, '|', classTwo, ").*"), "\\1", colnames(bcdata)), levels = c(classOne, classTwo))
scgroups <- factor(gsub(paste0("(", classOne, '|', classTwo, ").*"), "\\1", colnames(scdata)), levels = c(classOne, classTwo))
# The group factor should be named accordingly
names(bcgroups) <- colnames(bcdata)
table(bcgroups)
## bcgroups
## NA19101 NA19239
## 3 3
names(scgroups) <- colnames(scdata)
table(scgroups)
## scgroups
## NA19101 NA19239
## 288 288
As a first step, good-quality genes are identified as those which have more than five non-zero values. This filtering approach follows from Hemberg Lab’s notes on scRNA-Seq analysis.
# Preprocessing Stage - Clean up the dataset : Filtering the single cell data by identifying genes which have more than 5 non zero values
gkeep <- rowSums(scdata > 0) > 5
scdatafiltered <- scdata[gkeep,]
scngenes <- length(scdatafiltered[,1])
common <- intersect(rownames(bcdata), rownames(scdatafiltered))
bcdatafiltered<-bcdata[common, ] # give you common rows in bulk cell data frame
scdatafiltered<-scdatafiltered[common, ] # give you common rows in single cell data frame
scnclassOne<-length(which(scgroups==classOne))
scnclassTwo<-length(which(scgroups==classTwo))
# Preprocessing Stage - Visually inspecting the coverage
sccoverage <- colSums(scdatafiltered)/scngenes
scord <- order(scgroups)
bar.positions <- barplot(sccoverage[scord],col=scgroups[scord], xaxt='n', ylab="Average Counts per gene for each cell")
axis(side=1,at=c(bar.positions[scnclassOne/2], bar.positions[scnclassOne+scnclassTwo/2]), labels=c(classOne,classTwo), tick=FALSE)
As a next step we normalize the datasets using median normalization and run ROSeq. Specifying multiple cores at this stage would significantly speed up the process of fitting the parametric model and evaluating for differential expression.
# Solving Stage - Applying Median normalization to scfiltereddata
library(parallel)
library(MASS)
lib_size = colSums(scdatafiltered)
scdatafilteredmediannorm <- t(t(scdatafiltered)/lib_size*median(lib_size))
numCores <- detectCores()
results<-callROSeq(scdata=scdatafilteredmediannorm, scgroups, numCores)
pvalues_ROSeq<-c(rep(NA, nrow(scdatafilteredmediannorm)))
for (gene in 1:nrow(scdatafilteredmediannorm)){
pvalues_ROSeq[gene]<-results[[gene]][12]
}
pvaluesadjusted_ROSeq <-p.adjust(pvalues_ROSeq, method = "fdr")
names(pvaluesadjusted_ROSeq)<-rownames(scdatafilteredmediannorm)
# Saving Results as .RData
save(pvaluesadjusted_ROSeq, file = paste0(classOne,"_", classTwo, "_", "ROSeq.RData"))
Visualizing in a heat-map form, the top twenty differentially expressed genes as predicted by ROSeq, appears as follows. This serves as a qualititative, sanity check that ROSeq is calling out the correct genes.
library(gplots)
scdataordered<-scdatafilteredmediannorm[order(pvaluesadjusted_ROSeq), ]
resultsordered<-results[order(pvaluesadjusted_ROSeq)]
# Post- processing Stage: Grab the top 20 differentially expressed genes
scfirst20<-head(as.matrix(scdataordered), n = 20L)
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
col_breaks = c(seq(-1,-0.51,length=100), # for green
seq(-0.5,0.8,length=100), # for yellow
seq(0.81,1,length=100)) # for red
heatmap.2(scfirst20,
main = paste0(classOne, " vs ", classTwo), # heat map title
notecol="black", # change font color of cell labels to black
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(5,10), # widens margins around plot
col=my_palette, # use on color palette defined earlier
breaks=col_breaks, # enable color transition at specified limits
dendrogram="none", # only draw a row dendrogram
Colv="NA", Rowv="NA", labCol = FALSE)
We also glance at the model fit for some of the genes. For example, looking at the most and the least differentially expressed genes, gives the following results:
In addition to scRNA-Seq read counts, bulk or population RNA-Seq data corresponding to these three individuals was also made available, in the form of 3 replicates each. Differential expression was performed on this bulk cell RNA-Seq read count data in a pair-wise fashion using a standard, bulk cell Differential Expression technique called DESeq (see Anders et al.) in order to establish the ground truth : genes with an adjusted p-value less than 0.05 and an absolute log2(foldchange) greater than two were considered to be differentially expressed (DE). Similarly, genes with an adjusted p-value greater than 0.1 were NOT considered to be differentially expressed (notDE).
## Length Class Mode
## DE 182 -none- character
## notDE 14112 -none- character
Median Normalization was implemented next for use by methods such as Wilcoxon Rank-Sum and ROSeq, while TMM Normalization was implemented for use by ROTS (other packages such as MAST and SCDE have an inbuilt normalization step in their implementation), in order to equalize coverage across the two sub-populations or classes of cells under investigation.
Finally, all the analyses above are combined and overlaid in one single plot:
# Solving Stage - Evaluating ROSeq performance
AUC_Wilcoxon<-round(DE_Quality_AUC_plot(pvaluesadjusted_Wilcoxon, "red"), 3)
AUC_MAST<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_MAST, "blue"), 3)
AUC_ROTS<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_ROTS, "magenta"), 3)
AUC_SCDE<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_SCDE, "green"), 3)
AUC_ROSeq<-round(DE_Quality_AUC_plot_add(pvaluesadjusted_ROSeq, "black"), 3)
# Add Legend
addLegend(AUC_MAST, AUC_ROSeq, AUC_ROTS, AUC_SCDE, AUC_Wilcoxon)