This workshop is being held at the BioInfoSummer workshop on 3-7 December 2018.

1 Activity Overview

In this workshop you will be interrogating a mouse embryonic stem cell data collection. A data collection is a number of distinct datasets that have some kind of meaningful reason to be combined/merged.
This data comes from Kolodziejczyk et al: Kolodziejczyk, A. A. et al. Single Cell RNA-Sequencing of Pluripotent States Unlocks Modular Transcriptional Variation. Cell Stem Cell 17, 471-485 (2015).
We will cover three main kinds of merging of data: unsupervised, supervised and semi-supervised.
Have a look at the many Case Studies available on the scMerge website for more examples.

2 Links to resources

scMerge website
scMerge vignette and Case Studies
scMerge preprint
Yingxin Lin’s presentation slides at BioCAsia 2018
Analysis of Single Cell Tutorial, with thanks to Vladimir Kiselev, Tallulah Andrews, Jennifer Westoby, Davis McCarthy, Maren Büttner and Martin Hemberg.
Stephanie Hicks’ workshop presented on Tuesday 4 December 2018 at BioInfoSummer

3 scMerge installation

To complete today’s workshops, you will need to have R and Rstudio installed on your computer. Before attending the workshop, ensure R is at least version 3.5.0.

To install Bioconductor software, BiocManager must firstly be obtained from CRAN.

# If you need to install Bioconductor software
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
# Some CRAN packages required by scMerge
install.packages(c("ruv", "rsvd", "igraph", "pdist", "proxy", "foreach", "doSNOW", "distr", "Rcpp", "RcppEigen"))
devtools::install_github("theislab/kBET")

# Some BioConductor packages required by scMerge
BiocManager::install(c("SingleCellExperiment", "M3Drop"), version = "3.8")

# Installing scMerge and the data files using
devtools::install_github("SydneyBioX/scMerge.data")
devtools::install_github("SydneyBioX/scMerge")

During this workshop you are also likely to use the ggplot2 package and scater package, so load that into your workspace. If you run into errors with loading these packages, see if they are installed on your computer.

library(ggplot2)
# if not installed use
# install.packages("ggplot2")

library(scater)
# if not installed use
# BiocManager::install("scater", version = "3.8")

4 Interrogating mouse embryonic stem cell (ESC) dataset

Load the data from the scMerge.data package

library(scMerge)
data("sce_mESC", package = "scMerge.data")
sce_mESC

## class: SingleCellExperiment 
## dim: 24224 704 
## metadata(0):
## assays(2): counts logcounts
## rownames(24224): ENSMUSG00000000001 ENSMUSG00000000028 ...
##   ENSMUSG00000060565 ENSMUSG00000080069
## rowData names(0):
## colnames(704): ola_mES_lif_1_1.counts ola_mES_lif_1_10.counts ...
##   ola_mES_2i_5_95.counts ola_mES_2i_5_96.counts
## colData names(2): cellTypes batch
## reducedDimNames(0):
## spikeNames(0):

class(sce_mESC)

## [1] "SingleCellExperiment"
## attr(,"package")
## [1] "SingleCellExperiment"

sce_mESC is a SingleCellExperiment object. This is an S4 object that saves a lot of different pieces of information in the one object. You can browse the methods or functions that can be performed on the SingleCellExperiment object with the showMethods function.

showMethods(classes = "SingleCellExperiment")

Have a look at the different pieces of information within this object

dim(sce_mESC) # number of genes and number of cells

## [1] 24224   704

assays(sce_mESC)

## List of length 2
## names(2): counts logcounts

names(colData(sce_mESC))

## [1] "cellTypes" "batch"

table(sce_mESC$batch, sce_mESC$cellTypes)

##         
##          2i a2i lif
##   batch1  0   0  81
##   batch2 82  93  90
##   batch3 59  66  79
##   batch4 72   0   0
##   batch5 82   0   0

This data contains 5 batches, and we happen to know it has 3 cell types.

4.1 PCA plot of unmerged data

Using functions from the scater package, we can generate a number of informative graphs. The following is calling the plotPCA function from within the scater package. We set point colours by cell type and point shapes by batch.

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

By default, plotPCA pulls data from the ‘logcounts’ assay. If you wish to change that, you can change the assay within the run_args parameter.

names(assays(sce_mESC)) # these are the assays

## [1] "counts"    "logcounts"

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch",
                run_args = list(exprs_values = "counts")) + 
  labs(title = "counts")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

4.2 Running unsupervised scMerge

There are four key arguments to scMerge:

sce_combine - the SingleCellExperiment object of interest
ctl - set of negative control gene names
kmeansK - an integer vector of length same as the number of batches indicating the number of clusters we expect
assay_name - the assay name for the scMerged dataset

For this example, we can see the number of cell types per batch, so we can supply a very well-informed kmeansK vector.

Since this is mouse data, we can use preloaded stably expressed genes in segList_ensemblGeneID as negative control genes. These are genes that we have found to be stably expressed across multiple mouse embryonic development time points.

table(sce_mESC$batch, sce_mESC$cellTypes)

##         
##          2i a2i lif
##   batch1  0   0  81
##   batch2 82  93  90
##   batch3 59  66  79
##   batch4 72   0   0
##   batch5 82   0   0

data("segList_ensemblGeneID")

sce_mESC <- scMerge(sce_combine = sce_mESC, 
                    ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                    kmeansK = c(1,3,3,1,1),
                    assay_name = "scMerge_unsupervised")

## No maker nor marker_list information was supplied 
## Finding HVG...
## [1] 3082
## Clustering within each batch...
## Performing pca...
## Creating Mutual Nearest Cluster...
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7

##   group batch cluster
## 1     3     2       1
## 2     3     3       3
## 3     2     2       2
## 4     2     3       2
## 5     1     2       3
## 6     1     3       1
## 7     2     1       1
## 8     1     4       1
## 9     1     5       1
## Dimension of the replicates mapping matrix 
## [1] 704 356
## 
## Performing RUV normalisation... This might take a few minutes...

We now have have a new assay in our sce_mESC object called scMerge_unsupervised. Note that we also have a network graph generated, this shows the estimated sets of mutual nearest clusters, forming three distinct modules. Also notice that the distinct modules contain four, two and three nodes, corresponding to the batches containing cells for each cell type.

Visualise the scMerged data using plotPCA:

# note there is a new scMerge_unsupervised assay slot
sce_mESC

## class: SingleCellExperiment 
## dim: 24224 704 
## metadata(3): ruvK ruvK_optimal scRep_res
## assays(3): counts logcounts scMerge_unsupervised
## rownames(24224): ENSMUSG00000000001 ENSMUSG00000000028 ...
##   ENSMUSG00000060565 ENSMUSG00000080069
## rowData names(0):
## colnames(704): ola_mES_lif_1_1.counts ola_mES_lif_1_10.counts ...
##   ola_mES_2i_5_95.counts ola_mES_2i_5_96.counts
## colData names(2): cellTypes batch
## reducedDimNames(0):
## spikeNames(0):

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch",
                run_args = list(exprs_values = "scMerge_unsupervised")) + 
  labs(title = "scMerge_unsupervised")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

4.2.1 What if I don’t know how to select kmeansK?

We found that it’s better to overestimate kmeansK than to underestimate kmeansK. Try with setting kmeansK = c(5,5,5,5,5).

sce_mESC <- scMerge(sce_combine = sce_mESC, 
                    ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                    kmeansK = c(5,5,5,5,5),
                    assay_name = "scMerge_unsupervised_wrongk")

## No maker nor marker_list information was supplied 
## Finding HVG...
## [1] 3082
## Clustering within each batch...
## Performing pca...
## Creating Mutual Nearest Cluster...
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

##    group batch cluster
## 1      1     1       3
## 2      1     2       3
## 3      1     3       4
## 4      1     4       2
## 5      2     5       4
## 6      4     2       1
## 7      4     3       2
## 8      2     2       2
## 9      2     3       1
## 10     5     2       5
## 11     5     3       3
## 12     2     4       4
## 13     3     3       5
## 14     3     4       5
## 15     3     5       2
## Dimension of the replicates mapping matrix 
## [1] 704 373
## 
## Performing RUV normalisation... This might take a few minutes...

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch",
                run_args = list(exprs_values = "scMerge_unsupervised_wrongk")) + 
  labs(title = "scMerge_unsupervised_wrongk")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

The network diagram appears to show four distinct modules, suggesting perhaps it has overestimated the number of cell types present in this data collection.

Notice that even overestimating the kmeansK vector still does a decent job as far as the PCA plot goes.

4.2.2 Plotting Explanatory variables

The plotExplanatoryVariables function in scater breaks down the variance associated with each gene into components, and plots these as density plots. Typically we want variability to be explained by biological things, i.e. cell types, and not by technical things, i.e. batch. Let’s plot the explanatory variables for both unMerged and scMerged data. (Note this is a ggplot object, so you can add ggplot things to it, e.g. adding labels with labs.)

scater::plotExplanatoryVariables(sce_mESC,
                                 exprs_values = "logcounts", 
                                 variables = c("batch", "cellTypes")) + 
  labs(title = "logcounts")

scater::plotExplanatoryVariables(sce_mESC,
                                 exprs_values = "scMerge_unsupervised", 
                                 variables = c("batch", "cellTypes")) + 
  labs(title = "scMerge_unsupervised")

scater::plotExplanatoryVariables(sce_mESC,
                                 exprs_values = "scMerge_unsupervised_wrongk", 
                                 variables = c("batch", "cellTypes")) + 
  labs(title = "scMerge_unsupervised_wrongk")

Here we can see that the scMerged data has more genes with higher variance explained by cell type and not batch.

4.3 Running supervised scMerge

Let’s say that you have performed separate analysis on these 5 different batches of data, and you are fairly confident that the cell type labels you have assigned to these cells are accurate. But now you would like to combine all these data into a single dataset so that you can perform some further downstream analysis. Can you still use the information from your previous analyses or previous knowledge? With supervised scMerge you can ensure that the prior cell type information is included in the merging process.

With fully supervised scMerge, you do not need to provide the kmeansK argument, as no mutual nearest clusters are estimated. If you do provide kmeansK, it will do nothing.

(As an aside, if you want to perform merging of data and avoid estimating mutual nearest clusters to identify pseudoreplicates, simply include your replicate information within the cell_type argument.)

sce_mESC <- scMerge(sce_mESC,
                    ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                    assay_name = "scMerge_supervised",
                    cell_type = sce_mESC$cellTypes)

## Cell type information was supplied, and mutual nearest neighbour option was not selected 
## Performing supervised scMerge 
## Dimension of the replicates mapping matrix 
## [1] 704 353
## 
## Performing RUV normalisation... This might take a few minutes...

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch",
                run_args = list(exprs_values = "scMerge_supervised")) + 
  labs(title = "scMerge_supervised")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

4.4 Running semi-supervised scMerge

Now let’s say that there are only a subset of cells that you have prior knowledge about. You can incorporate this information into the scMerge algorithm by providing the cell type labels and noting which cells you have this information on.

For this example, we assume that we are quite sure that the 2i cells are 2i cells, but we don’t assume knowledge about the other cells. We do this by setting cell_type that includes which are the 2i cells, and also setting cell_type_inc as the index values for the 2i cells.

Note that in this semi-supervised mode you still need to provide a kmeansK argument.

sce_mESC <- scMerge(sce_mESC,
                    ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                    kmeansK = c(1,3,3,1,1),
                    assay_name = "scMerge_semisupervised1",
                    cell_type = sce_mESC$cellTypes,
                    cell_type_inc = which(sce_mESC$cellTypes == "2i"))

## No maker nor marker_list information was supplied 
## Finding HVG...
## [1] 3082
## Clustering within each batch...
## Performing pca...
## Creating Mutual Nearest Cluster...
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7

##   group batch cluster
## 1     3     2       1
## 2     3     3       3
## 3     2     2       2
## 4     2     3       2
## 5     1     2       3
## 6     1     3       1
## 7     2     1       1
## 8     1     4       1
## 9     1     5       1
## Performing semi-supervised scMerge with subsets of known cell type
## Dimension of the replicates mapping matrix 
## [1] 704 207
## 
## Performing RUV normalisation... This might take a few minutes...

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch",
                run_args = list(exprs_values = "scMerge_semisupervised1")) + 
  labs(title = "scMerge_semisupervised1")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

A different type of semi-supervised scMerge is if you provide the cell type labels, but still wish for the groups to be matched using the mutual nearest clusters method.

sce_mESC <- scMerge(sce_mESC,
                    ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                    assay_name = "scMerge_semisupervised2",
                    cell_type = sce_mESC$cellTypes,
                    cell_type_match = TRUE)

## Cell type information was supplied, and mutual nearest neighbour option was selected 
## Finding MNC from the known cell types of different batches...
## No maker nor marker_list information was supplied 
## Finding HVG...
## [1] 3082
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7

##   group batch cluster
## 1     1     2       1
## 2     1     3       1
## 3     3     2       2
## 4     3     3       2
## 5     2     2       3
## 6     2     3       3
## 7     2     1       1
## 8     1     4       1
## 9     1     5       1
## Dimension of the replicates mapping matrix 
## [1] 704 356
## 
## Performing RUV normalisation... This might take a few minutes...

scater::plotPCA(sce_mESC, 
                colour_by = "cellTypes", 
                shape_by = "batch",
                run_args = list(exprs_values = "scMerge_semisupervised2")) + 
  labs(title = "scMerge_semisupervised2")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

4.5 More on stably expressed genes

If you’re working with a different organism (e.g. human) you will want to use a different set of genes as negative control genes when merging your data. scMerge provides lists of stably expressed genes for human and mouse, based on microarray, RNA-Seq and scRNA-Seq datasets. These are provided as gene symbols or as ENSEMBL gene IDs.

data("segList")
str(segList)

## List of 2
##  $ human:List of 3
##   ..$ human_scSEG     : chr [1:1076] "AAR2" "AATF" "ABCF3" "ABHD2" ...
##   ..$ bulkMicroarrayHK: chr [1:553] "AATF" "ABL1" "ACAT2" "ACTB" ...
##   ..$ bulkRNAseqHK    : chr [1:3803] "AAGAB" "AAMP" "AAR2" "AARS" ...
##  $ mouse:List of 3
##   ..$ mouse_scSEG     : chr [1:826] "Nfatc3" "Ythdf2" "Gtf2e2" "Mrpl45" ...
##   ..$ bulkMicroarrayHK: chr [1:553] "Aatf" "Abl1" "Acat2" "Actb" ...
##   ..$ bulkRNAseqHK    : chr [1:3803] "Aagab" "Aamp" "Aar2" "Aars" ...

str(segList_ensemblGeneID)

## List of 2
##  $ human:List of 3
##   ..$ human_scSEG     : chr [1:1124] "ENSG00000131043" "ENSG00000275700" "ENSG00000276072" "ENSG00000161204" ...
##   ..$ bulkMicroarrayHK: chr [1:613] "ENSG00000275700" "ENSG00000276072" "ENSG00000097007" "ENSG00000120437" ...
##   ..$ bulkRNAseqHK    : chr [1:4047] "ENSG00000103591" "ENSG00000127837" "ENSG00000131043" "ENSG00000090861" ...
##  $ mouse:List of 3
##   ..$ mouse_scSEG     : chr [1:765] "ENSMUSG00000058835" "ENSMUSG00000026842" "ENSMUSG00000027671" "ENSMUSG00000020152" ...
##   ..$ bulkMicroarrayHK: chr [1:519] "ENSMUSG00000018697" "ENSMUSG00000026842" "ENSMUSG00000023832" "ENSMUSG00000029580" ...
##   ..$ bulkRNAseqHK    : chr [1:3434] "ENSMUSG00000037257" "ENSMUSG00000006299" "ENSMUSG00000027628" "ENSMUSG00000031960" ...

Alternatively, you may want to identify your own set of stably expressed genes. Key assumptions behind this is that the provided data does not have batch effects and covers a wide range of cell types. You can imagine that if you run scSEGIndex on a data from a single cell type, a lot of cell type specific markers may be called as stably expressed, while they’re not necessarily stably expressed across other cell types.

(Note the following line will not run since exprsMat is not yet defined.)

scSEGIndex(exprsMat, cell_type = NULL, ncore = 1)

5 Extension work

5.1 More data

If you have extra time, or want to continue exploring scMerge in your own time, have a look at the other Case Study datasets on the scMerge website. Download the data associated with another case study and explore the data using this lab as inspiration.

5.2 Fast computation

The RUVIII function in the package ruv is great, but slow for the large single-cell data. We have been able to achieve faster computation by introducting the fastRUVIII function in scMerge. It includes:
- randomised SVD algorithm: to obtain a very accurate approximation of the noise structure in the data by performing a partial SVD.

usage of the package Rcpp to speed up the matrix calculation.
rsvd_prop is a parameter between 0 and 1, controlling the degree of approximations.

We recommend only use this option when the number of cells is large in your single cell data (at least 10,000 cells).

?fastRUVIII

## starting httpd help server ... done

sce_mESC <- scMerge(sce_mESC, 
                    ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
                    kmeansK = c(1,3,3,1,1),
                    assay_name = "scMerge_fast", 
                    fast_svd = TRUE, 
                    rsvd_prop = 0.1)

## No maker nor marker_list information was supplied 
## Finding HVG...
## [1] 3082
## Clustering within each batch...
## Performing pca...
## Creating Mutual Nearest Cluster...
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7

##   group batch cluster
## 1     1     2       1
## 2     1     3       2
## 3     2     2       2
## 4     2     3       1
## 5     3     2       3
## 6     3     3       3
## 7     2     1       1
## 8     1     4       1
## 9     1     5       1
## Dimension of the replicates mapping matrix 
## [1] 704 356
## 
## Performing RUV normalisation... This might take a few minutes...

6 Finish

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.10.0               scMerge_0.1.14             
##  [3] ggplot2_3.1.0               knitr_1.20                 
##  [5] SingleCellExperiment_1.4.0  SummarizedExperiment_1.12.0
##  [7] DelayedArray_0.8.0          BiocParallel_1.16.0        
##  [9] matrixStats_0.54.0          Biobase_2.42.0             
## [11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1        
## [13] IRanges_2.16.0              S4Vectors_0.20.1           
## [15] BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1            viridisLite_0.3.0       
##  [3] foreach_1.4.4            DelayedMatrixStats_1.4.0
##  [5] gtools_3.8.1             assertthat_0.2.0        
##  [7] statmod_1.4.30           M3Drop_1.8.0            
##  [9] GenomeInfoDbData_1.2.0   vipor_0.4.5             
## [11] yaml_2.2.0               numDeriv_2016.8-1       
## [13] pillar_1.3.0             backports_1.1.2         
## [15] lattice_0.20-35          glue_1.3.0              
## [17] bbmle_1.0.20             digest_0.6.18           
## [19] RColorBrewer_1.1-2       XVector_0.22.0          
## [21] colorspace_1.3-2         htmltools_0.3.6         
## [23] Matrix_1.2-14            plyr_1.8.4              
## [25] pkgconfig_2.0.2          zlibbioc_1.28.0         
## [27] purrr_0.2.5              scales_1.0.0            
## [29] gdata_2.18.0             HDF5Array_1.10.0        
## [31] tibble_1.4.2             withr_2.1.2             
## [33] lazyeval_0.2.1           magrittr_1.5            
## [35] crayon_1.3.4             evaluate_0.12           
## [37] gplots_3.0.1             beeswarm_0.2.3          
## [39] tools_3.5.0              stringr_1.3.1           
## [41] Rhdf5lib_1.4.0           munsell_0.5.0           
## [43] irlba_2.3.2              bindrcpp_0.2.2          
## [45] compiler_3.5.0           rsvd_1.0.0              
## [47] caTools_1.17.1.1         rlang_0.3.0.1           
## [49] rhdf5_2.26.0             grid_3.5.0              
## [51] RCurl_1.95-4.11          iterators_1.0.10        
## [53] rstudioapi_0.8           igraph_1.2.2            
## [55] bitops_1.0-6             rmarkdown_1.10          
## [57] gtable_0.2.0             codetools_0.2-15        
## [59] ruv_0.9.7                reshape2_1.4.3          
## [61] R6_2.3.0                 gridExtra_2.3           
## [63] dplyr_0.7.8              bindr_0.1.1             
## [65] rprojroot_1.3-2          KernSmooth_2.23-15      
## [67] stringi_1.2.4            ggbeeswarm_0.6.0        
## [69] Rcpp_1.0.0               tidyselect_0.2.5

scMerge: R package for merging single cell RNA-Seq datasets

Dr Shila Ghazanfar, Ms Yingxin Lin, Mr Kevin Wang

6 December 2018