Skip to contents

tl;dr

To demonstrate, we’ll use one of the mammary gland datasets from the scRNAseq package. We will subset it down to a random set of 500 cells for speed.

library(scRNAseq)
sce <- BachMammaryData(samples="G_2")

set.seed(1000)
sce <- sce[,sample(ncol(sce), 500)]

For the purposes of this demonstration, we’ll perform an extremely expedited analysis. One would usually take more care here and do some quality control, create some diagnostic plots, etc., but we don’t have the space for that.

library(scuttle)
sce <- logNormCounts(sce)

library(scran)
dec <- modelGeneVar(sce)

library(scater)
set.seed(1000)
sce <- runPCA(sce, ncomponents=10, subset_row=getTopHVGs(dec, n=1000))

library(bluster)
clusters <- clusterRows(reducedDim(sce, "PCA"), NNGraphParam())

sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by=I(clusters), text_by=I(clusters))

We then run findDoubletClusters() to test each cluster against the null hypothesis that it does consist of doublets. The null is rejected if a cluster has many DE genes that lie outside the expression limits defined by the “source” clusters. On the other hand, if num.de is low, the cluster’s expression profile is consistent with the doublet hypothesis.

library(scDblFinder)
tab <- findDoubletClusters(sce, clusters)
tab
## DataFrame with 7 rows and 9 columns
##       source1     source2    num.de median.de               best     p.value
##   <character> <character> <integer> <integer>        <character>   <numeric>
## 6           2           1         0       122 ENSMUSG00000019256 1.00000e+00
## 3           6           5         8        46 ENSMUSG00000002985 8.41352e-05
## 7           5           1        11       393 ENSMUSG00000075705 2.04690e-10
## 4           7           2        24        63 ENSMUSG00000022491 4.15645e-13
## 1           7           6        86       715 ENSMUSG00000001349 8.49703e-08
## 5           7           2       112      1007 ENSMUSG00000024610 2.38765e-11
## 2           6           5       124       336 ENSMUSG00000023505 4.06936e-06
##   lib.size1 lib.size2      prop
##   <numeric> <numeric> <numeric>
## 6  0.927081  0.443667     0.084
## 3  2.026090  0.859424     0.242
## 7  1.195040  1.249946     0.030
## 4  0.630479  1.646729     0.150
## 1  0.800035  2.253942     0.192
## 5  0.836792  2.185591     0.032
## 2  1.078655  0.457542     0.270

Mathematical background

Consider a cell population ii that has mean transcript count λgi\lambda_{gi} for gene gg. Assume that each population exhibits a unique scaling bias sis_i, representing the efficiency of library preparation for that population. The observed read/UMI count for each gene is then μgi=siλgi\mu_{gi}=s_i\lambda_{gi}. (For simplicity, we will ignore gene-specific scaling biases, as this is easily accommodated by considering λgiϕgλgi\lambda_{gi} \equiv \phi_g \lambda_{gi} for some bias ϕg\phi_g.) The expected total count for each population is Ni=gμgiN_i = \sum_g \mu_{gi}.

Now, let us consider a doublet population jj that forms from two parent populations i1i_1 and i2i_2. The observed read count for gg in jj is μgj=sj(λgi1+λgi2)\mu_{gj} = s_j (\lambda_{gi_1} + \lambda_{gi_2}). Note that sjs_j need not be any particular function of si1s_{i_1} and si2s_{i_2}. Rather, this relationship depends on how quickly the reverse transcription and amplification reagents are saturated during library preparation, which is difficult to make assumptions around.

Normalization by library size

We obtain log-normalized expression values for each cell based on the library size. Assume that the library size-normalized expression values are such that μgi1Ni11<μgi2Ni21\mu_{gi_1}N_{i_1}^{-1} < \mu_{gi_2}N_{i_2}^{-1}, i.e., the proportion of gg increases in i2i_2 compared to i1i_1. The contribution of each sis_i cancels out, yielding λgi1gλgi1<λgi2gλgi2. \frac{\lambda_{gi_1}}{\sum_g \lambda_{gi_1}} < \frac{\lambda_{gi_2}}{\sum_g \lambda_{gi_2}} \;. The normalized expression value of the doublet cluster jj is subsequently λgi1+λgi2g(λgi1+λgi2), \frac{\lambda_{gi_1} + \lambda_{gi_2}}{\sum_g (\lambda_{gi_1} + \lambda_{gi_2})} \;, and it is fairly easy to show that λgi1gλgi1<λgi1+λgi2g(λgi1+λgi2)<λgi2gλgi2. \frac{\lambda_{gi_1}}{\sum_g \lambda_{gi_1}} < \frac{\lambda_{gi_1} + \lambda_{gi_2}}{\sum_g (\lambda_{gi_1} + \lambda_{gi_2})} < \frac{\lambda_{gi_2}}{\sum_g \lambda_{gi_2}} \;. In other words, the expected library size-normalized expression of our gene in the doublet cluster lies between that of the two parents.

It is harder to provide theoretical guarantees with arbitrary size factors, which is why we only use the library sizes for normalization instead. The exception is that of spike-in size factors that would estimate sis_i directly. This would allow us to obtain estimates of λgi\lambda_{gi} for the parent clusters and of λgi1+λgi2\lambda_{gi_1} + \lambda_{gi_2} for the doublets. In this manner, we could more precisely identify doublet clusters as those where the normalized expression value is equal to the sum of the parents. Unfortunately, spike-ins are generally not available for droplet-based data sets where doublets are most problematic.

Testing for (lack of) intermediacy

We want to identify the clusters that may be comprised of doublets of other clusters. For each cluster jj', we test for differential expression in the library size-normalized expression profiles against every other cluster ii'. For each pair of other clusters i1i'_1 and i2i'_2, we identify genes that change in jj' against both i1i'_1 and i2i'_2in the same direction. The presence of such genes violates the intermediacy expected of a doublet cluster and provides evidence that jj' is not a doublet of i1i'_1 and i2i'_2.

Significant genes are identified by an intersection-union test on the pp-values from the pairwise comparisons between jj' and i1i'_1 or i2i'_2. (Specifically, tt-tests are used via the findMarkers() function from scran.) The pp-value for a gene is set to unity when the signs of the log-fold changes are not the same between comparisons. Multiple correction testing is applied using the Benjamini-Hochberg method, and the number of genes detected at a specified false discovery rate (usually 5%) is counted. The pair (i1,i2)(i'_1, i'_2) with the fewest detected genes are considered as the putative parents of jj'.

In theory, it is possible to compute the Simes’ combined pp-value across all genes to reject the doublet hypothesis for jj'. This would provide a more rigorous approach to ruling out potential doublet/parent combinations. However, this is very sensitive to misspecification of clusters – see below.

Calling doublet clusters

Assuming that most clusters are not comprised of doublets, we identify clusters that have an unusually low number of detected genes that violate the intermediacy condition. This is achieved by identifying small outliers on the log-transformed number of detected genes, using the median absolute deviation-based method in the function. (We use a log-transformation simply to improve resolution at low values.) Clusters are likely to be doublets if they are outliers on this metric.

Doublet clusters should also have larger library sizes than the proposed parent clusters. This is consistent with the presence of more RNA in each doublet, though the library size of the doublet cluster need not be a sum of that of the parent clusters (due to factors such as saturation and composition effects). The proportion of cells assigned to the doublet cluster should also be “reasonable”; exactly what this means depends on the experimental setup and the doublet rate of the protocol in use.

Discussion

The biggest advantage of this approach lies in its interpretability. Given a set of existing clusters, we can explicitly identify those that are likely to be doublets. We also gain some insight onto the parental origins of each putative doublet cluster, which may be of some interest. We avoid any assumptions about doublet formation that are otherwise necessary for the simulation-based methods. In particular, we do not require any knowledge about exact the relationship between sjs_j and sis_i, allowing us to identify doublets even when the exact location of the doublet is unknown (e.g., due to differences in RNA content between the parent clusters).

The downside is that, of course, we are dependent on being supplied with sensible clusters where the parental and doublet cells are separated. The intermediacy requirement is loose enough to provide some robustness against misspecification, but this only goes so far. In addition, this strategy has a bias towards calling clusters with few cells as doublets (or parents of doublets) because the DE detection power is low. This can be somewhat offset by comparing num.de against median.de as latter will be low for clusters involved in systematically low-powered comparisons, though it is difficult to adjust for the exact effect of the differences of power on the IUT.

Session information

## R Under development (unstable) (2025-01-04 r87523)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scDblFinder_1.21.2          bluster_1.17.0             
##  [3] scater_1.35.0               ggplot2_3.5.1              
##  [5] scran_1.35.0                scuttle_1.17.0             
##  [7] ensembldb_2.31.0            AnnotationFilter_1.31.0    
##  [9] GenomicFeatures_1.59.1      AnnotationDbi_1.69.0       
## [11] scRNAseq_2.21.0             SingleCellExperiment_1.29.1
## [13] SummarizedExperiment_1.37.0 Biobase_2.67.0             
## [15] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
## [17] IRanges_2.41.2              S4Vectors_0.45.2           
## [19] BiocGenerics_0.53.3         generics_0.1.3             
## [21] MatrixGenerics_1.19.0       matrixStats_1.5.0          
## [23] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_1.8.9           magrittr_2.0.3           ggbeeswarm_0.7.2        
##   [4] gypsum_1.3.0             farver_2.1.2             rmarkdown_2.29          
##   [7] fs_1.6.5                 BiocIO_1.17.1            zlibbioc_1.53.0         
##  [10] ragg_1.3.3               vctrs_0.6.5              memoise_2.0.1           
##  [13] Rsamtools_2.23.1         RCurl_1.98-1.16          htmltools_0.5.8.1       
##  [16] S4Arrays_1.7.1           AnnotationHub_3.15.0     curl_6.1.0              
##  [19] BiocNeighbors_2.1.2      xgboost_1.7.8.1          Rhdf5lib_1.29.0         
##  [22] SparseArray_1.7.2        rhdf5_2.51.1             sass_0.4.9              
##  [25] alabaster.base_1.7.2     bslib_0.8.0              htmlwidgets_1.6.4       
##  [28] desc_1.4.3               alabaster.sce_1.7.0      httr2_1.0.7             
##  [31] cachem_1.1.0             GenomicAlignments_1.43.0 igraph_2.1.3            
##  [34] mime_0.12                lifecycle_1.0.4          pkgconfig_2.0.3         
##  [37] rsvd_1.0.5               Matrix_1.7-1             R6_2.5.1                
##  [40] fastmap_1.2.0            GenomeInfoDbData_1.2.13  digest_0.6.37           
##  [43] colorspace_2.1-1         dqrng_0.4.1              irlba_2.3.5.1           
##  [46] ExperimentHub_2.15.0     textshaping_0.4.1        RSQLite_2.3.9           
##  [49] beachmat_2.23.6          labeling_0.4.3           filelock_1.0.3          
##  [52] httr_1.4.7               abind_1.4-8              compiler_4.5.0          
##  [55] bit64_4.5.2              withr_3.0.2              BiocParallel_1.41.0     
##  [58] viridis_0.6.5            DBI_1.2.3                HDF5Array_1.35.2        
##  [61] alabaster.ranges_1.7.0   alabaster.schemas_1.7.0  MASS_7.3-64             
##  [64] rappdirs_0.3.3           DelayedArray_0.33.3      rjson_0.2.23            
##  [67] tools_4.5.0              vipor_0.4.7              beeswarm_0.4.0          
##  [70] glue_1.8.0               restfulr_0.0.15          rhdf5filters_1.19.0     
##  [73] grid_4.5.0               Rtsne_0.17               cluster_2.1.8           
##  [76] gtable_0.3.6             data.table_1.16.4        metapod_1.15.0          
##  [79] BiocSingular_1.23.0      ScaledMatrix_1.15.0      XVector_0.47.1          
##  [82] ggrepel_0.9.6            BiocVersion_3.21.1       pillar_1.10.1           
##  [85] limma_3.63.2             dplyr_1.1.4              BiocFileCache_2.15.0    
##  [88] lattice_0.22-6           rtracklayer_1.67.0       bit_4.5.0.1             
##  [91] tidyselect_1.2.1         locfit_1.5-9.10          Biostrings_2.75.3       
##  [94] knitr_1.49               gridExtra_2.3            bookdown_0.42           
##  [97] ProtGenerics_1.39.1      edgeR_4.5.1              xfun_0.50               
## [100] statmod_1.5.0            UCSC.utils_1.3.0         lazyeval_0.2.2          
## [103] yaml_2.3.10              evaluate_1.0.1           codetools_0.2-20        
## [106] tibble_3.2.1             alabaster.matrix_1.7.4   BiocManager_1.30.25     
## [109] cli_3.6.3                systemfonts_1.1.0        munsell_0.5.1           
## [112] jquerylib_0.1.4          Rcpp_1.0.13-1            dbplyr_2.5.0            
## [115] png_0.1-8                XML_3.99-0.18            parallel_4.5.0          
## [118] pkgdown_2.1.1            blob_1.2.4               bitops_1.0-9            
## [121] viridisLite_0.4.2        alabaster.se_1.7.0       scales_1.3.0            
## [124] purrr_1.0.2              crayon_1.5.3             rlang_1.1.4             
## [127] KEGGREST_1.47.0