scDblFinder — scDblFinder • scDblFinder

Identification of heterotypic (or neotypic) doublets in single-cell RNAseq using cluster-based generation of artificial doublets.

Usage

scDblFinder(
  sce,
  clusters = NULL,
  samples = NULL,
  clustCor = NULL,
  artificialDoublets = NULL,
  knownDoublets = NULL,
  knownUse = c("discard", "positive"),
  dbr = NULL,
  dbr.sd = NULL,
  dbr.per1k = 0.008,
  nfeatures = 1352,
  dims = 20,
  k = NULL,
  removeUnidentifiable = TRUE,
  includePCs = 19,
  propRandom = 0,
  propMarkers = 0,
  aggregateFeatures = FALSE,
  returnType = c("sce", "table", "full", "counts", "scores"),
  score = c("xgb", "weighted", "ratio"),
  processing = "default",
  metric = "logloss",
  nrounds = 0.25,
  max_depth = 4,
  iter = 3,
  trainingFeatures = NULL,
  unident.th = NULL,
  multiSampleMode = c("split", "singleModel", "singleModelSplitThres", "asOne"),
  threshold = TRUE,
  verbose = TRUE,
  BPPARAM = SerialParam(progressbar = verbose),
  ...
)

Arguments

sce: A SummarizedExperiment-class, SingleCellExperiment-class, or array of counts.
clusters: The optional cluster assignments. This is used to make doublets more efficiently. clusters should either be a vector of labels for each cell, or the name of a colData column of sce. Alternatively, if `clusters=TRUE`, fast clustering will be performed. If `clusters` is a single integer, it will determine how many clusters to create (using k-means clustering). If `clusters` is NULL or FALSE, purely random artificial doublets will be generated.
samples: A vector of the same length as cells (or the name of a column of colData(x)), indicating to which sample each cell belongs. Here, a sample is understood as being processed independently. If omitted, doublets will be searched for with all cells together. If given, doublets will be searched for independently for each sample, which is preferable if they represent different captures. If your samples were multiplexed using cell hashes, what you want to give here are the different batches/wells (i.e. independent captures, since doublets cannot arise across them) rather than biological samples.
clustCor: Include Spearman correlations to cell type averages in the predictors. If `clustCor` is a matrix of cell type marker expressions (with features as rows and cell types as columns), the subset of these which are present in the selected features will be correlated to each cell to produce additional predictors (i.e. one per cell type). Alternatively, if `clustCor` is a positive integer, this number of inter-cluster markers will be selected and used for correlation (se `clustCor=Inf` to use all available genes).
artificialDoublets: The approximate number of artificial doublets to create. If NULL, will be the maximum of the number of cells or 5*nbClusters^2 (with a minimum of 1500).
knownDoublets: An optional logical vector of known doublets (e.g. through cell barcodes), or the name of a colData column of `sce` containing that information. The way these are used depends on the `knownUse` argument.
knownUse: The way to use known doublets, either 'discard' (they are discarded for the purpose of training, but counted as positive for thresholding) or 'positive' (they are used as positive doublets for training - usually leads to a mild decrease in accuracy due to the fact that known doublets typically include a sizeable fraction of homotypic doublets). Note that `scDblFinder` does *not* enforce that the knownDoublets be necessarily called as doublets in the final classification, if they are not predicted as such.
dbr: The expected doublet rate, i.e. the proportion of the cells expected to be doublets. If omitted, will be calculated automatically based on the `dbr.per1k` argument and the number of cells.
dbr.sd: The uncertainty range in the doublet rate, interpreted as a +/- around `dbr`. During thresholding, deviation from the expected doublet rate will be calculated from these boundaries, and will be considered null within these boundaries. If NULL, will be 40% of `dbr`. Set to `dbr.sd=0` to disable the uncertainty around the doublet rate, or to `dbr.sd=1` to disable any expectation of the number of doublets (thus letting the thresholding be entirely driven by the misclassification of artificial doublets).
dbr.per1k: This is an alternative way of providing the expected doublet rate as a fraction of the number of (the thousands of) cells captured. The default, 0.008 (e.g. 3.2% doublets among 4000 cells), is appropriate for standard 10X chips. For High Throughput (HT) 10X chips, use half, i.e. 0.004. (Some more recent chips might have this rate even lower).
nfeatures: The number of top features to use. Alternatively, a character vectors of feature names (e.g. highly-variable genes) to use.
dims: The number of dimensions used.
k: Number of nearest neighbors (for KNN graph). If more than one value is given, the doublet density will be calculated at each k (and other values at the highest k), and all the information will be used by the classifier. If omitted, a reasonable set of values is used.
removeUnidentifiable: Logical; whether to remove artificial doublets of a combination that is generally found to be unidentifiable.
includePCs: The index of principal components to include in the predictors (e.g. `includePCs=1:2`), or the number of top components to use (e.g. `includePCs=10`, equivalent to 1:10).
propRandom: The proportion of the artificial doublets which should be made of random cells (as opposed to inter-cluster combinations). If clusters is FALSE or NULL, this is ignored (and set to 1).
propMarkers: The proportion of features to select based on marker identification.
aggregateFeatures: Whether to perform feature aggregation (recommended for ATAC). Can also be a positive integer, in which case this will indicate the number of components to use for feature aggregation (if TRUE, `dims` will be used.)
returnType: Either "sce" (default, returns a SingleCellExperiment with additional colData columns), "scores" (returns a data.frame of scores and doublet calls for each barcode), "table" (to return the table of cell attributes including artificial doublets), or "full" (returns an SCE object containing both the real and artificial cells).
score: Score to use for final classification.
processing: Counts (real and artificial) processing before KNN. Either 'default' (normal scater-based normalization and PCA), "rawPCA" (PCA without normalization), "rawFeatures" (no normalization/dimensional reduction), "normFeatures" (uses normalized features, without PCA) or a custom function with (at least) arguments `e` (the matrix of counts) and `dims` (the desired number of dimensions), returning a named matrix with cells as rows and components as columns.
metric: Error metric to optimize during training (e.g. 'merror', 'logloss', 'auc', 'aucpr').
nrounds: Maximum rounds of boosting. If NULL, will be determined through cross-validation. If a number <=1, will used the best cross-validation round minus `nrounds` times the standard deviation of the classification error.
max_depth: Maximum depths of each tree.
iter: A positive integer indicating the number of scoring iterations (ignored if `score` isn't based on classifiers). At each iteration, real cells that would be called as doublets are excluding from the training, and new scores are calculated. Recommended values are 1 or 2.
trainingFeatures: The features to use for training (defaults to an optimal pre-selection based on benchmark datasets). To exclude features (rather than list those to be included), prefix them with a "-".
unident.th: The score threshold below which artificial doublets will be considered unidentifiable.
multiSampleMode: Either "split" (recommended if there is heterogeneity across samples), "singleModel", "singleModelSplitThres", or "asOne" (see details below).
threshold: Logical; whether to threshold scores into binary doublet calls
verbose: Logical; whether to print messages and the thresholding plot.
BPPARAM: Used for multithreading when splitting by samples (i.e. when `samples!=NULL`); otherwise passed to eventual PCA and K/SNN calculations.
...: further arguments passed to getArtificialDoublets.

Value

The sce object with several additional colData columns, in particular `scDblFinder.score` (the final score used) and `scDblFinder.class` (whether the cell is called as 'doublet' or 'singlet'). See vignette("scDblFinder") for more details; for alternative return values, see the `returnType` argument.

Details

This function generates artificial doublets from real cells, evaluates their prevalence in the neighborhood of each cells, and uses this along with additional cell-level features to classify doublets. The approach is complementary to doublets identified via cell hashes and SNPs in multiplexed samples: the latter can identify doublets formed by cells of the same type from two samples, which are nearly undistinguishable from real cells transcriptionally, but cannot identify doublets made by cells of the same sample. See vignette("scDblFinder") for more details on the method.

The `clusters` and `propRandom` argument determines whether the artificial doublets are generated between clusters or randomly.

When multiple samples/captures are present, they should be specified using the samples argument. In this case, we recommend the use of BPPARAM to perform several of the steps in parallel. Artificial doublets and kNN networks will be computed separately; then the behavior will then depend on the `multiSampleMode` argument:

split: the whole process is split by sample. This is the default and recommended mode, because it is the most robust (e.g. to heterogeneity between samples, also for instance in the number of cells), and in practice we have not seen major gains in sharing information across samples;
singleModel: the doublets are generated on a per-sample basis, but the classifier and thresholding will be trained globally;
singleModelSplitThres: the doublets are generated on a per-sample basis, the classifier is trained globally, but the final thresholding is per-sample;
asOne: the doublet rate (if not given) is calculated as the weighted average of sample-specific doublet rates, and all samples are otherwise run as if they were one sample. This can get computationally more intensive, and can lead to biases if there are batch effects.

When inter-sample doublets are available, they can be provided to `scDblFinder` through the knownDoublets argument to improve the identification of further doublets. How exactly these are used depends on the `knownUse` argument: with 'discard' (default), the known doublets are excluded from the training step, but counted as positives. With 'positive', they are included and treated as positive doublets for the training step. Note that because known doublets can in practice include a lot of homotypic doublets, this second approach can often lead to a slight decrease in the accuracy of detecting heterotypic doublets.

Finally, for some types of data, such as single-cell ATAC-seq, selecting a number of top features is ineffective due to the high sparsity of the signal. In such contexts, rather than _selecting_ features we recommend to use the alternative approach of _aggregating_ similar features (with `aggregateFeatures=TRUE`), which strongly improves accuracy. See the vignette for more detail.

Examples

library(SingleCellExperiment)
sce <- mockDoubletSCE()
sce <- scDblFinder(sce)
#> Creating ~1500 artificial doublets...
#> Dimensional reduction
#> Evaluating kNN...
#> Training model...
#> iter=0, 23 cells excluded from training.
#> [1]	train-logloss:0.471404 
#> Will train until train_logloss hasn't improved in 2 rounds.
#> 
#> [2]	train-logloss:0.340009 
#> [3]	train-logloss:0.256516 
#> [4]	train-logloss:0.202024 
#> [5]	train-logloss:0.163660 
#> [6]	train-logloss:0.131863 
#> [7]	train-logloss:0.111575 
#> [8]	train-logloss:0.094119 
#> [9]	train-logloss:0.080730 
#> [10]	train-logloss:0.070168 
#> [11]	train-logloss:0.061699 
#> [12]	train-logloss:0.055297 
#> [13]	train-logloss:0.049246 
#> [14]	train-logloss:0.045006 
#> [15]	train-logloss:0.041082 
#> [16]	train-logloss:0.037230 
#> [17]	train-logloss:0.034096 
#> [18]	train-logloss:0.031961 
#> [19]	train-logloss:0.029695 
#> iter=1, 19 cells excluded from training.
#> [1]	train-logloss:0.471069 
#> Will train until train_logloss hasn't improved in 2 rounds.
#> 
#> [2]	train-logloss:0.339401 
#> [3]	train-logloss:0.255269 
#> [4]	train-logloss:0.201083 
#> [5]	train-logloss:0.158154 
#> [6]	train-logloss:0.130112 
#> [7]	train-logloss:0.107241 
#> [8]	train-logloss:0.091591 
#> [9]	train-logloss:0.079358 
#> [10]	train-logloss:0.068245 
#> [11]	train-logloss:0.060573 
#> [12]	train-logloss:0.052238 
#> [13]	train-logloss:0.046550 
#> [14]	train-logloss:0.042383 
#> [15]	train-logloss:0.037792 
#> [16]	train-logloss:0.033966 
#> [17]	train-logloss:0.031337 
#> [18]	train-logloss:0.028771 
#> [19]	train-logloss:0.026748 
#> [20]	train-logloss:0.024873 
#> [21]	train-logloss:0.023266 
#> [22]	train-logloss:0.021835 
#> iter=2, 19 cells excluded from training.
#> [1]	train-logloss:0.471069 
#> Will train until train_logloss hasn't improved in 2 rounds.
#> 
#> [2]	train-logloss:0.339401 
#> [3]	train-logloss:0.255269 
#> [4]	train-logloss:0.201083 
#> [5]	train-logloss:0.158154 
#> [6]	train-logloss:0.130112 
#> [7]	train-logloss:0.107241 
#> [8]	train-logloss:0.091591 
#> [9]	train-logloss:0.079358 
#> [10]	train-logloss:0.068245 
#> [11]	train-logloss:0.060573 
#> [12]	train-logloss:0.052238 
#> [13]	train-logloss:0.046550 
#> [14]	train-logloss:0.042383 
#> [15]	train-logloss:0.037792 
#> [16]	train-logloss:0.033966 
#> [17]	train-logloss:0.031337 
#> [18]	train-logloss:0.028771 
#> [19]	train-logloss:0.026748 
#> [20]	train-logloss:0.024873 
#> [21]	train-logloss:0.023266 
#> Threshold found:0.528
#> 18 (3.5%) doublets called
table(truth=sce$type, call=sce$scDblFinder.class)
#>          call
#> truth     singlet doublet
#>   singlet     500       0
#>   doublet       0      18