Doublet identifiation in single-cell ATAC-seq
Pierre-Luc Germain
University and ETH ZürichSource:
vignettes/scATAC.Rmd
scATAC.Rmd
Abstract
An introduction to the methods implemented for doublet detection in single-cell ATAC-seq.
Introduction
Analyses in single-cell RNAseq are typically limited to a relatively small (e.g. one or two thousands) set of features that are most informative; these are often the genes with a higher expression (and hence more chances of being quantified). In contrast, single-cell ATACseq (scATACseq) data is considerably more sparse, with most reads being spread across hundreds of thousands of regions. In this context, selecting a subset of genes is highly ineffective, and therefore many of the methods developed for single-cell RNAseq are not easily applicable, and need to be adapted. Methods have therefore been developed specifically for scATACseq data (Granja et al. 2021; Thibodeau et al. 2021).
This vignette presents different approaches to doublet detection in
single-cell ATAC-seq implemented in the package: the first is an
adaptation of scDblFinder
, the second a reimplementation of
the AMULET method from Thibodeau et al. (2021). The latter has the
advantage of capturing homotypic doublets, but does not perform well in
all datasets, and especially requires the cells to have a high library
size. We therefore next present two ways of combining the two.
Applying the scDblFinder method
With default parameters, the scDblFinder
method performs
very poorly on scATACseq data due to the increase spread of the reads
across many features. Since working with all features (i.e. tiles or
peaks) is computationally very expensive, an alternative approach is to
begin by reducing the size of the dataset, not through
selection (as in scRNAseq), but by aggregating
correlated features into a relatively small set. This has the advantage
of using all information, as well as making the count data more
continuous. This method yields comparable performance to specialized
single-cell ATACseq software (Germain et al., 2021).
The feature aggregation can be triggered using the
aggregateFeatures=TRUE
argument, which will aggregate peak
or tile counts into the number of meta-features defined by the
nfeatures
. If the number of meta-features is low (which we
recommend), the meta-features can be directly used to calculated
distances rather than going through the SVD step (which can be triggered
with the processing
argument). Such an example would
be:
suppressPackageStartupMessages(library(scDblFinder))
# we use a dummy SingleCellExperiment as example:
sce <- mockDoubletSCE(ngenes=300)
# setting low number of artificial doublets (same as ncells) just for speedup:
sce <- scDblFinder(sce, artificialDoublets=1, aggregateFeatures=TRUE, nfeatures=25, processing="normFeatures")
## Aggregating features...
## Creating ~526 artificial doublets...
## Dimensional reduction
## Evaluating kNN...
## Training model...
## iter=0, 17 cells excluded from training.
## iter=1, 34 cells excluded from training.
## iter=2, 36 cells excluded from training.
## Threshold found:0.719
## 18 (3.4%) doublets called
If you encounter problems running the aggregation-based approach on
large datasets, first make sure you have the mbkmeans
package installed.
Using the Amulet method
The AMULET method from Thibodeau et al. (2021) is based on the
assumption that, in a diploid cell, any given genomic region should be
captured at most twice. Therefore, cells with loci covered by more than
two fragments are indicative of the droplet being a doublet. Of note,
this approach has the advantage of capturing homotypic doublets, which
instead tend to be missed by other methods. Since it was only available
in the form of a mixture of java and python scripts, we re-implemented
the method in scDblFinder
(see ?amulet
),
leading to equal or superior results to the original implementation
(Germain et al. 2021).
As in the original implementation, we recommend excluding the
mitochondrial and sex chromosomes, as well as repetitive regions. This
can be specified with the regionsToExclude
argument (see
the underlying ?getFragmentOverlaps
). It can be used as
follows:
# here we use a dummy fragment file for example:
fragfile <- system.file("extdata", "example_fragments.tsv.gz", package="scDblFinder")
# we might also give a GRanges of repeat elements, so that these regions are excluded:
suppressPackageStartupMessages(library(GenomicRanges))
repeats <- GRanges("chr6", IRanges(1000,2000))
# it's better to combine these with mitochondrial and sex chromosomes
otherChroms <- GRanges(c("M","chrM","MT","X","Y","chrX","chrY"),IRanges(1L,width=10^8))
# here since I don't know what chromosome notation you'll be using I've just put them all,
# although this will trigger a warning when combining them:
toExclude <- suppressWarnings(c(repeats, otherChroms))
# we then launch the method
res <- amulet(fragfile, regionsToExclude=toExclude)
## Fragment file is not tabix-indexed, requiring thewhole file to be imported in memory.
## 12:24:09 PM - Splitting and subsetting barcodes...
## 12:24:09 PM - Obtaining overlaps...
res
## nFrags uniqFrags nAbove2 total.nAbove2 p.value q.value
## barcode1 878 878 1 1 0.475069053 0.791781755
## barcode2 2401 2401 0 0 0.798103482 0.798103482
## barcode3 2325 2325 1 1 0.475069053 0.791781755
## barcode4 1882 1882 0 0 0.798103482 0.798103482
## barcode5 1355 1355 6 6 0.001335761 0.006678806
The results is a data.frame with statistics for each barcode,
including a p-value. In contrast to the scDblFinder
score,
a lower p-value here is indicative of the droplet being more likely to
be a doublet (as in the original method). By default, only the barcodes
with a minimum number of reads are considered, but it is possible to
specify the droplets for which to gather statistics using the
barcodes
argument.
While the package includes an implementation that works based on
peak/tile count matrices (see ?amuletFromCounts
), it has a
much lower performance with respect to the one based directly on the
fragment files (see ?amulet
), and we therefore discourage
its use.
The workhorse behind the amulet
function is the
getFragmentOverlaps
, which also includes all of the
relevant arguments. If the fragment files are not Tabix-indexed, the
whole fragment file will have to be loaded in memory for processing;
while this ensures relatively rapid computation, it has high memory
requirements. Therefore, if the fragment file is Tabix-indexed (as is
for instance done as part of the ArchR pipeline), it will be read and
processed per chromosome, which is a little slower due to overhead, but
keeps memory requirements rather low. This behavior can be disabled by
specifying fullInMemory=TRUE
.
Combining mehtods
While the scDblFinder
-based approach generally performs
well, none of the two approach is optimal across all datasets tested. We
therefore investigated two strategies for combining the rationales of
each approach.
The Amulet method tends to perform best with datasets that have
homotypic doublets and where cells have a high library size (i.e. median
library size per cell of 10-15k reads), while the
scDblFinder
-based approach works better for heterotypic
doublets. Until an optimal solution is found, we recommend using
multiple approaches to inform decisions, in particular using the p-value
combination method below.
The Clamulet method
The clamulet
method (Classification-powered Amulet-like
method) operates similarly to the scDblFinder
method, but
generates artificial doublets by operating on the fragment coverages.
This has the advantage that the number of loci covered by more than two
reads can be computed for artificial doublets, enabling the use of this
feature (along with the kNN-based ones) in a classification scheme. It
however has the disadvantage of being rather slow and memory hungry, and
appears to be outperformed by a simple p-value combination of the two
methods (see below). We therefore do not recommend its
usage.
The clamulet
method uses the aforementioned aggregation
approach, and its usage includes a number of arguments from both the
scDblFinder
and amulet
method (see in
particular ?getFragmentOverlaps
):
# not run
d <- clamulet("path/to/fragments.tsv.gz")
Since our dummy fragment file is so small (5 barcodes), here we’ll have to adjust the arguments for an example to run:
d <- clamulet(fragfile, k=2, nfeatures=3)
## 12:24:10 PM - Reading full fragments...
## 12:24:10 PM - Splitting and subsetting barcodes...
## 12:24:10 PM - Computing coverages
## 12:24:10 PM - Obtaining windows
## 12:24:11 PM - Obtaining window counts
## 12:24:11 PM - Aggregating features
## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : You're computing too large a percentage of total singular values, use a
## standard svd instead.
## 12:24:11 PM - Computing features for artificial doublets
## 12:24:11 PM - Counting overlaps for real cells
## 12:24:11 PM - Counting overlaps for artificial doublets
## 12:24:11 PM - Scoring network
## 12:24:11 PM - Iterative training
## iter=0, 0 cells excluded from training.
## iter=1, 0 cells excluded from training.
## 12:24:11 PM Done!
d
## total nAbove2 total.nAbove2 weighted ratio.k2 include.in.training
## barcode1 20 1 1 0.3822253 0.5 TRUE
## barcode2 14 0 0 NaN 0.5 TRUE
## barcode3 11 1 1 1.0000000 1.0 TRUE
## barcode4 12 0 0 NaN 0.5 TRUE
## barcode5 18 6 6 0.2831145 0.5 TRUE
## score
## barcode1 0.4174298
## barcode2 0.4174298
## barcode3 0.4174298
## barcode4 0.4174298
## barcode5 0.4174298
The score can then be interpreted as for scDblFinder
. We
however note that this method proved inferior to
alternatives.
Simple p-value combination
The amulet and scDblFinder scores above can be simply combined by treating them as p-values and aggregating them (here using Fisher’s method from the aggregation package, but see also the metap package):
res$scDblFinder.p <- 1-colData(sce)[row.names(res), "scDblFinder.score"]
res$combined <- apply(res[,c("scDblFinder.p", "p.value")], 1, FUN=function(x){
x[x<0.001] <- 0.001 # prevent too much skew from very small or 0 p-values
suppressWarnings(aggregation::fisher(x))
})
We found this to perform better than averaging the scores or their ranks, and while it is not the very best method in any of the datasets tested, it has a more robust performance overall (see Germain et al., 2021).
References
Jeffrey M. Granja et al., “ArchR Is a Scalable Software Package for Integrative Single-Cell Chromatin Accessibility Analysis,” Nature Genetics, February 25, 2021, 1–9, https://doi.org/10.1038/s41588-021-00790-6
Asa Thibodeau et al., “AMULET: A Novel Read Count-Based Method for Effective Multiplet Detection from Single Nucleus ATAC-Seq Data,” Genome Biology 22, no. 1 (December 2021): 252, https://doi.org/10.1186/s13059-021-02469-x
Pierre-Luc Germain et al., “Doublet Identification in Single-Cell Sequencing Data Using ScDblFinder” (F1000Research, September 28, 2021), https://doi.org/10.12688/f1000research.73600.1
Session information
## R Under development (unstable) (2025-01-04 r87523)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scDblFinder_1.21.2 SingleCellExperiment_1.29.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [7] IRanges_2.41.2 S4Vectors_0.45.2
## [9] BiocGenerics_0.53.3 generics_0.1.3
## [11] MatrixGenerics_1.19.0 matrixStats_1.5.0
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 gridExtra_2.3 rlang_1.1.4
## [4] magrittr_2.0.3 scater_1.35.0 compiler_4.5.0
## [7] systemfonts_1.1.0 vctrs_0.6.5 pkgconfig_2.0.3
## [10] crayon_1.5.3 fastmap_1.2.0 XVector_0.47.1
## [13] scuttle_1.17.0 Rsamtools_2.23.1 rmarkdown_2.29
## [16] UCSC.utils_1.3.0 ggbeeswarm_0.7.2 ragg_1.3.3
## [19] xfun_0.50 bluster_1.17.0 zlibbioc_1.53.0
## [22] cachem_1.1.0 beachmat_2.23.6 jsonlite_1.8.9
## [25] DelayedArray_0.33.3 BiocParallel_1.41.0 irlba_2.3.5.1
## [28] parallel_4.5.0 cluster_2.1.8 R6_2.5.1
## [31] bslib_0.8.0 limma_3.63.2 rtracklayer_1.67.0
## [34] xgboost_1.7.8.1 jquerylib_0.1.4 Rcpp_1.0.13-1
## [37] bookdown_0.42 knitr_1.49 Matrix_1.7-1
## [40] igraph_2.1.3 tidyselect_1.2.1 abind_1.4-8
## [43] yaml_2.3.10 viridis_0.6.5 codetools_0.2-20
## [46] curl_6.1.0 lattice_0.22-6 tibble_3.2.1
## [49] evaluate_1.0.1 desc_1.4.3 Biostrings_2.75.3
## [52] pillar_1.10.1 BiocManager_1.30.25 RCurl_1.98-1.16
## [55] ggplot2_3.5.1 munsell_0.5.1 scales_1.3.0
## [58] glue_1.8.0 metapod_1.15.0 tools_4.5.0
## [61] BiocIO_1.17.1 data.table_1.16.4 BiocNeighbors_2.1.2
## [64] ScaledMatrix_1.15.0 locfit_1.5-9.10 GenomicAlignments_1.43.0
## [67] fs_1.6.5 scran_1.35.0 XML_3.99-0.18
## [70] grid_4.5.0 edgeR_4.5.1 colorspace_2.1-1
## [73] GenomeInfoDbData_1.2.13 beeswarm_0.4.0 BiocSingular_1.23.0
## [76] restfulr_0.0.15 vipor_0.4.7 cli_3.6.3
## [79] rsvd_1.0.5 textshaping_0.4.1 S4Arrays_1.7.1
## [82] viridisLite_0.4.2 dplyr_1.1.4 gtable_0.3.6
## [85] sass_0.4.9 digest_0.6.37 SparseArray_1.7.2
## [88] ggrepel_0.9.6 dqrng_0.4.1 rjson_0.2.23
## [91] htmlwidgets_1.6.4 htmltools_0.5.8.1 pkgdown_2.1.1
## [94] lifecycle_1.0.4 httr_1.4.7 statmod_1.5.0
## [97] MASS_7.3-64