simPIC 1.1.0
simPIC is an R package for the simple simulation of single-cell RNA sequencing data. This vignette gives an overview and introduction to simPIC’s functionality.
simPIC can be installed from Bioconductor:
BiocManager::install("simPIC")
Assuming you already have a matrix of count data similar to that you wish to simulate there are two simple steps to creating a simulated data set with simPIC.
# Load package
suppressPackageStartupMessages({
library(simPIC)
})
# Load test data
set.seed(12)
counts <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
# Estimate parameters
est <- simPICestimate(counts)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean
# Simulate data using estimated parameters
sim <- simPICsimulate(est)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
These steps will be explained in detail in the following sections but briefly the first step takes a dataset and estimates simulation parameters from it and the second step takes those parameters and simulates a new dataset.
simPIC
recommends to use a Paired-Insertion Count (PIC) matrix for optimal use
of the quantitative information present in scATAC-seq data. You can convert your
own matrix to PIC by following the PICsnATAC
vignette. Briefly, you need three input files for PIC -
singlecell.csv
)peaks.bed
)fragment.tsv.gz
)pic_mat <- PIC_counting(
cells = cells,
fragment_tsv_gz_file_location = fragment_tsv_gz_file_location,
peak_sets = peak_sets
)
The core of the simPIC model is a gamma-Poisson distribution. This model is used
to generate a uniform quantification based peak-by-cell count matrix. Mean
chromatin accessibility signals for each peak are simulated from a weibull
distribution with default
settings. Users have the flexibility to choose from gamma
, lognormal
or
pareto
distribution as well. Each cell is given an expected library size,
simulated from a log-normal distribution to match to a given dataset. Sparsity
is imposed on counts simulated from a Poisson
distribution.
simPICcount
classAll the parameters for the simPIC simulation are stored in a simPICcount
object. A class specifically desgined for storing simPIC scATAC-seq simulation
parameters. Let’s create a new one and see what it looks like.
sim.params <- newsimPICcount()
we can see the default values for the simPICcount
object parameters. These
values are based on provided test data.
sim.params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 5000
#>
#> Slot "nCells":
#> [1] 700
#>
#> Slot "seed":
#> [1] 47341
#>
#> Slot "default":
#> [1] TRUE
#>
#> Slot "pm.distr":
#> [1] "weibull"
#>
#> Slot "lib.size.meanlog":
#> [1] 6.687082
#>
#> Slot "lib.size.sdlog":
#> [1] 0.344361
#>
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#>
#> Slot "peak.mean.rate":
#> [1] 7.100648
#>
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#>
#> Slot "peak.mean.pi":
#> [1] -17.17441
#>
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#>
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#>
#> Slot "sparsity":
#> [1] 0.95132112 0.95280678 0.41876963 0.91809606 0.31367613 0.37465769
#> [7] 0.11508070 0.11565853 0.91282331 0.12997421 0.29936492 0.07657539
#> [13] 0.37739953 0.20495343 0.98123142 0.98941133 0.31369648 0.53168575
#> [19] 0.65535333 0.07456167 0.64285361 0.27511977 0.81004527 0.66217860
#> [25] 0.92540088 0.50633174 0.49355193 0.90714596 0.46009649 0.03976855
#> [ reached getOption("max.print") -- omitted 199970 entries ]
To get a particular parameter, for e.g., number of peaks, we can use simPICget
function:
simPICget(sim.params, "nPeaks")
#> [1] 5000
Alternatively, to give a parameter a new value we can use the
setsimPICparameters
function:
sim.params <- setsimPICparameters(sim.params, nPeaks = 2000)
simPICget(sim.params, "nPeaks")
#> [1] 2000
To get or set multiple parameters use simPICget
or setsimPICparameters
functions:
# Set multiple parameters at once (using a list)
sim.params <- setsimPICparameters(sim.params,
update = list(nPeaks = 8000, nCells = 500)
)
# Extract multiple parameters as a list
params <- simPICgetparameters(
sim.params,
c("nPeaks", "nCells", "peak.mean.shape")
)
# Set multiple parameters at once (using additional arguments)
params <- setsimPICparameters(sim.params,
lib.size.sdlog = 3.5, lib.size.meanlog = 9.07
)
params
#> An object of class "simPICcount"
#> Slot "nPeaks":
#> [1] 8000
#>
#> Slot "nCells":
#> [1] 500
#>
#> Slot "seed":
#> [1] 47341
#>
#> Slot "default":
#> [1] TRUE
#>
#> Slot "pm.distr":
#> [1] "weibull"
#>
#> Slot "lib.size.meanlog":
#> [1] 9.07
#>
#> Slot "lib.size.sdlog":
#> [1] 3.5
#>
#> Slot "peak.mean.shape":
#> [1] 0.7909301
#>
#> Slot "peak.mean.rate":
#> [1] 7.100648
#>
#> Slot "peak.mean.scale":
#> [1] 0.09522228
#>
#> Slot "peak.mean.pi":
#> [1] -17.17441
#>
#> Slot "peak.mean.meanlog":
#> [1] -2.825233
#>
#> Slot "peak.mean.sdlog":
#> [1] -1.366378
#>
#> Slot "sparsity":
#> [1] 0.95132112 0.95280678 0.41876963 0.91809606 0.31367613 0.37465769
#> [7] 0.11508070 0.11565853 0.91282331 0.12997421 0.29936492 0.07657539
#> [13] 0.37739953 0.20495343 0.98123142 0.98941133 0.31369648 0.53168575
#> [19] 0.65535333 0.07456167 0.64285361 0.27511977 0.81004527 0.66217860
#> [25] 0.92540088 0.50633174 0.49355193 0.90714596 0.46009649 0.03976855
#> [ reached getOption("max.print") -- omitted 199970 entries ]
simPIC allows you to estimate many of it’s parameters from a
SingleCellExperiment object containing counts or a counts matrix using the
simPICestimate
function.
# Get the counts from test data
count <- readRDS(system.file("extdata", "test.rds", package = "simPIC"))
# Check that counts is a dgCMatrix
class(count)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
typeof(count)
#> [1] "S4"
# Check the dimensions, each row is a peak, each column is a cell
dim(count)
#> [1] 5000 700
# Show the first few entries
count[1:5, 1:5]
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> TTGCCCACACTCAAGT-1 TATGTGGGTCCAAGAG-1
#> chr22-29837913-29838812 1 2
#> chr12-90076794-90077665 . .
#> chr11-63574191-63575102 . .
#> chr12-54977628-54978479 . .
#> chr1-221715272-221716182 . .
#> CGATGATCAAGTAACA-1 GTCGTAATCAGGAAGC-1
#> chr22-29837913-29838812 . .
#> chr12-90076794-90077665 . .
#> chr11-63574191-63575102 . .
#> chr12-54977628-54978479 . .
#> chr1-221715272-221716182 . .
#> TACTAGGCACAAACAA-1
#> chr22-29837913-29838812 .
#> chr12-90076794-90077665 .
#> chr11-63574191-63575102 .
#> chr12-54977628-54978479 .
#> chr1-221715272-221716182 1
new <- newsimPICcount()
new <- simPICestimate(count)
#> simPIC is:
#> estimating library size parameters...
#> estimating sparsity...
#> estimating peak mean parameters...
#> using weibull distribution for estimating peak mean
## estimating using gamma distribution
## new <- simPICestimate(count, pm.distr = "gamma")
Here we estimated parameters from a counts matrix using default parameters. The estimation process has the following steps:
For more details of the estimation procedures see ?simPICestimate
.
Once we have a set of parameters we are happy with we can use simPICsimulate
to simulate counts. To make adjustments to the parameters provide them as
additional arguments. Alternatively if we don’t supply any parameters defaults
will be used:
sim <- simPICsimulate(new, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim
#> class: SingleCellExperiment
#> dim: 5000 1000
#> metadata(1): Params
#> assays(1): counts
#> rownames(5000): Peak1 Peak2 ... Peak4999 Peak5000
#> rowData names(2): Peak exp.peakmean
#> colnames(1000): Cell1 Cell2 ... Cell999 Cell1000
#> colData names(2): Cell exp.libsize
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
## simulating using gamma distribution
## sim <- simPICsimulate(new, nCells =1000, pm.distr = "gamma")
Looking at the output of simPICsimulate
we can see that sim
is
SingleCellExperiment
object with 5000 features (peaks) and
1000 samples (cells). The main part of this object is a features by
samples matrix containing the simulated counts (accessed using counts
).
Additionally a SingleCellExperiment
contains information about each cell
(accessed using colData
) and each peak (accessed using rowData
). simPIC uses
these slots, as well as assays
, to store information about the intermediate
values of the simulation.
# Access the counts
counts(sim)[1:5, 1:5]
#> Cell1 Cell2 Cell3 Cell4 Cell5
#> Peak1 0 0 0 0 0
#> Peak2 0 0 1 1 0
#> Peak3 1 0 1 1 0
#> Peak4 0 0 0 0 0
#> Peak5 0 0 0 0 0
# Information about peaks
head(rowData(sim))
#> DataFrame with 6 rows and 2 columns
#> Peak exp.peakmean
#> <character> <numeric>
#> Peak1 Peak1 6.75102e-06
#> Peak2 Peak2 2.68961e-04
#> Peak3 Peak3 1.02505e-04
#> Peak4 Peak4 3.80896e-06
#> Peak5 Peak5 4.83026e-05
#> Peak6 Peak6 4.76012e-05
# Information about cells
head(colData(sim))
#> DataFrame with 6 rows and 2 columns
#> Cell exp.libsize
#> <character> <numeric>
#> Cell1 Cell1 1467
#> Cell2 Cell2 575
#> Cell3 Cell3 1122
#> Cell4 Cell4 1105
#> Cell5 Cell5 925
#> Cell6 Cell6 887
# Peak by cell matrices
names(assays(sim))
#> [1] "counts"
For more details about the SingleCellExperiment object refer to thevignette.
The simPICsimulate
function provides additional simulation details:
* Cell information (colData
)
* Cell
- Unique cell identifier.
* exp.libsize
- The expected library size for that cell. (not obtained from
the final simulated counts)
* Peak information (rowData
)
* Peak
- Unique peak identifier.
* exp.peakmean
- The expected peak means for that peak. (not obtained from
the final simulated counts)
* Peak by cell information (assays
)
* counts
- The final simulated counts.
For more information on the simulation see ?simPICsimulate
.
simPIC provides a function simPICcompare
that aims to make these comparisons
easier. This function takes a list of SingleCellExperiment
objects, combines
the datasets and produces comparison plots. Let’s make two small simulations and
see how they compare.
sim1 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
sim2 <- simPICsimulate(nPeaks = 20000, nCells = 1000)
#> simPIC is:
#> updating parameters...
#> setting up SingleCellExperiment object...
#> Simulating library size...
#> Simulating peak mean...
#> using weibull distribution for simulating peak mean
#> Simulating true counts...
#> Done!!
comparison <- simPICcompare(list(real = sim1, simPIC = sim2))
names(comparison)
#> [1] "RowData" "ColData" "Plots"
names(comparison$Plots)
#> [1] "Means" "Variances" "MeanVar" "LibrarySizes" "ZerosPeak"
#> [6] "ZerosCell" "MeanZeros" "PeakMeanNzp"
The returned list has three items. The first two are the combined datasets by
peak (RowData
) and by cell (ColData
) and the third contains some comparison
plots (produced using ggplot2
), for example a plot of the distribution of
means:
comparison$Plots$Means
These are only a few of the plots you might want to consider but it should be easy to make more using the returned data.
If you use simPIC in your work please cite our paper:
citation("simPIC")
#> To cite package 'simPIC' in publications use:
#>
#> Chugh S, McCarthy D, Shim H (2024). _simPIC: simPIC: flexible
#> simulation of paired-insertion counts for single-cell ATAC-sequencing
#> data_. R package version 1.1.0,
#> <https://github.com/sagrikachugh/simPIC>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {simPIC: simPIC: flexible simulation of paired-insertion counts for single-cell
#> ATAC-sequencing data},
#> author = {Sagrika Chugh and Davis McCarthy and Heejung Shim},
#> year = {2024},
#> note = {R package version 1.1.0},
#> url = {https://github.com/sagrikachugh/simPIC},
#> }
sessionInfo()
#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] simPIC_1.1.0 SingleCellExperiment_1.27.0
#> [3] SummarizedExperiment_1.35.0 Biobase_2.65.0
#> [5] GenomicRanges_1.57.0 GenomeInfoDb_1.41.0
#> [7] IRanges_2.39.0 S4Vectors_0.43.0
#> [9] BiocGenerics_0.51.0 MatrixGenerics_1.17.0
#> [11] matrixStats_1.3.0 BiocStyle_2.33.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 farver_2.1.1
#> [4] fastmap_1.1.1 digest_0.6.35 lifecycle_1.0.4
#> [7] survival_3.6-4 magrittr_2.0.3 compiler_4.4.0
#> [10] rlang_1.1.3 sass_0.4.9 tools_4.4.0
#> [13] utf8_1.2.4 yaml_2.3.8 knitr_1.46
#> [16] S4Arrays_1.5.0 labeling_0.4.3 fitdistrplus_1.1-11
#> [19] DelayedArray_0.31.0 abind_1.4-5 BiocParallel_1.39.0
#> [22] withr_3.0.0 grid_4.4.0 fansi_1.0.6
#> [25] beachmat_2.21.0 colorspace_2.1-0 ggplot2_3.5.1
#> [28] scales_1.3.0 MASS_7.3-60.2 tinytex_0.50
#> [ reached getOption("max.print") -- omitted 42 entries ]