library(Seurat)
#> Loading required package: SeuratObject
#> Loading required package: sp
#> 'SeuratObject' was built with package 'Matrix' 1.7.0 but the current
#> version is 1.7.1; it is recomended that you reinstall 'SeuratObject' as
#> the ABI for 'Matrix' may have changed
#>
#> Attaching package: 'SeuratObject'
#> The following objects are masked from 'package:base':
#>
#> intersect, t
library(curl)
#> Using libcurl 8.5.0 with OpenSSL/3.0.13
library(SimBu)
This chapter covers the different input and output options of the package in detail.
The input for your simulations is always a SummarizedExperiment
object. You can create this object with different constructing functions, which will explained below. It is also possible to merge multiple datasets objects into one.
Sfaira is not covered in this vignette, but in “Public Data Integration”.\
Using existing count matrices and annotations is already covered in the “Getting started” vignette; this section will explain some minor details.\
When generating a dataset with your own data, you need to provide the count_matrix
parameter of dataset()
; additionally you can provide a TPM matrix with the tpm_matrix
. This will then lead to two simulations, one based on counts and one based on TPMs. For either of them, genes are located in the rows, cells in the columns.
Additionally, an annotation table is needed, with the cell-type annotations. It needs to consist of at least out of 2 columns: ID
and cell_type
, where ID
has to be identical to the column names of the provides matrix/matrices. If not all cells appear in the annotation or matrix, the intersection of both is used to generate the dataset. \
Here is some example data:
counts <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::Matrix(matrix(stats::rpois(3e5, 5), ncol = 300), sparse = TRUE)
tpm <- Matrix::t(1e6 * Matrix::t(tpm) / Matrix::colSums(tpm))
colnames(counts) <- paste0("cell-", rep(1:300))
colnames(tpm) <- paste0("cell-", rep(1:300))
rownames(counts) <- paste0("gene-", rep(1:1000))
rownames(tpm) <- paste0("gene-", rep(1:1000))
annotation <- data.frame(
"ID" = paste0("cell-", rep(1:300)),
"cell_type" = c(
rep("T cells CD4", 50),
rep("T cells CD8", 50),
rep("Macrophages", 100),
rep("NK cells", 10),
rep("B cells", 70),
rep("Monocytes", 20)
),
row.names = paste0("cell-", rep(1:300))
)
seurat_obj <- Seurat::CreateSeuratObject(counts = counts, assay = "gene_expression", meta.data = annotation)
# store normalized matrix in the 'data' layer
SeuratObject::LayerData(seurat_obj, assay = "gene_expression", layer = "data") <- tpm
seurat_obj
#> An object of class Seurat
#> 1000 features across 300 samples within 1 assay
#> Active assay: gene_expression (1000 features, 0 variable features)
#> 2 layers present: counts, data
It is possible to use a Seurat object to build a dataset; give the name of the assay containing count data in the counts
slot, the name of the column in the meta table with the unique cell IDs and the name of the column in the meta table with the cell type identifier. Additionally you may give the name of the assay containing TPM data in the counts
slot.
ds_seurat <- SimBu::dataset_seurat(
seurat_obj = seurat_obj,
counts_layer = "counts",
cell_id_col = "ID",
cell_type_col = "cell_type",
tpm_layer = "data",
name = "seurat_dataset"
)
#> Filtering genes...
#> Created dataset.
It is possible to use an h5ad file directly, a file format which stores AnnData objects. As h5ad files can store cell specific information in the obs
layer, no additional annotation input for SimBu is needed.
Note: if you want both counts and tpm data as input, you will have to provide two files; the cell annotation has to match between these two files. As SimBu expects the cells to be in the columns and genes/features in the rows of the input matrix, but this is not necessarily the case for anndata objects https://falexwolf.de/img/scanpy/anndata.svg, SimBu can handle h5ad files with cells in the obs
or var
layer. If your cells are in obs
, use cells_in_obs=TRUE
and FALSE
otherwise. This will also automatically transpose the matrix.
To know, which columns in the cell annotation layer correspond to the cell identifiers and cell type labels, use the cell_id_col
and cell_type_col
parameters, respectively.\
As this function uses the SimBu
python environment to read the h5ad files and extract the data, it may take some more time to initialize the conda environment at the first usage only.
# example h5ad file, where cell type info is stored in `obs` layer
# h5 <- system.file("extdata", "anndata.h5ad", package = "SimBu")
# ds_h5ad <- SimBu::dataset_h5ad(
# h5ad_file_counts = h5,
# name = "h5ad_dataset",
# cell_id_col = 0, # this will use the rownames of the metadata as cell identifiers
# cell_type_col = "group", # this will use the 'group' column of the metadata as cell type info
# cells_in_obs = TRUE # in case your cell information is stored in the var layer, switch to FALSE
# )
You are able to merge multiple datasets by using the dataset_merge
function:
ds <- SimBu::dataset(
annotation = annotation,
count_matrix = counts,
tpm_matrix = tpm,
name = "test_dataset"
)
#> Filtering genes...
#> Created dataset.
ds_multiple <- SimBu::dataset_merge(
dataset_list = list(ds_seurat, ds),
name = "ds_multiple"
)
#> Filtering genes...
#> Created dataset.
The simulation
object contains three named entries: \
bulk
: a SummarizedExperiment object with the pseudo-bulk dataset(s) stored in the assays
. They can be accessed like this:simulation <- SimBu::simulate_bulk(
data = ds_multiple,
scenario = "random",
scaling_factor = "NONE",
nsamples = 10, ncells = 100
)
#> Finished simulation.
dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_counts"]])
#> [1] 1000 10
dim(SummarizedExperiment::assays(simulation$bulk)[["bulk_tpm"]])
#> [1] 1000 10
If only the count matrix was given to the dataset initially, only the bulk_counts
assay is filled.
cell_fractions
: a table where rows represent the simulated samples and columns represent the different simulated cell-types. The entries in the table store the specific cell-type fraction per sample.\
scaling_vector
: a named list, with the used scaling value for each cell from the single cell dataset. \
utils::sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] curl_5.2.3 Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
#> [5] SimBu_1.8.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9
#> [3] magrittr_2.0.3 spatstat.utils_3.1-0
#> [5] farver_2.1.2 rmarkdown_2.28
#> [7] zlibbioc_1.52.0 vctrs_0.6.5
#> [9] ROCR_1.0-11 spatstat.explore_3.3-3
#> [11] htmltools_0.5.8.1 S4Arrays_1.6.0
#> [13] SparseArray_1.6.0 sass_0.4.9
#> [15] sctransform_0.4.1 parallelly_1.38.0
#> [17] KernSmooth_2.23-24 bslib_0.8.0
#> [19] htmlwidgets_1.6.4 ica_1.0-3
#> [21] plyr_1.8.9 plotly_4.10.4
#> [23] zoo_1.8-12 cachem_1.1.0
#> [25] igraph_2.1.1 mime_0.12
#> [27] lifecycle_1.0.4 pkgconfig_2.0.3
#> [29] Matrix_1.7-1 R6_2.5.1
#> [31] fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [33] MatrixGenerics_1.18.0 fitdistrplus_1.2-1
#> [35] future_1.34.0 shiny_1.9.1
#> [37] digest_0.6.37 colorspace_2.1-1
#> [39] patchwork_1.3.0 S4Vectors_0.44.0
#> [41] tensor_1.5 RSpectra_0.16-2
#> [43] irlba_2.3.5.1 GenomicRanges_1.58.0
#> [45] labeling_0.4.3 progressr_0.15.0
#> [47] spatstat.sparse_3.1-0 fansi_1.0.6
#> [49] polyclip_1.10-7 httr_1.4.7
#> [51] abind_1.4-8 compiler_4.4.1
#> [53] withr_3.0.2 BiocParallel_1.40.0
#> [55] fastDummies_1.7.4 highr_0.11
#> [57] MASS_7.3-61 proxyC_0.4.1
#> [59] DelayedArray_0.32.0 tools_4.4.1
#> [61] lmtest_0.9-40 httpuv_1.6.15
#> [63] future.apply_1.11.3 goftest_1.2-3
#> [65] glue_1.8.0 nlme_3.1-166
#> [67] promises_1.3.0 grid_4.4.1
#> [69] Rtsne_0.17 cluster_2.1.6
#> [71] reshape2_1.4.4 generics_0.1.3
#> [73] spatstat.data_3.1-2 gtable_0.3.6
#> [75] tidyr_1.3.1 data.table_1.16.2
#> [77] utf8_1.2.4 XVector_0.46.0
#> [79] spatstat.geom_3.3-3 BiocGenerics_0.52.0
#> [81] RcppAnnoy_0.0.22 ggrepel_0.9.6
#> [83] RANN_2.6.2 pillar_1.9.0
#> [85] stringr_1.5.1 spam_2.11-0
#> [87] RcppHNSW_0.6.0 later_1.3.2
#> [89] splines_4.4.1 dplyr_1.1.4
#> [91] lattice_0.22-6 deldir_2.0-4
#> [93] survival_3.7-0 tidyselect_1.2.1
#> [95] miniUI_0.1.1.1 pbapply_1.7-2
#> [97] knitr_1.48 gridExtra_2.3
#> [99] IRanges_2.40.0 SummarizedExperiment_1.36.0
#> [101] scattermore_1.2 stats4_4.4.1
#> [103] xfun_0.48 Biobase_2.66.0
#> [105] matrixStats_1.4.1 stringi_1.8.4
#> [107] UCSC.utils_1.2.0 lazyeval_0.2.2
#> [109] yaml_2.3.10 evaluate_1.0.1
#> [111] codetools_0.2-20 tibble_3.2.1
#> [113] cli_3.6.3 uwot_0.2.2
#> [115] xtable_1.8-4 reticulate_1.39.0
#> [117] munsell_0.5.1 jquerylib_0.1.4
#> [119] Rcpp_1.0.13 GenomeInfoDb_1.42.0
#> [121] spatstat.random_3.3-2 globals_0.16.3
#> [123] png_0.1-8 spatstat.univar_3.0-1
#> [125] parallel_4.4.1 ggplot2_3.5.1
#> [127] dotCall64_1.2 sparseMatrixStats_1.18.0
#> [129] listenv_0.9.1 viridisLite_0.4.2
#> [131] scales_1.3.0 ggridges_0.5.6
#> [133] leiden_0.4.3.1 purrr_1.0.2
#> [135] crayon_1.5.3 rlang_1.1.4
#> [137] cowplot_1.1.3