Single cell data is gaining sophistication - Cells can be measured in multiple modalities including gene expression, chromatin accessibility, cell surface markers and protein expression. These orthogonal measures of the same or matched cells enable a holistic construction of the cell state. However it has been challenging to share multiomic data, especially in an integrated format that consolidates the multiple layers of measurement. The MultiAssayExperiment
container (implemented in the MultiAssayExperiment package) provides a framework to package the various modalities into a single object on a per cell basis.
The scMultiome package is a collection of public single cell multiome data sets pre-processed and packaged into MultiAssayExperiment
objects for downstream analysis. It also provides basic functions to save the MultiAssayExperiment
as .hdf5
files so that users need to only load the desired modalities into memory.
The ‘scMultiome’ package is similar to SingleCellMultiModal in terms of providing multimodal data as an ExperimentHub package. One key difference is that we included additional functionalities to load only the desired modalities, and allow users to save their MAE
objects as .hdf5
s. The selective loading of experiments is desirable because multimodal data can be large: A typical 10x scMultiome consists of 100-200K atac-seq peaks on top of the typical 30-50K genes from rna-seq. Finally, scMultiome provides a list of transcription factor and chromatin co-activator binding sites compiled from ENCODE and ChIP-Atlas ChIP-seq data as GRangesList
objects. ChIP-seq data complements transcription factor motif information by potentially distinguishing related family members and providing occupancy information of chromatin factors that do not directly bind to DNA.
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("scMultiome")
View currently available data sets and the names of their accessor functions. Help pages for particular accessors contain more information on their data sets.
library(scMultiome)
listDatasets()
Call | Author | Title | Species | Lineage | CellNumber | Multiome | DiskSize | Version |
---|---|---|---|---|---|---|---|---|
colonHealthy() | Zhang | Healthy colon | Homo sapiens | Colon | 59231 | unpaired | 6.7 GB | 2022-09-21 |
hematopoiesis() | Granje | scATAC-seq and unpaired scRNA-seq of hematopoetic cells | Homo sapiens | blood | 10250 | unpaired | 1.0 GB | 2019-10-29 |
prostateENZ() | Taavitsainen | LNCaP Cells Treated with Enzalutamide | Homo sapiens | Prostate | 15522 | unpaired | 2.9 GB | 2022-09-06 |
reprogramSeq() | Xie | Reprogram-seq of LNCaP cells | Homo sapiens | Prostate | 3903 | paired | 430 MB | 2022-10-04 |
tfBinding(“hg19”, “atlas”) | ChipAtlas, ENCODE | TF Binding hg19 ChIPAtlas+ENCODE | Homo sapiens | All | Bulk | n/a | 275 MB | 2022-09-20 |
tfBinding(“hg19”, “atlas.sample”) | ChipAtlas | TF Binding hg19 ChIPAtlas by sample | Homo sapiens | All | Bulk | n/a | 20 GB | 2024-04-06 |
tfBinding(“hg19”, “atlas.tissue”) | ChipAtlas | TF Binding hg19 ChIPAtlas by tissue | Homo sapiens | All | Bulk | n/a | 13 GB | 2024-04-06 |
tfBinding(“hg19”, “encode.sample”) | ENCODE | TF Binding hg19 ENCODE by sample | Homo sapiens | All | Bulk | n/a | 1013 MB | 2024-04-06 |
tfBinding(“hg38”, “atlas”) | ChipAtlas, ENCODE | TF Binding hg38 ChIPAtlas+ENCODE | Homo sapiens | All | Bulk | n/a | 280 MB | 2022-09-20 |
tfBinding(“hg38”, “atlas.sample”) | ChipAtlas | TF Binding hg38 ChIPAtlas by sample | Homo sapiens | All | Bulk | n/a | 20 GB | 2024-04-06 |
tfBinding(“hg38”, “atlas.tissue”) | ChipAtlas | TF Binding hg38 ChIPAtlas by tissue | Homo sapiens | All | Bulk | n/a | 13 GB | 2024-04-06 |
tfBinding(“hg38”, “encode.tissue”) | ENCODE | TF Binding hg38 ENCODE by sample | Homo sapiens | All | Bulk | n/a | 1007 MB | 2024-04-06 |
tfBinding(“mm10”, “atlas”) | ChipAtlas, ENCODE | TF Binding mm10 ChIPAtlas+ENCODE | Mus musculus | All | Bulk | n/a | 160 MB | 2022-09-20 |
tfBinding(“mm10”, “atlas.sample”) | ChipAtlas | TF Binding mm10 ChIPAtlas by sample | Mus musculus | All | Bulk | n/a | 12 GB | 2024-04-06 |
tfBinding(“mm10”, “atlas.tissue”) | ChipAtlas | TF Binding mm10 ChIPAtlas by tissue | Mus musculus | All | Bulk | n/a | 7.5 GB | 2024-04-06 |
tfBinding(“mm10”, “encode.sample”) | ENCODE | TF Binding mm10 ENCODE by sample | Mus musculus | All | Bulk | n/a | 47 MB | 2024-04-06 |
tfMotifs(‘mouse’) | Greenleaf Lab | TF motifs human | Mus musculus | n/a | n/a | n/a | 256 KB | 2 |
Access a data set by calling its accessor function:
prostateENZ(metadata = TRUE)
#> ExperimentHub with 1 record
#> # snapshotDate(): 2024-04-29
#> # names(): EH7792
#> # package(): scMultiome
#> # $dataprovider: Tampere University
#> # $species: Homo sapiens
#> # $rdataclass: MultiAssayExperiment
#> # $rdatadateadded: 2022-12-13
#> # $title: LNCaP Cells Treated with Enzalutamide
#> # $description: Multiome data (scATAC and scRNAseq) for enzalutamide-treated...
#> # $taxonomyid: 9606
#> # $genome: hg38
#> # $sourcetype: tar.gz
#> # $sourceurl: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE168667&forma...
#> # $sourcesize: NA
#> # $tags: c("CellCulture", "ExpressionData", "GEO", "Homo_sapiens_Data",
#> # "SequencingData", "SingleCellData", "Tissue")
#> # retrieve record with 'object[["EH7792"]]'
See the help files for accessor functions for possible extra options, e.g. ?prostateENZ
.
In addition to single cell multiomic data, the package also provides binding sites of transcription factors, obtained from bulk ChIP-seq studies (merging of ChIP-Atlas and ENCODE).
If multiple ChIP-seq files are available for the same transcription factor, the peaks are merged to create a union set of peaks. Currently three genomic builds genomes are provided: hg38, hg19, and mm10.
The ChIP-seq data is packaged into individual RDS files and they are accessed with a common accessor function, tfBinding
, specifying the genome and source.
tfBinding(genome = "hg38", source ="atlas", metadata = TRUE)
#> ExperimentHub with 1 record
#> # snapshotDate(): 2024-04-29
#> # names(): EH7814
#> # package(): scMultiome
#> # $dataprovider: Genentech
#> # $species: Homo sapiens
#> # $rdataclass: GRangesList
#> # $rdatadateadded: 2023-01-13
#> # $title: TF Binding Info hg38 (ChIP-Atlas and ENCODE)
#> # $description: Combined transcription factor ChIP-seq data from ChIP-Atlas ...
#> # $taxonomyid: 9606
#> # $genome: hg38
#> # $sourcetype: BED
#> # $sourceurl: https://github.com/inutano/chip-atlas/
#> # $sourcesize: NA
#> # $tags: c("CellCulture", "ExpressionData", "GEO", "Homo_sapiens_Data",
#> # "SequencingData", "SingleCellData", "Tissue")
#> # retrieve record with 'object[["EH7814"]]'
If you want to contribute your publicly available multiome data set, please read this section and contact the package maintainer, Xiaosai Yao xiaosai.yao@gmail.com.
Once the data has been preprocessed and converted into MultiAssayExperiment
, it can be saved in a hdf5 file using saveMAE
. You also need to provide metadata, documentation, and the accessor function that will retrieve the data from ExperimentHub
.
Adding data involves updating the package and as such, it must be done in “developer mode”. The developer mode allows access to additional tools such as documentation templates.
To work in developer mode, you must first clone the package repository with git
. Create a branch from master
and work on that.
Start an R session in the package directory (e.g. by opening the RStudio project in RStudio) and load all the functions. This is necessary for the R engine to temporarily identify your working directory as the package installation directory, and to expose the internal functions that you will be using.
MultiAssayExperiment
is saved in the hdf5 format as it splits the MAEs into individual experiments so that you can choose to load selected experiments. Upon loading, selected experiments are reassembled and wrapped into an MAE object. Assays are represented by DelayedMatrix
objects to save memory.and saved as a hdf5 file.
Currently only MultiAssayExperiment
objects are supported. Experiments must be objects that inherit from SummarizedExperiment
and will usually be SingleCellExperiment
, hence full support is provided for the latter and their slots (reducedDims
and altExps
).
# construct a dummy data set
mae <- dummyMAE()
mae
#> A MultiAssayExperiment object of 2 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 2:
#> [1] EXP1: SingleCellExperiment with 20 rows and 10 columns
#> [2] EXP2: SingleCellExperiment with 20 rows and 10 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
# name the file to save to
fileName <- tempfile(fileext = ".h5")
# save data set
saveMAE(mae, fileName)
#> creating h5 file
#> saving experiment EXP1
#> dissassembling experiment
#> writing class
#> writing assays
#> ... counts
#> ... logcounts
#> writing properties
#> ...colData
#> ...colnames
#> ...rowData
#> ...rownames
#> ... rowRanges
#> ... metadata
#> writing reducedDims
#> ... PCA
#> ... tSNE
#> writing altExps
#> ... Hash
#> dissassembling experiment
#> writing class
#> writing assays
#> ... counts
#> writing properties
#> ...colnames
#> ... metadata
#> ... Protein
#> dissassembling experiment
#> writing class
#> writing assays
#> ... counts
#> writing properties
#> ...colnames
#> ... metadata
#> ... done
#> saving experiment EXP2
#> dissassembling experiment
#> writing class
#> writing assays
#> ... counts
#> ... logcounts
#> writing properties
#> ...colData
#> ...colnames
#> ...rowData
#> ...rownames
#> ... rowRanges
#> ... metadata
#> writing reducedDims
#> ... PCA
#> ... tSNE
#> writing altExps
#> ... Hash
#> dissassembling experiment
#> writing class
#> writing assays
#> ... counts
#> writing properties
#> ...colnames
#> ... metadata
#> ... Protein
#> dissassembling experiment
#> writing class
#> writing assays
#> ... counts
#> writing properties
#> ...colnames
#> ... metadata
#> ... done
#> $EXP1
#> [1] TRUE
#>
#> $EXP2
#> [1] TRUE
You can use testFile
to validate that your data set can be reconstructed.
testFile(fileName)
#> TESTING loading MAE from file: /tmp/RtmpMbeDoI/file2996c772b9ebe7.h5
#> LOADING ALL EXPERIMENTS: EXP1, EXP2
#> loading dataset stored in file /tmp/RtmpMbeDoI/file2996c772b9ebe7.h5
#> loading experiments
#> loading EXP1
#> loading properties
#> ... colnames
#> ... colData
#> ... rownames
#> ... rowData
#> ... rowRanges
#> ... metadata
#> loading assays
#> ... counts
#> ... logcounts
#> loading reducedDims
#> ... PCA
#> ... tSNE
#> loading altExps
#> ... Hash
#> loading EXP1/altExps/Hash
#> loading properties
#> ... colnames
#> ... metadata
#> loading assays
#> ... counts
#> rebuilding experiment
#> ...colnames
#> ...metadata
#> ... Protein
#> loading EXP1/altExps/Protein
#> loading properties
#> ... colnames
#> ... metadata
#> loading assays
#> ... counts
#> rebuilding experiment
#> ...colnames
#> ...metadata
#> rebuilding experiment
#> ...colData
#> ...colnames
#> ...rowData
#> ...rowRanges
#> ...rownames
#> ...metadata
#> ...reducedDims
#> ...altExps
#> loading EXP2
#> loading properties
#> ... colnames
#> ... colData
#> ... rownames
#> ... rowData
#> ... rowRanges
#> ... metadata
#> loading assays
#> ... counts
#> ... logcounts
#> loading reducedDims
#> ... PCA
#> ... tSNE
#> loading altExps
#> ... Hash
#> loading EXP2/altExps/Hash
#> loading properties
#> ... colnames
#> ... metadata
#> loading assays
#> ... counts
#> rebuilding experiment
#> ...colnames
#> ...metadata
#> ... Protein
#> loading EXP2/altExps/Protein
#> loading properties
#> ... colnames
#> ... metadata
#> loading assays
#> ... counts
#> rebuilding experiment
#> ...colnames
#> ...metadata
#> rebuilding experiment
#> ...colData
#> ...colnames
#> ...rowData
#> ...rowRanges
#> ...rownames
#> ...metadata
#> ...reducedDims
#> ...altExps
#> building MultiAssayExperiment
#> A MultiAssayExperiment object of 2 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 2:
#> [1] EXP1: SingleCellExperiment with 20 rows and 10 columns
#> [2] EXP2: SingleCellExperiment with 20 rows and 10 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
For a detailed explanation of the process see ?saveMAE
.
To add metadata, we first create a R script to store your data set’s metadata. The metadata script provides key information such as data resource, species, genomic build and version number. This R script will be named inst/scripts/make-metadata-<DATASET_NAME>.R
.
makeMakeMetadata("dataset")
Metadata must be a 1-row data frame with specific columns and values must be character strings (some fields allow character vectors). See inst/scripts/make-metadata.R
for more information and inst/scripts/make-metadata-prostateENZ.R
for an example.
The file also stores metadata that will be returned by listDatasets
. Likewise, this must be a 1-row data frame and values must be character strings.
Once your make-metadata
file is ready, build the metadata
source(system.file("scripts", "make-metadata.R", package = "scMultiome"))
This must run without errors. If successful,
all the datasets will be captured in inst/extdata/manifest.csv
and inst/extdata/metadata.csv
Subsequently, validate metadata
ExperimentHubData::makeExperimentHubMetadata(dirname(system.file(package = "scMultiome")))
This call must also run without errors. It will return an ExperimentHub
object that will display your metadata in the form that the end users will see it.
Every data set needs an additional documentation to describe how the data was obtained. This doesn’t have to be a working script, just a report. Pseudocode is acceptable. Note that code evaluation has been disabled so that you can copy your actual code and the lengthy analysis does not run again.
Create an Rmarkdown file called inst/scripts/make-data-<DATASET_NAME>.Rmd
.
makeMakeData("dataset")
Every data set is accessed by its own accessor function, which provides access and help for your data set.
For every data set, we create a R
file that defines the function to retrieve the data set and provide important documentation. The package framework is constructed such that accessor functions are extremely simple and you can basically copy the original one (prostateENZ
) and most of its documentation.
Create an R file called <DATASET_NAME>.R
makeR("dataset")
This file will quote the accompanying Rmd file created above. This way the R file itself is more concise and easier to edit. Adjust the file accordingly:
make-metadata-<>.R
file.experiments
and metadata
.MultiAssayExperiment
.experiments
argument reflects the experiment names in your data set.Once the files are ready, build the documentation, and ?scMultiome
and ?<DATASET_NAME>
to review it.
# build documentation
devtools::document()
# view package man page
?scMultiome
# view your data set man page
help("dataset")
Edit the DESCRIPTION
file to update package metadata: add yourself as package author and package version.
desc::desc_add_author("<GIVEN_NAME>", "<FAMILY_NAME>", "<EMAIL>", role = "aut")
desc::desc_bump_version("minor")
Push your changes to git and create a merge request. Once the request is approved and merged to the master
branch, the package maintainer will take over and move on to the next stage.
The final step of the process is to place the data set in Bioconductor’s data bucket. This can only be done with Bioconductor’s knowledge and blessing.
Bioconductor will be notified that a scMultiome
update is coming by email at hubs@bioconductor.org
. They will issue a temporary SAS token for the Bioconductor data bucket. The data set will be placed in the Bioconductor staging directory with uploadFile
and Bioconductor will be notified again that the upload is ready. They will receive a link to the package repository, update metadata in ExperimentHub
and finalize the process.
uploadFile(file = fileName, sasToken = "<SAS_TOKEN>")
Consult this vignette for current Bioconductor requirements.
If you found any of this vignette or the process confusing, we would welcome feedback and gladly add more clarifications. Please email the package maintainer, Xiaosai Yao xiaosai.yao@gmail.com.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] scMultiome_1.4.1 SingleCellExperiment_1.26.0
#> [3] MultiAssayExperiment_1.30.3 SummarizedExperiment_1.34.0
#> [5] Biobase_2.64.0 GenomicRanges_1.56.1
#> [7] GenomeInfoDb_1.40.1 IRanges_2.38.1
#> [9] S4Vectors_0.42.1 MatrixGenerics_1.16.0
#> [11] matrixStats_1.3.0 ExperimentHub_2.12.0
#> [13] AnnotationHub_3.12.0 BiocFileCache_2.12.0
#> [15] dbplyr_2.5.0 BiocGenerics_0.50.0
#> [17] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4 blob_1.2.4
#> [4] filelock_1.0.3 Biostrings_2.72.1 fastmap_1.2.0
#> [7] digest_0.6.36 mime_0.12 lifecycle_1.0.4
#> [10] KEGGREST_1.44.1 RSQLite_2.3.7 magrittr_2.0.3
#> [13] compiler_4.4.1 rlang_1.1.4 sass_0.4.9
#> [16] tools_4.4.1 utf8_1.2.4 yaml_2.3.9
#> [19] knitr_1.48 S4Arrays_1.4.1 bit_4.0.5
#> [22] curl_5.2.1 DelayedArray_0.30.1 abind_1.4-5
#> [25] HDF5Array_1.32.0 withr_3.0.0 purrr_1.0.2
#> [28] desc_1.4.3 grid_4.4.1 fansi_1.0.6
#> [31] Rhdf5lib_1.26.0 cli_3.6.3 rmarkdown_2.27
#> [34] crayon_1.5.3 generics_0.1.3 httr_1.4.7
#> [37] DBI_1.2.3 cachem_1.1.0 rhdf5_2.48.0
#> [40] zlibbioc_1.50.0 AnnotationDbi_1.66.0 BiocManager_1.30.23
#> [43] XVector_0.44.0 vctrs_0.6.5 Matrix_1.7-0
#> [46] jsonlite_1.8.8 bookdown_0.40 bit64_4.0.5
#> [49] jquerylib_0.1.4 glue_1.7.0 BiocVersion_3.19.1
#> [52] UCSC.utils_1.0.0 tibble_3.2.1 pillar_1.9.0
#> [55] rappdirs_0.3.3 htmltools_0.5.8.1 rhdf5filters_1.16.0
#> [58] GenomeInfoDbData_1.2.12 R6_2.5.1 evaluate_0.24.0
#> [61] lattice_0.22-6 png_0.1-8 backports_1.5.0
#> [64] memoise_2.0.1 bslib_0.7.0 SparseArray_1.4.8
#> [67] checkmate_2.3.1 xfun_0.45 pkgconfig_2.0.3