Task 1: provide a unified representation of single-cell data
Challenges:
- Hundreds of scRNA-seq software tools
- Most R and Bioconductor packages define their own class
- Some extend SummarizedExperiment, some ExpressionSet
- Most packages don’t fully exploit the potential of SummarizedExperiment (e.g., assay does not have to be a matrix)
Proposed solutions:
- Create a class for developers to extend: SingleCellExperiment
Useful Bioconductor packages and other resources:
Task 2: scale-up of existing tools / implementation of tools to handle large-scale datasets
Challenges:
- Tools are scalable to thousands of cells.
- 10X Genomics released 1.3 Million cells dataset!
- Main problem: does not fit in memory!
Proposed solutions:
- HD5 files + "chunk operations"
- Simple algorithms + approximate, scalable methods
- Provide API to perform common operations independent of data representation (in memory vs. on disk)
Useful Bioconductor packages and other resources:
Interested in contributing? Join the slack channel:
https://community-bioc.slack.com
Discussion points
- Benchmark (canonical datasets)
- Splatter (simulations of scRNA-seq)
- What to do next?
- BigDataAlgorithms: define scope, what functionalities we want
- Prior art in astronomy, etc?
- Visualization?
- Multi assay?
- People are running single-cell assays that generate multiple types of data (e.g., RNA expression and methylation) from each single-cell.
- Can store each assay in a SingleCellExperiment and then put inside a MultiAssayExperiment to link up the row and column metadata.
- Multiple samples--list of SingleCellExperiments vs giant joined SingleCellExperiment. Can we learn from flowSet?