Over decades, many departments across gRED and Roche have generated large amounts of drug response screening data using Genentech’s rich drug compounds inventory. While extensive labor and time has been invested to generate these data, they are not analyzed in a standardized manner for meaningful comparison. On one hand, large screens are performed across many cell lines and drugs in a semi-automated manner. On the other hand, small-scale studies, which focused on factors that contribute to sensitivity and resistance to certain therapies, are generally performed by each individual scientist with limited automation. These are complementary approaches but were rarely handled the same way. Commercial softwares are available for analyzing large datasets, whereas researchers for small-scale datasets often process data ad hoc through software like PRISM.
Here, we propose a suite of computational tools that enable the processing, archiving, and visualization of drug response data from any experiment, regardless of size or experimental design, thus ensuring reproducibility and implementation of the Findable, Accessible, Interoperable, and Reusable (F.A.I.R.) principles, with the goal of making this accessible to the public community.
For now we share a subset of the gDR suite components for pre-processing and processing the data.
gDR suite consists of a few packages that power our app and make it a comprehensive tool. All the packages under the gDR umbrella are stored in the gDR platform GitHub organization.
We are happy to share with you our packages for importing, processing and managing gDR data: - gDRimport - gDRcore - gDRutils - gDRtestData
The gDR data model is based on the SummarizedExperiment and BumpyMatrix. If readers are unfamiliar with these data models, we recommend first reading SummarizedExperiment vignettes, followed by the BumpyMatrix vignettes. The SummarizedExperiment data structure enables ease of subsetting within the SummarizedExperiment object, but also provides ease when trying to correlate drug response data with genomic data, as these data may jointly stored in a MultiAssayExperiment. The BumpyMatrix allows for storage of multi-dimensional data while retaining a matrix abstraction.
This data structure is the core data structure that all downstream processing functions as well as visualization tools operate off of.
The gDR suite was designed in a modular manner, such that a user can jump into the “standard” end-to-end gDR processing pipeline at several entry points as is suitable for his or her needs. The full pipeline involves:
manifest, template(s), raw data
|
| 1. Aggregating all raw data and metadata
| into a single long table.
|
V
single, long table
|
| 2. Transforming the long table into
| a SummarizedExperiment object with BumpyMatrix assays
| by specifying what columns belong on rows,
| columns, and nested.
|
V
SummarizedExperiment object
with raw and treated assays
|
| 3. Normalizing, averaging, and fitting data.
|
V
SummarizedExperiment object
with raw, treated, normalized,
averaged, and metric assays,
ready for use by downstream visualization
A user should be able to enter any part of this pipeline as long as they are able to create the intermediate object (i.e., the individual manifest, template, and raw data files, or a single, long table, or a SummarizedExperiment object with Bumpy assays).
The gDR suite ultimately requires a single, long merged table containing both raw data and metadata.
To support a common use case, we provide a convenience function that takes three objects: manifest, template(s), and raw data to create this single, long merged table for the user. The manifest contains metadata on the experimental design, template files specify the drugs and cell lines used, and the raw data output files obtained from a plate reader or a scanner.
Exemplary data can be found here:
library(gDR)
#> Loading required package: gDRcore
#> Loading required package: gDRimport
#> Loading required package: gDRutils
# get test data from gDRimport package
# i.e. paths to manifest, templates and results files
td <- get_test_data()
manifest_path(td)
[1] “/media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/site-library/gDRimport/extdata/data1/manifest.xlsx”
template_path(td)
[1] “/media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/site-library/gDRimport/extdata/data1/Template_7daytreated.xlsx” [2] “/media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/site-library/gDRimport/extdata/data1/Template_Untreated.xlsx”
result_path(td)
[1] “/media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/site-library/gDRimport/extdata/data1/RawData_day0.xlsx” [2] “/media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/site-library/gDRimport/extdata/data1/RawData_day7.xlsx”
Using the convenience function import_data
, the long table is easily created:
# Import data
imported_data <-
import_data(manifest_path(td), template_path(td), result_path(td))
head(imported_data)
This function will expect certain “identifiers” that tell the processing functions which columns in the long table map to certain expected fields, so that each column is interpreted correctly. For more details regarding these identifiers, see the “Details” section of ?identifiers
. Use set_env_identifier
or set_SE_identifiers
to set up the correct mappings between the expected fields and your long table column names.
Next, we can transform the long table into our initial SummarizedExperiment object. To do so, we need to tell the software: - What should go on rows and columns versus be nested in the assay. - Which rows in our table to consider as “control” versus “treated” for normalization. - Which data type should be converted into SE.
We can do so by setting the untreated_tag
identifier like set_env_identifier("untreated_tag" = c("MY_CONTROL_TERMINOLOGY_HERE"))
.
specifying the nested_keys
argument within create_and_normalize_SE
and specifiying data_type
.
inl <- prepare_input(imported_data)
#> Warning in .set_nested_confounders(nested_confounders = nested_confounders, : 'Plate' nested confounder(s) is/are not present in the data.
#> Switching into 'Barcode' nested confounder(s).
detected_data_types <- names(inl$exps)
detected_data_types
#> [1] "combination" "single-agent"
se <- create_and_normalize_SE(
inl$df_list[["single-agent"]],
data_type = "single-agent",
nested_confounders = inl$nested_confounders)
#> INFO [2024-10-22 18:25:09]
#> INFO [2024-10-22 18:25:09]
se
#> class: SummarizedExperiment
#> dim: 3 6
#> metadata(3): identifiers experiment_metadata Keys
#> assays(3): RawTreated Controls Normalized
#> rownames(3): G00002_drug_002_moa_A_168 G00004_drug_004_moa_A_168
#> G00011_drug_011_moa_B_168
#> rowData names(4): Gnumber DrugName drug_moa Duration
#> colnames(6): CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> CL00012_cellline_CA_breast_cellline_CA_unknown_30 ...
#> CL00015_cellline_FA_breast_cellline_FA_unknown_42
#> CL00018_cellline_IB_breast_cellline_IB_unknown_54
#> colData names(6): clid CellLineName ... subtype ReferenceDivisionTime
Note that this has created a SummarizedExperiment object with rowData
, colData
, metadata
and 3 assays
.
Next, we can average and fit the data of interest.
se <- average_SE(se, data_type = "single-agent")
se <- fit_SE(se, data_type = "single-agent")
se
#> class: SummarizedExperiment
#> dim: 3 6
#> metadata(5): identifiers experiment_metadata Keys fit_parameters
#> .internal
#> assays(5): RawTreated Controls Normalized Averaged Metrics
#> rownames(3): G00002_drug_002_moa_A_168 G00004_drug_004_moa_A_168
#> G00011_drug_011_moa_B_168
#> rowData names(4): Gnumber DrugName drug_moa Duration
#> colnames(6): CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> CL00012_cellline_CA_breast_cellline_CA_unknown_30 ...
#> CL00015_cellline_FA_breast_cellline_FA_unknown_42
#> CL00018_cellline_IB_breast_cellline_IB_unknown_54
#> colData names(6): clid CellLineName ... subtype ReferenceDivisionTime
Steps (2) and (3) can be combined into a single step through a convenience function: runDrugResponseProcessingPipeline
. Moreover, the output is MultiAssayExperiment
object with one experiment per each detected data type. Currently four data types are supported: ‘single-agent’, ‘cotreatment’, ‘codilution’ and ‘matrix’. The first three data types are processed via the ‘single-agent’ model while the ‘marix’ data is processed via the ‘combintation’ model.
# Run gDR pipeline
mae <- runDrugResponseProcessingPipeline(imported_data)
mae
#> A MultiAssayExperiment object of 2 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 2:
#> [1] combination: SummarizedExperiment with 2 rows and 6 columns
#> [2] single-agent: SummarizedExperiment with 3 rows and 6 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
Note that our final MultiAssayExperiment object can be made up of multiple experiments with multiple assays:
names(mae)
#> [1] "combination" "single-agent"
SummarizedExperiment::assayNames(mae[[1]])
#> [1] "RawTreated" "Controls" "Normalized" "Averaged"
#> [5] "excess" "all_iso_points" "isobolograms" "scores"
#> [9] "Metrics"
Each assay from each experiment can be easily transformed to data.table
format using convert_se_assay_to_dt
function:
library(kableExtra)
se <- mae[["single-agent"]]
head(convert_se_assay_to_dt(se, "Metrics"))
#> rId cId
#> <char> <char>
#> 1: G00002_drug_002_moa_A_168 CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> 2: G00002_drug_002_moa_A_168 CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> 3: G00004_drug_004_moa_A_168 CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> 4: G00004_drug_004_moa_A_168 CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> 5: G00011_drug_011_moa_B_168 CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> 6: G00011_drug_011_moa_B_168 CL00011_cellline_BA_breast_cellline_BA_unknown_26
#> x_mean x_AOC x_AOC_range xc50 x_max ec50
#> <num> <num> <num> <num> <num> <num>
#> 1: 0.9157913 0.08420865 0.05733832 7.4683485 0.4225500 5.7741811
#> 2: 0.7420679 0.25793215 0.25820178 4.1712724 0.3065500 100.0000000
#> 3: 0.6675856 0.33241443 0.32656451 0.4942034 0.0260500 0.4793539
#> 4: 0.5092443 0.49075566 0.48159819 0.3199532 -0.4177500 0.5757493
#> 5: 0.3225188 0.67748125 NA -Inf 0.3225188 NA
#> 6: 0.2600750 0.73992500 NA -Inf 0.2600750 NA
#> x_inf x_0 h r2 p_value rss x_sd_avg
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0.14499197 1 1.3311221 0.9199573 0.0001450862960 0.03159935 0.8794050
#> 2: -0.76364407 1 0.2918356 0.7732475 0.0055517820959 0.06111735 0.7276889
#> 3: 0.04254419 1 2.9149107 0.9751992 0.0000024023028 0.04890734 0.7335509
#> 4: -0.48824612 1 1.1597048 0.9858751 0.0000003349268 0.03576754 0.5983619
#> 5: NA NA NA NA NA NA 0.2540203
#> 6: NA NA NA NA NA NA 0.3936048
#> fit_type maxlog10Concentration N_conc normalization_type
#> <char> <num> <int> <char>
#> 1: DRC3pHillFitModelFixS0 1.000000 9 RV
#> 2: DRC3pHillFitModelFixS0 1.000000 9 GR
#> 3: DRC3pHillFitModelFixS0 1.000000 9 RV
#> 4: DRC3pHillFitModelFixS0 1.000000 9 GR
#> 5: DRCTooFewPointsToFit -0.823909 1 RV
#> 6: DRCTooFewPointsToFit -0.823909 1 GR
#> fit_source Gnumber DrugName drug_moa Duration clid CellLineName Tissue
#> <char> <char> <char> <char> <num> <char> <char> <char>
#> 1: gDR G00002 drug_002 moa_A 168 CL00011 cellline_BA breast
#> 2: gDR G00002 drug_002 moa_A 168 CL00011 cellline_BA breast
#> 3: gDR G00004 drug_004 moa_A 168 CL00011 cellline_BA breast
#> 4: gDR G00004 drug_004 moa_A 168 CL00011 cellline_BA breast
#> 5: gDR G00011 drug_011 moa_B 168 CL00011 cellline_BA breast
#> 6: gDR G00011 drug_011 moa_B 168 CL00011 cellline_BA breast
#> parental_identifier subtype ReferenceDivisionTime
#> <char> <char> <num>
#> 1: cellline_BA unknown 26
#> 2: cellline_BA unknown 26
#> 3: cellline_BA unknown 26
#> 4: cellline_BA unknown 26
#> 5: cellline_BA unknown 26
#> 6: cellline_BA unknown 26
Once the data is stored in the database, there are multiple ways to visualize the data depending on the scientific needs. The primary method to do is through our RShiny visualization tool ‘gDRviz’. Here, users can search and select experiments present in the database, and use downstream visualization modules to look at dose response curves, heatmaps, etc.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /media/volume/teran2_disk/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] kableExtra_1.4.0 gDR_1.3.1 gDRutils_1.3.16 gDRimport_1.3.5
#> [5] gDRcore_1.3.13 BiocStyle_2.33.1
#>
#> loaded via a namespace (and not attached):
#> [1] viridisLite_0.4.2 fastmap_1.2.0
#> [3] BumpyMatrix_1.13.0 TH.data_1.1-2
#> [5] digest_0.6.37 lifecycle_1.0.4
#> [7] survival_3.7-0 magrittr_2.0.3
#> [9] compiler_4.4.1 rlang_1.1.4
#> [11] sass_0.4.9 drc_3.0-1
#> [13] tools_4.4.1 plotrix_3.8-4
#> [15] utf8_1.2.4 yaml_2.3.10
#> [17] data.table_1.16.2 knitr_1.48
#> [19] lambda.r_1.2.4 S4Arrays_1.5.11
#> [21] DelayedArray_0.31.14 xml2_1.3.6
#> [23] abind_1.4-8 multcomp_1.4-26
#> [25] BiocParallel_1.39.0 purrr_1.0.2
#> [27] BiocGenerics_0.51.3 grid_4.4.1
#> [29] stats4_4.4.1 fansi_1.0.6
#> [31] colorspace_2.1-1 scales_1.3.0
#> [33] gtools_3.9.5 MASS_7.3-61
#> [35] MultiAssayExperiment_1.31.5 SummarizedExperiment_1.35.4
#> [37] cli_3.6.3 mvtnorm_1.3-1
#> [39] rmarkdown_2.28 crayon_1.5.3
#> [41] rstudioapi_0.17.0 httr_1.4.7
#> [43] readxl_1.4.3 cachem_1.1.0
#> [45] stringr_1.5.1 zlibbioc_1.51.2
#> [47] splines_4.4.1 assertthat_0.2.1
#> [49] parallel_4.4.1 formatR_1.14
#> [51] BiocManager_1.30.25 cellranger_1.1.0
#> [53] XVector_0.45.0 matrixStats_1.4.1
#> [55] vctrs_0.6.5 Matrix_1.7-1
#> [57] sandwich_3.1-1 jsonlite_1.8.9
#> [59] carData_3.0-5 bookdown_0.41
#> [61] car_3.1-3 IRanges_2.39.2
#> [63] S4Vectors_0.43.2 Formula_1.2-5
#> [65] systemfonts_1.1.0 testthat_3.2.1.1
#> [67] jquerylib_0.1.4 rematch_2.0.0
#> [69] glue_1.8.0 codetools_0.2-20
#> [71] stringi_1.8.4 futile.logger_1.4.3
#> [73] GenomeInfoDb_1.41.2 GenomicRanges_1.57.2
#> [75] UCSC.utils_1.1.0 munsell_0.5.1
#> [77] tibble_3.2.1 pillar_1.9.0
#> [79] htmltools_0.5.8.1 brio_1.1.5
#> [81] GenomeInfoDbData_1.2.13 R6_2.5.1
#> [83] evaluate_1.0.1 lattice_0.22-6
#> [85] Biobase_2.65.1 futile.options_1.0.1
#> [87] backports_1.5.0 bslib_0.8.0
#> [89] svglite_2.1.3 SparseArray_1.5.45
#> [91] checkmate_2.3.2 xfun_0.48
#> [93] MatrixGenerics_1.17.0 zoo_1.8-12
#> [95] pkgconfig_2.0.3