Evgenii Chekalin, Billy Zeng, Patrick Newbury, Benjamin Glicksberg, Jing Xing, Ke Liu, Dimitri Joseph, Bin Chen

Edited on September 28, 2020; Compiled on September 29, 2020

Package overview

As the field of precision medicine progresses, we start to tailor treatments for cancer patients classified not only by their clinical, but also by their molecular features. The emerging cancer subtypes defined by these features require dedicated resources to assist the discovery of drug candidates for preclinical evaluation. Voluminous cancer patient gene expression profiles have been accumulated in public databases, enabling the creation of a cancer-specific expression signature. Meanwhile, large-scale gene expression profiles of chemical compounds became available in recent years. By matching the cancer-specific expression signature to compound-induced gene expression profiles from large drug libraries, researchers can prioritize small molecules that present high potency to reverse expression of signature genes for further experimental testing of their efficacy. This approach has proven to be an efficient and cost-effective way to identify efficacious drug candidates. However, the success of this approach requires multiscale procedures, imposing significant challenges to many labs. Therefore, we present OCTAD (http://octad.org): an open workspace for virtually screening compounds targeting precise cancer patient groups using gene expression features. We have included 19,127 patient tissue samples covering 50 cancer types, and expression profiles for 12,442 distinct compounds. We will continue to include more tissue samples. We streamline all the procedures including deep-learning based reference tissue selection, disease gene expression signature creation, drug reversal potency scoring, and in silico validation. We release OCTAD as a web portal and a standalone R package to allow experimental and computational scientists to easily navigate the tool. The code and data can also be employed to answer various biological questions.

Workflow

We use Hepatocellular Carcinoma (HCC) to illustrate the utility of the desktop pipeline. We provide the code and data for investigating differential expression, pathway enrichment, drug prediction and hit selection, and in silico validation. In this workflow, we will select case tissue samples from our compiled TCGA data and compute control tissue samples from the GTEx data. Note that the compiled data contains adjacent normal samples which can also serve as control tissues. By default, the octad package uses Small OCTAD dataset containing expression values only for LINCS 978 landmark genes required for sRGES score computation. To download the full expression values, please refer to the link octad.counts.and.tpm.h5 (~3G). We recommend to use the full expression matrix. By default, computated results are stored in the temporary directory. ## Select case samples Choosing cases (tumor samples from the phenoDF data.frame) and controls (corresponding samples treated as background such as normal tissue, adjacent normal tissue or tumor tissue samples without a specific mutation) is critical to achieving the best results. Several methods are included in the provided code which demonstrates multiple control sample selection methods. There are no built-in validation steps to evaluate case samples. Visualization of cases in a t-SNE (t-Distributed Stochastic Neighbor Embedding) plot could help understand their relations with other OCTAD samples. Samples sharing similar transcriptomic profiles tend to cluster together in the t-SNE plot. The cases scattering in multiple clusters are not recommended to choose as a group. Phenotype data frame phenoDF contains tissue types including normal tissue, adjacent normal tissue, primary cancer, recurrent cancer, and metastatic cancer.

To list all available samples from the octad database, use the phenoDF data frame. To select HCC samples, subset the phenoDF:

#select data
library(octad)
## Loading required package: magrittr
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: edgeR
## Loading required package: limma
## Loading required package: RUVSeq
## Loading required package: Biobase
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:limma':
## 
##     plotMA
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: EDASeq
## Loading required package: ShortRead
## Loading required package: BiocParallel
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: Rsamtools
## Loading required package: GenomicRanges
## 
## Attaching package: 'GenomicRanges'
## The following object is masked from 'package:magrittr':
## 
##     subtract
## Loading required package: GenomicAlignments
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## The following object is masked from 'package:Biobase':
## 
##     rowMedians
## 
## Attaching package: 'GenomicAlignments'
## The following object is masked from 'package:dplyr':
## 
##     last
## 
## Attaching package: 'ShortRead'
## The following object is masked from 'package:dplyr':
## 
##     id
## The following object is masked from 'package:magrittr':
## 
##     functions
## Loading required package: DESeq2
## Loading required package: rhdf5
## Loading required package: foreach
## Loading required package: Rfast
## Loading required package: Rcpp
## Loading required package: RcppZiggurat
## 
## Attaching package: 'Rfast'
## The following objects are masked from 'package:MatrixGenerics':
## 
##     colMads, colMaxs, colMedians, colMins, colRanks, colVars, rowMads,
##     rowMaxs, rowMedians, rowMins, rowRanks, rowVars
## The following objects are masked from 'package:matrixStats':
## 
##     colMads, colMaxs, colMedians, colMins, colRanks, colVars, rowMads,
##     rowMaxs, rowMedians, rowMins, rowRanks, rowVars
## The following object is masked from 'package:Biobase':
## 
##     rowMedians
## The following object is masked from 'package:dplyr':
## 
##     nth
## Loading required package: octad.db
## Loading required package: ExperimentHub
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
## 
##     cache
## Welcome to octad.db package. This is a database package, pipeline available via the octad package. If you want to run the pipeline on the webserver, please refer to octad.org
## Loading required package: httr
## 
## Attaching package: 'httr'
## The following object is masked from 'package:Biobase':
## 
##     content
phenoDF=get_ExperimentHub_data("EH7274") #load data.frame with samples included in the OCTAD database. 
## snapshotDate(): 2022-10-24
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
head(phenoDF) #list all data included within the package
##                  sample.id sample.type                  biopsy.site cancer
## 1 GTEX-1117F-0226-SM-5GZZ7      normal       ADIPOSE - SUBCUTANEOUS normal
## 2 GTEX-1117F-0426-SM-5EGHI      normal            MUSCLE - SKELETAL normal
## 3 GTEX-1117F-0526-SM-5EGHJ      normal              ARTERY - TIBIAL normal
## 4 GTEX-1117F-0626-SM-5N9CS      normal            ARTERY - CORONARY normal
## 5 GTEX-1117F-0726-SM-5GIEN      normal     HEART - ATRIAL APPENDAGE normal
## 6 GTEX-1117F-1326-SM-5EGHH      normal ADIPOSE - VISCERAL (OMENTUM) normal
##   data.source gender read.count.file age_in_year metastatic_site tumor_grade
## 1        GTEX Female      TOIL.RData        <NA>            <NA>        <NA>
## 2        GTEX Female      TOIL.RData        <NA>            <NA>        <NA>
## 3        GTEX Female      TOIL.RData        <NA>            <NA>        <NA>
## 4        GTEX Female      TOIL.RData        <NA>            <NA>        <NA>
## 5        GTEX Female      TOIL.RData        <NA>            <NA>        <NA>
## 6        GTEX Female      TOIL.RData        <NA>            <NA>        <NA>
##   tumor_stage gain_list loss_list mutation_list subtype
## 1        <NA>                              <NA>    <NA>
## 2        <NA>                              <NA>    <NA>
## 3        <NA>                              <NA>    <NA>
## 4        <NA>                              <NA>    <NA>
## 5        <NA>                              <NA>    <NA>
## 6        <NA>                              <NA>    <NA>
HCC_primary=subset(phenoDF,cancer=='liver hepatocellular carcinoma'&sample.type == 'primary') #select data
case_id=HCC_primary$sample.id #select cases

The sample ids will be stored in the character vector case_id. The code can be easily modified to select for other cancers or a set of samples based on mutations and copy numbers (e.g., TP53 mutation or MYC amplification). It is also recommended to use the R package cgdsr to select TCGA samples based on more clinical and molecular features.

Compute or select control samples

Use the function computeRefTissue to compute appropriate normal tissues via comparing gene expression features between case samples and normal tissue samples. Users can select adjacent normal tissue samples if available. By default, features from the precomputed AutoEncoder file are used, but other features such as top varying genes across samples can be employed as well. Pairwise Spearman correlation is computed between every case sample and every normal sample using these features. For each normal sample, its median correlation with all case samples is then computed. Top correlated normal samples (defined by control_size) are then selected as control.

#computing top 50 reference tissues 
control_id=computeRefTissue(case_id,output=FALSE,adjacent=TRUE,source = "octad",control_size = 50)  
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
#please note, if \code{output = TRUE}, \code{outputFolder} variable must be specified, otherwise it will be written to \code{tempdir()}
# use adjacent normal tissue samples as control_id allow you to avoid running this function    

There is also an availability to select control samples by hand:

#computing top 50 reference tissues 
control_id=subset(phenoDF,biopsy.site=='LIVER'&sample.type=='normal')$sample.id[1:50] #select first 50 samples from healthy liver
# use adjacent normal tissue samples as control_id allow you to avoid running this function    

The relationships among case, control and other samples can be visualised through a t-SNE matrix precomputed based on the features derived from autoencoder.

tsne=get_ExperimentHub_data("EH7276") #Download file with tsneresults for all samples in the octad.db once. After this it will be cached and no additional download required.
## see ?octad.db and browseVignettes('octad.db') for documentation
## loading from cache
tsne$type <- "others"
tsne$type[tsne$sample.id %in% case_id] <- "case"
tsne$type[tsne$sample.id %in% control_id] <- "control"

#plot
p2 <- ggplot(tsne, aes(X, Y, color = type)) + geom_point(alpha = 0.4)+
    labs(title = paste ('TNSE PLOT'), x= 'TSNE Dim1', y='TSNE Dim2', caption="OCTAD")+
    theme_bw()
p2