1 Introduction

CTdata is the companion Package for CTexploreR and provides omics data to select and characterise cancer testis genes. Data come from public databases and include expression and methylation values of genes in normal and tumor samples as well as in tumor cell lines, and expression in cells treated with a demethylating agent is also available.

The data are served through the ExperimentHub infrastructure, which allows download them only once and cache them for further use. Currently available data are summarised in the table below and details in the next section.

library("CTdata")
DT::datatable(CTdata())

2 Installation

To install the package:

if (!require("BiocManager"))
    install.packages("CTdata")

BiocManager::install("CTdata")

To install the package from GitHub:

if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("UCLouvain-CBIO/CTdata")

3 Available data

For details about each data, see their respective manual pages.

3.1 Global datasets

3.1.1 GTEX data

A SummarizedExperiment object with gene expression data in normal tissues from GTEx database:

library("SummarizedExperiment")
GTEX_data()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 24359 32 
## metadata(0):
## assays(1): TPM
## rownames(24359): ENSG00000243485 ENSG00000237613 ... ENSG00000198695
##   ENSG00000198727
## rowData names(3): external_gene_name GTEX_category max_TPM_somatic
## colnames(32): Testis Ovary ... Uterus Vagina
## colData names(0):

3.1.2 CCLE data

A SummarizedExperiment object with gene expression data in cancer cell lines from CCLE:

CCLE_data()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 24327 1229 
## metadata(0):
## assays(1): TPM
## rownames(24327): ENSG00000000003 ENSG00000000005 ... ENSG00000284543
##   ENSG00000284546
## rowData names(5): external_gene_name
##   percent_of_positive_CCLE_cell_lines
##   percent_of_negative_CCLE_cell_lines max_TPM_in_CCLE CCLE_category
## colnames(1229): LC1SQSF COLO794 ... ECC2 A673
## colData names(30): DepMap_ID cell_line_name ... Cellosaurus_issues type

3.1.3 Normal tissue gene expression

A SummarizedExperiment object with gene expression values in normal tissues with or without allowing multimapping:

normal_tissues_multimapping_data()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 24359 18 
## metadata(0):
## assays(2): TPM_no_multimapping TPM_with_multimapping
## rownames(24359): ENSG00000000003 ENSG00000000005 ... ENSG00000284543
##   ENSG00000284546
## rowData names(3): external_gene_name lowly_expressed_in_GTEX
##   multimapping_analysis
## colnames(18): adrenal_gland breast_epithelium ... transverse_colon
##   upper_lobe_of_left_lung
## colData names(0):

3.1.4 Demethylated gene expression

A SummarizedExperiment object containing genes differential expression analysis (with RNAseq expression values) in cell lines treated or not with a demethylating agent (5-Aza-2’-Deoxycytidine).

DAC_treated_cells()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 24359 32 
## metadata(0):
## assays(1): log1p
## rownames(24359): ENSG00000243485 ENSG00000237613 ... ENSG00000198695
##   ENSG00000198727
## rowData names(18): external_gene_name logFC_B2-1 ... padj_TS603 induced
## colnames(32): B2-1_CTL_rep1 B2-1_CTL_rep2 ... TS603_DAC_rep1
##   TS603_DAC_rep2
## colData names(9): ref cell ... library lab

As above, with multimapping:

DAC_treated_cells_multimapping()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 24359 32 
## metadata(0):
## assays(1): log1p
## rownames(24359): ENSG00000243485 ENSG00000237613 ... ENSG00000198695
##   ENSG00000198727
## rowData names(18): external_gene_name logFC_B2-1 ... padj_TS603 induced
## colnames(32): B2-1_CTL_rep1 B2-1_CTL_rep2 ... TS603_DAC_rep1
##   TS603_DAC_rep2
## colData names(9): ref cell ... library lab

3.1.5 TCGA data

A SummarizedExperiment with gene expression data in TCGA samples (tumor and peritumoral samples : SKCM, LUAD, LUSC, COAD, ESCA, BRCA and HNSC):

TCGA_TPM()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 24350 4141 
## metadata(0):
## assays(1): TPM
## rownames(24350): ENSG00000000003 ENSG00000000005 ... ENSG00000284543
##   ENSG00000284546
## rowData names(19): external_gene_name percent_pos_SKCM ...
##   max_TPM_in_TCGA TCGA_category
## colnames(4141): TCGA-EB-A5SF-01A-11R-A311-07
##   TCGA-EE-A3J8-06A-11R-A20F-07 ... TCGA-CV-6935-11A-01R-1915-07
##   TCGA-CV-7183-01A-11R-2016-07
## colData names(65): patient sample ... CD8_T_cells proliferation_score

3.1.6 Testis scRNAseq data

A SingleCellExperiment object containing gene expression from testis single cell RNAseq experiment (The adult human testis transcriptional cell atlas (Guo et al. 2018)):

library("SingleCellExperiment")
testis_sce()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SingleCellExperiment 
## dim: 19777 6490 
## metadata(0):
## assays(2): counts logcounts
## rownames(19777): FAM87B LINC00115 ... NCF4-AS1 LINC01689
## rowData names(4): external_gene_name percent_pos_testis_germcells
##   percent_pos_testis_somatic testis_cell_type
## colnames(6490): Donor2-AAACCTGGTGCCTTGG-1 Donor2-AAACCTGTCAACGGGA-1 ...
##   Donor1-TTTGTCAGTGTGCGTC-2 Donor1-TTTGTCATCCAAACTG-2
## colData names(6): nGene nUMI ... Donor sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

3.1.7 Human Protein Atlas data

A SingleCellExperiment object containing gene expression in different human cell types based on scRNAseq data obtained from the Human Protein Atlas (https://www.proteinatlas.org)/

scRNAseq_HPA()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SingleCellExperiment 
## dim: 20082 66 
## metadata(0):
## assays(1): TPM
## rownames(20082): ENSG00000000003 ENSG00000000005 ... ENSG00000288684
##   ENSG00000288695
## rowData names(4): external_gene_name max_TPM_in_a_somatic_cell_type
##   max_in_germcells_group Higher_in_somatic_cell_type
## colnames(66): Adipocytes T-cells ... Syncytiotrophoblasts Extravillous
##   trophoblasts
## colData names(2): Cell_type group
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

3.2 CT genes determination

## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache

With the datasets above, we generated a list of 298 CT genes (see figure below for details).

We used multimapping because many CT genes belong to gene families from which members have identical or nearly identical sequences. This is likely the reason why these genes are not detected in GTEx database, as GTEx processing pipeline specifies that overlapping intervals between genes are excluded from all genes for counting. Some CT genes can thus only be detected in RNAseq data in which multimapping reads are not discarded.

3.3 CT datasets

3.3.1 Metylation in normal tissue

A RangedSummarizedExperiment containing methylation of CpGs located within CT promoters in normal tissues:

CT_methylation_in_tissues()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: RangedSummarizedExperiment 
## dim: 51725 14 
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowData names(0):
## colnames(14): adipose colon ... thyroid sperm
## colData names(0):

A SummarizedExperiment with Cancer-Testis genes’ promoters mean methylation in normal tissues:

CT_mean_methylation_in_tissues()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: SummarizedExperiment 
## dim: 298 14 
## metadata(0):
## assays(1): ''
## rownames(298): TTLL10 TAS1R1 ... RBMY1F RBMY1J
## rowData names(7): ensembl_gene_id CpG_density ... somatic_methylation
##   germline_methylation
## colnames(14): adipose colon ... thyroid sperm
## colData names(0):

3.3.2 TCGA data

A SummarizedExperiment with gene expression data in TCGA samples (tumor and peritumoral samples : SKCM, LUAD, LUSC, COAD, ESCA, BRCA and HNSC):

TCGA_CT_methylation()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## class: RangedSummarizedExperiment 
## dim: 666 3423 
## metadata(0):
## assays(1): methylation
## rownames(666): cg26578072 cg27631599 ... cg16626452 cg22051787
## rowData names(52): address_A address_B ... MASK_extBase MASK_general
## colnames(3423): TCGA-ER-A42L-06A-11D-A24V-05
##   TCGA-WE-A8K1-06A-21D-A373-05 ... TCGA-BB-A6UO-01A-12D-A34K-05
##   TCGA-IQ-A61O-01A-11D-A30F-05
## colData names(3): samples sample project_id

3.3.3 CCLE data

A matrix with gene expression correlations in CCLE cancer cell lines:

dim(CCLE_correlation_matrix())
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## [1]   298 24327
CCLE_correlation_matrix()[1:10, 1:5]
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
##                 ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457
## ENSG00000162571     0.103434061     0.001049710    -0.042549895      0.14462034
## ENSG00000173662    -0.071928957    -0.023465128    -0.072986999      0.08297871
## ENSG00000157330    -0.009096577    -0.012775872    -0.013617443      0.01686471
## ENSG00000234593    -0.042559061    -0.022602098    -0.001087715     -0.03065048
## ENSG00000117148    -0.027897133     0.031152498     0.012210680      0.08032635
## ENSG00000131914     0.098018050     0.062352652     0.046643780      0.05024423
## ENSG00000142698    -0.009432254    -0.014082458     0.007391337      0.02454294
## ENSG00000143006     0.018450551    -0.005250837    -0.052121771      0.01223201
## ENSG00000237853    -0.138887798    -0.012717974    -0.037329853      0.05698454
## ENSG00000226088     0.116214009     0.016494255     0.050339175     -0.02698655
##                 ENSG00000000460
## ENSG00000162571     0.051584992
## ENSG00000173662     0.054997238
## ENSG00000157330    -0.041359090
## ENSG00000234593    -0.014316867
## ENSG00000117148     0.081165649
## ENSG00000131914     0.032435630
## ENSG00000142698     0.008474658
## ENSG00000143006    -0.006112485
## ENSG00000237853     0.005744308
## ENSG00000226088     0.029441251

3.3.4 CT genes

A tibble with Cancer-Testis (CT) genes and their characteristics:

CT_genes()
## see ?CTdata and browseVignettes('CTdata') for documentation
## loading from cache
## # A tibble: 298 × 36
##    ensembl_gene_id external_gene_name family chr   strand transcription_start_…¹
##    <chr>           <chr>              <chr>  <chr>  <int>                  <int>
##  1 ENSG00000162571 TTLL10             <NA>   1          1                1173880
##  2 ENSG00000173662 TAS1R1             <NA>   1          1                6555307
##  3 ENSG00000157330 CFAP107            <NA>   1          1               12746200
##  4 ENSG00000234593 KAZN-AS1           <NA>   1         -1               14419973
##  5 ENSG00000117148 ACTL8              <NA>   1          1               17755333
##  6 ENSG00000131914 LIN28A             <NA>   1          1               26410817
##  7 ENSG00000142698 C1orf94            <NA>   1          1               34176907
##  8 ENSG00000143006 DMRTB1             <NA>   1          1               53459399
##  9 ENSG00000237853 NFIA-AS1           <NA>   1         -1               61253510
## 10 ENSG00000226088 HHLA3-AS1          <NA>   1         -1               70360437
## # ℹ 288 more rows
## # ℹ abbreviated name: ¹​transcription_start_site
## # ℹ 30 more variables: X_linked <lgl>, TPM_testis <dbl>, max_TPM_somatic <dbl>,
## #   GTEX_category <chr>, lowly_expressed_in_GTEX <lgl>,
## #   multimapping_analysis <chr>, testis_specificity <chr>,
## #   testis_cell_type <chr>, Higher_in_somatic_cell_type <lgl>,
## #   percent_of_positive_CCLE_cell_lines <dbl>, …

Session information

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.24.0 SummarizedExperiment_1.32.0
##  [3] Biobase_2.62.0              GenomicRanges_1.54.0       
##  [5] GenomeInfoDb_1.38.0         IRanges_2.36.0             
##  [7] S4Vectors_0.40.0            BiocGenerics_0.48.0        
##  [9] MatrixGenerics_1.14.0       matrixStats_1.0.0          
## [11] CTdata_1.2.0                BiocStyle_2.30.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0              dplyr_1.1.3                  
##  [3] blob_1.2.4                    filelock_1.0.2               
##  [5] Biostrings_2.70.0             bitops_1.0-7                 
##  [7] fastmap_1.1.1                 RCurl_1.98-1.12              
##  [9] BiocFileCache_2.10.0          promises_1.2.1               
## [11] digest_0.6.33                 mime_0.12                    
## [13] lifecycle_1.0.3               ellipsis_0.3.2               
## [15] KEGGREST_1.42.0               interactiveDisplayBase_1.40.0
## [17] RSQLite_2.3.1                 magrittr_2.0.3               
## [19] compiler_4.3.1                rlang_1.1.1                  
## [21] sass_0.4.7                    tools_4.3.1                  
## [23] utf8_1.2.4                    yaml_2.3.7                   
## [25] knitr_1.44                    S4Arrays_1.2.0               
## [27] htmlwidgets_1.6.2             bit_4.0.5                    
## [29] curl_5.1.0                    DelayedArray_0.28.0          
## [31] abind_1.4-5                   withr_2.5.1                  
## [33] purrr_1.0.2                   grid_4.3.1                   
## [35] fansi_1.0.5                   ExperimentHub_2.10.0         
## [37] xtable_1.8-4                  cli_3.6.1                    
## [39] rmarkdown_2.25                crayon_1.5.2                 
## [41] generics_0.1.3                httr_1.4.7                   
## [43] DBI_1.1.3                     cachem_1.0.8                 
## [45] zlibbioc_1.48.0               AnnotationDbi_1.64.0         
## [47] BiocManager_1.30.22           XVector_0.42.0               
## [49] vctrs_0.6.4                   Matrix_1.6-1.1               
## [51] jsonlite_1.8.7                bookdown_0.36                
## [53] bit64_4.0.5                   crosstalk_1.2.0              
## [55] jquerylib_0.1.4               glue_1.6.2                   
## [57] DT_0.30                       BiocVersion_3.18.0           
## [59] later_1.3.1                   tibble_3.2.1                 
## [61] pillar_1.9.0                  rappdirs_0.3.3               
## [63] htmltools_0.5.6.1             GenomeInfoDbData_1.2.11      
## [65] R6_2.5.1                      dbplyr_2.3.4                 
## [67] lattice_0.22-5                evaluate_0.22                
## [69] shiny_1.7.5.1                 AnnotationHub_3.10.0         
## [71] png_0.1-8                     memoise_2.0.1                
## [73] httpuv_1.6.12                 bslib_0.5.1                  
## [75] Rcpp_1.0.11                   SparseArray_1.2.0            
## [77] xfun_0.40                     pkgconfig_2.0.3