Contents

1 Mapping between symbols

1.1 org.Hs.eg.db

The org packages contain information to map between different symbols. Here check for available org packages.

BiocManager::available("^org\\.")
##  [1] "org.Ag.eg.db"      "org.At.tair.db"    "org.Bt.eg.db"     
##  [4] "org.Ce.eg.db"      "org.Cf.eg.db"      "org.Dm.eg.db"     
##  [7] "org.Dr.eg.db"      "org.EcK12.eg.db"   "org.EcSakai.eg.db"
## [10] "org.Gg.eg.db"      "org.Hs.eg.db"      "org.Mm.eg.db"     
## [13] "org.Mmu.eg.db"     "org.Pf.plasmo.db"  "org.Pt.eg.db"     
## [16] "org.Rn.eg.db"      "org.Sc.sgd.db"     "org.Ss.eg.db"     
## [19] "org.Xl.eg.db"

The regular expression "^org\\.") insists that the package names starts with org ("^org") followed by a literal period rather than a wild-card representing any letter ("\\.").

In addition to these packages, many org resources are available from AnnotationHub, described below

library(AnnotationHub)
query(AnnotationHub(), "^org\\.")
## snapshotDate(): 2019-05-02
## AnnotationHub with 1710 records
## # snapshotDate(): 2019-05-02 
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Escherichia coli, 'Chlorella vulgaris'_C-169, 'Klebsiella a...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH70563"]]' 
## 
##             title                                                     
##   AH70563 | org.Ag.eg.db.sqlite                                       
##   AH70564 | org.At.tair.db.sqlite                                     
##   AH70565 | org.Bt.eg.db.sqlite                                       
##   AH70566 | org.Cf.eg.db.sqlite                                       
##   AH70567 | org.Gg.eg.db.sqlite                                       
##   ...       ...                                                       
##   AH73812 | org.Plasmodium_vivax.eg.sqlite                            
##   AH73813 | org.Burkholderia_mallei_ATCC_23344.eg.sqlite              
##   AH73814 | org.Bacillus_cereus_(strain_ATCC_14579_|_DSM_31).eg.sqlite
##   AH73815 | org.Bacillus_cereus_ATCC_14579.eg.sqlite                  
##   AH73816 | org.Schizosaccharomyces_cryophilus_OY26.eg.sqlite

The naming convention of org objects uses a two-letter code to represent species, e.g., Hs is Homo sapiens followed by the central identifier used to map to and from other symbols; for org.Hs.eg.db, the central identifier is the Entrez gene identifier, and to map from, say HGNC Symbol to Ensembl identifier, a map must exist between the gene symbol and the Entrez identifier, and then from the Entrez identifier to the Ensembl identifier.

Many additional org packages are available on AnnotationHub, as mentioned briefly below.

library(org.Hs.eg.db)

We can discover available keytypes() for querying the database, and columns() to map to, e.g.,

head(keys(org.Hs.eg.db))
## [1] "1"  "2"  "3"  "9"  "10" "11"

Here are a handful of ENTREZID keys

eid <- sample(keys(org.Hs.eg.db), 10)

Two main functions are select() and mapIds(). mapIds() is more focused. It guarantees a one-to-one mapping between keys a single selected column. By defaul, if a key maps to multiple values, then the ‘first’ value returned by the database is used. The return value is a named vector; the 1:1 mapping between query and return value makes this function particularly useful in pipelines where a single mapping must occur.

mapIds(org.Hs.eg.db, eid, "SYMBOL", "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
##      112268315          54012      106481389          57325           4113 
## "LOC112268315"      "ZNF299P"    "RNU6-657P"        "KAT14"       "MAGEB2" 
##      100418862      104413892          79160      106481873      105369543 
##     "SPECC1P1"      "F10-AS1"    "LINC01711"     "RNU4-84P" "LOC105369543"

select() is more general, returning a data.frame of keys, plus one or more columns. If a key maps to multiple values, then multiple rows are returned.

map <- select(org.Hs.eg.db, eid, c("SYMBOL", "GO"), "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
dim(map)
## [1] 17  5
head(map)
##    ENTREZID       SYMBOL         GO EVIDENCE ONTOLOGY
## 1 112268315 LOC112268315       <NA>     <NA>     <NA>
## 2     54012      ZNF299P       <NA>     <NA>     <NA>
## 3 106481389    RNU6-657P       <NA>     <NA>     <NA>
## 4     57325        KAT14 GO:0000086      IEA       BP
## 5     57325        KAT14 GO:0004402      IDA       MF
## 6     57325        KAT14 GO:0005515      IPI       MF

1.2 GO.db

GO.db

library(GO.db)

2 Transcript annotations

2.1 TxDb.Hsapiens.UCSC.hg38.knownGene

TxDb packages contain information about gene models (exon, gene, transcript coordinates). There are a number of TxDb packages available to install

library(dplyr)    # for `%>%`
BiocManager::available("^TxDb") %>%
    tibble::enframe(name = NULL)
## # A tibble: 34 x 1
##    value                                
##    <chr>                                
##  1 TxDb.Athaliana.BioMart.plantsmart22  
##  2 TxDb.Athaliana.BioMart.plantsmart25  
##  3 TxDb.Athaliana.BioMart.plantsmart28  
##  4 TxDb.Btaurus.UCSC.bosTau8.refGene    
##  5 TxDb.Celegans.UCSC.ce11.ensGene      
##  6 TxDb.Celegans.UCSC.ce11.refGene      
##  7 TxDb.Celegans.UCSC.ce6.ensGene       
##  8 TxDb.Cfamiliaris.UCSC.canFam3.refGene
##  9 TxDb.Dmelanogaster.UCSC.dm3.ensGene  
## 10 TxDb.Dmelanogaster.UCSC.dm6.ensGene  
## # … with 24 more rows

and to download from AnnotationHub

query(AnnotationHub(), "^TxDb\\.")
## snapshotDate(): 2019-05-02
## AnnotationHub with 94 records
## # snapshotDate(): 2019-05-02 
## # $dataprovider: UCSC
## # $species: Rattus norvegicus, Gallus gallus, Macaca mulatta, Caenorhab...
## # $rdataclass: TxDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH52245"]]' 
## 
##             title                                        
##   AH52245 | TxDb.Athaliana.BioMart.plantsmart22.sqlite   
##   AH52246 | TxDb.Athaliana.BioMart.plantsmart25.sqlite   
##   AH52247 | TxDb.Athaliana.BioMart.plantsmart28.sqlite   
##   AH52248 | TxDb.Btaurus.UCSC.bosTau8.refGene.sqlite     
##   AH52249 | TxDb.Celegans.UCSC.ce11.refGene.sqlite       
##   ...       ...                                          
##   AH70596 | TxDb.Ptroglodytes.UCSC.panTro5.refGene.sqlite
##   AH70597 | TxDb.Rnorvegicus.UCSC.rn5.refGene.sqlite     
##   AH70598 | TxDb.Rnorvegicus.UCSC.rn6.refGene.sqlite     
##   AH70599 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite    
##   AH70600 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite

Here we load the TxDb object containing gene models for Homo sapiens using annotations provided by UCSC for the hg38 genome build, using the knownGene annotation track.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)

2.2 exons(), transcripts(), genes()

The coordinates of annotated exons can be extracted as a GRanges object

exons(TxDb.Hsapiens.UCSC.hg38.knownGene)
## GRanges object with 647025 ranges and 1 metadata column:
##                    seqnames        ranges strand |   exon_id
##                       <Rle>     <IRanges>  <Rle> | <integer>
##        [1]             chr1   11869-12227      + |         1
##        [2]             chr1   12010-12057      + |         2
##        [3]             chr1   12179-12227      + |         3
##        [4]             chr1   12613-12697      + |         4
##        [5]             chr1   12613-12721      + |         5
##        ...              ...           ...    ... .       ...
##   [647021] chrUn_GL000220v1 155997-156149      + |    647021
##   [647022] chrUn_KI270442v1 380608-380726      + |    647022
##   [647023] chrUn_KI270442v1 217250-217401      - |    647023
##   [647024] chrUn_KI270744v1   51009-51114      - |    647024
##   [647025] chrUn_KI270750v1 148668-148843      + |    647025
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

Additional information is also present in the database, for instance the GENEID (Entrez gene id for these TxDb)

ex <- exons(TxDb.Hsapiens.UCSC.hg38.knownGene, columns = "GENEID")
ex
## GRanges object with 647025 ranges and 1 metadata column:
##                    seqnames        ranges strand |          GENEID
##                       <Rle>     <IRanges>  <Rle> | <CharacterList>
##        [1]             chr1   11869-12227      + |       100287102
##        [2]             chr1   12010-12057      + |       100287102
##        [3]             chr1   12179-12227      + |       100287102
##        [4]             chr1   12613-12697      + |       100287102
##        [5]             chr1   12613-12721      + |       100287102
##        ...              ...           ...    ... .             ...
##   [647021] chrUn_GL000220v1 155997-156149      + |       109864274
##   [647022] chrUn_KI270442v1 380608-380726      + |            <NA>
##   [647023] chrUn_KI270442v1 217250-217401      - |            <NA>
##   [647024] chrUn_KI270744v1   51009-51114      - |            <NA>
##   [647025] chrUn_KI270750v1 148668-148843      + |            <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

Note that the object reports “595 sequences”; this is because the exons include both standard chromosomes and partially assembled contigs. Use keepStandardChromosomes() to update the object to contain only exons found on the ‘standard’ chromomes; the pruning.mode= argument determines whether sequence names that are ‘in use’ (have exons associated with them) can be dropped.

std_ex <- keepStandardChromosomes(ex, pruning.mode="coarse")
std_ex
## GRanges object with 591211 ranges and 1 metadata column:
##            seqnames      ranges strand |          GENEID
##               <Rle>   <IRanges>  <Rle> | <CharacterList>
##        [1]     chr1 11869-12227      + |       100287102
##        [2]     chr1 12010-12057      + |       100287102
##        [3]     chr1 12179-12227      + |       100287102
##        [4]     chr1 12613-12697      + |       100287102
##        [5]     chr1 12613-12721      + |       100287102
##        ...      ...         ...    ... .             ...
##   [591207]     chrM   5826-5891      - |            <NA>
##   [591208]     chrM   7446-7514      - |            <NA>
##   [591209]     chrM 14149-14673      - |            <NA>
##   [591210]     chrM 14674-14742      - |            <NA>
##   [591211]     chrM 15956-16023      - |            <NA>
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

It is then possible to ask all sorts of question, e.g., the number of exons on each chromosome

table(seqnames(std_ex))
## 
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 
## 54957 43673 35902 23108 26710 26200 28551 22424 20967 20752 35167 34142 
## chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
## 10059 20483 22509 29016 36559 10389 35837 12821  6621 13013 18396  2918 
##  chrM 
##    37

or the identity the exons with more than 10000 nucleotides.

std_ex[width(std_ex) > 10000]
## GRanges object with 267 ranges and 1 metadata column:
##         seqnames              ranges strand |          GENEID
##            <Rle>           <IRanges>  <Rle> | <CharacterList>
##     [1]     chr1   32485101-32496686      + |          728116
##     [2]     chr1   35919499-35930528      + |           26523
##     [3]     chr1   36055637-36072500      + |          192669
##     [4]     chr1   92387011-92402056      + |           79871
##     [5]     chr1   96813273-96823738      + |           58155
##     ...      ...                 ...    ... .             ...
##   [263]     chrX 140774403-140793215      + |          286411
##   [264]     chrX   73841382-73851592      - |            7503
##   [265]     chrX   73841382-73852723      - |            7503
##   [266]     chrX 132369317-132379677      - |           55796
##   [267]     chrX 138614731-138632986      - |            2258
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

and of course more scientifically relevant questions.

2.3 exonsBy(), transcriptsBy(), etc

2.4 ensembldb

The ensembldb package provides access to similar, but more rich, information from Ensembl, with most data resources available via AnnotationHub; the AnnotationHub query asks for records that include both EnsDb and a particular Ensembl release.

library(ensembldb)
query(AnnotationHub(), c("^EnsDb\\.", "Ensembl 96"))
## snapshotDate(): 2019-05-02
## AnnotationHub with 0 records
## # snapshotDate(): 2019-05-02

3 Accessing online resources

3.1 biomaRt

library(biomaRt)

Visit the biomart website and figure out how to browse data to retrieve, e.g., genes on chromosomes 21 and 22. You’ll need to browse to the ensembl mart, Homo spaiens data set, establish filters for chromosomes 21 and 22, and then specify that you’d like the Ensembl gene id attribute returned.

Now do the same process in biomaRt:

library(biomaRt)
head(listMarts(), 3)                      ## list marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <-                                ## fully specified mart
    useMart("ensembl", dataset = "hsapiens_gene_ensembl")

head(listFilters(ensembl), 3)             ## filters
myFilter <- "chromosome_name"
substr(filterOptions(myFilter, ensembl), 1, 50) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3)          ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")

## assemble and query the mart
res <- getBM(attributes =  myAttributes, filters =  myFilter,
             values =  myValues, mart = ensembl)

3.2 KEGGREST

library(KEGGREST)

3.3 AnnotationHub

AnnotationHub provides a resource of annotations that are available without requiring an annotation package.

library(AnnotationHub)
ah <- AnnotationHub()

One example of such annotations are org-style data resources for less-model organisms. Discover available resources using the flexible query() command.

query(ah, "^org\\.")
## AnnotationHub with 1710 records
## # snapshotDate(): 2019-05-02 
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Escherichia coli, 'Chlorella vulgaris'_C-169, 'Klebsiella a...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH70563"]]' 
## 
##             title                                                     
##   AH70563 | org.Ag.eg.db.sqlite                                       
##   AH70564 | org.At.tair.db.sqlite                                     
##   AH70565 | org.Bt.eg.db.sqlite                                       
##   AH70566 | org.Cf.eg.db.sqlite                                       
##   AH70567 | org.Gg.eg.db.sqlite                                       
##   ...       ...                                                       
##   AH73812 | org.Plasmodium_vivax.eg.sqlite                            
##   AH73813 | org.Burkholderia_mallei_ATCC_23344.eg.sqlite              
##   AH73814 | org.Bacillus_cereus_(strain_ATCC_14579_|_DSM_31).eg.sqlite
##   AH73815 | org.Bacillus_cereus_ATCC_14579.eg.sqlite                  
##   AH73816 | org.Schizosaccharomyces_cryophilus_OY26.eg.sqlite

Find out more about a particular resource using [ to select just that resource, or use mcols() on a subset of resources. identifier, e.g.,

ah["AH70563"]
## AnnotationHub with 1 record
## # snapshotDate(): 2019-05-02 
## # names(): AH70563
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Anopheles gambiae
## # $rdataclass: OrgDb
## # $rdatadateadded: 2019-04-29
## # $title: org.Ag.eg.db.sqlite
## # $description: NCBI gene ID based annotations about Anopheles gambiae
## # $taxonomyid: 180454
## # $genome: NCBI genomes
## # $sourcetype: NCBI/ensembl
## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl....
## # $sourcesize: NA
## # $tags: c("NCBI", "Gene", "Annotation") 
## # retrieve record with 'object[["AH70563"]]'

Retrieve and use a resource by using [[ with the corresponding

org <- ah[["AH70563"]]
## downloading 0 resources
## loading from cache 
##     'AH70563 : 77309'
org
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ANOPHELES_DB
## | ORGANISM: Anopheles gambiae
## | SPECIES: Anopheles
## | EGSOURCEDATE: 2019-Apr26
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 180454
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 2019-Apr24
## | GOEGSOURCEDATE: 2019-Apr26
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Anopheles gambiae)
## | GPSOURCEURL: 
## | GPSOURCEDATE: 2018-Oct2
## | ENSOURCEDATE: 2019-Apr08
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
## 
## Please see: help('select') for usage information

Determine the central key, and the columns that can be mapped between

chooseCentralOrgPkgSymbol(org)
## [1] "ENTREZID"
columns(org)
##  [1] "ACCNUM"       "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [5] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
##  [9] "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [13] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"      
## [17] "SYMBOL"       "UNIGENE"      "UNIPROT"

Here are some Entrez identifiers, and their corresponding symbols for Anopheles gambiae, either allowing for 1:many maps (select()) or enforcing 1:1 maps. We use AnnotationDbi::select() to disambiguate between the select() generic defined in AnnotationDbi and the select() generic defined in dplyr: theses methods have incompatible signatures and ‘contracts’, and so must be invoked in a way that resolves our intention explicitly.

library(dplyr)    # for `%>%`
eid <- head(keys(org))
AnnotationDbi::select(org, eid, "SYMBOL", "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
##   ENTREZID          SYMBOL
## 1  1267437 AgaP_AGAP012606
## 2  1267439 AgaP_AGAP012559
## 3  1267440 AgaP_AGAP012558
## 4  1267447 AgaP_AGAP012586
## 5  1267450 AgaP_AGAP012834
## 6  1267459 AgaP_AGAP012589
eid %>%
    mapIds(x = org, "SYMBOL", "ENTREZID") %>%
    tibble::enframe("ENTREZID", "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## # A tibble: 6 x 2
##   ENTREZID SYMBOL         
##   <chr>    <chr>          
## 1 1267437  AgaP_AGAP012606
## 2 1267439  AgaP_AGAP012559
## 3 1267440  AgaP_AGAP012558
## 4 1267447  AgaP_AGAP012586
## 5 1267450  AgaP_AGAP012834
## 6 1267459  AgaP_AGAP012589

3.4 ExperimentHub

ExperimentHub is analogous to AnnotationHub, but contains curated experimental results. Increasingly, ExperimentHub packages are provided to document and ease access to these resources. A great example of an ExperimentHub package is [curatedTCGAData][].

library(ExperimentHub)
library(curatedTCGAData)

The [curatedTCGAData][] package provides an interface to a collection of resources available through ExperimentHub. The interface is straigth-forward. Use curatedTCGAData() to discover available types of data, choosing assay types after identifying cancer types.

curatedTCGAData()
## Please see the list below for available cohorts and assays
## Available Cancer codes:
##  ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH
##  KIRC KIRP LAML LGG LIHC LUAD LUSC MESO OV PAAD PCPG
##  PRAD READ SARC SKCM STAD TGCT THCA THYM UCEC UCS UVM 
## Available Data Types:
##  CNACGH CNACGH_CGH_hg_244a
##  CNACGH_CGH_hg_415k_g4124a CNASeq CNASNP
##  CNVSNP GISTIC_AllByGene GISTIC_Peaks
##  GISTIC_ThresholdedByGene Methylation
##  Methylation_methyl27 Methylation_methyl450
##  miRNAArray miRNASeqGene mRNAArray
##  mRNAArray_huex mRNAArray_TX_g4502a
##  mRNAArray_TX_g4502a_1
##  mRNAArray_TX_ht_hg_u133a Mutation
##  RNASeq2GeneNorm RNASeqGene RPPAArray
curatedTCGAData("BRCA")
##                                         Title DispatchClass
## 31                       BRCA_CNASeq-20160128           Rda
## 32                       BRCA_CNASNP-20160128           Rda
## 33                       BRCA_CNVSNP-20160128           Rda
## 35             BRCA_GISTIC_AllByGene-20160128           Rda
## 36                 BRCA_GISTIC_Peaks-20160128           Rda
## 37     BRCA_GISTIC_ThresholdedByGene-20160128           Rda
## 39  BRCA_Methylation_methyl27-20160128_assays        H5File
## 40      BRCA_Methylation_methyl27-20160128_se           Rds
## 41 BRCA_Methylation_methyl450-20160128_assays        H5File
## 42     BRCA_Methylation_methyl450-20160128_se           Rds
## 43                 BRCA_miRNASeqGene-20160128           Rda
## 44                    BRCA_mRNAArray-20160128           Rda
## 45                     BRCA_Mutation-20160128           Rda
## 46              BRCA_RNASeq2GeneNorm-20160128           Rda
## 47                   BRCA_RNASeqGene-20160128           Rda
## 48                    BRCA_RPPAArray-20160128           Rda
curatedTCGAData("BRCA", c("RNASeqGene", "CNVSNP"))
##                       Title DispatchClass
## 33     BRCA_CNVSNP-20160128           Rda
## 47 BRCA_RNASeqGene-20160128           Rda

Adding dry.run = FALSE triggers the actual download (first time only) of the data from ExperimentHub, and presentation to the user as a MultiAssayExperiment.

mae <- curatedTCGAData("BRCA", c("RNASeqGene", "CNVSNP"), dry.run=FALSE)
mae
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] BRCA_CNVSNP-20160128: RaggedExperiment with 284458 rows and 2199 columns 
##  [2] BRCA_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 878 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

It is then easy to work with these data, via individual assays or in a more integrative analysis. For example, the distribution of library sizes in the RNASeq data can be visualized with.

mae[["BRCA_RNASeqGene-20160128"]] %>%
    assay() %>%
    colSums() %>%
    density() %>%
    plot(main = "TCGA BRCA RNASeq Library Size")

4 Annotating variants

4.1 VariantAnnotation

library(VariantAnnotation)

4.2 ensemblVEP

library(ensemblVEP)

5 Provenance

sessionInfo()
## R version 3.6.0 Patched (2019-04-26 r76431)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ensemblVEP_1.27.0                      
##  [2] VariantAnnotation_1.31.3               
##  [3] Rsamtools_2.1.2                        
##  [4] Biostrings_2.53.0                      
##  [5] XVector_0.25.0                         
##  [6] RaggedExperiment_1.9.0                 
##  [7] curatedTCGAData_1.7.0                  
##  [8] MultiAssayExperiment_1.11.4            
##  [9] SummarizedExperiment_1.15.5            
## [10] DelayedArray_0.11.2                    
## [11] BiocParallel_1.19.0                    
## [12] matrixStats_0.54.0                     
## [13] ExperimentHub_1.11.1                   
## [14] KEGGREST_1.25.0                        
## [15] biomaRt_2.41.3                         
## [16] ensembldb_2.9.2                        
## [17] AnnotationFilter_1.9.0                 
## [18] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.6
## [19] GenomicFeatures_1.37.3                 
## [20] GenomicRanges_1.37.14                  
## [21] GenomeInfoDb_1.21.1                    
## [22] dplyr_0.8.2                            
## [23] GO.db_3.8.2                            
## [24] org.Hs.eg.db_3.8.2                     
## [25] AnnotationDbi_1.47.0                   
## [26] IRanges_2.19.10                        
## [27] S4Vectors_0.23.17                      
## [28] Biobase_2.45.0                         
## [29] AnnotationHub_2.17.3                   
## [30] BiocFileCache_1.9.1                    
## [31] dbplyr_1.4.2                           
## [32] BiocGenerics_0.31.4                    
## [33] BiocStyle_2.13.2                       
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0                    bit64_0.9-7                  
##  [3] shiny_1.3.2                   assertthat_0.2.1             
##  [5] interactiveDisplayBase_1.23.0 BiocManager_1.30.5.1         
##  [7] blob_1.1.1                    BSgenome_1.53.0              
##  [9] GenomeInfoDbData_1.2.1        yaml_2.2.0                   
## [11] progress_1.2.2                lattice_0.20-38              
## [13] pillar_1.4.2                  RSQLite_2.1.1                
## [15] backports_1.1.4               glue_1.3.1                   
## [17] digest_0.6.19                 promises_1.0.1               
## [19] htmltools_0.3.6               httpuv_1.5.1                 
## [21] Matrix_1.2-17                 XML_3.98-1.20                
## [23] pkgconfig_2.0.2               bookdown_0.11                
## [25] zlibbioc_1.31.0               purrr_0.3.2                  
## [27] xtable_1.8-4                  later_0.8.0                  
## [29] tibble_2.1.3                  lazyeval_0.2.2               
## [31] cli_1.1.0                     magrittr_1.5                 
## [33] crayon_1.3.4                  mime_0.7                     
## [35] memoise_1.1.0                 evaluate_0.14                
## [37] fansi_0.4.0                   tools_3.6.0                  
## [39] prettyunits_1.0.2             hms_0.4.2                    
## [41] stringr_1.4.0                 compiler_3.6.0               
## [43] rlang_0.4.0                   grid_3.6.0                   
## [45] RCurl_1.95-4.12               rappdirs_0.3.1               
## [47] bitops_1.0-6                  rmarkdown_1.13               
## [49] codetools_0.2-16              DBI_1.0.0                    
## [51] curl_3.3                      R6_2.4.0                     
## [53] GenomicAlignments_1.21.4      knitr_1.23                   
## [55] rtracklayer_1.45.1            bit_1.1-14                   
## [57] utf8_1.1.4                    zeallot_0.1.0                
## [59] ProtGenerics_1.17.2           stringi_1.4.3                
## [61] Rcpp_1.0.1                    png_0.1-7                    
## [63] vctrs_0.1.0                   tidyselect_0.2.5             
## [65] xfun_0.8