---
title: "TCGAloading"
author: "Marcel Ramos, Levi Waldron"
date: "July 8, 2015"
vignette: >
  %\VignetteIndexEntry{TCGAloading}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output: BiocStyle::html_document
---

# Overview

This vignette shows how to load TCGA multi-assay datasets and convert
them to Bioconductor core objects (ExpressionSet, GRanges, and
GRangesList). It uses the RTCGAToolbox Bioconductor library, but
currently you must use the pre-devel version at:

```{r, eval=FALSE}
library(BiocInstaller)
biocLite("LiNk-NY/RTCGAToolbox")
```

# Loading data from TCGA

All fully-open data from TCGA that is accessible through the
[firehose_get](https://confluence.broadinstitute.org/display/GDAC/Download)
command-line program.

You have to select a "run date" for processed data:

```{r, message=FALSE}
library(RTCGAToolbox)
(rundates <- getFirehoseRunningDates())
```

And an "analysis date" for analyzed data, such as GISTIC2 regions of
recurrent copy number variation:

```{r}
(analysisdates <- getFirehoseAnalyzeDates())
```

For exactly reproduceable results you would hard-code one of these,
since data can change over time as samples are added or new algorithms
are used for processing and analysis, but here will just use the most
recent versions rundates[1] and analysisdates[1].

The following commands (not evaluated) will fetch data for five cancer
types:

```{r, eval=FALSE}
ov <- getFirehoseData("OV", runDate=rundates[1], gistic2_Date=analysisdates[1], RNAseq_Gene=TRUE, 
        miRNASeq_Gene=TRUE, RNAseq2_Gene_Norm=TRUE, CNA_SNP=TRUE, CNV_SNP=TRUE, CNA_Seq=TRUE, 
        CNA_CGH=TRUE,  Methylation=TRUE, Mutation=TRUE, mRNA_Array=TRUE, miRNA_Array=TRUE, 
        RPPA=TRUE, todir = NULL)
 
gbm <- getFirehoseData("GBM", runDate=rundates[1], gistic2_Date=analysisdates[1], RNAseq_Gene=TRUE, 
       	miRNASeq_Gene=TRUE, RNAseq2_Gene_Norm=TRUE, CNA_SNP=TRUE, CNV_SNP=TRUE, CNA_Seq=TRUE, 
        CNA_CGH=TRUE,  Methylation=TRUE, Mutation=TRUE, mRNA_Array=TRUE, miRNA_Array = TRUE,  
        RPPA=TRUE, todir = NULL)
 
coad <- getFirehoseData("COAD", runDate=rundates[1], gistic2_Date=analysisdates[1], RNAseq_Gene=TRUE, 
        miRNASeq_Gene=TRUE, RNAseq2_Gene_Norm=TRUE, CNA_SNP = TRUE, CNV_SNP=TRUE, CNA_Seq = TRUE, 
        CNA_CGH = TRUE, Methylation = TRUE, Mutation = TRUE, mRNA_Array = TRUE, miRNA_Array = TRUE, 
        RPPA = TRUE, todir = NULL)
 
laml <- getFirehoseData("LAML", runDate=rundates[1], gistic2_Date=analysisdates[1], RNAseq_Gene=TRUE, 
          miRNASeq_Gene=TRUE, RNAseq2_Gene_Norm=TRUE, CNA_SNP = TRUE, CNV_SNP=TRUE, CNA_Seq = TRUE, 
          CNA_CGH = TRUE, Methylation = TRUE, Mutation = TRUE, mRNA_Array = TRUE, miRNA_Array = TRUE, 
          RPPA = TRUE, todir = NULL)
 
blca <- getFirehoseData("BLCA", runDate=rundates[1], gistic2_Date=analysisdates[1], RNAseq_Gene=TRUE, 
          miRNASeq_Gene=TRUE, RNAseq2_Gene_Norm=TRUE, CNA_SNP = TRUE, CNV_SNP=TRUE, CNA_Seq = TRUE, 
          CNA_CGH = TRUE, Methylation = TRUE, Mutation = TRUE, mRNA_Array = TRUE, miRNA_Array = TRUE, 
          RPPA = TRUE, todir = NULL)
```

Instead, we'll just load the ovarian cancer dataset from the
bioc2015multiomicsworkshop for demonstration:

```{r}
library(bioc2015multiomicsworkshop)
data(laml)
laml
```

What we really want are Bioconductor core data objects.  The extract()
function from RTCGAToolbox creates the appropriate object.  If
clinical=FALSE, it will create simple matrices and GRanges, if
clinical=TRUE, it will create ExpressionSet and SummarizedExperiment
objects.  

Note that there is an unfortunate inconsistency in the the names of
the assays between what we saw above in the show method above for the
laml object, and in the arguments to getFirehoseData().  The extract()
function uses the same data types as in the arguments to
getFirehoseData(), with case and underscore-insensitive matching.  The
following choices are available:

```{r}
choices <- tolower(gsub("_", "", c("RNAseq_Gene", "miRNASeq_Gene",
             "RNAseq2_Gene_Norm", "CNA_SNP", "CNV_SNP", "CNA_Seq",
             "CNA_CGH", "Methylation", "Mutation", "mRNA_Array",
             "miRNA_Array", "RPPA")))
```

For example, for copy number we get a GRangesList:

```{r}
cna <- extract(laml, "cnasnp", clinical=TRUE)
cna
```

And for RNAseq we get an ExpressionSet:

```{r}
rnaseq <- extract(laml, "rnaseqgene", clinical=TRUE)
rnaseq
```

There is partial but not full overlap in the patients:
```{r}
summary(names(cna) %in% sampleNames(rnaseq))
```

They have the same clinical data, albeit as a DataFrame for the
GRangesList, and an AnnotatedDataFrame for the ExpressionSet:

```{r}
elementMetadata(cna)[1:2, 1:4]
pData(rnaseq)[1:2, 1:4]
```

Note that in the first two rows of the cna object, we have two different samples for the same patient (tcga-ab-2802 is the patient identifier, -03 and -11 are sample types).  See the [TCGA barcode information](https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), and the [Code Tables Report](https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTable=Sample%20type) for an explanation of the sample types. 03 is "Primary Blood Derived Cancer - Peripheral Blood" and 11 is "Solid Tissue Normal"