BiocIntro 0.0.10
Description: This workshop will introduce you to the Bioconductor collection of R packages for statistical analysis and comprehension of high-throughput genomic data. The emphasis is on data exploration, using RNA-sequence gene expression experiments as a motivating example. How can I access common sequence data formats from R? How can I use information about gene models or gene annotations in my analysis? How do the properties of my data influence the statistical analyses I should perform? What common workflows can I perform with R and Bioconductor? How do I deal with very large data sets in R? These are the sorts of questions that will be tackled in this workshop.
Requirements: You will need to bring your own laptop. The workshop will use cloud-based resources, so your laptop will need a web browser and WiFi capabilities. Participants should have used R and RStudio for tasks such as those covered in introductory workshops earlier in the week. Some knowledge of the biology of gene expression and of concepts learned in a first course in statistics will be helpful.
Relevance: This workshop is relevant to anyone eager to explore
genomic data in R. The workshop will help connect ‘core’ R concepts
for working with data (e.g., data management via data.frame()
,
statistical modelling with lm()
or t.test()
, visualization using
plot()
or ggplot()
) to the special challenges of working with large
genomic data sets. It will be especially helpful to those who have or
will have their own genomic data, and are interested in more fully
understanding how to work with it in R.
RNA-seq
cell dex
SRR1039508 N61311 untrt
SRR1039509 N61311 trt
SRR1039512 N052611 untrt
SRR1039513 N052611 trt
SRR1039516 N080611 untrt
SRR1039517 N080611 trt
SRR1039520 N061011 untrt
SRR1039521 N061011 trt
source: http://bio.lundberg.gu.se/courses/vt13/rnaseq.html
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003 679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419 467 515 621 365 587
... ... ... ... ... ...
SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 1047 770 572
ENSG00000000005 0 0 0
ENSG00000000419 799 417 508
... ... ... ...
Research question
untrt
and trt
experimental treatments?Our goal
Sample information
read.table()
integer()
factor()
and NA
data.frame()
: coordinated management
$
[ , ]
samples_file <-
system.file(package="BiocIntro", "extdata", "samples.tsv")
samples <- read.table(samples_file)
samples
## cell dex avgLength
## SRR1039508 N61311 untrt 126
## SRR1039509 N61311 trt 126
## SRR1039512 N052611 untrt 126
## SRR1039513 N052611 trt 87
## SRR1039516 N080611 untrt 120
## SRR1039517 N080611 trt 126
## SRR1039520 N061011 untrt 101
## SRR1039521 N061011 trt 98
samples$dex <- relevel(samples$dex, "untrt")
Counts
head()
to view the first few.matrix()
rather than data.frame()
counts_file <-
system.file(package="BiocIntro", "extdata", "counts.tsv")
counts <- read.table(counts_file)
dim(counts)
## [1] 63677 8
head(counts)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
counts <- as.matrix(counts)
GRanges
)Row annotations.
url <- "ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz"
BiocFileCache
to download the resource once to a location
that persists across R sessions.library(BiocFileCache)
About Bioconductor packages
BiocFileCache
is available from https://bioconductor.orgBiocFileCache
from the BiocFileCache ‘landing
page’.BiocFileCache
package vignette (access the
vignette from within R: browseVignettes("BiocFileCache")
).BiocManager::install("BiocFileCache")
.?bfcrpath
.gtf_file <- bfcrpath(rnames = url)
## Using temporary cache /var/folders/yn/gmsh_22s2c55v816r6d51fx1tnyl61/T//RtmpLUzYG5/BiocFileCache
## adding rname 'ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz'
read.table()
or similar, but contain structured information that
we want to represent in R.Common sequence data formats
Use the rtracklayer package to import the file.
library(rtracklayer)
gtf <- import(gtf_file)
A GRanges
object
seqnames()
(e.g.,
chromosome), start()
/ end()
/ width()
, strand()
, etc.$
or mcols()$
to access annotations on ranges.gene_id
column as names()
.rowidx <- gtf$type == "gene"
colidx <- c("gene_id", "gene_name", "gene_biotype")
genes <- gtf[rowidx, colidx]
names(genes) <- genes$gene_id
genes$gene_id <- NULL
genes
## GRanges object with 63677 ranges and 2 metadata columns:
## seqnames ranges strand | gene_name gene_biotype
## <Rle> <IRanges> <Rle> | <character> <character>
## ENSG00000223972 1 11869-14412 + | DDX11L1 pseudogene
## ENSG00000227232 1 14363-29806 - | WASH7P pseudogene
## ENSG00000243485 1 29554-31109 + | MIR1302-10 lincRNA
## ENSG00000237613 1 34554-36081 - | FAM138A lincRNA
## ENSG00000268020 1 52473-54936 + | OR4G4P pseudogene
## ... ... ... ... . ... ...
## ENSG00000198695 MT 14149-14673 - | MT-ND6 protein_coding
## ENSG00000210194 MT 14674-14742 - | MT-TE Mt_tRNA
## ENSG00000198727 MT 14747-15887 + | MT-CYB protein_coding
## ENSG00000210195 MT 15888-15953 + | MT-TT Mt_tRNA
## ENSG00000210196 MT 15956-16023 - | MT-TP Mt_tRNA
## -------
## seqinfo: 265 sequences from an unspecified genome; no seqlengths
SummarizedExperiment
)Three different data sets
counts
: results of the RNAseq workflowsamples
: sample and experimental design informationgenes
: information about the genes that we’ve assayed.Coordinate our manipulation
Use the SummarizedExperiment package and data representation.
[,]
assay()
, rowData()
,
rowRanges()
, colData()
, etc.library(SummarizedExperiment)
samples
rows match the order of the
samples in the columns of the counts
matrix, and the order of the
genes
rows match the order of the rows of the counts
matrix.SummarizedExperiment
to coordinate our data manipulation.samples <- samples[colnames(counts),]
genes <- genes[rownames(counts),]
se <- SummarizedExperiment(
assays = list(counts = counts),
rowRanges = genes, colData = samples
)
se
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(0):
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(2): gene_name gene_biotype
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): cell dex avgLength
Gestalt
t.test()
for each row of the count matrix, asking
whether the trt
samples have on average counts that differ from
the untrt
samples.The DESeq2 pacakge
library(DESeq2)
dds <- DESeqDataSet(se, ~ cell + dex)
fit <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
destats <- results(fit)
destats
## log2 fold change (MLE): dex trt vs untrt
## Wald test p-value: dex trt vs untrt
## DataFrame with 63677 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000000003 708.602169691234 -0.38125388742934 0.100654430181804
## ENSG00000000005 0 NA NA
## ENSG00000000419 520.297900552084 0.206812715390398 0.112218674568195
## ENSG00000000457 237.163036796015 0.0379205923946151 0.14344471633862
## ENSG00000000460 57.9326331250967 -0.0881676962628265 0.287141995236272
## ... ... ... ...
## ENSG00000273489 0.275899382507797 1.48372584344306 3.51394515550546
## ENSG00000273490 0 NA NA
## ENSG00000273491 0 NA NA
## ENSG00000273492 0.105978355992386 -0.463691271907546 3.52308373749196
## ENSG00000273493 0.106141666408122 -0.521381077922898 3.53139001322807
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000000003 -3.78775069056286 0.000152017272514002 0.00128292609656079
## ENSG00000000005 NA NA NA
## ENSG00000000419 1.84294384322566 0.0653372100662581 0.196469601297369
## ENSG00000000457 0.26435684326705 0.791504962999781 0.91141814384918
## ENSG00000000460 -0.307052600196215 0.75880333554496 0.895006448013164
## ... ... ... ...
## ENSG00000273489 0.422239328669782 0.672850337762336 NA
## ENSG00000273490 NA NA NA
## ENSG00000273491 NA NA NA
## ENSG00000273492 -0.131615171950935 0.895288684444562 NA
## ENSG00000273493 -0.147641884914972 0.88262539793309 NA
SummarizedExperiment
, so that we can
manipulate these in a coordianted fashion too.rowData(se) <- cbind(rowData(se), destats)
order()
and head()
to identify the row indexes of the top 20
most differentially expressed (based on adjusted P-value) genes.SummarizedExperiment
to contain just these rows.assay()
of our subset as a heatmaptop20idx <- head( order(rowData(se)$padj), 20)
top20 <- se[top20idx,]
heatmap(assay(top20))