--- title: "Appendix: Gene Expression Using salmon" author: "Martin Morgan % \VignetteIndexEntry{Appendix: Gene Expression Using salmon} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")) ) suppressPackageStartupMessages({ library(Biostrings) library(ShortRead) library(rtracklayer) library(tximport) library(SummarizedExperiment) library(DESeq2) library(tidyverse) }) ``` # Introduction This describes some of the 'upstream' steps required for known-gene differential expression analysis, using the ultra-fast alignment work flow that goes from FASTQ files to count data without intermediate BAM files. We follow the [salmon tutorial][], _R_-ified in the later stages. We assume a Linux operating system with folder `~/a/salmon/tutorial`, and with `salmon` installed at `~/bin/salmon`. ```{r} TUTORIAL <- "~/a/salmon/tutorial" SALMON <- "~/bin/salmon" ``` [salmon tutorial]: https://combine-lab.github.io/salmon/getting_started/ # Download The tutorial requires FASTQ files representing (paired-end) sequenced samples. We retrieve them from the short read archive (SRA) using the `wget` command line tool. ``` #!/bin/bash DIR=~/a/tutorial SRA=ftp://ftp.sra.ebi.ac.uk/vol1/fastq for i in `seq 28 40`; do mkdir -p ${DIR}/data/DRR0161${i}; cd ${DIR}/data/DRR0161${i}; wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz; wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz; done cd $DIR ``` We will align the sequences to a reference transcriptome, in this case from Ensembl for _Arabidopsis thaliana_. We download the sequence and index it. ``` #! /bin/bash DIR=~/a/tutorial ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28 curl \ ${ENSEMBL}/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz \ -o ${DIR}/athal.fa.gz salmon index -t ${DIR}/athal.fa.gz -i ${DIR}/athal_index ``` If we were interested, it's easy enough to read this in to _R_ ```{r} library(Biostrings) readDNAStringSet(file.path(TUTORIAL, "athal.fa.gz")) ``` We also need to know how transcripts map to genes since we'll do a gene-level analysis. We download the annotation file (gff3) matching the transcript data base. ``` #! /bin/bash DIR=~/a/tutorial ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28 curl \ ${ENSEMBL}/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gff3.gz\ -o ${DIR}/Arabidopsis_thaliana.TAIR10.28.gff3.gz ``` # Quantification The tutorial bash script invokes `salmon` on each sample, using the paired-end FASTQ files. ``` #!/bin/bash DIR=/home/ubuntu/tutorial SALMON=/home/ubuntu/bin/salmon for fn in $DIR/data/DRR0161{25..40}; do sample=`basename ${fn}` echo "Processing sample ${samp}" ${SALMON} quant -i athal_index -l A \ -1 ${fn}/${sample}_1.fastq.gz \ -2 ${fn}/${sample}_2.fastq.gz \ -p 8 -o quants/${sample}_quant done ``` An _R_ version to quantify one sample is ```{r} quantify = function(sample_file, output, salmon, doit = TRUE) { sample = basename(sample_file) samples = file.path(sample_file, paste0(sample, "_", 1:2, ".fastq.gz")) output = file.path(output, paste0(sample, "_quant")) args = c( "quant", "-i", index, "-l A", "-1", samples[1], "-2", samples[2], "-p", parallel::detectCores(), "-o", output ) if (doit) system2(salmon, args) else { txt = paste(salmon, paste(args, collapse=" ")) message(txt) invisible(0) } } ``` Here we quantify all samples (set `doit = TRUE` to actually perform the counting step. ```{r, eval = FALSE} index = file.path(TUTORIAL, "athal_index") output = file.path(TUTORIAL, "quants") data_dir = file.path(TUTORIAL, "data") sample_files = normalizePath(dir(data_dir, full = TRUE)) args = list(salmon = SALMON, output = output, doit = FALSE) Map(quantify, sample_files, MoreArgs = args) ``` # Analysis in _R_ We need to provide a mapping between the transcripts that the reads were aligned to and the genes that we will perform the analysis on. We import the GFF annotation file, and extract the transcripts and their parents (genes) as a tibble. ## Transcript - gene map ```{r} library(rtracklayer) library(tidyverse) file = file.path(TUTORIAL, "Arabidopsis_thaliana.TAIR10.29.gff3") gff = import(file) tx2gene = tibble( txid = gff$transcript_id, geneid = as.character(gff$Parent) ) %>% na.omit() ``` ## Input To import the count data, we use the [tximport][] package to import the data. We find the relevant count files, and provide names to identify each file (these names are propagated by the import function to the column names of the output count matrix). ```{r} library(tximport) library(SummarizedExperiment) files = dir( file.path(TUTORIAL, "quants"), pattern = ".sf", recursive = TRUE, full = TRUE ) names(files) = sub("_quant", "", basename(dirname(files))) counts = tximport(files, type = "salmon", tx2gene = tx2gene) names(counts) ``` The input is a list with three identically-dimensioned matrices and a fourth element describing how the other elements were determine. We put the three matrices into a SummarizedExperiment, to connect up with the [DESeq2][] work flow ``` library(SummarizedExperiment) se = SummarizedExperiment(counts[-4]) ``` ## Experimental design Next steps: add the experimental design as `colData()` on the `SummarizedExperiment`. [tximport]: https://bioconductor.org/packages/tximport [DESeq2]: https://bioconductor.org/packages/DESeq2