This describes some of the ‘upstream’ steps required for known-gene differential expression analysis, using the ultra-fast alignment work flow that goes from FASTQ files to count data without intermediate BAM files.
We follow the salmon tutorial, R-ified in the later stages. We assume a Linux operating system with folder ~/a/salmon/tutorial
, and with salmon
installed at ~/bin/salmon
.
TUTORIAL <- "~/a/salmon/tutorial"
SALMON <- "~/bin/salmon"
The tutorial requires FASTQ files representing (paired-end) sequenced samples. We retrieve them from the short read archive (SRA) using the wget
command line tool.
#!/bin/bash
DIR=~/a/tutorial
SRA=ftp://ftp.sra.ebi.ac.uk/vol1/fastq
for i in `seq 28 40`;
do
mkdir -p ${DIR}/data/DRR0161${i};
cd ${DIR}/data/DRR0161${i};
wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz;
wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz;
done
cd $DIR
We will align the sequences to a reference transcriptome, in this case from Ensembl for Arabidopsis thaliana. We download the sequence and index it.
#! /bin/bash
DIR=~/a/tutorial
ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28
curl \
${ENSEMBL}/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz \
-o ${DIR}/athal.fa.gz
salmon index -t ${DIR}/athal.fa.gz -i ${DIR}/athal_index
If we were interested, it’s easy enough to read this in to R
library(Biostrings)
readDNAStringSet(file.path(TUTORIAL, "athal.fa.gz"))
## A DNAStringSet instance of length 41392
## width seq names
## [1] 462 ATGTCCCTTCTGTTTCAACA...GAAGAGATGTCGGCATTAA ATMG00010.1 cdna:...
## [2] 324 ATGTTTCAGTTCGCTAAGTT...CTTTCTTTTCAACTCGTAA ATMG00030.1 cdna:...
## [3] 948 ATGACAAAGCGTGAGTATAA...CAGACATTATGTCGACTAG ATMG00040.1 cdna:...
## [4] 396 ATGTCTGGCGTTTACCTGAC...AAGCGGCGGGTTCGGGTAA ATMG00050.1 cdna:...
## [5] 542 CCAATTTTTGGGCCAATTCC...AAAGTCAAGTCAAGAATAA ATMG00060.1 cdna:...
## ... ... ...
## [41388] 952 GCCATTCCACATTTTTTTAT...ATGAAAAGAAGTAGGAATT AT1G80970.2 cdna:...
## [41389] 1418 GAACATAGTTAGACCGTGTA...CTTTTGTTTGTTGAATCAC AT1G80980.1 cdna:...
## [41390] 690 ATGGATAGATTGAGTGACTT...TCGTGACGACTGCTGCTGA AT1G80990.1 cdna:...
## [41391] 276 ATGTTATGTGCGTGTTGCGT...TGCTTTTTCGGATGATTGA AT1G81001.1 cdna:...
## [41392] 739 AAAGAAACCTCTTCCTCTAA...ATTAGAATAACATTATTTT AT1G81020.1 cdna:...
We also need to know how transcripts map to genes since we’ll do a gene-level analysis. We download the annotation file (gff3) matching the transcript data base.
#! /bin/bash
DIR=~/a/tutorial
ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28
curl \
${ENSEMBL}/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gff3.gz\
-o ${DIR}/Arabidopsis_thaliana.TAIR10.28.gff3.gz
The tutorial bash script invokes salmon
on each sample, using the paired-end FASTQ files.
#!/bin/bash
DIR=/home/ubuntu/tutorial
SALMON=/home/ubuntu/bin/salmon
for fn in $DIR/data/DRR0161{25..40};
do
sample=`basename ${fn}`
echo "Processing sample ${samp}"
${SALMON} quant -i athal_index -l A \
-1 ${fn}/${sample}_1.fastq.gz \
-2 ${fn}/${sample}_2.fastq.gz \
-p 8 -o quants/${sample}_quant
done
An R version to quantify one sample is
quantify = function(sample_file, output, salmon, doit = TRUE) {
sample = basename(sample_file)
samples = file.path(sample_file, paste0(sample, "_", 1:2, ".fastq.gz"))
output = file.path(output, paste0(sample, "_quant"))
args = c(
"quant", "-i", index, "-l A",
"-1", samples[1], "-2", samples[2],
"-p", parallel::detectCores(),
"-o", output
)
if (doit)
system2(salmon, args)
else {
txt = paste(salmon, paste(args, collapse=" "))
message(txt)
invisible(0)
}
}
Here we quantify all samples (set doit = TRUE
to actually perform the counting step.
index = file.path(TUTORIAL, "athal_index")
output = file.path(TUTORIAL, "quants")
data_dir = file.path(TUTORIAL, "data")
sample_files = normalizePath(dir(data_dir, full = TRUE))
args = list(salmon = SALMON, output = output, doit = FALSE)
Map(quantify, sample_files, MoreArgs = args)
We need to provide a mapping between the transcripts that the reads were aligned to and the genes that we will perform the analysis on. We import the GFF annotation file, and extract the transcripts and their parents (genes) as a tibble.
library(rtracklayer)
library(tidyverse)
file = file.path(TUTORIAL, "Arabidopsis_thaliana.TAIR10.29.gff3")
gff = import(file)
tx2gene = tibble(
txid = gff$transcript_id,
geneid = as.character(gff$Parent)
) %>% na.omit()
To import the count data, we use the tximport package to import the data. We find the relevant count files, and provide names to identify each file (these names are propagated by the import function to the column names of the output count matrix).
library(tximport)
library(SummarizedExperiment)
files = dir(
file.path(TUTORIAL, "quants"),
pattern = ".sf", recursive = TRUE, full = TRUE
)
names(files) = sub("_quant", "", basename(dirname(files)))
counts = tximport(files, type = "salmon", tx2gene = tx2gene)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## summarizing abundance
## summarizing counts
## summarizing length
names(counts)
## [1] "abundance" "counts" "length"
## [4] "countsFromAbundance"
The input is a list with three identically-dimensioned matrices and a fourth element describing how the other elements were determine. We put the three matrices into a SummarizedExperiment, to connect up with the DESeq2 work flow
library(SummarizedExperiment)
se = SummarizedExperiment(counts[-4])
Next steps: add the experimental design as colData()
on the SummarizedExperiment
.