scPipe
is a package initially designed to process single-cell RNA-sequencing (scRNA-seq) data generated by different protocols. We have modified it to accommodate pre-processing capability of single-cell ATAC-Seq (Assay for Transposase-Accessible Chromatin using sequencing) data pre-processing. scPipe
ATAC-Seq module is designed for protocols without UMIs, but can also adapt to any UMI protocols if the need arise.
scPipe
ATAC-Seq module consists of two major components. The first is data pre-processing with barcodes as raw fastq as input and a feature count matrix as output. Second is the data pre-processing with barcode CSV file as input and a feature count matrix as output.
10X ATAC
method currently is the most popular method to generate scATAC-Seq data with higher sensitivity and lower cost. The structure of the 10X ATAC library is shown below.
The output fastq files from a 10X ATAC
experiment is paired-ended and data is contained within both reads.
The pipeline is visually described via the workflow diagram depicted below. The functions will be explained in greater detail.
It is not mandatory to specify an output folder even though it can be specified. If no output_folder
is defined a folder named scPipe-atac-output
will get created in the working directory.
We begin by loading the library.
library(scPipe)
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE)
Specify the output folder.
output_folder <- paste0(tempdir(), "/scPipeATACVignette")
To process the data, we need the fastq
files (both compressed and uncompressed versions are accepted) and a cell barcode annotation. We have supplied very small sample FASTQ files of chr21. The barcode annotation could be either in a .fastq
format or a .csv
file with at least two columns, where the first column has the cell id and the second column contains the barcode sequence. All these files can be found in the data
folder of the scPipe
package:
# data fastq files
r1 <- file.path(data.folder, "small_chr21_R1.fastq.gz")
r2 <- file.path(data.folder, "small_chr21_R3.fastq.gz")
# barcodes in fastq format
barcode_fastq <- file.path(data.folder, "small_chr21_R2.fastq.gz")
# barcodes in .csv format
barcode_1000 <- file.path(data.folder, "chr21_modified_barcode_1000.csv")
The pipeline starts with fastq file reformatting. We move the barcode and UMI sequences (if available) to the read name and leave the transcript sequence as is. This outputs a read name that looks like @[barcode_sequence]_[UMI_sequence]#[readname]
. Usually the scATAC-Seq data is paired-end and a 16bp long barcode is located on both reads. Here the barcode information is located on a separate fastq file and the length of the barcode fastq file matches the length of the reads files. Therefore, you need a minimal example like below to generate the output.
sc_atac_trim_barcode (r1 = r1,
r2 = r2,
bc_file = barcode_fastq,
rmN = FALSE,
rmlow = FALSE,
output_folder = output_folder)
## Output Directory Does Not Exist. Created Directory: /tmp/RtmpF20gvE/scPipeATACVignette
## Saving the output at location:
## /tmp/RtmpF20gvE/scPipeATACVignette
## No valid_barcode_file provided; no barcode error correction will occur.
## Total reads: 35000
## Total reads removed due to N's in barcodes: 0
## Total reads removed due to low quality barcodes: 0
## Total barcodes provided in FASTQ file: 16567
## time elapsed: 287 milliseconds
This generates two output fastq files that are appended by the prefix demux_
to notify that the new files are the reformatted (a.k.a. demultiplexed) .fastq
files. These will get saved in the scPipe-atac-output
directory if the user has not specified an output_folder
.
However, if the barcodes are in the form of a .csv
file, some extra information on 0-indexed barcode start (id1_st
, id2_st
) and the barcode length (id1_len
, id2_len
) are also required to be entered by the user. The algorithm is flexible to do a “look-around” to identify whether the correct parameters are used for a subset of data (hence saving time) and report back to the console if it believes the barcode position is incorrect and/or should be shifted.
sc_atac_trim_barcode (r1 = r1,
r2 = r2,
bc_file = barcode_1000,
id1_st = -1,
id1_len = -1,
id2_st = -1,
id2_len = -10,
output_folder = output_folder,
rmN = TRUE)
To accommodate combinatorial indexing in some scATAC-seq protocols, the bc_file
parameter will accept a list of barcode files as well (currently only implemented for the fastq
approach).
Additionally, rmN
, rmlow
and min_qual
parameters define the quality thresholds for the reads. If there are Ns
in the barcode or UMI positions those will be discarded by rmN = TRUE
. rmlow =TRUE
will remove reads having lower quality than what is defined by min_qual
(default: 20).
Completion of this function will output three different outputs depending on the findings; * complete matches: When the barcode is completely matched and identified in the correct position * partial matches: When the barcode is identified in the location specified but corrected with hamming distance approach * unmatched: no barcode match is found in the given position even after hamming distance corrections are applied
NOTE: we use a zero based index system, so the indexing of the sequence starts at zero.
Next, we align reads to the genome. This example uses Rsubread
but any aligner that support RNA-seq alignment and gives standard BAM output can be used here.
demux_r1 <- file.path(output_folder, "demux_completematch_small_chr21_R1.fastq.gz")
demux_r2 <- file.path(output_folder, "demux_completematch_small_chr21_R3.fastq.gz")
reference = file.path(data.folder, "small_chr21.fa")
aligned_bam <- sc_aligning(ref = reference,
R1 = demux_r1,
R2 = demux_r2,
nthreads = 12,
output_folder = output_folder)
## ATAC-Seq mode is selected...
## Genome index location not specified. Looking for the index in/tmp/RtmpF20gvE/scPipeATACVignette
## Genome index not found. Creating one in /tmp/RtmpF20gvE/scPipeATACVignette ...
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.21.0
##
## //================================= setting ==================================\\
## || ||
## || Index name : genome_index ||
## || Index space : base space ||
## || Index split : no-split ||
## || Repeat threshold : 100 repeats ||
## || Gapped index : no ||
## || ||
## || Free / total memory : 51.3GB / 125.4GB ||
## || ||
## || Input files : 1 file in total ||
## || o small_chr21.fa ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Check the integrity of provided reference sequences ... ||
## || No format issues were found ||
## || Scan uninformative subreads in reference sequences ... ||
## || 3 uninformative subreads were found. ||
## || These subreads were excluded from index building. ||
## || Estimate the index size... ||
## || 8%, 0 mins elapsed, rate=380.7k bps/s ||
## || 16%, 0 mins elapsed, rate=714.4k bps/s ||
## || 24%, 0 mins elapsed, rate=1033.8k bps/s ||
## || 33%, 0 mins elapsed, rate=1304.0k bps/s ||
## || 41%, 0 mins elapsed, rate=1577.4k bps/s ||
## || 49%, 0 mins elapsed, rate=1796.8k bps/s ||
## || 58%, 0 mins elapsed, rate=1996.0k bps/s ||
## || 66%, 0 mins elapsed, rate=2217.8k bps/s ||
## || 74%, 0 mins elapsed, rate=2382.4k bps/s ||
## || 83%, 0 mins elapsed, rate=2578.4k bps/s ||
## || 91%, 0 mins elapsed, rate=2798.4k bps/s ||
## || 3.0 GB of memory is needed for index building. ||
## || Build the index... ||
## || 8%, 0 mins elapsed, rate=41.3k bps/s ||
## || 16%, 0 mins elapsed, rate=81.2k bps/s ||
## || 24%, 0 mins elapsed, rate=120.0k bps/s ||
## || 33%, 0 mins elapsed, rate=157.2k bps/s ||
## || 41%, 0 mins elapsed, rate=194.6k bps/s ||
## || 49%, 0 mins elapsed, rate=229.4k bps/s ||
## || 58%, 0 mins elapsed, rate=263.1k bps/s ||
## || 66%, 0 mins elapsed, rate=298.0k bps/s ||
## || 74%, 0 mins elapsed, rate=329.9k bps/s ||
## || 83%, 0 mins elapsed, rate=363.2k bps/s ||
## || 91%, 0 mins elapsed, rate=397.7k bps/s ||
## || Save current index block... ||
## || [ 0.0% finished ] ||
## || [ 10.0% finished ] ||
## || [ 20.0% finished ] ||
## || [ 30.0% finished ] ||
## || [ 40.0% finished ] ||
## || [ 50.0% finished ] ||
## || [ 60.0% finished ] ||
## || [ 70.0% finished ] ||
## || [ 80.0% finished ] ||
## || [ 90.0% finished ] ||
## || [ 100.0% finished ] ||
## || ||
## || Total running time: 0.2 minutes. ||
## |Index /tmp/RtmpF20gvE/scPipeATACVignette/genome_index was successfully built. ||
## || ||
## \\============================================================================//
## No partial matches, checking for reads with non-matched barcodes.
## No reads found with non-matching barcodes.
## Outputted demultiplexing stats file to/tmp/RtmpF20gvE/scPipeATACVignette/scPipe_atac_stats//demultiplexing_stats.csv
## Output file name is not provided. Aligned reads are saved in /tmp/RtmpF20gvE/scPipeATACVignette/demux_completematch_small_chr21_R1_aligned.bam
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.21.0
##
## //================================= setting ==================================\\
## || ||
## || Function : Read alignment (DNA-Seq) ||
## || Input file 1 : demux_completematch_small_chr21_R1.fastq.gz ||
## || Input file 2 : demux_completematch_small_chr21_R3.fastq.gz ||
## || Output file : demux_completematch_small_chr21_R1_aligned.bam (BAM), ... ||
## || Index name : genome_index ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 12 ||
## || Phred offset : 33 ||
## || # of extracted subreads : 10 ||
## || Min read1 vote : 3 ||
## || Min read2 vote : 1 ||
## || Max fragment size : 600 ||
## || Min fragment size : 50 ||
## || Max mismatches : 3 ||
## || Max indel length : 5 ||
## || Report multi-mapping reads : yes ||
## || Max alignments per multi-mapping read : 1 ||
## || ||
## \\============================================================================//
##
## //=============== Running (29-Oct-2024 19:27:12, pid=1772257) ================\\
## || ||
## || Check the input reads. ||
## || The input file contains base space reads. ||
## || Initialise the memory objects. ||
## || Estimate the mean read length. ||
## || The range of Phred scores observed in the data is [2,37] ||
## || Create the output BAM file. ||
## || Check the index. ||
## || Init the voting space. ||
## || Global environment is initialised. ||
## || Load the 1-th index block... ||
## || The index block has been loaded. ||
## || Start read mapping in chunk. ||
## || Estimated fragment length : 123 bp ||
## || ||
## || Completed successfully. ||
## || ||
## \\==================================== ====================================//
##
## //================================ Summary =================================\\
## || ||
## || Total fragments : 35,000 ||
## || Mapped : 18,851 (53.9%) ||
## || Uniquely mapped : 18,057 ||
## || Multi-mapping : 794 ||
## || ||
## || Unmapped : 16,149 ||
## || ||
## || Properly paired : 14,524 ||
## || Not properly paired : 4,327 ||
## || Singleton : 1,458 ||
## || Chimeric : 0 ||
## || Unexpected strandness : 114 ||
## || Unexpected fragment length : 180 ||
## || Unexpected read order : 2,575 ||
## || ||
## || Indels : 197 ||
## || ||
## || Running time : 0.6 minutes ||
## || ||
## \\============================================================================//
Next, the BAM file needs to be modified in a way one or two new columns are generated for the cellular barcode tag and the molecular barcode (i.e. UMI) tag denoted by CB:Z:
and OX:Z:
, respectively.
sorted_tagged_bam <- sc_atac_bam_tagging (inbam = aligned_bam,
output_folder = output_folder,
bam_tags = list(bc="CB", mb="OX"),
nthreads = 12)
## Detected bc_len: 16 Detected UMI len: 0
## updating progress every 3 minutes...
## number of read processed:0
## 0 0 0 0
## 70000 reads processed, 70k reads/sec
## number of read processed: 70000
## time elapsed: 1022 milliseconds
## Demultiplexed BAM file sorted and indexed ...
## Using default value for barcode length (bc_length = 16)
## Generating mapping statistics per barcode
## Iterating through 5,000,000 reads at a time
## chunk1
## Completed generating mapping statistics per barcode, saved to /tmp/RtmpF20gvE/scPipeATACVignette/scPipe_atac_stats/mapping_stats_per_barcode.csv
## The output tagged and sorted BAM file was sent to /tmp/RtmpF20gvE/scPipeATACVignette
Next, PCR duplicate reads are removed from the BAM file. If samtools
is installed, then sc_atac_remove_duplicates
can be used, with the installation path of samtools
being an optional argument if a particular version is preferred. However, if samtools
isn’t installed, the PCR duplicate removal should be performed externally.
removed <- sc_atac_remove_duplicates(sorted_tagged_bam,
output_folder = output_folder)
if (!isFALSE(removed))
sorted_tagged_bam <- removed
Next, a fragment file
in a .bed
format is generated, where each line represents a unique fragment generated by the assay. This file is used to generate useful quality control statistics, as well as for the filter
and cellranger
cell calling methods.
sc_atac_create_fragments(inbam = sorted_tagged_bam,
output_folder = output_folder)
## Output folder name is: /tmp/RtmpF20gvE/scPipeATACVignette
## Output BED file: /tmp/RtmpF20gvE/scPipeATACVignette/fragments.bed
Similar to many other tools, the the ability to call peaks on a pseudo-bulk level on the demultiplexed reads has also been incorporated. MACS3
is used underneath to achieve this functionality. If the genome size is not provided then it will be roughly estimated.
# CHECK IF THE .narrowPeak file is small enough to include in the package,
# otherwise, we need to make this basilisk call work??
sc_atac_peak_calling(inbam = sorted_tagged_bam,
ref = reference,
genome_size = NULL,
output_folder = output_folder)
After the read alignment and BAM file demultiplexing, a feature set can be used to find the overlap between the aligned reads and the features using the sc_atac_feature_counting
function.
Accepted format of the feature_input
should be either a BED format (i.e. format should have three main columns; chromosome, feature start and feature end, extension of the file is not considered) or a genome.fasta
file. If using a BED format as the feature_input
, the feature_type
should be either “peak” or “tss”. If using a .fasta
for the feature_input
, the feature_type
needs to be defined as genome_bin
. This genome.fasta
file will be used within the function to generate a genome_bin
that defines the array of features. The size of the bins can be set using the bin_size
parameter (default: 2000).
Note: avoid using input BAM files larger than 8GB for best performance
features <- file.path(output_folder, "NA_peaks.narrowPeak")
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
feature_input = features,
bam_tags = list(bc="CB", mb="OX"),
feature_type = "peak",
organism = "hg38",
yieldsize = 1000000,
exclude_regions = TRUE,
output_folder = output_folder,
fix_chr = "none"
)
If genome_bin approach is selected, the following can be used:
reference <- file.path(data.folder, "small_chr21.fa")
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
feature_input = reference,
bam_tags = list(bc="CB", mb="OX"),
feature_type = "genome_bin",
organism = "hg38",
cell_calling = FALSE,
yieldsize = 1000000,
exclude_regions = TRUE,
output_folder = output_folder,
fix_chr = "none"
)
Call calling is performed on the data prior to finding overlaps. The cell calling methods available are the emptyDrops
method from DropletUtils, filter
which filters the cells based on various QC cut-offs, and cellranger
which models the signal and noise as a mixture of two negative binomial distributions, though the filter
method is recommended as it is most commonly used and the best-suited for scATAC-seq data.
For the filter
cell-calling method, there are a variety of QC metrics that can be used, including:
min_uniq_frags
max_uniq_frags
min_frac_peak
min_frac_tss
min_frac_enhancer
min_frac_promoter
max_frac_mito
features <- file.path(output_folder, "NA_peaks.narrowPeak")
sc_atac_feature_counting (fragment_file = file.path(output_folder, "fragments.bed"),
feature_input = features,
bam_tags = list(bc="CB", mb="OX"),
feature_type = "peak",
organism = "hg38",
cell_calling = "filter",
min_uniq_frags = 0,
min_frac_peak = 0,
min_frac_promoter = 0,
yieldsize = 1000000,
exclude_regions = TRUE,
output_folder = output_folder,
fix_chr = "none"
)
## `peak`, `tss` or `gene` feature_type is selected for feature input
## Creating Galignment object for the feature input ...
## hg38is a recognized organism. Using annotation files in repository.
## Generating TSS plot data
## Generating feature-barcode fragment count matrix
## Average no. of fragments per barcode: 1.34885245901639
## Raw feature matrix generated: /tmp/RtmpF20gvE/scPipeATACVignette/unfiltered_feature_matrix.rds
## Calculating TSS enrichment scores
## Generating QC metrics for cells
## `filter` function is used for cell calling ...
## cell called and stored in /tmp/RtmpF20gvE/scPipeATACVignette
## Cells with less than 200 counts are filtered out.
## Number of cells to remove:32
## No cells were filtered out since otherwise there would be too few left.
## features with less than 10 counts are filtered out.
## Number of features to remove:8
## No features were filtered out since otherwise there would be too few left.
## making sparse
## Sparse matrix generated
## Sparse count matrix is saved in
## /tmp/RtmpF20gvE/scPipeATACVignette/sparse_matrix.mtx
## Binary matrix is saved in:
## /tmp/RtmpF20gvE/scPipeATACVignette/binary_matrix.rds
## Computing feature QC metrics
## writing to csv
## SCE object is saved in:
## /tmp/RtmpF20gvE/scPipeATACVignette/scPipe_atac_SCEobject.rds
## sc_atac_feature_counting completed in 10.9906167984009 seconds
Mapping quality (MAPQ) value can be set to filter data further in step (default: 30). Additionally, regions that are considered anomalous, unstructured, or high signal in next-generation sequencing experiments are excluded using an inbuilt excluded_regions
file (available are for organism
hg19
, hg38
, and mm10
) or a user provided excluded_regions_filename
file in BED format.
The discrepancy of chr
between the alignment file and the feature file/excluded regions file can also be fixed (using fix_char
parameter) if needed by adding the string chr
to the beginning of either the features, and/or excluded regions. So the options for fix_char
are “none”, “excluded_regions”, “feature”, “both”.
NOTE genome_bin
approach may be more reliable in detecting sensitive features than using a pseudo-bulk peak calling approach, hence it would make the function slower as well.
The sc_atac_feature_counting
function generates a matrix format of the feature by cell matrix (features as rows, cell barcodes as columns) in several formats (raw, sparse, binary, jaccard) that can be used downstream to generate a singleCellExperiment, SCE
object.
feature_matrix <- readRDS(file.path(output_folder, "unfiltered_feature_matrix.rds"))
dplyr::glimpse(feature_matrix)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:1559] 3 5 3 3 3 3 3 3 3 1 ...
## ..@ p : int [1:1526] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ Dim : int [1:2] 11 1525
## ..@ Dimnames:List of 2
## .. ..$ feature: chr [1:11] "chr21:284477-284941" "chr21:287515-288874" "chr21:325151-325387" "chr21:414105-416460" ...
## .. ..$ barcode: chr [1:1525] "AAACGAACAGTAAGCG" "AAACGAACATGGATGG" "AAACGAATCGGACGAA" "AAACGAATCTGTGTGA" ...
## ..@ x : num [1:1559] 1 1 1 1 1 1 2 1 1 1 ...
## ..@ factors : list()
sparseM <- readRDS(file.path(output_folder, "sparse_matrix.rds"))
dplyr::glimpse(sparseM)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:50] 3 3 3 3 3 1 3 3 3 3 ...
## ..@ p : int [1:33] 0 1 2 3 4 5 7 8 9 11 ...
## ..@ Dim : int [1:2] 11 32
## ..@ Dimnames:List of 2
## .. ..$ feature: chr [1:11] "chr21:284477-284941" "chr21:287515-288874" "chr21:325151-325387" "chr21:414105-416460" ...
## .. ..$ barcode: chr [1:32] "AAATGAGCAATCAGGG" "AACTTGGCATGGCCGT" "AATACGCAGTTGGAAT" "ACAGCGCAGATGCGCA" ...
## ..@ x : num [1:50] 7 12 16 3 6 1 21 6 6 9 ...
## ..@ factors : list()