library(data.table)
library(ggplot2)
library(CellBarcode)

1 Introduction

1.1 About the package

Whatโ€™s this package used for?

Cellular DNA barcoding (genetic lineage tracing) is a powerful tool for lineage tracing and clonal tracking studies. This package provides a toolbox for DNA barcode analysis, from extraction from fastq files to barcode error correction and quantification.

What types of barcode can this package handle?

The package can handle all kinds of barcodes, as long as the barcodes have a pattern which can be matched by a regular expression, and each barcode is within a single sequencing read. It can handle barcodes with flexible length, and barcodes with UMI (unique molecular identifier).

This tool can also be used for the pre-processing part of amplicon data analysis such as CRISPR gRNA screening, immune repertoire sequencing and meta genome data.

What can the package do?

The package provides functions for 1). Sequence quality control and filtering, 2). Barcode (and UMI) extraction from sequencing reads, 3). Sample and barcode management with metadata, 4). Barcode filtering.

1.2 About function naming

Most of the functions in this packages have names with bc_ as initiation. We hope it can facilitate the syntax auto-complement function of IDE (integrated development toolkit) or IDE-like tools such as RStudio, R-NVIM (in VIM), and ESS (in Emacs). By typing bc_ you can have a list of suggested functions, then you can pick the function you need from the list.

TODO: the function brain-map

1.3 About test data set

The test data set in this package can be accessed by

system.file("extdata", "mef_test_data", package="CellBarcode")

The data are from Jos et. al (TODO: citation). There are 7 mouse embryo fibroblast (MEF) lines with different DNA barcodes. The barcodes are in vivo inducible VDJ barcodes (TODO: add citation when have). These MEF lines were mixed in a ratio of 1:2:4:8:16:32:64.

sequence clone size 2^x
AAGTCCAGTTCTACTATCGTAGCTACTA 1
AAGTCCAGTATCGTTACGCTACTA 2
AAGTCCAGTCTACTATCGTTACGACAGCTACTA 3
AAGTCCAGTTCTACTATCGTTACGAGCTACTA 4
AAGTCCATCGTAGCTACTA 5
AAGTCCAGTACTGTAGCTACTA 6
AAGTCCAGTACTATCGTACTA 7

Then 5 pools of 196 to 50000 cells were prepared from the MEF lines mixture. For each pool 2 technical replicates (NGS libraries) were prepared and sequenced, finally resulting in 10 samples.

sample name cell number replication
195_mixa 195 mixa
195_mixb 195 mixb
781_mixa 781 mixa
781_mixb 781 mixb
3125_mixa 3125 mixa
3125_mixb 3125 mixb
12500_mixa 12500 mixa
12500_mixb 12500 mixb
50000_mixa 50000 mixa
50000_mixb 50000 mixb

The original FASTQ files are relatively large, so only 2000 reads for each sample have been randomly sampled as a test set here.

example_data <- system.file("extdata", "mef_test_data", package = "CellBarcode")
fq_files <- dir(example_data, "fastq.gz", full=TRUE)

# prepare metadata for the samples
metadata <- stringr::str_split_fixed(basename(fq_files), "_", 10)[, c(4, 6)]
metadata <- as.data.frame(metadata)
sample_name <- apply(metadata, 1, paste, collapse = "_")
colnames(metadata) = c("cell_number", "replication")
# metadata should has the row names consistent to the sample names
rownames(metadata) = sample_name
metadata
#>            cell_number replication
#> 195_mixb           195        mixb
#> 50000_mixa       50000        mixa
#> 50000_mixb       50000        mixb
#> 12500_mixa       12500        mixa
#> 12500_mixb       12500        mixb
#> 3125_mixa         3125        mixa
#> 3125_mixb         3125        mixb
#> 781_mixa           781        mixa
#> 781_mixb           781        mixb
#> 195_mixa           195        mixa

2 Installation

Install from Bioconductor.

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("CellBarcode")

Install the development version from Github.

# install.packages("remotes")
remotes::install_github("wenjie1991/CellBarcode")

3 A basic workflow

Here is an example of a basic workflow:

# install.packages("stringr")
library(CellBarcode)
library(magrittr)

# The example data is the mix of MEF lines with known barcodes
# 2000 reads for each file have been sampled for this test dataset
# extract UMI barcode with regular expression
bc_obj <- bc_extract(
  fq_files,  # fastq file
  pattern = "([ACGT]{12})CTCGAGGTCATCGAAGTATC([ACGT]+)CCGTAGCAAGCTCGAGAGTAGACCTACT", 
  pattern_type = c("UMI" = 1, "barcode" = 2),
  sample_name = sample_name,
  metadata = metadata
)
bc_obj
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains: 
#> ----------
#> @metadata: 4 field(s) available:
#> cell_number  replication  raw_read_count  barcode_read_count
#> ----------
#> @messyBc: 10 sample(s) for raw barcodes:
#>     In sample $195_mixb there are: 1329 Tags
#>     In sample $50000_mixa there are: 1318 Tags
#>     In sample $50000_mixb there are: 1389 Tags
#>     In sample $12500_mixa there are: 1325 Tags
#>     In sample $12500_mixb there are: 1365 Tags
#>     In sample $3125_mixa there are: 1294 Tags
#>     In sample $3125_mixb there are: 1301 Tags
#>     In sample $781_mixa there are: 1297 Tags
#>     In sample $781_mixb there are: 1306 Tags
#>     In sample $195_mixa there are: 1352 Tags

# sample subset operation, select technical repeats 'mixa'
bc_sub = bc_subset(bc_obj, sample=replication == "mixa")
bc_sub 
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains: 
#> ----------
#> @metadata: 4 field(s) available:
#> cell_number  replication  raw_read_count  barcode_read_count
#> ----------
#> @messyBc: 5 sample(s) for raw barcodes:
#>     In sample $50000_mixa there are: 1318 Tags
#>     In sample $12500_mixa there are: 1325 Tags
#>     In sample $3125_mixa there are: 1294 Tags
#>     In sample $781_mixa there are: 1297 Tags
#>     In sample $195_mixa there are: 1352 Tags

# filter the barcode, UMI barcode amplicon >= 2 & UMI counts >= 2
bc_sub <- bc_cure_umi(bc_sub, depth = 2) %>% bc_cure_depth(depth = 2)

# select barcodes with a white list
bc_2df(bc_sub)
#>    sample_name                      barcode_seq count
#> 1   50000_mixa           AAGTCCAGTACTGTAGCTACTA     8
#> 2   50000_mixa         AAGTCCAGTATCGTTACGCTACTA     9
#> 3   50000_mixa              AAGTCCATCGTAGCTACTA     3
#> 4   12500_mixa         AAGTCCAGTATCGTTACGCTACTA    19
#> 5   12500_mixa           AAGTCCAGTACTGTAGCTACTA    10
#> 6   12500_mixa              AAGTCCATCGTAGCTACTA     3
#> 7    3125_mixa         AAGTCCAGTATCGTTACGCTACTA    13
#> 8    3125_mixa           AAGTCCAGTACTGTAGCTACTA     9
#> 9    3125_mixa              AAGTCCATCGTAGCTACTA     5
#> 10   3125_mixa AAGTCCAGTTCTACTATCGTTACGAGCTACTA     6
#> 11    781_mixa         AAGTCCAGTATCGTTACGCTACTA     7
#> 12    781_mixa           AAGTCCAGTACTGTAGCTACTA     7
#> 13    195_mixa         AAGTCCAGTATCGTTACGCTACTA    11
#> 14    195_mixa              AAGTCCATCGTAGCTACTA     9
#> 15    195_mixa           AAGTCCAGTACTGTAGCTACTA     9
#> 16    195_mixa AAGTCCAGTTCTACTATCGTTACGAGCTACTA     2
bc_sub[c("AAGTCCAGTACTATCGTACTA", "AAGTCCAGTACTGTAGCTACTA"), ]
#> Bonjour le monde, This is a BarcodeObj.
#> ----------
#> It contains: 
#> ----------
#> @metadata: 5 field(s) available:
#> cell_number  replication  raw_read_count  barcode_read_count  depth_cutoff
#> ----------
#> @messyBc: 5 sample(s) for raw barcodes:
#>     In sample $50000_mixa there are: 438 Tags
#>     In sample $12500_mixa there are: 491 Tags
#>     In sample $3125_mixa there are: 455 Tags
#>     In sample $781_mixa there are: 525 Tags
#>     In sample $195_mixa there are: 488 Tags
#> ----------
#> @cleanBc: 5 samples for cleaned barcodes
#>     In sample $50000_mixa there are: 1 barcodes
#>     In sample $12500_mixa there are: 1 barcodes
#>     In sample $3125_mixa there are: 1 barcodes
#>     In sample $781_mixa there are: 1 barcodes
#>     In sample $195_mixa there are: 1 barcodes

# export the barcode counts to data.frame
head(bc_2df(bc_sub))
#>   sample_name              barcode_seq count
#> 1  50000_mixa   AAGTCCAGTACTGTAGCTACTA     8
#> 2  50000_mixa AAGTCCAGTATCGTTACGCTACTA     9
#> 3  50000_mixa      AAGTCCATCGTAGCTACTA     3
#> 4  12500_mixa AAGTCCAGTATCGTTACGCTACTA    19
#> 5  12500_mixa   AAGTCCAGTACTGTAGCTACTA    10
#> 6  12500_mixa      AAGTCCATCGTAGCTACTA     3

# export the barcode counts to matrix
head(bc_2matrix(bc_sub))
#>                                  X12500_mixa X195_mixa X3125_mixa X50000_mixa
#> AAGTCCAGTACTGTAGCTACTA                    10         9          9           8
#> AAGTCCAGTATCGTTACGCTACTA                  19        11         13           9
#> AAGTCCAGTTCTACTATCGTTACGAGCTACTA           0         2          6           0
#> AAGTCCATCGTAGCTACTA                        3         9          5           3
#>                                  X781_mixa
#> AAGTCCAGTACTGTAGCTACTA                   7
#> AAGTCCAGTATCGTTACGCTACTA                 7
#> AAGTCCAGTTCTACTATCGTTACGAGCTACTA         0
#> AAGTCCATCGTAGCTACTA                      0

4 Sequence quality control

4.1 Evaluation

In a full analysis starting from fastq files, the first step is to check the seqencing quality and filter as required. The bc_seq_qc function is for checking the sequencing quality. If multiple samples are input the output is a BarcodeQcSet object, otherwise a BarcodeQC object will be returned. In addition, bc_seq_qc also can handle the ShortReadQ, DNAStringSet and other data types.

qc_noFilter <- bc_seq_qc(fq_files)
qc_noFilter
#> The sequence QC set, use `[]` to select sample:
#>      5290_10_BCM_195_mef_mixb_GTCATTG_S11_R1_001.fastq.gz
#>      5290_1_BCM_50000_mef_mixa_GTTCTCC_S2_R1_001.fastq.gz
#>      5290_2_BCM_50000_mef_mixb_GATGTGT_S5_R1_001.fastq.gz
#>      5290_3_BCM_12500_mef_mixa_TGCCTTG_S4_R1_001.fastq.gz
#>      5290_4_BCM_12500_mef_mixb_TAACTGC_S8_R1_001.fastq.gz
#>      5290_5_BCM_3125_mef_mixa_GCTTCCA_S9_R1_001.fastq.gz
#>      5290_6_BCM_3125_mef_mixb_TGTGAGT_S7_R1_001.fastq.gz
#>      5290_7_BCM_781_mef_mixa_CCTTACC_S12_R1_001.fastq.gz
#>      5290_8_BCM_781_mef_mixb_CGTATCC_S13_R1_001.fastq.gz
#>      5290_9_BCM_195_mef_mixa_GTACTGT_S14_R1_001.fastq.gz
bc_names(qc_noFilter)
#>  [1] "5290_10_BCM_195_mef_mixb_GTCATTG_S11_R1_001.fastq.gz"
#>  [2] "5290_1_BCM_50000_mef_mixa_GTTCTCC_S2_R1_001.fastq.gz"
#>  [3] "5290_2_BCM_50000_mef_mixb_GATGTGT_S5_R1_001.fastq.gz"
#>  [4] "5290_3_BCM_12500_mef_mixa_TGCCTTG_S4_R1_001.fastq.gz"
#>  [5] "5290_4_BCM_12500_mef_mixb_TAACTGC_S8_R1_001.fastq.gz"
#>  [6] "5290_5_BCM_3125_mef_mixa_GCTTCCA_S9_R1_001.fastq.gz" 
#>  [7] "5290_6_BCM_3125_mef_mixb_TGTGAGT_S7_R1_001.fastq.gz" 
#>  [8] "5290_7_BCM_781_mef_mixa_CCTTACC_S12_R1_001.fastq.gz" 
#>  [9] "5290_8_BCM_781_mef_mixb_CGTATCC_S13_R1_001.fastq.gz" 
#> [10] "5290_9_BCM_195_mef_mixa_GTACTGT_S14_R1_001.fastq.gz"
class(qc_noFilter)
#> [1] "BarcodeQcSet"
#> attr(,"package")
#> [1] "CellBarcode"

The bc_plot_seqQc function can be invoked with a BarcodeQcSet as argument, and the output is a QC summary with two panels. The first shows the ratio of ATCG bases for each sequencing cycle with one sample per row; this allows the user to, for example, identify constant or random parts of the sequencing read. The second figure shows the average sequencing quality index of each cycle (base).

For the test set, the first 12 bases are UMI, which are random. This is followed by the constant region of the barcode (the PCR primer selects reads with this sequence), and here we observe a specific base for each cycle across all the samples.

bc_plot_seqQc(qc_noFilter)