1 Load contig files
2 High confidence UMIs belonging to T cells per cell
3 Reads / UMIs
4 Apply T-cell contig UMI filter
5 Colophone

This vignette shows an example of loading data from the CellRanger pipeline and doing some QC to pick barcodes. If gene expression was also collected, it is better to do joint cell calling.
Some types of multiplets / debris can be better assessed with by using the gene expression data. See vignette('repertoire_and_expression') for details of how to merge repertoire with a SingleCellExperiment object.

library(CellaRepertorium)
library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(stringr)

1 Load contig files

files = list.files(system.file('extdata', package = 'CellaRepertorium'), pattern = "all_contig_annotations_.+?.csv.xz", recursive = TRUE, full.names = TRUE)
# Pull out sample and population names
samp_map = tibble(anno_file = files, pop = str_match(files, 'b6|balbc')[,1], sample = str_match(files, '_([0-9])\\.')[,2])

knitr::kable(samp_map)

anno_file	pop	sample
/tmp/RtmpO2z9C7/Rinst23915a6af304bb/CellaRepertorium/extdata/all_contig_annotations_b6_4.csv.xz	b6	4
/tmp/RtmpO2z9C7/Rinst23915a6af304bb/CellaRepertorium/extdata/all_contig_annotations_b6_5.csv.xz	b6	5
/tmp/RtmpO2z9C7/Rinst23915a6af304bb/CellaRepertorium/extdata/all_contig_annotations_b6_6.csv.xz	b6	6
/tmp/RtmpO2z9C7/Rinst23915a6af304bb/CellaRepertorium/extdata/all_contig_annotations_balbc_1.csv.xz	balbc	1
/tmp/RtmpO2z9C7/Rinst23915a6af304bb/CellaRepertorium/extdata/all_contig_annotations_balbc_2.csv.xz	balbc	2
/tmp/RtmpO2z9C7/Rinst23915a6af304bb/CellaRepertorium/extdata/all_contig_annotations_balbc_3.csv.xz	balbc	3

PBMC pooled from BALB/c and C57BL/6 mice were assayed on 10X genomics V3 chemistry and a library enriched for TCR were run. For the purposes of illustrating functionality in this package, cell barcodes were subsampled 3 times for each of the BALB/c and Black6 pools to generate distinct samples, which is reflected in the sample column. More details are available in the scripts in the script directory of this package.

# read in CSV
all_anno = samp_map %>% rowwise() %>% mutate(anno = list(read_csv(anno_file, col_types = cols(
  barcode = col_character(),
  is_cell = col_logical(),
  contig_id = col_character(),
  high_confidence = col_logical(),
  length = col_double(),
  chain = col_character(),
  v_gene = col_character(),
  d_gene = col_character(),
  j_gene = col_character(),
  c_gene = col_character(),
  full_length = col_logical(),
  productive = col_character(),
  cdr3 = col_character(),
  cdr3_nt = col_character(),
  reads = col_double(),
  umis = col_double(),
  raw_clonotype_id = col_character(),
  raw_consensus_id = col_character()
))))

all_anno = all_anno %>% unnest(cols = c(anno))

(The column types typically don’t need to be specified in such detail, but watch for issues in the high_confidence, is_cell and full_length columns which may be read as a character vs logical depending on your specific inputs. Either is fine, but you will want to be consistent across files.)

We read in several files of annotated “contigs” output from 10X genomics VDJ version 3.0.

The pipeline for assembling reads into contigs, and mapping them to UMIs and cells is described in the 10X genomics documentation, and its source code is available here.

cell_tbl = unique(all_anno[c("barcode","pop","sample","is_cell")])
cdb = ContigCellDB(all_anno, contig_pk = c('barcode','pop','sample','contig_id'), cell_tbl = cell_tbl, cell_pk = c('barcode','pop','sample'))

Note that initially there are 3818 contigs.

cdb = mutate_cdb(cdb, celltype = guess_celltype(chain))
cdb = filter_cdb(cdb, high_confidence)

After filtering for only high_confidence contigs there are 2731 contigs.

We read in the contig annotation file for each of the samples, and annotate the contig as a alpha-beta T cell, gamma-delta T cell, B cell or chimeric “multi” cell type based on where various

2 High confidence UMIs belonging to T cells per cell

total_umi = crosstab_by_celltype(cdb)
T_ab_umi = total_umi[c(cdb$cell_pk,"is_cell","T_ab")]

ggplot(T_ab_umi, aes(color = factor(is_cell), x = T_ab, group = interaction(is_cell, sample, pop))) + stat_ecdf() + coord_cartesian(xlim = c(0, 10)) + ylab('Fraction of barcodes') + theme_minimal() + scale_color_discrete('10X called cell?')

10X defines a procedure to separate cells from background that fits a Gaussian mixture model to the UMI distributions for each sample. However in some cases, it may be desirable to implement a common QC threshold with a different stringency, such as:

Comparing across multiple samples
When a sample has been enriched for a particular cell type (eg with pre-sequencing flow cytometry).

When we consider only high confidence UMIs that unambiguous map to T cells, most “non cells” have 1 or fewer, while most putative cells have >5. However, we might want to adopt a different UMI-based cell filter, as was done below.

3 Reads / UMIs

qual_plot = ggplot(cdb$contig_tbl, aes(x = celltype, y= umis)) + geom_violin() + geom_jitter() + facet_wrap(~sample + pop) + scale_y_log10() + xlab("Annotated cell type")

qual_plot 
#> Warning: Groups with fewer than two data points have been dropped.
#> Groups with fewer than two data points have been dropped.
#> Groups with fewer than two data points have been dropped.

qual_plot + aes(y = reads)
#> Warning: Groups with fewer than two data points have been dropped.
#> Groups with fewer than two data points have been dropped.
#> Groups with fewer than two data points have been dropped.

The number of UMIs and reads by sample and annotated cell type.

4 Apply T-cell contig UMI filter

# At least 2 UMI mapping to high confidence T cell contigs.
good_bc = total_umi %>% ungroup() %>% filter(is_cell) %>% filter(T_ab >= 2)
total_cells = good_bc %>% group_by(sample, pop) %>% summarize(good_bc = n())
#> `summarise()` has grouped output by 'sample'. You can override using the
#> `.groups` argument.
knitr::kable(total_cells)

sample	pop	good_bc
1	balbc	133
2	balbc	137
3	balbc	143
4	b6	149
5	b6	150
6	b6	148

Apply a filter on UMIs.

contigs_qc = semi_join(cdb$contig_tbl, good_bc %>% select(sample, pop, barcode)) %>% 
  filter(full_length, productive == 'True', high_confidence, chain != 'Multi')
#> Joining with `by = join_by(pop, sample, barcode)`

And take only high confidence, full length, productive \(\alpha-\beta\) T cell contigs.

5 Colophone

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] Biostrings_2.70.0       GenomeInfoDb_1.38.0     IRanges_2.36.0         
#>  [4] S4Vectors_0.40.0        BiocGenerics_0.48.0     ggdendro_0.1.23        
#>  [7] XVector_0.42.0          purrr_1.0.2             stringr_1.5.0          
#> [10] tidyr_1.3.0             readr_2.1.4             ggplot2_3.4.4          
#> [13] dplyr_1.1.3             CellaRepertorium_1.12.0 BiocStyle_2.30.0       
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0        farver_2.1.1            bitops_1.0-7           
#>  [4] fastmap_1.1.1           RCurl_1.98-1.12         digest_0.6.33          
#>  [7] lifecycle_1.0.3         magrittr_2.0.3          compiler_4.3.1         
#> [10] progress_1.2.2          rlang_1.1.1             sass_0.4.7             
#> [13] tools_4.3.1             utf8_1.2.4              yaml_2.3.7             
#> [16] knitr_1.44              prettyunits_1.2.0       labeling_0.4.3         
#> [19] bit_4.0.5               plyr_1.8.9              RColorBrewer_1.1-3     
#> [22] withr_2.5.1             grid_4.3.1              fansi_1.0.5            
#> [25] colorspace_2.1-0        future_1.33.0           globals_0.16.2         
#> [28] scales_1.2.1            MASS_7.3-60             cli_3.6.1              
#> [31] rmarkdown_2.25          crayon_1.5.2            generics_0.1.3         
#> [34] reshape2_1.4.4          tzdb_0.4.0              minqa_1.2.6            
#> [37] cachem_1.0.8            zlibbioc_1.48.0         splines_4.3.1          
#> [40] parallel_4.3.1          BiocManager_1.30.22     vctrs_0.6.4            
#> [43] boot_1.3-28.1           Matrix_1.6-1.1          jsonlite_1.8.7         
#> [46] bookdown_0.36           hms_1.1.3               bit64_4.0.5            
#> [49] archive_1.1.6           listenv_0.9.0           magick_2.8.1           
#> [52] jquerylib_0.1.4         glue_1.6.2              parallelly_1.36.0      
#> [55] nloptr_2.0.3            codetools_0.2-19        cowplot_1.1.1          
#> [58] stringi_1.7.12          gtable_0.3.4            lme4_1.1-34            
#> [61] broom.mixed_0.2.9.4     munsell_0.5.0           tibble_3.2.1           
#> [64] pillar_1.9.0            furrr_0.3.1             htmltools_0.5.6.1      
#> [67] GenomeInfoDbData_1.2.11 R6_2.5.1                vroom_1.6.4            
#> [70] evaluate_0.22           lattice_0.22-5          backports_1.4.1        
#> [73] broom_1.0.5             bslib_0.5.1             Rcpp_1.0.11            
#> [76] nlme_3.1-163            xfun_0.40               forcats_1.0.0          
#> [79] pkgconfig_2.0.3

Quality control and Exploration of UMI-based repertoire data

24 October 2023

Contents

1 Load contig files

2 High confidence UMIs belonging to T cells per cell

3 Reads / UMIs

4 Apply T-cell contig UMI filter

5 Colophone