Contents

1 Introduction

The software package ZygosityPredictor allows to predict how many copies of a gene are affected by mutations, in particular small variants (single nucleotide variants, SNVs, and small insertions and deletions, Indels). In addition to the basic calculations of the affected copy number of a variant, ZygosityPredictor can phase multiple variants in a gene and ultimately make a prediction if and how many wild-type copies of the gene are left. This information proves to be of particular use in the context of translational medicine. For example, in cancer genomes, ZygosityPredictor can address whether unmutated copies of tumor-suppressor genes are present. ZygosityPredictor was developed to handle somatic and germline small variants. In addition to the small variant context, it can assess larger deletions, which may cause losses of exons or whole genes.

2 Installation

The following code can be used to install ZygosityPredictor. The installation needs to be done once.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ZygosityPredictor")

3 load example data

To demonstrate the use of ZygosityPredictor, NGS data from the Seq-C2 project were used [1]. In the following chunk, all required datalayer of the WGS_LL_T_1 sample are loaded. The variants are loaded as GRanges objects, one for somatic copy number alterations (GR_SCNA), one for germline- and one for somatic small variants (GR_GERM_SMALL_VARS and GR_SOM_SMALL_VARS). The input formats will be discussed in more detail in section 4.

library(ZygosityPredictor)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(GenomicRanges)
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## Loading required package: GenomeInfoDb
# file to sequence alignment 
FILE_BAM <- system.file("extdata", "ZP_example.bam", 
                        package = "ZygosityPredictor")
# meta information of the sample
PURITY <- 0.98
PLOIDY <- 1.57
SEX <- "female"
# variants
data("GR_SCNA")
data("GR_GERM_SMALL_VARS")
data("GR_SOM_SMALL_VARS")
# used gene model
data("GR_GENE_MODEL")

4 Calculation of affected copies of a variant

Two functions are provided to calculate how many copies are affected by single small variants, based on two formulas, one for germline variants and one for somatic variants.

4.1 Germline variants

To calculate the affected copies for a germline variant by using aff_germ_copies(), the following inputs are required:

  • af: numeric; between 0 and 1; calculated allele frequency of the variant in the tumor sample
  • tcn: numeric; total copy number at the position of the variant
  • purity: numeric; between 0 and 1; purity or tumor cell content of the tumor sample
  • c_normal: numeric; expected copy number at the position of the variant in normal tissue, 1 for gonosomes in male samples, and 2 for male autosomes and all chromosomes in female samples. (The function can also assess the c_normal parameter by itself, but then the following two inputs must be provided: chr and sex)
  • chr: (only if c_normal is not provided) character; can be either a single number or in the “chr1” format; chromosome of the variant
  • sex: (only if c_normal is not provided) character; either “male” or “female” / “m” or “f”; sex of the sample
  • af_normal: (default 0.5) numeric; allele-frequency of germline variant in normal tissue. 0.5 represents heterozygous variants in diploid genome, 1 would be homozygous. Could be relevant if germline CNVs are present at the position. Then also the c_normal parameter would have to be adjusted.

the output is a numeric value that indicates the affected copies.

## as an example we take the first variant of our prepared input data and 
## extract the required information from different input data layer
## the allele frequency and the chromosome can be taken from the GRanges object

AF = elementMetadata(GR_GERM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_GERM_SMALL_VARS[1]) %>%
  as.character()

## the total copy number (tcn) can be extracted from the CNV object by selecting
## the CNV from the position of the variant

TCN = elementMetadata(
  subsetByOverlaps(GR_SCNA, GR_GERM_SMALL_VARS[1])
  )[["tcn"]]

## purity and sex can be taken from the global variables of the sample
## with this function call the affected copies are calculated for the variant
aff_germ_copies(af=AF,
                tcn=TCN,
                purity=PURITY,
                chr=CHR,
                sex=SEX)
## [1] 1.50733

4.2 Somatic variants

To calculate how many copies are affected by a somatic variant by aff_som_copies(), the same inputs are required, but a different formula is evaluated:

## the function for somatic variants works the same way as the germline function

AF = elementMetadata(GR_SOM_SMALL_VARS[1])[["af"]]
CHR = seqnames(GR_SOM_SMALL_VARS[1]) %>%
  as.character()
TCN = elementMetadata(
  subsetByOverlaps(GR_SCNA, GR_SOM_SMALL_VARS[1])
  )[["tcn"]]

aff_som_copies(af=AF,
               chr=CHR,
               tcn=TCN,
               purity=PURITY,
               sex=SEX)
## [1] 1.614292

4.3 Calculate affected copies of a set of variants

In order to apply the previously mentioned functions to a whole set of variants and calculate the affected copies, the following code can be used.

## as an example we calculate the affected copies for the somatic variants:
GR_SOM_SMALL_VARS %>%
  ## cnv information for every variant is required.. merge with cnv object
  IRanges::mergeByOverlaps(GR_SCNA) %>% 
  as_tibble() %>%
  ## select relevant columns
  select(chr=1, pos=2, gene, af, tcn) %>%
  mutate_at(.vars=c("tcn", "af"), .funs=as.numeric) %>%
  rowwise() %>%
  mutate(
    aff_copies = aff_som_copies(chr, af, tcn, PURITY, SEX),
    wt_copies = tcn-aff_copies
  )
## # A tibble: 6 × 7
## # Rowwise: 
##   chr         pos gene      af   tcn aff_copies wt_copies
##   <fct>     <int> <chr>  <dbl> <dbl>      <dbl>     <dbl>
## 1 chr6    4734801 CDYL  0.327   4.90      1.61      3.29 
## 2 chr6    4734837 CDYL  0.274   4.90      1.35      3.55 
## 3 chr8  143805314 SCRIB 0.0608  2.92      0.180     2.74 
## 4 chr8  143805342 SCRIB 0.0515  2.92      0.152     2.77 
## 5 chr17   7675088 TP53  0.474   1.49      0.724     0.763
## 6 chr17  41771694 JUP   0.857   1.06      0.947     0.117

5 Predict Zygosity

In this section, we will use the WGS_LL_T_1 dataset from the Seq-C2 project as an example to investigate whether mutations in the following genes result in total absence of wildtype copies. The genes which were selected as an example for the analysis are shown below. The example data set was reduced to these genes.

5.1 Format of input data

Some inputs are optional, while others are compulsory. The latter are labeled with “**”. Of note, ZygosityPredictor is applied downstream of variant calling, therefore the variant calls, including information on identified somatic copy number aberrations (sCNAs), have to be provided. The inputs can be divided into five classes:

  • File paths:
    • bamDna** : character; path to indexed alignment (.bam format)
    • bamRna: character; path to rna-sequencing data (.bam format).
    • vcf: character, or character vector containing several vcf file paths; path to variant call file (.vcf.gz format). Will be used (if provided) for extended SNP phasing if variants on the same gene are too far away from each other for direct phasing
  • Sample meta information:
    • purity** : numeric; between 0 and 1; indicates purity or tumor cell content of the sample
    • ploidy: numeric; ground ploidy of the sample
    • sex** : character; “male” or “female” / “m” or “f”; sex of the patient the sample was taken from
  • Variants
    • somCna** : GRanges object; containing all genomic segments (sCNA) with annotated total copy number (default metadata column name “tcn”, custom name can be provided by COLNAME_TCN) and information about LOH (default column name “cna_type”, custom name can be provided by COLNAME_CNA_TYPE). The cna_type column should contain the string “LOH” if loss-of heterozygosity is present at the segment. If large deletions should be included to the analysis the total copy number has to be decreased accordingly. If the total copy number is smaller than 0.5, the tool will assume a homozygous deletion. An incomplete deletion is assumed if at least one copy is lost compared to the ploidy of the sample (works only if the ploidy is provided as an input)
    • somSmallVars: GRanges object; containing all somatic small variants. Required metadata columns are: reference base (“REF”/”ref”), alternative base (“ALT”/”alt”), allele frequency in the tumor sample (raw allele frequency, i.e. as measured in the tumor sample; not the corrected allele frequency in the supposedly pure tumor) (“AF”/”af”), gene (“gene”/”GENE”, according to the used gene model (GENCODE39 in the example data) and the annotation provided below). If no relevant somatic small variants are present, can also be NULL or not provided.
    • germSmallVars: GRanges object; Analogous to GR_SOM_SMALL_VARS. If no relevant germline small variants are present, can also be NULL or not provided.
  • Used Gene model
    • geneModel** : GRanges object; containing the gene model for the used reference genome. Required metadata columns are: “gene”. Artificially restricting the gene model can be used to tell the tool which genes to analyze. In the case of this vignette, the object contains only the genes we selected.
  • Options
    • includeIncompleteDel: logical, default=TRUE; Should incomplete deletions (monoallelic deletions in a diploid sample) be included in the evaluation? Since these often span large parts of the chromosomes and will lead to many affected genes, it can be advisable to include or exclude them, depending on the research question.
    • includeHomoDel: logical, default=TRUE; Should homozygous deletions be included in the evaluation?
    • showReadDetail: logical, default=FALSE; If this option is TRUE, another table is added to the output that contains more detailed information about the classification of the read pairs. More detailed information is provided in section 4.3.4.
    • assumeSomCnaGaps: logical, default=FALSE; Only required if the somCna object lacks copy number information for genomic segments on which small variants are detected. By default, variants in such regions will be excluded from the analysis as required information about the copy number is missing. These variants will be attached to the final output list in a separate tibble. To include them, this flag must be set TRUE and the ground ploidy must be given as an input. This ground ploidy will then be taken as tcn in the missing regions.
    • byTcn: logical, default=TRUE; optional if includeHomoDel or includeIncompleteDel is TRUE. If FALSE the tool will not use tcn as a criterion to assign large deletions. It will use the cna_type column and check for indicating strings like HOMDEL/HomoDel/DEL. Some commonly used strings are covered. It is recommended to leave this flag TRUE.
    • printLog: logical, default=TRUE; If TRUE, the tool will print detailed information how the assessment is done for each gene.
    • colnameTcn: character; indicating the name of the metadata column containing the tcn information in the somCna object. If not provided the tool tries to detect the column according to default names.
    • colnameCnaType: character; The same as for colnameTcn, but for cna-type information.
    • distCutOff: numeric, default=5000; if input vcf is provided and SNP phasing is performed, this will limt the distance at which the SNP phasing should not be tried anymore. As the probability of finding overlapping reads at such a long distance is very low and the runtime will increase exponentially.

5.2 Predict zygosity for a set of genes in a sample

The prediction of zygosity for a set of genes in a sample can be assessed by the predict_zygosity() function.

Important note: The runtime of the analysis depends strongly on the number of genes to be assessed and on the number of input variants. It is therefore recommended to reduce the number of genes to the necessary ones. Also, depending on the research question to be addressed, the variants should be filtered to the most relevant ones, not only because of runtime considerations, but also to sharpen the final result. A large number of mutations in a gene, some of which are of little biological relevance or even SNPs, will inevitably reduce the validity of the results.

full_prediction = predict_zygosity(
  purity = PURITY, 
  ploidy = PLOIDY,
  sex = SEX,
  somCna = GR_SCNA, 
  somSmallVars = GR_SOM_SMALL_VARS, 
  germSmallVars = GR_GERM_SMALL_VARS, 
  geneModel = GR_GENE_MODEL,
  bamDna = FILE_BAM,
  showReadDetail = TRUE
)
## no RNA file provided: Analysis will be done without RNA reads

5.3 Interpretation of results

Of note, the results displayed here were chosen to explain and exemplify the functionality of the tool; biological and medical impact of the specific variants had not been a selection criterion. The result which is returned by the function consists of a list of three (or up to six) tibbles:

  • Evaluation per variant
  • Evaluation per gene
  • Phasing info
  • Read pair info (only if showReadDetail=TRUE)
  • Variants not covered by somCna (only if present and no sCNA gap assumption was done)
  • detailed information about extended SNP phasing if it was performed

5.3.1 Evaluation per variant

The first result of the function is the evaluation per variant. In this step all information required for subsequent steps is annotated and the affected copies per variant are calculated. For every variant, the function checks whether it already affects all copies of the gene. The format of the output is a tibble; the number of rows corresponds to the total number of input variants. The tool annotates a few self-explanatory columns such as the origin of the respective variant (germline/somatic) or the class (snv/ins/del). It also appends information from the sCNA results: the total copy number at the position of the variant and the information if a loss of heterozygosity is present (cna_type). Also, an ID is assigned to every small variant. Then, the genes are numbered consecutively in order to unambiguously assign variants to genes in the following analysis. The most important results of this step are the calculation of the affected and wildtype copies, as well as, depending on the data, an initial check of whether a variant already affects all copies. Of note, there can be situations in which left wildtype copies are below 0.5,
but still this information is not sufficient to predict “all copies affected” without doubt. Depending on the origin of the variant, further criteria must be met (e.g., LOH). The procedure for this first check is shown in the pre_info column.

# here the new columns are selected and viewed
full_prediction$eval_per_variant %>%
  # these steps are just to have a better overview
  # round numeric columns
  mutate_at(.vars=c("af","tcn", "aff_cp", "wt_cp"),
            .funs = round, 2) %>%
  # to get a better overview, the columns which are already in the inut are 
  # removed
  select(-chr, -pos, -alt, -ref, -af)
## # A tibble: 19 × 10
##    gene   mut_id origin   class   tcn tcn_assumed cna_type aff_cp wt_cp pre_info
##    <chr>  <chr>  <chr>    <chr> <dbl> <lgl>       <chr>     <dbl> <dbl> <chr>   
##  1 TRANK1 m1     germline snv    1.49 FALSE       LOH        1.5  -0.01 germlin…
##  2 BRCA1  m1     germline snv    1.06 FALSE       HZ         1.07 -0.01 germlin…
##  3 ELP2   m1     germline del    2    FALSE       HZ         1.07  0.93 germlin…
##  4 ELP2   m2     germline snv    2    FALSE       HZ         0.8   1.2  germlin…
##  5 ELP2   m3     germline snv    2    FALSE       HZ         0.8   1.2  germlin…
##  6 CDYL   m1     somatic  snv    4.9  FALSE       HZ         1.61  3.29 somatic…
##  7 CDYL   m2     somatic  snv    4.9  FALSE       HZ         1.35  3.55 somatic…
##  8 SCRIB  m1     somatic  snv    2.92 FALSE       LOH        0.18  2.74 somatic…
##  9 SCRIB  m2     somatic  snv    2.92 FALSE       LOH        0.15  2.77 somatic…
## 10 TP53   m1     somatic  snv    1.49 FALSE       LOH        0.72  0.76 somatic…
## 11 JUP    m1     somatic  del    1.06 FALSE       LOH        0.95  0.12 somatic…
## 12 TRIM3  m1     somatic  homd…  0.03 FALSE       homdel     1.54  0.03 homdel …
## 13 TRIM3  m2     somatic  inco…  1.06 FALSE       incompl…   0.51  1.06 incompl…
## 14 TRIM3  m3     somatic  inco…  0.03 FALSE       incompl…   1.54  0.03 incompl…
## 15 TP53   m2     somatic  inco…  1.49 FALSE       incompl…   0.08  1.49 incompl…
## 16 JUP    m2     somatic  inco…  1.06 FALSE       incompl…   0.51  1.06 incompl…
## 17 BRCA1  m2     somatic  inco…  1.06 FALSE       incompl…   0.51  1.06 incompl…
## 18 BRCA1  m3     somatic  inco…  1.06 FALSE       incompl…   0.51  1.06 incompl…
## 19 TRANK1 m2     somatic  inco…  1.49 FALSE       incompl…   0.08  1.49 incompl…

For example, mutation m1 in TRANK1 (first line) fulfills all criteria that are sufficient for a germline variant to say with a high degree of certainty that all copies are affected, which can be seen in the pre_info column. The same applies to the somatic JUP variant. In the case of the ELP2, CDYL and SCRIB genes, there are two or even three variants, neither of which leads to complete loss of the wildtype gene, which is why the next step is to check whether they together affect all copies. Homzygous deletions (homdels) always lead automatically to “all copies affected”, whereas incomplete deletions do not affect all copies.

5.3.2 Aggregation per gene

The aggregation at gene level is one of the main features of ZygosityPredictor. A prediction of how many copies are affected by mutations is provided for every gene. The final prediction is stored in tibble format with three columns: gene, status - either “all copies affected” or “wt copies left” - and an info column that explains how the prediction was made for that gene.

full_prediction$eval_per_gene
## # A tibble: 8 × 3
##   gene   status              info                                               
##   <chr>  <chr>               <chr>                                              
## 1 TRANK1 all_copies_affected germline-variant -> LOH -> AF >= 0.5               
## 2 BRCA1  wt_copies_left      germline-variant -> no LOH                         
## 3 ELP2   all_copies_affected phasing: at least one of the combinations of muts …
## 4 CDYL   wt_copies_left      phasing: variants detected on the same reads -> sa…
## 5 SCRIB  wt_copies_left      phasing: variants found only on different reads ->…
## 6 TP53   wt_copies_left      somatic-variant -> LOH ->  left wt copies 0.76     
## 7 JUP    all_copies_affected somatic-variant -> LOH -> left wt copies 0.12      
## 8 TRIM3  all_copies_affected homdel

Each gene appears once in the table. If there are several variants in one gene, the info always refers to the one that affects the highest number of copies. If most copies are affected by a combination of two small variants, the info refers to both of them. In general, the heterozygous deletions have the lowest priority; and if any other small variant or homdel is present in the gene, it will always be indicated as the more relevant variant. In the case of ELP2, CDYL and SCRIB, several small variants are present, which is why the tool attempts to do phasing at the present position. This is indicated by “phasing” in the info column. The exact results of the phasing can be viewed in detail in the next step.

5.3.3 Haplotype-phasing results

A haplotype phasing is only performed if more than one variant occurs in a gene, each of which does not affect all copies of the gene by itself. The tool then tries to phase at this position to check if the variants are located on the same or different copies. That means that not at every prediction a phasing is done and therefore the following output is not always present.

full_prediction$phasing_info
## # A tibble: 5 × 17
##   gene  comb   dist   tcn status left_wt_cp  both  none  mut1  mut2 dev_var
##   <chr> <chr> <dbl> <dbl> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 ELP2  m1-m2    20  2    diff         0.13     0     2     1     6       0
## 2 ELP2  m2-m3  8828  2    null         2        0     0     0     0       0
## 3 ELP2  m1-m3  8848  2    null         2        0     0     0     0       0
## 4 CDYL  m1-m2    36  4.9  same         3.29    25    22     0     0       0
## 5 SCRIB m1-m2    28  2.92 diff         2.59     0    38     5     7       0
## # ℹ 6 more variables: no_overlap <dbl>, none_raw <dbl>, info <chr>,
## #   DNA_rds <dbl>, RNA_rds <dbl>, class_comb <chr>

In the present case, phasing was performed for three genes separately. The output contains information such as the distance between the two variants and the tcn at the position. If tcn differs at both positions, for example for breakpoint genes, the larger one is displayed here. The status column shows whether the mutations were found on the same copy (same) or on different copies (diff). This prediction is determined by whether the mutations are on the same reads / read pairs. The column DNA_rds contains the number of reads / read pairs that cover both positions. If RNA is available, the number of reads / read pairs is also displayed for RNA reads. Depending on whether the variants were found on different or the same copies, a final prediction for the remaining wildtype-copies can be calculated from the affected copies of both variants. This final calculation can be found in the column left_wt_cp. Only if the variants are found on different copies and the remaining wildtype-copies are below 0.5, it can be assumed that no wildtype copies are left. The four columns both, none, mut1, and mut2 indicate how many of the reads / read pairs belong to each category: Both variants detected, none, only the first, or only the second. RNA reads that do not contain an exon necessary for phasing, e.g., due to alternative splicing or that map somewhere entirely different in the genome, end up in a category called non-overlapping. The category dev_var contains reads / read pairs that carry a different variant than the one to be expected at the position. The output for ELP2 shows how genes are handled in which more than 2 variants occur. Each mutation is then combined once with each other to perform the phasing on each combination. In the case of ELP2 three variants are present, resulting in three possible combinations. Two of the variants are located close to each other (m1 and m2), while the third one has a distance of almost 9000 basepairs to them. Therefore, it is only possible to assign a status to the combination of the close variants. The two other combinations get the status “null” as no overlapping reads / read pairs are present. Theoretically, the tool can attempt phasing if no reads overlap both positions. For such distant variants, intermediate SNPs can be used, which then have to be provided by the input vcf.

5.3.4 Detailed info about used read pairs

This output is only provided if the option showReadDetail is set TRUE. It consists of a table containing every read pair that was used to perform haplotype phasing. In more detail, the read name of the read in the bam file is provided with the classification in one of the four categories mentioned before (both, none, mut1, mut2, no_overlap, dev_var).

full_prediction$readpair_info
## # A tibble: 108 × 5
##    qname                                  result origin comb  gene 
##    <chr>                                  <chr>  <chr>  <chr> <chr>
##  1 K00281:15:HM775BBXX:8:2113:22972:30398 mut2   DNA    m1-m2 ELP2 
##  2 K00281:21:HT52WBBXX:2:1203:24870:26652 mut2   DNA    m1-m2 ELP2 
##  3 K00281:21:HT52WBBXX:2:1203:24880:26424 mut2   DNA    m1-m2 ELP2 
##  4 K00281:15:HM775BBXX:7:2112:14925:27989 mut1   DNA    m1-m2 ELP2 
##  5 K00281:21:HT52WBBXX:3:1205:26230:48843 none   DNA    m1-m2 ELP2 
##  6 K00281:15:HM775BBXX:4:2210:28767:35409 none   DNA    m1-m2 ELP2 
##  7 K00281:15:HM775BBXX:6:2224:16102:47682 mut2   DNA    m1-m2 ELP2 
##  8 K00281:21:HT52WBBXX:2:2118:32157:42478 mut2   DNA    m1-m2 ELP2 
##  9 K00281:15:HM775BBXX:4:1213:4117:24208  mut2   DNA    m1-m2 ELP2 
## 10 K00281:21:HT52WBBXX:1:1117:2270:27690  both   DNA    m1-m2 CDYL 
## # ℹ 98 more rows

For example here, the first row tells us that the read / read pair with the name K00281:15:HM775BBXX:8:2113:22972:30398 (first row) stem from a fragment of the gene ELP2 and can be used for the phasing of the combination of the variants m1 and m2. It was assigned into the category “mut2” which means that m2 is present on that read / read pair, but m1 was not detected. Depending on the number of variants of the used sample and the coverage, this table can get very big, which may affect runtime.

6 References

Appendix

  1. Fang LT, Zhu B, Zhao Y, Chen W, Yang Z, Kerrigan L, Langenbach K, de Mars M, Lu C, Idler K, et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nature Biotechnology. 2021;39(9):1151-1160 / PMID:34504347
  2. Lawrence M, Huber W, Pages H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118*
  3. Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.10, https://CRAN.R-project.org/package=dplyr.
  4. Wickham H (2022). stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org, https://github.com/tidyverse/stringr.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] XVector_0.42.0          GenomicRanges_1.54.0    GenomeInfoDb_1.38.0    
##  [4] IRanges_2.36.0          S4Vectors_0.40.0        BiocGenerics_0.48.0    
##  [7] stringr_1.5.0           dplyr_1.1.3             ZygosityPredictor_1.2.0
## [10] BiocStyle_2.30.0       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0            blob_1.2.4                 
##  [3] filelock_1.0.2              Biostrings_2.70.0          
##  [5] bitops_1.0-7                fastmap_1.1.1              
##  [7] RCurl_1.98-1.12             BiocFileCache_2.10.0       
##  [9] VariantAnnotation_1.48.0    GenomicAlignments_1.38.0   
## [11] XML_3.99-0.14               digest_0.6.33              
## [13] lifecycle_1.0.3             KEGGREST_1.42.0            
## [15] RSQLite_2.3.1               magrittr_2.0.3             
## [17] compiler_4.3.1              rlang_1.1.1                
## [19] sass_0.4.7                  progress_1.2.2             
## [21] tools_4.3.1                 igraph_1.5.1               
## [23] utf8_1.2.4                  yaml_2.3.7                 
## [25] rtracklayer_1.62.0          knitr_1.44                 
## [27] prettyunits_1.2.0           S4Arrays_1.2.0             
## [29] bit_4.0.5                   curl_5.1.0                 
## [31] DelayedArray_0.28.0         xml2_1.3.5                 
## [33] abind_1.4-5                 BiocParallel_1.36.0        
## [35] withr_2.5.1                 purrr_1.0.2                
## [37] grid_4.3.1                  fansi_1.0.5                
## [39] biomaRt_2.58.0              SummarizedExperiment_1.32.0
## [41] cli_3.6.1                   rmarkdown_2.25             
## [43] crayon_1.5.2                generics_0.1.3             
## [45] httr_1.4.7                  rjson_0.2.21               
## [47] DBI_1.1.3                   cachem_1.0.8               
## [49] zlibbioc_1.48.0             parallel_4.3.1             
## [51] AnnotationDbi_1.64.0        BiocManager_1.30.22        
## [53] restfulr_0.0.15             matrixStats_1.0.0          
## [55] vctrs_0.6.4                 Matrix_1.6-1.1             
## [57] jsonlite_1.8.7              bookdown_0.36              
## [59] hms_1.1.3                   bit64_4.0.5                
## [61] GenomicFeatures_1.54.0      jquerylib_0.1.4            
## [63] glue_1.6.2                  codetools_0.2-19           
## [65] stringi_1.7.12              BiocIO_1.12.0              
## [67] tibble_3.2.1                pillar_1.9.0               
## [69] rappdirs_0.3.3              htmltools_0.5.6.1          
## [71] BSgenome_1.70.0             GenomeInfoDbData_1.2.11    
## [73] R6_2.5.1                    dbplyr_2.3.4               
## [75] evaluate_0.22               lattice_0.22-5             
## [77] Biobase_2.62.0              png_0.1-8                  
## [79] Rsamtools_2.18.0            memoise_2.0.1              
## [81] bslib_0.5.1                 SparseArray_1.2.0          
## [83] xfun_0.40                   MatrixGenerics_1.14.0      
## [85] pkgconfig_2.0.3