1 Basics

1.1 Installing tidyFlowCore

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. tidyFlowCore is an R package available via Bioconductor, an open-source repository for R packages related to biostatistics and biocomputing.

R can be installed on any operating system from CRAN, after which you can install tidyFlowCore by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("tidyFlowCore")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

1.2 Preliminaries

tidyFlowCore adopts the so-called “tidy” functional programming paradigm developed by Wickham et al. in the tidyverse ecosystem of R packages (Wickham, François, Henry, Müller, and Vaughan, 2023). For information about the tidyverse ecosytem broadly, feel free to reference the (free) R for Data Science book, the tidyverse website, or this paper describing the larger tidyomics project.

tidyFlowCore integrates the flowCore Bioconductor package’s data analysis capabilities with those of the tidyverse. If you’re relatively unfamiliar with the Bioconductor project, you might be interested in this blog post.

1.3 Asking for help

Learning to use R and Bioconductor can be challenging, so it’s important to know where to get help. The main place to ask questions about tidyFlowCore is the Bioconductor support site. Use the tidyFlowCore tag there and look at previous posts.

You can also ask questions on GitHub or Twitter. But remember, if you’re asking for help, follow the posting guidelines. Make sure to include a simple example that reproduces your issue (a “reprex”) and your session info to help developers understand and solve your problem.

1.4 Citing tidyFlowCore

If you use tidyFlowCore for your research, please use the following citation.

citation("tidyFlowCore")
#> Warning in person1(given = given[[i]], family = family[[i]], middle =
#> middle[[i]], : It is recommended to use 'given' instead of 'middle'.
#> To cite package 'tidyFlowCore' in publications use:
#> 
#>   Keyes TJ (2024). _tidyFlowCore: Bringing flowCore to the tidyverse_.
#>   doi:10.18129/B9.bioc.tidyFlowCore
#>   <https://doi.org/10.18129/B9.bioc.tidyFlowCore>,
#>   https://github.com/keyes-timothy/tidyflowCore/tidyFlowCore - R
#>   package version 1.0.0,
#>   <http://www.bioconductor.org/packages/tidyFlowCore>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {tidyFlowCore: Bringing flowCore to the tidyverse},
#>     author = {Timothy J Keyes},
#>     year = {2024},
#>     url = {http://www.bioconductor.org/packages/tidyFlowCore},
#>     note = {https://github.com/keyes-timothy/tidyflowCore/tidyFlowCore - R package version 1.0.0},
#>     doi = {10.18129/B9.bioc.tidyFlowCore},
#>   }

2 tidyFlowCore quick start

tidyFlowCore allows you to treat flowCore data structures like tidy data.frames or tibbles. It does so by implementing dplyr, tidyr, and ggplot2 verbs that can be deployed directly on the flowFrame and flowSet S4 classes.

In this section, we give a brief example of how tidyFlowCore can enable a data analysis pipeline to use all the useful functions of the flowCore package and many of the functions of the dplyr, tidyr, and ggplot2 packages.

2.1 Load required packages

library(tidyFlowCore)
library(flowCore)

2.2 Read data

For our example here, we download some publicly available mass cytometry (CyTOF) data downloadable through the (Weber, M, Soneson, and Charlotte, 2019) package. These data are made available as a flowCore::flowSet S4 object, Bioconductor’s standard data structure for cytometry data.

# read data from the HDCytoData package
bcr_flowset <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
#> see ?HDCytoData and browseVignettes('HDCytoData') for documentation
#> loading from cache
#> Warning in updateObjectFromSlots(object, ..., verbose = verbose): dropping
#> slot(s) 'colnames' from object = 'flowSet'

To read more about this dataset, run the following command:

?HDCytoData::Bodenmiller_BCR_XL_flowSet

2.3 Data transformation

The flowCore package natively supports multiple types of data preprocessing and transformations for cytometry data through the use of its tranform class.

For example, if we want to apply the standard arcsinh transformation often used for CyTOF data to our current dataset, we could use the following code:

asinh_transformation <- flowCore::arcsinhTransform(a = 0, b = 1/5, c = 0)
transformation_list <- 
  flowCore::transformList(
    colnames(bcr_flowset), 
    asinh_transformation
  )

transformed_bcr_flowset <- flowCore::transform(bcr_flowset, transformation_list)

Alternatively, we can also use the tidyverse’s functional programming paradigm to perform the same transformation. For this, we use the mutate-across framework via tidyFlowCore:

transformed_bcr_flowset <- 
  bcr_flowset |>
  dplyr::mutate(across(-ends_with("_id"), \(.x) asinh(.x / 5)))

2.4 Cell type counting

Suppose we’re interested in counting the number of cells that belong to each cell type (encoded in the population_id column of bcr_flowset) in our dataset. Using standard flowCore functions, we could perform this calculation in a few steps:

# extract all expression matrices from our flowSet 
combined_matrix <- flowCore::fsApply(bcr_flowset, exprs)

# take out the concatenated population_id column 
combined_population_id <- combined_matrix[, 'population_id']

# perform the calculation
table(combined_population_id)
#> combined_population_id
#>     1     2     3     4     5     6     7     8 
#>  3265  6651 62890 51150  1980 18436 24518  3901

tidyFlowCore allows us to perform the same operation simply using the dplyr package’s count function, with output provided in the convenient form of a tibble:

bcr_flowset |> 
  dplyr::count(population_id)
#> # A tibble: 8 × 2
#>   population_id     n
#>           <dbl> <int>
#> 1             1  3265
#> 2             2  6651
#> 3             3 62890
#> 4             4 51150
#> 5             5  1980
#> 6             6 18436
#> 7             7 24518
#> 8             8  3901

tidyFlowCore also makes it easy to perform the counting after grouping by other variables in our metadata. For example, we can see that bcr_flowset contains data from multiple .FCS files, each of which is associated with a file name.

flowCore::pData(object = bcr_flowset)
#>                                                                  name
#> PBMC8_30min_patient1_BCR-XL.fcs       PBMC8_30min_patient1_BCR-XL.fcs
#> PBMC8_30min_patient1_Reference.fcs PBMC8_30min_patient1_Reference.fcs
#> PBMC8_30min_patient2_BCR-XL.fcs       PBMC8_30min_patient2_BCR-XL.fcs
#> PBMC8_30min_patient2_Reference.fcs PBMC8_30min_patient2_Reference.fcs
#> PBMC8_30min_patient3_BCR-XL.fcs       PBMC8_30min_patient3_BCR-XL.fcs
#> PBMC8_30min_patient3_Reference.fcs PBMC8_30min_patient3_Reference.fcs
#> PBMC8_30min_patient4_BCR-XL.fcs       PBMC8_30min_patient4_BCR-XL.fcs
#> PBMC8_30min_patient4_Reference.fcs PBMC8_30min_patient4_Reference.fcs
#> PBMC8_30min_patient5_BCR-XL.fcs       PBMC8_30min_patient5_BCR-XL.fcs
#> PBMC8_30min_patient5_Reference.fcs PBMC8_30min_patient5_Reference.fcs
#> PBMC8_30min_patient6_BCR-XL.fcs       PBMC8_30min_patient6_BCR-XL.fcs
#> PBMC8_30min_patient6_Reference.fcs PBMC8_30min_patient6_Reference.fcs
#> PBMC8_30min_patient7_BCR-XL.fcs       PBMC8_30min_patient7_BCR-XL.fcs
#> PBMC8_30min_patient7_Reference.fcs PBMC8_30min_patient7_Reference.fcs
#> PBMC8_30min_patient8_BCR-XL.fcs       PBMC8_30min_patient8_BCR-XL.fcs
#> PBMC8_30min_patient8_Reference.fcs PBMC8_30min_patient8_Reference.fcs

tidyFlowCore makes it easy to perform grouped tidy operations, like counting, using information in our flowSet’s metadata:

bcr_flowset |> 
  # use the .tidyFlowCore_identifier pronoun to access the name of 
  # each experiment in the flowSet 
  dplyr::count(.tidyFlowCore_identifier, population_id)
#> # A tibble: 128 × 3
#>    .tidyFlowCore_identifier           population_id     n
#>    <chr>                                      <dbl> <int>
#>  1 PBMC8_30min_patient1_BCR-XL.fcs                1    31
#>  2 PBMC8_30min_patient1_BCR-XL.fcs                2   112
#>  3 PBMC8_30min_patient1_BCR-XL.fcs                3   761
#>  4 PBMC8_30min_patient1_BCR-XL.fcs                4  1307
#>  5 PBMC8_30min_patient1_BCR-XL.fcs                5     5
#>  6 PBMC8_30min_patient1_BCR-XL.fcs                6   127
#>  7 PBMC8_30min_patient1_BCR-XL.fcs                7   444
#>  8 PBMC8_30min_patient1_BCR-XL.fcs                8    51
#>  9 PBMC8_30min_patient1_Reference.fcs             1    52
#> 10 PBMC8_30min_patient1_Reference.fcs             2   132
#> # ℹ 118 more rows

2.5 Nesting and unnesting

flowFrame and flowSet data objects have a clear relationship with one another in the flowCore application programming interface (API). Essentially, flowSets are nested flowFrames - or, in other words, flowSets are made up of multiple flowFrames!

tidyFlowCore provides a useful API for converting between flowSet and flowFrame data structures at various degrees of nesting using the group/nest and ungroup/unnest verbs. Note that in the dplyr and tidyr APIs, group/nest and ungroup/unnest are not synonyms (grouped data.frames are different from nested data.frames). However, because of how flowFrames and flowSets are structured, tidyFlowCore’s group/nest and ungroup/unnest functions have identical behavior, respectively.

# unnesting a flowSet results in a flowFrame with an additional column, 
# 'tidyFlowCore_name` that identifies cells based on which experiment in the 
# original flowSet they come from
bcr_flowset |> 
  dplyr::ungroup()
#> flowFrame object 'filefa4693e0b6415'
#> with 172791 cells and 40 observables:
#>                    name               desc     range  minRange  maxRange
#> $P1                Time               Time   2399633    0.0000   2399632
#> $P2         Cell_length        Cell_length        69    0.0000        68
#> $P3      CD3(110:114)Dd     CD3(110:114)Dd      9383  -61.6796      9382
#> $P4       CD45(In115)Dd      CD45(In115)Dd      5035    0.0000      5034
#> $P5        BC1(La139)Dd       BC1(La139)Dd     14306 -100.8797     14305
#> ...                 ...                ...       ...       ...       ...
#> $P36           group_id           group_id         3         0         2
#> $P37         patient_id         patient_id         9         0         8
#> $P38          sample_id          sample_id        17         0        16
#> $P39      population_id      population_id         9         0         8
#> $P40 .tidyFlowCore_name .tidyFlowCore_name        17         0        16
#> 297 keywords are stored in the 'description' slot
# flowSets can be unnested and re-nested for various analyses
bcr_flowset |> 
  dplyr::ungroup() |> 
  # group_by cell type
  dplyr::group_by(population_id) |> 
  # calculate the mean HLA-DR expression of each cell population
  dplyr::summarize(mean_hladr_expression = mean(`HLA-DR(Yb174)Dd`)) |> 
  dplyr::select(population_id, mean_hladr_expression)
#> # A tibble: 8 × 2
#>   population_id mean_hladr_expression
#>           <dbl>                 <dbl>
#> 1             3                  3.67
#> 2             7                  3.33
#> 3             4                  4.33
#> 4             2                 87.1 
#> 5             6                 88.2 
#> 6             8                  3.12
#> 7             1                 51.4 
#> 8             5                 18.0

2.6 Plotting

tidyFlowCore also provides a direct interface between ggplot2 and flowFrame or flowSet data objects. For example…

# cell population names, from the HDCytoData documentation
population_names <- 
  c(
    "B-cells IgM-", 
    "B-cells IgM+", 
    "CD4 T-cells", 
    "CD8 T-cells", 
    "DC", 
    "monocytes",            
    "NK cells",             
    "surface-"
  )

# calculate mean CD20 expression across all cells
mean_cd20_expression <- 
  bcr_flowset |> 
  dplyr::ungroup() |> 
  dplyr::summarize(mean_expression = mean(asinh(`CD20(Sm147)Dd` / 5))) |> 
  dplyr::pull(mean_expression)

# calculate mean CD4 expression across all cells
mean_cd4_expression <- 
  bcr_flowset |> 
  dplyr::ungroup() |> 
  dplyr::summarize(mean_expression = mean(asinh(`CD4(Nd145)Dd` / 5))) |> 
  dplyr::pull(mean_expression)

bcr_flowset |> 
  # preprocess all columns that represent protein measurements
  dplyr::mutate(dplyr::across(-ends_with("_id"), \(.x) asinh(.x / 5))) |> 
  # plot a CD4 vs. CD45 scatterplot
  ggplot2::ggplot(ggplot2::aes(x = `CD20(Sm147)Dd`, y = `CD4(Nd145)Dd`)) +
  # add some reference lines
  ggplot2::geom_hline(
    yintercept = mean_cd4_expression, 
    color = "red", 
    linetype = "dashed"
  ) + 
  ggplot2::geom_vline(
    xintercept = mean_cd20_expression, 
    color = "red", 
    linetype = "dashed"
  ) + 
  ggplot2::geom_point(size = 0.1, alpha = 0.1) +
  # facet by cell population
  ggplot2::facet_wrap(
    facets = ggplot2::vars(population_id), 
    labeller = 
      ggplot2::as_labeller(
        \(population_id) population_names[as.numeric(population_id)]
      )
  ) + 
  # axis labels
  ggplot2::labs(
    x = "CD20 expression (arcsinh)", 
    y = "CD4 expression (arcsinh)"
  )