Background

The current flowQ package does the quality assessment on the ungated FCM data. However, there is a need to identify deviant samples by monitoring the consistencies of the underlying statistical properties of different gated cell populations (such as white blood cells, lymphacytes, monocytes etc). The flowQ package was also not designed for dealing with large datasets. To meet these needs, We developed QUALIFIER package using the gating template created in flowJo and performing QA checks on different gated populations. It divides the data preprocessing from the actual outlier detection process so that the statistics are calcuated all at once and the outlier detection and visualization can be done more efficiently and interactively. ncdfFlow is used to solve the memory limit issue for large datasets.

Parsing the QA gating template

parseWorkspace function from flowWorkspace package is used to parse the flowJo workspace. Since we are only interested in the gating strategy (gates defined in the template),execute can be set as FALSE
to skip the actual gating process. useInternal is set as TRUE to use internal structure (c++) for faster parsing. Please see the documentation of flowWorkspace package for more details regarding how to use parseWorkspace function.

ws<-openWorkspace("./data/QA_MFI_RBC_bounary_eventsV3.xml")##should replace with your own xml workspace file path
GT<-parseWorkspace(ws,execute=FALSE,useInternal=TRUE)

Apply the gating template to the data

Then gating hierarchy gh_template containing the actual template gates is extracted from the result GatingSetGT”.

gh_template<-GT[[1]]                

The GatingSet method here is the contructor that takes a gatingHierarchy object as the template and a list of FCS file names (filenames) that need the quality assurance. The result “G” is the GatingSet that contains the gated data and some of the cell population statistics that can be viewed by getPopStats method.

##datapath is the path where FCS files stores
G<-GatingSet(gh_template,filenames,path="./data")
getPopStats(G[[1]])

Optionally, isNcdf can be set to TRUE to support netCDF storage of large flow datasets when memory resource is limited.

Calculating the statistics

After flow data is gated, the statistics of the gated data need to be extracted and saved before the QA checks. db is an environment that serves as a container storing all the QA data and results. Firsly, initDB function initializes and creates the data structures for QA. Then qaPreprocess is a convenient wrapper that calls underlining routines (getQAStats},saveToDB) to calculates/extracts statistics of gated cell populations and save them along with the gating set and the FCS meta data.metaFile` is a csv file that contains the extra annoations about each FCS file.It should at least have one column speficifying FCS file names.

db<-new.env()
initDB(db)
qaPreprocess(db=db,gs=G
            ,metaFile="./data/FCS_File_mapping.csv" #should replace with your own FCS meta data file path
            ,fcs.colname="FCS_Files"
            ,date.colname=c("RecdDt","AnalysisDt")
    )

fcs.colname and date.colname are the arguments to identify the columns in the meta data (the flat table stored as a csv file) that specifies the FCS filenames and the dates (to be formatted as to “m/d/y”). Once the preprocessing steps are finished, the data are ready for quality assessments.

Defining qaTasks

read.qaTask reads external csv spreadsheet that contains the descriptions of each QA task and creates a list of qaTask objects.

checkListFile<-file.path(system.file("data",package="QUALIFIER"),"qaCheckList.csv.gz")
qaTask.list<-read.qaTask(db,checkListFile=checkListFile)
## Outlier function is not specified!
## 'outlier.norm' will be used as default outlier function.
## Outlier function is not specified!
## 'qoutlier' will be used as default outlier function.
## [1] "7 qaTask created ahd saved in db!"
qaTask.list[1:2]
## $MFIOverTime
## qaTask: MFIOverTime 
## Level : Assay 
## Description : Fluorescence stability over time 
## population:  MFI
## Default formula :MFI ~ RecdDt | channel * stain
## <environment: 0x55ffb49682c8>
## Plot type:  xyplot
## $horiz
## [1] FALSE
## 
## $scales
## $scales$format
## [1] "%m/%d/%y"
## 
## 
## 
## $NumberOfEvents
## qaTask: NumberOfEvents 
## Level : Tube 
## Description : Number of Events Collected 
## population:  root
## Default formula :count ~ RecdDt | Tube
## <environment: 0x55ffb49e2278>
## Plot type:  xyplot
## $horiz
## [1] FALSE
## 
## $scales
## $scales$format
## [1] "%m/%d/%y"

The qaTask can also be created individually by the contructor makeQaTask.

Quality assessment and visualiztion

qaCheck and plot are the two main methods to perform the quality assessment and visualize the QA results. They both use the information stored in qaTask object and the formula, which is given either explicitly by the argument or implicitly by the qaTask object. It is generally of the form y ~ x | g1 * g2 * … , where y is the statistics to be checked in this QA and must be one of the four types:

            "MFI": 
                Median Fluorescence Intensity of the cell population specified by `qaTask`,
        
            "proportion": 
                    the percentage of the cell population specified by `qaTask` in the parent population, 
        
            "count": 
                    the number of events of the cell population specified by `qaTask`,
        
            "spike": 
                    the variance of intensity over time of each channel, which indicates the stability of the fluorescence intensity.

x specifies the variable plotted on x-axis (such as date) in the plot method.

g1, g2, … are the conditioning variables, which divide the data into subgroups and apply the outlier detection whitin each individual group or plot the subgroups in different panels. They may also be omitted, in which case the outlier detection is peformed in the entire dataset.

For example, RBC Lysis efficiency (percentage of WBC population) check is defined by qaTask .

qaTask.list[["RBCLysis"]]
## qaTask: RBCLysis 
## Level : Tube 
## Description : Sufficient RBC lysis 
## population:  WBC_perct
## Default formula :proportion ~ RecdDt | Tube
## <environment: 0x55ffb4b14830>
## Plot type:  xyplot
## $horiz
## [1] FALSE
## 
## $scales
## $scales$format
## [1] "%m/%d/%y"

According to the formula stored in qaTask, it uses the statistical property “proportion” and groups the data by “Tube” (or staining panel). “RecdDt” is only for the plotting purpose (specifing the x-axis). Cell population is defined as “WBC_perct”

qaCheck(qaTask.list[["RBCLysis"]]
       ,outlierfunc=list(func = outlier.cutoff
                        ,args =  list(lBound=0.8)
                       )
      )

qaCheck reads all the necessary information about the gated data from qaTask object. Users only needs to specificy how the outliers are called. This is done by providing an outlier detection function outlierfunc that takes a numeric vector as input and returns a logical vector as the output. Here outlier.cutoff provided by the package is used and threshold “lBound” is specified (“less than”, use uBound for “larger than”).

After the outliers are called, the results can be plotted by plot method.

plot(qaTask.list[["RBCLysis"]],xlab="Record Date",ylab="percent")

By default all the data are plotted, argument “subset” can be used to visualize a small subset.

plot(qaTask.list[["RBCLysis"]],subset=Tube=='CD8/CD25/CD4/CD3/CD62L',xlab="Record Date",ylab="percent")

clearCheck is the method to removes the outlier results detected by the previous qaCheck call for the specific qaTask.

clearCheck(qaTask.list[["RBCLysis"]])

With scatterPlot flag set as true and subset properly specified plot method can generate scatter plots for the selected FCS files,

plot(qaTask.list[["RBCLysis"]],subset=name=='06087181_F01_I010.fcs',scatterPlot=TRUE)
QUALIFIER-plot-subset2

QUALIFIER-plot-subset2

x term in the formula is normally ignored in qaCheck. However, when “plotType” of the qaTask is “bwplot”, it is used as the conditioning variable that divides the data into subgroups, within which the outlierfunc is applied.

qaTask.list[["MNC"]]
## qaTask: MNC 
## Level : Assay 
## Description : Consistency of Lymphocyte/MNC Gate 
## population:  MNC
## Default formula :proportion ~ coresampleid
## <environment: 0x55ffb4a50aa0>
## Plot type:  bwplot
## $horiz
## [1] FALSE
## 
## $scales
## $scales$format
## [1] "%m/%d/%y"

This qaTask detects the significant variance of MNC cell population percentage among aliquots, which have the same “coresampleid”. Plot type of this object tells the method to group data by “coresampleid”.

qaCheck(qaTask.list[["MNC"]],z.cutoff=1.5)

Interquartile Range based outlier detection function is used to detect outliers

plot(qaTask.list[["MNC"]],proportion~factor(coresampleid),xlab="Sample ID",ylab="percent",scales=list(x=list(rot=45)))

The red circles in the boxplot indicate the possible outlier samples and the box of red color indicates the entire sample group has significant variance and is marked as the group outlier. By default qaCheck uses normal-distribution-based outlier function to detect group outliers.
User-defined function can also supplied through gOutlierfunc argument.Again the function should take a numeric vector as input and returns a logical vector as the output. The formula supplied here in the plot method overwrites the one stored in the qaTask object, thus change the way of viewing the data.

With scatterPlot and subset arguments, scatter plots can be generated for the selected FCS files or sample groups,

plot(qaTask.list[["MNC"]]
        ,scatterPlot=TRUE
        ,subset=coresampleid==11730)
QUALIFIER-plot-MNC-scatter

QUALIFIER-plot-MNC-scatter

We can also apply simple aggregation to the statisics through the formula.

qaTask.list[["BoundaryEvents"]]
## qaTask: BoundaryEvents 
## Level : Channel 
## Description : Off-scale Boundary Events 
## population:  margin
## Default formula :proportion ~ RecdDt | channel
## <environment: 0x55ffb4bcc380>
## Plot type:  xyplot
## $horiz
## [1] FALSE
## 
## $scales
## $scales$format
## [1] "%m/%d/%y"

Here the default formula only extracts the “proportion” from each individual channel. In order to check the total percentage of boundary events of all channels for each fcs file, we can write a new formula by applying aggregation function “sum” to “proportion” and group the data by fcs file (“name” in this case).

qaCheck(qaTask.list[["BoundaryEvents"]]
        ,sum(proportion) ~ RecdDt | name
        ,outlierfunc=list(func = outlier.cutoff
                          ,args =list(uBound=0.0003)
                          )
        )

And we still can visualize the results chanel by chanel.

plot(qaTask.list[["BoundaryEvents"]],proportion ~ RecdDt | channel,xlab="Record Date",ylab="percent")