Artistic-2.0

* flowCut* can be downloaded from “https://github.com/jmeskas/flowCut” or cloned from “https://github.com/jmeskas/flowCut.git”. For more information on installation guidelines, see the GitHub, Bioconductor and CRAN websites. Once installed, to load the library, type the following into R:

* flowCut* automatically removes outlier events in flow cytometry data files due to abnormal flow behaviours resulting from clogs and other common technical problems. Our approach is based on identifying both regions of low density and segments (default size of 500 events) that are significantly different from the rest.

Eight measures of each segment (mean, median, 5th, 20th, 80th and 95th percentile, second moment (variation) and third moment (skewness)) are calculated. The density of the summation, over both the 8 measures of each segment and all cleaning channels, gives a density of summed measures distribution. This distribution, accompanied with two parameters (*MaxValleyHgt* and *MaxPercCut*), will determine which segments are significantly different from the other segments. The segments that are significantly different will have all of their events removed. We also flag files if they display any of the following: 1) not monotonically increasing in time, 2) sudden changes in fluorescence, 3) large gradual change of fluorescence in all channels, or 4) very large gradual change of fluorescence in one channel. (During this vignette four digit codes of only Ts and Fs will be seen, ex. TFTF. These codes are a pass/flag of the 4 flagging tests).

The * flowCut* dataset contains flowFrames borrowed from the

We run * flowCut* with default parameters with the exception of

A list containing four elements will be returned. The first element, *$frame*, is the flowFrame file returned after * flowCut* has cleaned the flowFrame. The second element,

```
## [,1]
## Is it monotonically increasing in time "T"
## Largest continuous jump "0.051"
## Continuous - Pass "T"
## Mean of % of range of means divided by range of data "0.075"
## Mean of % - Pass "T"
## Max of % of range of means divided by range of data "0.114"
## Max of % - Pass "T"
## Has a low density section been removed "T"
## % of low density removed "5.32"
## How many segments have been removed "4"
## % of events removed from segments removed "12.04"
## Worst channel "FL1-H"
## % of events removed "17.36"
## FileID "1"
## Type of Gating "MaxPercCut"
## Was the file run twice "No"
## Has the file passed "T"
```

Having the *Plot* parameter set to ‘All’ above created a plot in the flowCut folder under your current working directory, which is the same as Figure . The first five sub-plots in Figure are the channels that undergo fluorescence analysis from * flowCut*. The segments that are removed because they are significantly different are coloured black. The removed low density sections lower than

The top and bottom horizontal dark brown lines correspond to the 98th and the 2nd percentile of the data after cleaning. The 98th and 2nd percentile lines for the full data before cleaning are coloured in light brown, but are most likely not seen. Most of the time these two sets of lines are indistinguishable on the plot because there is no significant change in percentiles before and after cleaning. We found the top and bottom 2\(\%\) of the events are spread out too much and do not serve as a nice max and min value of the file when comparing the range of the means to the range of the file.

Connecting all the means of each segment gives the brown line in the middle. Sometimes you can see that the brown line has some pink parts. The pink parts are the mean of each segment before cleaning. With the difference between pink and brown, the user can see which segments are removed, and perhaps understand why as well. Of course this only shows the mean measure of the 8 measures used to judge if a segment is removed or not, so information on the other 7 measures is not shown.

The numbers on top of each of the eight plots (in the form of A / B ( C ) ) indicate mean drift before cutting (A), mean drift after cutting (B) and the max one step change after cutting (C). The mean drift is calculated as the difference of the maximum mean and the minimum mean divided by the 2-98 percentile difference of the data. This would catch any gradual changes or fluctuations in the file. If the mean drift becomes significant, the file will be attempted to be cleaned (pre-cleaning) or flagged (post-cleaning). The parameters *MeanOfMeans* and *MaxOfMeans* control the strictness of these flags, and we recommend not to change these numbers unless a greater or weaker strictness is desired. The *MeanOfMeans* parameter is the threshold for the average of the range of means of all channels, where *MaxOfMeans* is the threshold just for the channel with the highest range of means.

Figure illustrates what the range of the means of the segments on the CD33 channel from Figure . We will explain the ideas in the previous paragraph with this picture in mind. The range of 0.231 from Figure is the difference between the maximum and the minimum mean of the segments in one channel. If this ranged value for the largest channel is larger than *MaxOfMeans* the file will undergo cleaning. If the average of this ranged value for all markers is larger than *MeanOfMeans* the file will undergo cleaning. In other words, *MeanOfMeans* and *MaxOfMeans* essentially set a restriction for a tolerable mean drift.

The value in the bracket (C) is associated to the parameter *MaxContin*, which indicates the changes of the means between adjacent segments in each channel. This parameter would catch abrupt mean changes such as spikes. The default is 0.1 and we recommend not to change this value. Saying it again in a different way, if adjacent segments have differences in their means that exceed *MaxContin*, then the file will undergo cleaning.

If, after cleaning, the file still has at least one larger value for one of the threshold parameters (*MaxOfMeans*, *MeanOfMeans*, *MaxContin*), then the file will be flagged. Hence, these three parameters have the ability to judge first if the file needs to be cleaned, and second if it is cleaned sufficiently enough to not get flagged afterwards.

The 8 statistical measures * flowCut* calculates from each segment are summed up (and over all cleaning channels) to obtain a single statistical number for each segment. If we plot the density of these summed measures for every segment, then we obtain the last plot in Figure . The

The title of the image will display if the file passed or it was flagged (This information is also visible in *$data*). In Figure a ‘1 Passed TTTT’ is displayed. The ‘1’ is for the file’s ID, ‘Passed’ is stating the file passed and ‘TTTT’ is showing that it has passed all flagging tests (please see Introduction for the four flagging tests).

Generally, the deGate function returns a gating value of a 1D density profile. This was originally designed to separate cell populations. However, it could be utilized in our outlier detection methodology. We manipulated the deGate function so that it will always return a gating line to the right of the majority of events (or the highest peak) because we are only interested in removing the cells that are most different. Since the density of summed measures distribution is of Z-scores, naturally the significantly different segments lie to the right of the distribution.

The size of the second biggest peak will play a role in the gating process that we use. If the second biggest peak is less than 10% of the height of the biggest peak, then the file will be treated as a uni-modal distribution. This is achieved by setting the *tinypeak.removal* parameter in deGate to 0.1. For these uni-modal distributions, we want to calculate a gating threshold to the right of the highest peak that naturally separates the segments in the density of summed measures distribution. Therefore, we want to find a valley that is close to zero in the density distribution. To do this, we first allow deGate to search for valleys that have at least 0.1% (*tinypeak.removal* = 0.001) of the height of the highest peak, and then make sure they have a very short valley. Allowing for a very short valley or a tall one can be controlled by adjusting the *MaxValleyHgt* parameter. We find setting it to 0.1 forces the algorithm to always find the best natural separation in the density of summed measures distribution. Details on how changing this parameter will affect results is discussed in the next section. There are times when the second biggest peak is larger than 10% of the height of the highest peak. These cases we label as bi-modal, and calculating the gating line between the two peaks is quite straight forward using deGate.

There are two additional cases that would stop the algorithm from cleaning a file that seems to have a clearly separable distribution. The first is that * flowCut* has a preliminary checking step. If the file passes all the flagging tests (values less than

This example will show what *MaxValleyHgt* does and how changing the value will affect the segment deletion analysis.

The *MaxValleyHgt* parameter plays a role with calculating the threshold on the density of summed measures distribution. *MaxValleyHgt* defines the upper limit of the ratio of the height of the intersection point where the cutoff line meets the density plot and the height of the maximum peak. Setting the number higher can potentially cut off more events, whereas lowering the number will potentially reduce it.

For example, if we set *MaxValleyHgt* to 0.01 we have: