1 Introduction

The package provides a high-level processing of genomic datasets focused in nucleosome positioning experiments, despite they should be also applicable to chromatin inmunoprecipitation (ChIP) experiments in general.

The aim of this package is not providing an all-in-one data analysis pipeline but complement those existing specialized libraries for low-level data importation and pre-processment into / framework.

works with data from the two main high-troughput technologies available nowadays for ChIP: Next Generation Sequencing/NGS (ChIP-seq) and Tiling Microarrays (ChIP-on-Chip).

This is a brief summary of the main functions:

For more details about the functions and how to use them refer to the manual.

This software was published in Bioinformatics Journal. See the paper for additional information [@Flores2011].

2 Reading data

As mentioned previously, uses the pre-processed data of other lower level packages for data importation, supporting a few but common formats that should fulfill the requirements of most users.

ExpressionSet from package is used for Tiling Array experiments as described in and other packages for the Tiling Array manipulation. This kind of experiments can be readed with the processTilingArray function.

AlignedRead from package is recommended for NGS, covering most of the state of the art sequencing technologies. Additionally, support for reads in RangedData format is also provided (a range per read with a strand column).

2.1 Reading Tiling Arrays

Tiling Arrays are a cheap and fast way to have low-resolution nucleosome coverage maps. They have been widely used in literature [@Yuan2005, @Lee2007, @Mavrich2008], but complex statistical methods were needed for their processing [@Liu2007].

This kind of microarrays cover a part of the genome with certain spacing between probes which causes a drop in the resolution and originates some problems. The nucleosome calling from Tiling Array data required hard work on bioinformatics side and use of heavy and artificious statistical machinery such as Hidden Markov Models [@Yuan2005, @Lee2007] or higher order Bayesian Networks [@Kuan2009].

presents a new method based on a simple but effective peak calling method which achieves a great performance at low computing cost that will be presented in subsequent sections.

In order to standardize the data coming both from Tiling Arrays and NGS, the array fluorescence intensities (usually the ratio of the hybridization of nucleosomal and control sample) are converted to 1bp resolution by inferring the missed values from the neighboring probes. This is done by the function processTilingArray:

processTilingArray(data, exprName, chrPattern, inferLen=50)

An example of a processed dataset is provided in this package. See the help page of tilingArray_preproc for details on how it has been created. This object is a numeric vector covering the 8000 first positions of chromosome 1 in yeast (Saccharomices cerevisiae genome SacCer1).

head(nucleosome_tiling, n=25)
#>  [1] 1.273222 1.281978 1.290734 1.299490 1.308246 1.352696 1.397145 1.441595
#>  [9] 1.486044 1.501795 1.517547 1.533298 1.549049 1.547577 1.546105 1.544633
#> [17] 1.543161 1.539886 1.536612 1.533337 1.530063 1.488922 1.447782 1.406642
#> [25] 1.365502

This values represent the normalized fluorescence intensity from hybridized sample of nucleosomal DNA versus naked DNA obtained from . The values can be either direct observations (if a probe was starting at that position) or a inferred value from neighboring probes. This data can be passed directly to the filtering functions, as described later in the section 3.

2.2 Importing BAM files

Additionally, the function importBAM, allows to directly import into the mapped reads of a NGS experiment contained in a BAM file. The user has to specify whether the file contains paired-end or single-end read fragments.

sample.file <- system.file("extdata", "cellCycleM_chrII_5000-25000.bam",
reads <- readBAM(sample.file, type="paired")
#> GRanges object with 6 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]    chrII 5790-5912      *
#>   [2]    chrII 5791-5920      *
#>   [3]    chrII 5809-5914      *
#>   [4]    chrII 5811-5983      *
#>   [5]    chrII 5815-5934      *
#>   [6]    chrII 5822-6096      *
#>   -------
#>   seqinfo: 17 sequences from an unspecified genome; no seqlengths

2.3 Next Generation Sequencing

NGS has become one of the most popular technique to map nucleosome in the genome in the last years [@Kaplan2009, @Schones2008, @Xi2010]. The drop of the costs of a genome wide sequencing together with the high resolution coverage maps obtained, made it the election of many scientists.

The package allows reading of the data coming from many sources (Bowtie, MAQ, Illumina pipeline…) and has become one of the most popular packages in / for NGS data manipulation.

A new package, called , has been recently created to perform preprocessing and quality assesment on NGS experiments. supports most of the output generated by the functions on that package and recommends its use for quality control and correction of common biases that affect NGS.

handles ShortRead and RangedData data formats. The dataset nucleosome_htseq includes some NGS reads obtained from a nucleosome positioning experiment also from yeast genome, following a protocol similar to the one described in [@Lee2007].

The paired-end reads coming from Illumina Genome Analyzer II sequencer were mapped using Bowtie and imported into using . Paired ends where merged and sorted according the start position. Those in the first 8000bp of chromosome 1 where saved for this example. Further details are in the reference [@Deniz2011]:

#> [1] "GRanges"
#> attr(,"package")
#> [1] "GenomicRanges"
#> GRanges object with 18001 ranges and 0 metadata columns:
#>           seqnames    ranges strand
#>              <Rle> <IRanges>  <Rle>
#>       [1]     chr1     1-284      +
#>       [2]     chr1     5-205      +
#>       [3]     chr1     5-205      +
#>       [4]     chr1     5-209      +
#>       [5]     chr1     5-283      +
#>       ...      ...       ...    ...
#>   [17997]     chr1 7994-8151      +
#>   [17998]     chr1 7994-8151      +
#>   [17999]     chr1 7994-8151      +
#>   [18000]     chr1 7994-8152      +
#>   [18001]     chr1 7994-8152      +
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Now we will transform the reads to a normalized format. Moreover, as the data is paired-ended and we are only interested in mononucleosomes (which are typically 147bp), we will discard the reads with a length greater than 200bp, allowing margin for some underdigestion but discarding extra long reads. Note that the behaviour of fragmentLen is different for single-ended data, see the manual page of this function for detailed information.

As our final objective is identifying the nucleosome positions, and does it from the dyad, we will increase the sharpness of the dyads by removing some bases from the ends of each read. In the next example, we will create two new objects, one with the original paired-end reads and another one with the reads trimmed to the middle 40bp around the dyad (using the trim argument).

# Process the paired end reads, but discard those with length > 200
reads_pair <- processReads(nucleosome_htseq, type="paired", fragmentLen=200)

# Process the reads, but now trim each read to 40bp around the dyad
reads_trim <- processReads(nucleosome_htseq, type="paired", fragmentLen=200,

The next step is obtain the coverage (the count of how many reads are in each position). The standard package function coverage will work well here, but it is a common practice to normalize the coverage values according to the total number of short reads obtained in the NGS experiment. The common used unit is reads per milon (r.p.m.) which is the coverage value divided by the total number of reads and multiplied per one milion. A quick and efficient way to do this with is the coverage.rpm function. 1 Note that conversion in the example dataset gives huge values. This is because r.p.m. expects a large number of reads, and this dataset is only a fraction of a whole one. Also take into account that reads from single-ended (or trimmed reads) and reads from paired-ended could have different mean value of coverage

# Calculate the coverage, directly in reads per million (r.p.m)
cover_pair <- coverage.rpm(reads_pair)
cover_trim <- coverage.rpm(reads_trim)

In Figure 1 we can observe the effect of trim attribute plotting both coverages. Note that the coverages are normalized in the range 0–1:

2.4 MNase bias correction

Variation in the sharpness of the peaks using \texttt{trim} attribute.

Figure 1: Variation in the sharpness of the peaks using attribute

The Microccocal Nuclease is a widely used enzyme that has been proved to have a biase for certain dinucleotide steps [@Deniz2011]. In this package we offer a quick way to inspect the effect of such artifact by correcting the profiles of nucleosomal DNA reads with a mock sample of naked DNA digested with MNase.

The use of this function requires a paired-end control sample and a paired end or extended single-read nucleosomal DNA sample. A toy example generated using synthetic data can be found in Figure 2.

# Toy example
map <- syntheticNucMap(as.ratio=TRUE, wp.num=50, fuz.num=25)
exp <- coverage(map$syn.reads)
ctr <- coverage(map$ctr.reads)
corrected <- controlCorrection(exp, ctr)
Toy example of MNase biase correction. Random nucleosomal and control reads have been generated using \texttt{synteticNucMap} function and corrected using \texttt{controlCorrect}.

Figure 2: Toy example of MNase biase correction
Random nucleosomal and control reads have been generated using function and corrected using .

3 Signal Smoothing and Nucleosome Calling

In the previous sections we converted the experimental data from NGS or Tiling Arrays to a continous, 1bp resolution signal. In this section we will remove the noise present in the data and score the peaks identified, giving place to the nucleosome calls.

Previously, in the literature, Hidden Markov Models, Support Vector Machines or other complex intelligent agents where used for this task [@Yuan2005, @Lee2007, @Kuan2009, @Chen2010, @Xi2010]. This was needed for dealing with the noise and uncertain characterization of the fuzzy positioning of the nucleosomes.

Despite this approach is a valid way to face the problem, the use of such artificious constructs is difficult to implement and sometimes requires a subjective modeling of the solution, constraining or at least conditioning the results observed.

The method presented here proposes to keep it simple, allowing the researcher to study the results he or she is interested a posteriori.

aim is to evaluate where the nucleosomes are located and how accurate that position is. We can find a nucleosome read in virtually any place in the genome, but some positions will show a high concentration and will allow us to mark this nucleosome as well-positioned whereas other will be less phased giving place to fuzzy or de-localized nucleosomes [@Jiang2009].

We think it’s better to provide a detailed but convenient identification of the relevant nucleosome regions and score them according to its grade of fuzziness. From our point of view, every researcher should make the final decision regarding filtering, merging or classifying the nucleosomes according its necessities, and is only a tool to help in this dirty part of the research.

3.1 Noise removal

NGS and specially Tiling Array data show a very noisy profile which complicates the process of the nucleosome detection from peaks in the signal. A common approach used in the literature is smooth the signal with a sliding window average and then use a Hidden Markov Model to calculate the probabilities of having one or another state.

Original intensities from tiling array experiment. Smoothing using a sliding window of variable length (0, 20, 50 and 100 bp) is presented.

Figure 3: Original intensities from tiling array experiment
Smoothing using a sliding window of variable length (0, 20, 50 and 100 bp) is presented.

As can be seen in Figure 3, data needs some smoothing to be interpretable, but a simple sliding window average is not sufficient. Short windows allow too much noise but larger ones change the position and the shape of the peaks.

proposes a method of filtering based on the Fourier Analysis of the signal and the selection of its principal components.

Any signal can be described as a function of individual periodic waves with different frequencies and the combination of them creates more complex signals. The noise in a signal can be described as a small, non periodic fluctuations, and can be easily identified and removed [@Smith1999].

uses this theory to transform the input data into the Fourier space using the Fast Fourier Transform (FFT). A FFT has a real and a imaginary component. The representation of the real component it’s called the power spectrum of the signals and shows which are the frequencies that have more weight (power) in the signal. The low frequency components (so, very periodic) usually have a huge influence in the composite signal, but its rellevance drops as the frequency increases.

We can look at the power spectrum of the example dataset with the following command:

fft_ta <- filterFFT(nucleosome_tiling, pcKeepComp=0.01, showPowerSpec=TRUE)
Power spectrum of the example Tiling Array data, percentile 1 marked with a dashed line.

Figure 4: Power spectrum of the example Tiling Array data, percentile 1 marked with a dashed line

In the Figure 4 only the half of the components are plotted, as the spectrum is repeated symmetrically respect to its middle point. The first component (not shown in the plot), has period 1, and, in practice, is a count of the lenght of the signal, so it has a large value.

High frequency signals are usually echoes (repeating waves) of lower frequencies, i.e. a peak at 10 will be the sum of the pure frequence 10 plus the echo of the frequency 5 in its 2nd repetition. Echoes can be ignored without losing relevant information.

The approach follows is supposing that with just a small percentage of the components of the signal, the input signal can be recreated with a high precision, but without a significant amount of noise. We check empirically that with 1% or 2% of the components (this means account 1 or 2 components for each 100 positions of the genomic data) it’s enough to recreate the signal with a very high correlation (>0.99). Tiling Array could require more smoothing (about 1% should be fine) and NGS has less noise and more components can be selected for fitting better the data (about 2%), See Figure 4 for the selected components in the example.

In order to easy the choice of the pcKeepComp parameter, includes a function for automatic detection of a fitted value that provides a correlation between the original and the filtered profiles close to the one specified. See the manual page of pcKeepCompDetect for detailed information.

In short, the cleaning process consists on converting the coverage/intensity values to the Fourier space, and knock-out (set to 0) the components greater than the given percentile in order to remove the noise from the profile. Then the inverse Fast Fourier Transform is applyied to recreate the filtered signal. In Figure 5 the filtered signal is overlapped to the raw signal.

The cleaning of the input has almost no effect on the position and shape of the peaks, mantaining a high correlation with the original signal but allowing achieve a great performance with a simple peak detection algorithm: