0. Installation

Please note that ‘rifi’ is only available for Unix based systems. To install this package, start R (>= version “4.2”) and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("rifi")

I. Introduction

The stability or halflife of bacterial transcripts is often estimated using Rifampicin timeseries data. Rifampicin has the special feature that it prevents the initiation of transcprition, but RNA polymerases which are already elongating are unaffected (Campbell et al. 2001). This has the implication that the RNA concentrations of positions downstream of the transcriptional start site appear unchanged until the last polymerase has passed this point. The result is a delayed exponential decay (Chen et al. 2015), which can be fitted by the following model:

$$c(t,n) = \begin{cases} \frac{\alpha}{\lambda} & \quad \text{if } t < \frac{n}{v}\\ \frac{\alpha}{\lambda} \times e^{-\lambda t} & \quad \text{if } t \geq \frac{n}{v} \end{cases}$$

The model (Chen et al. 2015) consists of two phases; the firts phase describes the delay where the transcript concentration is in its steady state defined by the ratio of the synthesis rate $$\alpha$$ and the decay constant $$\lambda$$ ($$steadystate = \frac{\alpha}{\lambda}$$). The delay is dependent on the distance from the transcriptional start site $$n$$ and the transcription velocity $$v$$. If the time after the Rifampicin additon is greater than the delay ($$delay = \frac{n}{v}$$) the exponential decay phase starts.

In addition to the standard model, we are using a second model which describes the behaviour at positions were the concentration increases after Rifampicin addition (Figure 1, right panel). This phenomenon can be explained by Rifampicin relievable transcription termination, e.g. through the transcriptional interference (TI) collision mechanism (Shearwin, Callen, and Egan 2005) or termination by short-lived factors such as sRNAs (Wang et al. 2015). In the following we will call this model the ‘TI model’ which consists of three phases:

$$c(t,n) = \begin{cases} \frac{\alpha - \alpha \times \beta}{\lambda} & \quad \text{if } t < \frac{n - n_{term}}{v}\\ \frac{\alpha}{\lambda} - \frac{\alpha \times \beta}{\lambda} \times e ^{-\lambda (t -\frac{n - n_{term}}{v})} & \quad \text{if } \frac{n - n_{term}}{v} < t < \frac{n}{v}\\ (\frac{\alpha}{\lambda} - \frac{\alpha \times \beta}{\lambda} \times e ^{-\lambda (t -\frac{n_{term}}{v})}) \times e^{-\lambda (t-\frac{n}{v})} & \quad \text{if } \frac{n}{v} \leq t \end{cases}$$

The first phase describes again the steady state concentration at a given transcript position, but here the synthesis rate $$\alpha$$ is reduced by the TI-termination-factor $$\beta$$. We assume a short lived factor responsible for the termination whose synthesis is stopped after rifampicin addition. Thus after the relieve of termination all polymerases that start at the transcriptional start site can reach positions downstream of the former termination site ($$n_{term}$$), the time polymerases need from the position of termination to the position $$n$$ is delay for the increase ($$delay_{increase}= \frac{n - n_{term}}{v}$$). After the last polymerase has passed the respective position, the exponential decay phase starts.

‘rifi’ is a tool to do a stability analysis on high-throughput rifampicin data. RNA sequencing and microarray data derived from rifampicin treated bacteria with sufficiently high time resolution can reveal many insights into the mechanics of transcription, RNAP velocity and RNA stability. ‘rifi’ is a tool for the holistic identification of these transcription processes. The core part of the data analysis by rifi is the utilization of one of the two non linear regression models applied on the time series data of each probe (or bin), giving the probe/bin specific delay, decay constant $$\lambda$$ and half-life ($$t_\frac{1}{2} = \frac{\ln(2)}{\lambda}$$) (Figure 1, left panel).

After the fit of the individual probes/bins, common worklfows usually combine the individual half-life values based on the given genome annotation to get an average for the gene based stability. This procedure can not deal with differences within a given gene, e.g. due to processing sites. ‘rifi’ uses an annotation agnostic approach to get an unbiased estimate of individual transcripts as they actually appear in vivo. probes/bins with equal properties in the extracted values delay, half-life, TI_termination_factor and the given intensity values are combined into segments by dynamic programming (called fragmentation in ‘rifi’), independent of an existing genome annotation (Figure 2). The fragmentation is performed hierarchically.
Initially segments of bins are grouped by regions without significant sequencing depth into position_segments. Those are grouped into delay_fragments by common velocity. Subsequently, each delay-fragment is grouped by similar half-life into half_life_fragments, on which the bins finally are grouped into intensity_fragments by similar intensity. From the fragmentation, many events can be extracted; iTSS (internal transcription start sites), transcription pausing_sites, velocity_changes,processing_sites, partial terminations, as well as instances of Rifampicin relievable transcription termination, e.g.  by TI (transcription interference). All data are integrated to give an estimate of continuous transcriptional units, i.e. operons. Comprehensive output tables and visualizations of the full genome result and the individual fits for all probes/bins are produced.

1. Quickstart

If you have your data prepared as described in [The Input Data Frame] (#the-input-data-frame) you can use the rifi_wrapper to run ’rifi" with default options. rifi_wrapper conveniently wraps all functions included in rifi. That allows the user to run the whole workflow with one function. If the data contain a background component, e.g. in case of microarray data, take to define a meaningful background intensity.

The functions used are:

For rifi_wrapper you only need to provide the path to a .gff file of the respective genome and the input SummarizedExperiment object. The genome annotation is needed for the visualization and to map fragmented segments to annotated genes for an easier interpretation.

library(rifi)
data(example_input_e_coli)
Path = gzfile(system.file("extdata", "gff_e_coli.gff3.gz", package = "rifi"))
wrapper_minimal <-
rifi_wrapper(
inp = example_input_e_coli,
cores = 2,
path = Path,
bg = 0,
restr = 0.01
)
}

The wrapper saves the output of each sub-function in a list. Thus each intermediate result can be re-run with custom settings. the object ‘wrapper_minimal’ contains only minimal artificial data to reduce the runtime of the test run.

data(wrapper_minimal)

# list of 6 SummarizedExperiment objects
length(wrapper_minimal)
## [1] 6
#the first intermediate result
wrapper_minimal[[1]]
## class: RangedSummarizedExperiment
## dim: 4 33
## assays(1): ''
## rownames: NULL
## rowData names(7): position ID ... flag position_segment
## colnames: NULL
## colData names(2): timepoint replicate

2. The output

A small example output can be loaded with data(summary_minimal). The final output is a SummarizedExperiment object.

All main results are stored as metadata in the SummarizedExperiment object. The first five entries are not of immediate importance.

data(summary_minimal)

# the main results
names(metadata(summary_minimal))   
##  [1] "timepoints"                        "replicates"
##  [3] "logbook"                           "logbook_details"
##  [5] "annot"                             "dataframe_summary_1"
##  [7] "dataframe_summary_2"               "dataframe_summary_events"
##  [9] "dataframe_summary_events_HL_int"   "dataframe_summary_events_ps_itss"
## [11] "dataframe_summary_events_velocity" "dataframe_summary_TI"

1. bin/probe based results

Entry 6, dataframe_summary_1, contains all fit results for each individual bin/probe and a mapping to the genome annotation. They can be exported to an .csv file using write.csv().

# bin/probe probe based results
head(metadata(summary_minimal)$dataframe_summary_1) ## ID feature_type gene locus_tag position strand segment TU ## 1 1 <NA> <NA> <NA> 50 + S_1 TU_1 ## 2 2 <NA> <NA> <NA> 100 + S_1 TU_1 ## 3 3 <NA> <NA> <NA> 150 + S_1 TU_1 ## 4 4 CDS thrL BW25113_RS00005 200 + S_1 TU_1 ## 5 5 CDS thrL BW25113_RS00005 250 + S_1 TU_1 ## 6 6 <NA> <NA> <NA> 300 + S_1 TU_1 ## delay_fragment delay HL_fragment half_life intensity_fragment intensity flag ## 1 D_1 0.08 Dc_1 1.57 I_1 138.55 _ABG_ ## 2 D_1 0.17 Dc_1 1.54 I_1 146.89 _ABG_ ## 3 D_1 0.50 Dc_1 1.52 I_1 152.35 _ABG_ ## 4 D_1 0.76 Dc_1 1.54 I_1 163.39 _ABG_ ## 5 D_1 1.00 Dc_1 1.50 I_1 149.01 _ABG_ ## 6 D_1 1.21 Dc_1 1.58 I_1 160.37 _ABG_ ## TI_termination_factor ## 1 NA ## 2 NA ## 3 NA ## 4 NA ## 5 NA ## 6 NA #export to csv write.csv(metadata(summary_minimal)$dataframe_summary_1, file="filename.csv")

2. fragment based results

Entry 7, dataframe_summary_2, contains all fit results for each intensity_fragment. Due to the hierachic fragmentation strategy [Figure 2] several intensity_fragment (I_x) can belong to one decay_fragment (Dc_x), one delay_fragment (D_x) and one transcriptional unit (TU_x). The fragment IDs can be used to locate the fragment in the output PDF file. The fragments are mapped to the genome annotation and the mean halflife given in minutes and the mean intensity is given with their standard deviations. Also the estimated velocity in nt/min is given. The table can be exported as an .csv file using write.csv().

# all estimated fragments
metadata(summary_minimal)$dataframe_summary_2 ## feature_type gene locus_tag first_position_frg last_position_frg ## 1 NA|CDS NA|thrL NA|BW25113_RS00005 50 300 ## 2 CDS thrA BW25113_RS00010 350 600 ## 3 CDS thrA BW25113_RS00010 650 900 ## 4 NA NA NA 1151 1051 ## 5 NA NA NA 1001 901 ## strand TU segment delay_fragment HL_fragment half_life HL_SD HL_SE ## 1 + TU_1 S_1 D_1 Dc_1 1.54 0.03 0.01 ## 2 + TU_1 S_1 D_1 Dc_2 3.04 0.04 0.02 ## 3 + TU_1 S_1 D_2 Dc_3 3.00 0.05 0.02 ## 4 - TU_2 S_2 D_3 Dc_4 2.00 0.02 0.01 ## 5 - TU_2 S_2 D_3 Dc_4 2.00 0.02 0.01 ## intensity_fragment intensity intensity_SD intensity_SE velocity ## 1 I_1 151.53 9.12 3.72 143.86 ## 2 I_2 300.24 5.75 2.35 143.86 ## 3 I_3 446.67 7.36 3.00 417.25 ## 4 I_4 205.03 1.78 1.03 201.10 ## 5 I_5 102.47 3.88 2.24 201.10 #export to csv write.csv(metadata(summary_minimal)$dataframe_summary_2, file="filename.csv")

3. Transcription events

Entry 8, dataframe_summary_events, contains all estimated transcriptional events. Events always appear between two transcript fragments.

# all estimated events
metadata(summary_minimal)$dataframe_summary_events ## event p_value p_adjusted FC_HL FC_intensity FC_HL_adapted ## 1 iTSS_I NA NA NA NA NA ## 2 Termination NA NA NA NA 0.99 ## 3 iTSS_II NA NA NA NA 1.97 ## 4 iTSS_II NA NA NA NA 0.99 ## 5 Int_event NA NA NA 0.98 1.97 ## 6 Int_event NA NA NA 0.57 0.99 ## 7 Int_event NA NA NA -1.00 0.99 ## 8 HL_event 2.68e-10 4.02e-10 0.98 NA NA ## 9 HL_event 1.07e-11 3.21e-11 -0.02 NA NA ## 10 velocity 1.91e-04 1.91e-04 NA NA NA ## FC_HL_FC_intensity event_position velocity_ratio feature_type gene ## 1 NA 625 NA CDS thrA ## 2 0.50 1026 NA ## 3 1.00 325 NA ## 4 1.51 625 NA CDS thrA ## 5 NA 325 NA ## 6 NA 625 NA CDS thrA ## 7 NA 1026 NA ## 8 NA 325 NA ## 9 NA 625 NA CDS thrA ## 10 NA 625 2.9 CDS thrA ## locus_tag strand TU segment_1 segment_2 ## 1 BW25113_RS00010 + TU_1 S_1|TU_1|D_1 S_1|TU_1|D_2 ## 2 - TU_2 S_2|TU_2|D_3|Dc_4|I_4 S_2|TU_2|D_3|Dc_4|I_5 ## 3 + TU_1 S_1|TU_1|D_1|Dc_1|I_1 S_1|TU_1|D_1|Dc_2|I_2 ## 4 BW25113_RS00010 + TU_1 S_1|TU_1|D_1|Dc_2|I_2 S_1|TU_1|D_1|Dc_3|I_3 ## 5 + TU_1 S_1|TU_1|D_1|Dc_1|I_1 S_1|TU_1|D_1|Dc_1|I_2 ## 6 BW25113_RS00010 + TU_1 S_1|TU_1|D_1|Dc_2|I_2 S_1|TU_1|D_1|Dc_2|I_3 ## 7 - TU_2 S_2|TU_2|D_3|Dc_4|I_4 S_2|TU_2|D_3|Dc_4|I_5 ## 8 + TU_1 S_1|TU_1|D_1|Dc_1 S_1|TU_1|D_1|Dc_2 ## 9 BW25113_RS00010 + TU_1 S_1|TU_1|D_1|Dc_2 S_1|TU_1|D_1|Dc_3 ## 10 BW25113_RS00010 + TU_1 S_1|TU_1|D_1 S_1|TU_1|D_2 ## event_duration gap_fragments features ## 1 -1.27 50 2 ## 2 NA 50 3 ## 3 NA 50 4 ## 4 NA 50 4 ## 5 NA 50 2 ## 6 NA 50 2 ## 7 NA 50 2 ## 8 NA 50 2 ## 9 NA 50 2 ## 10 NA 50 2 #export to csv write.csv(metadata(summary_minimal)$dataframe_summary_events, file="filename.csv")

Changes in the linear increase of the delay indicate a potential pausing site if there is a sudden delay increase (Figure 3A), a potential internal starting site iTSS_I if there is a sudden decrease in the delay (Figure 3B) and a velocity change of the RNA polymerase if there is a slope change (Figure 3C). The events are statistically evaluated by Ancova (apply_ancova) or a t-test (apply_Ttest_delay).
A fragment border between halflife segments (HL_event) which belong to the same transcriptional unit might indicate a processing site (Figure 3D) and a fragment border between intensity fragments (Int_event) (Figure 3E) can indicate a processing site (Figure 3F), a new transcriptional start site (iTSS_II) (Figure 3G) or an partial termination (Figure 3H), depending on the respective intensity foldchanges and the halflife foldchanges.
Each event is described by is type, its p-Value and adjusted p-Value, the foldchange or event duration, the estimated event position, a mapping to existing annotations and the IDs of the two bordering fragments.