A fast introduction into the analysis of DNA mutational events by means of automorphisms between two DNA sequences algebraically represented as Abelian finite group.

This is a R package to compute the automorphisms between pairwise aligned DNA
sequences represented as elements from a Genomic Abelian group as described in
reference (1). In a general scenario, whole chromosomes or genomic
regions from a population (from any species or close related species) can be
algebraically represented as a direct sum of cyclic groups or more specifically
Abelian *p*-groups. Basically, we propose the representation of multiple
sequence alignments (MSA) of length *N* as a finite Abelian group created by
the direct sum of Abelian group of *prime-power order*:

\[ \qquad G = (\mathbb{Z}_{p^{\alpha_{1}}_1})^{n_1} \oplus (\mathbb{Z}_{p^{\alpha_{2}}_1})^{n_2} \oplus \dots \oplus (\mathbb{Z}_{p^{\alpha_{k}}_k})^{n_k} \]

Where, the \(p_i\)’s are prime numbers, \(\alpha_i \in \mathbb{N}\) and \(\mathbb{Z}_{p^{\alpha_{i}}_i}\) is the group of integer modulo \(p^{\alpha_{i}}_i\).

For the purpose of estimating the automorphism between two aligned DNA sequences, \(p^{\alpha_{i}}_i \in \{5, 2^6, 5^3 \}\).

Herein, automorphisms are considered algebraic descriptions of mutational event observed in codon sequences represented on different Abelian groups. In particular, as described in references (3-4), for each representation of the codon set on a defined Abelian group there are 24 possible isomorphic Abelian groups. These Abelian groups can be labeled based on the DNA base-order used to generate them. The set of 24 Abelian groups can be described as a group isomorphic to the symmetric group of degree four (\(S_4\), see reference (4)).

For further support about the symmetric group on the 24 Abelian group of
genetic-code cubes, users can also see
Symmetric Group of the Genetic-CodeCubes.,
specifically the Mathematica notebook
*IntroductionToZ5GeneticCodeVectorSpace.nb* and interact with it using
Wolfram Player, freely available (for Windows and Linux OS) at,
https://www.wolfram.com/player/.

```
library(Biostrings)
library(GenomAutomorphism)
```

A pairwise sequence alignment of protein coding regions SARS coronavirus GZ02 (GenBank: AY390556.1) and Bat SARS-like coronavirus isolate Rs7327 (GenBank: KY417151.1) is provided with the package.

```
data(covid_aln, package = "GenomAutomorphism")
covid_aln#> DNAMultipleAlignment with 2 rows and 29166 columns
#> aln names
#> [1] ATGGAGAGCCTTGTTCTTGGTGTCAACGAGAAAACACACGTCCAAC...CAAAATTCCATGAGTGGAGCTTCTGCTGATTCAACTCAGGCATAA lcl|AY390556.1_cd...
#> [2] ATGGAGAGCCTTGTTCTTGGTGTCAACGAGAAAACACACGTCCAAC...CAAAATTCCATGAGTGGAGCTTCTGCTGATTCAACTCAGGCATAA lcl|KY417151.1_cd...
```

Group operations defined on the sets of DNA bases and codons are associated to
physicochemical or/and biophysical relationships between DNA bases and between
codons and aminoacids. In other words, a proper definition of a group
operation on the set of bases or on the set of codons will encode the
physicochemical or/and biophysical relationships between the set’s elements.
Thus, *by group operations defined on the set of bases or on the set of
codons, we understand an encoding applied to represent specified
physicochemical or/and biophysical relationships as group operations between
the elements of the set*. Then, we shall say that * such an encoding permits
the representation of DNA bases, codons, genes, and genomic sequences as
elements from algebraic structures*.

The DNA base set can be represented in 24 possible base orders, which leads to
24 possible representations of the genetic code. Each genetic code
representation base-triplets on the Galois field *GF(4)* (or in *GF(5)*) leads
to genetic code vector 3D-space, which is mathematically equivalent to a cube
inserted in the 3D space (1). Each cube is denoted according to the
corresponding base order.

Given a base-order, say ‘ACGT’, the Abelian group defined on this ordered set is isomorphic to the Abelian group defined on the set of integers modulo 4 (\(\mathbb{Z}_{4}\)). In practical terms, this is equivalent to replace each DNA base by the corresponding integer element. The base replacement in cube “ACGT and group”Z4" (\(\mathbb{Z}_{4}\)) is:

```
base2int("ACGT", group = "Z4", cube = "ACGT")
#> [1] 0 1 2 3
```

The base replacement in cube "ACGT and group ‘Z5’ (\(\mathbb{Z}_{5}\)):

```
base2int("ACGT", group = "Z5", cube = "ACGT")
#> [1] 1 2 3 4
```

After the DNA sequence is read, the corresponding codon sequences can be represented in the Abelian group \(\mathbb{Z}_{64}\) (i.e., the set of integers remainder modulo 64). The codon coordinates are requested on the cube ACGT. Following reference (4)), cubes are labeled based on the order of DNA bases used to define the sum operation.

```
codon_coord(
codons <-codon = covid_aln,
cube = "ACGT",
group = "Z64",
chr = 1L,
strand = "+",
start = 1,
end = 750)
codons#> CodonGroup object with 250 ranges and 4 metadata columns:
#> seqnames ranges strand | seq1 seq2 coord1 coord2
#> <Rle> <IRanges> <Rle> | <character> <character> <character> <character>
#> [1] 1 1 + | ATG ATG 50 50
#> [2] 1 2 + | GAG GAG 10 10
#> [3] 1 3 + | AGC AGC 33 33
#> [4] 1 4 + | CTT CTT 55 55
#> [5] 1 5 + | GTT GTT 59 59
#> ... ... ... ... . ... ... ... ...
#> [246] 1 246 + | GAT GAT 11 11
#> [247] 1 247 + | AAG AAG 2 2
#> [248] 1 248 + | AGC AGC 33 33
#> [249] 1 249 + | TAC TAT 13 15
#> [250] 1 250 + | GAG GAG 10 10
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

The codon sequences (*seq1* and *seq2*) with their corresponding coordinates
(left) are returned, as well as the coordinated representation on
\(\mathbb{Z}_{64}\) (*coord1* and *coord2*).

The particular interest are the coordinate representation on *“dual”*
genetic-code cubes. These are cubes where codons with complementary base pairs
have the same coordinates in the corresponding cubes, as shown in reference
(4)). Each pair of *“dual”* cubes integrates a group.

For example, let’s consider the complementary codons “ACG” and “TGC”, with complementary base pairs: A::T, C:::G, and G:::C, where symbol “:” denotes the hydrogen bonds between the bases.

```
c("ACG", "TGC")
x0 <- DNAStringSet(x0)
x1 <-
x1#> DNAStringSet object of length 2:
#> width seq
#> [1] 3 ACG
#> [2] 3 TGC
```

Their representations on the dual cubes “ACGT” and “TGCA” on \(\mathbb{Z}_{4}\) are:

```
base_coord(x1, cube = "ACGT")
x2 <-
x2#> BaseGroup object with 3 ranges and 4 metadata columns:
#> seqnames ranges strand | seq1 seq2 coord1 coord2
#> <Rle> <IRanges> <Rle> | <character> <character> <numeric> <numeric>
#> [1] 1 1 + | A T 0 3
#> [2] 1 2 + | C G 1 2
#> [3] 1 3 + | G C 2 1
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
base_coord(x1, cube = "TGCA")
x2. <-
x2.#> BaseGroup object with 3 ranges and 4 metadata columns:
#> seqnames ranges strand | seq1 seq2 coord1 coord2
#> <Rle> <IRanges> <Rle> | <character> <character> <numeric> <numeric>
#> [1] 1 1 + | A T 3 0
#> [2] 1 2 + | C G 2 1
#> [3] 1 3 + | G C 1 2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

The sum of base coordinates modulo \(\mathbb{Z}_{4}\) is 3.

```
## cube "ACGT"
$coord1 + x2$coord2) %% 4
(x2#> [1] 3 3 3
## cube "TGCA"
$coord1 + x2.$coord2) %% 4
(x2.#> [1] 3 3 3
```

The same result for the same codon on different cubes

```
## Codon ACG
$coord1 + x2.$coord1) %% 4
(x2#> [1] 3 3 3
## Codon TGC
$coord2 + x2.$coord2) %% 4
(x2#> [1] 3 3 3
```

Their codon representation on \(\mathbb{Z}_{64}\) are:

```
## cube ACGT
codon_coord(codon = x2, group = "Z64")
x3 <-
x3#> CodonGroup object with 1 range and 4 metadata columns:
#> seqnames ranges strand | seq1 seq2 coord1 coord2
#> <Rle> <IRanges> <Rle> | <character> <character> <character> <character>
#> [1] 1 1 + | ACG TGC 18 45
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
## cube TGCA
codon_coord(codon = x2., group = "Z64")
x3. <-
x3.#> CodonGroup object with 1 range and 4 metadata columns:
#> seqnames ranges strand | seq1 seq2 coord1 coord2
#> <Rle> <IRanges> <Rle> | <character> <character> <character> <character>
#> [1] 1 1 + | ACG TGC 45 18
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

The sum of base coordinates modulo \(\mathbb{Z}_{64}\) is 63.

```
## cube "ACGT"
as.numeric(x3$coord1) + as.numeric(x3$coord2)) %% 64
(#> [1] 63
## cube "TGCA"
as.numeric(x3.$coord1) + as.numeric(x3.$coord2)) %% 64
(#> [1] 63
```

The same result for the same codon on different cubes

```
## Codon ACG
as.numeric(x3$coord1) + as.numeric(x3.$coord1)) %% 64
(#> [1] 63
## Codon TGC
as.numeric(x3$coord2) + as.numeric(x3.$coord2)) %% 64
(#> [1] 63
```

Automorphisms can be computed starting directly from the *FASTA* file. Notice
that we can work only with genomic regions of our interest by giving the
*start* and *end* alignment coordinates. In \(\mathbb{Z}_{64}\) automorphisms
are described as functions \(f(x) = k\,x\quad mod\,64\), where \(k\) and \(x\) are
elements from the set of integers modulo 64. Below, in function
automorphism three important arguments are given
values: *group = “Z64”*, *cube = c(“ACGT”, “TGCA”)*, and *cube_alt = c(“CATG”,
“GTAC”)*.

In groups “Z64” and “Z125” not all the mutational events can be described as
automorphisms from a given cube. The analysis of automorphisms is then
accomplished in the set of *dual* genetic-code cubes. A character string
denoting pairs of *dual* genetic-code cubes, is given as argument for *cube*.
Setting for group specifies on which group the automorphisms will be
computed. These groups can be: “Z5”, “Z64”, “Z125”, and “Z5^3”.

If automorphisms are not found in first set of dual cubes, then the algorithm search for automorphisms in a alternative set of dual cubes.

```
automorphisms(
autm <-seqs = covid_aln,
group = "Z64",
cube = c("ACGT", "TGCA"),
cube_alt = c("CATG", "GTAC"),
start = 1,
end = 750,
verbose = FALSE)
#> Warning: 'IS_BIOC_BUILD_MACHINE' environment variable detected, setting BiocParallel workers to 4 (was 71)
autm#> Automorphism object with 250 ranges and 8 metadata columns:
#> seqnames ranges strand | seq1 seq2 aa1 aa2 coord1 coord2 autm
#> <Rle> <IRanges> <Rle> | <character> <character> <character> <character> <numeric> <numeric> <numeric>
#> [1] 1 1 + | ATG ATG M M 50 50 1
#> [2] 1 2 + | GAG GAG E E 10 10 1
#> [3] 1 3 + | AGC AGC S S 33 33 1
#> [4] 1 4 + | CTT CTT L L 55 55 1
#> [5] 1 5 + | GTT GTT V V 59 59 1
#> ... ... ... ... . ... ... ... ... ... ... ...
#> [246] 1 246 + | GAT GAT D D 11 11 1
#> [247] 1 247 + | AAG AAG K K 2 2 1
#> [248] 1 248 + | AGC AGC S S 33 33 1
#> [249] 1 249 + | TAC TAT Y Y 13 15 11
#> [250] 1 250 + | GAG GAG E E 10 10 1
#> cube
#> <character>
#> [1] ACGT
#> [2] ACGT
#> [3] ACGT
#> [4] ACGT
#> [5] ACGT
#> ... ...
#> [246] ACGT
#> [247] ACGT
#> [248] ACGT
#> [249] ACGT
#> [250] ACGT
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

Observe that two new columns were added, the automorphism coefficient \(k\)
(named as *autm*) and the genetic-code cube where the automorphism was found.
By convention the DNA sequence is given for the positive strand. Since the
*dual cube* of **“ACGT”** corresponds to the complementary base order
**TGCA**, automorphisms described by the cube **TGCA** represent mutational
events affecting the DNA negative strand (-).

The last result can be summarized by gene regions as follow:

```
automorphismByRanges(autm)
aut_range <-
aut_range#> GRanges object with 9 ranges and 1 metadata column:
#> seqnames ranges strand | cube
#> <Rle> <IRanges> <Rle> | <character>
#> [1] 1 1-102 + | ACGT
#> [2] 1 103 - | TGCA
#> [3] 1 104-105 + | ACGT
#> [4] 1 106 - | TGCA
#> [5] 1 107-201 + | ACGT
#> [6] 1 202 - | TGCA
#> [7] 1 203-205 + | ACGT
#> [8] 1 206 - | TGCA
#> [9] 1 207-250 + | ACGT
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

That is, function automorphismByRanges permits the classification of the pairwise alignment of protein-coding sub-regions based on the mutational events observed on it quantitatively represented as automorphisms on genetic-code cubes.

Searching for automorphisms on \(\mathbb{Z}_{64}\) permits us a quantitative differentiation between mutational events at different codon positions from a given DNA protein-encoding region. As shown in reference (4) a set of different cubes can be applied to describe the best evolutionary aminoacid scale highly correlated with aminoacid physicochemical properties describing the observed evolutionary process in a given protein.

More information about this subject can be found in the supporting material from reference (4)) at GitHub GenomeAlgebra_SymmetricGroup, particularly by interacting with the Mathematica notebook Genetic-Code-Scales_of_Amino-Acids.nb.