MicroRNAs are deeply linked to the emergence of the complex octopus brain

Soft-bodied cephalopods such as octopuses are exceptionally intelligent invertebrates with a highly complex nervous system that evolved independently from vertebrates. Because of elevated RNA editing in their nervous tissues, we hypothesized that RNA regulation may play a major role in the cognitive success of this group. We thus profiled messenger RNAs and small RNAs in three cephalopod species including 18 tissues of the Octopus vulgaris. We show that the major RNA innovation of soft-bodied cephalopods is an expansion of the microRNA (miRNA) gene repertoire. These evolutionarily novel miRNAs were primarily expressed in adult neuronal tissues and during the development and had conserved and thus likely functional target sites. The only comparable miRNA expansions happened, notably, in vertebrates. Thus, we propose that miRNAs are intimately linked to the evolution of complex animal brains.

Three million sequencing reads generated by two full-length RNA-seq methods (Iso-seq and FLAM-seq (16,66)) were used to build a catalog of high-quality mRNAs isoforms (Methods). The isoforms were then filtered (Methods) to produce a final set of 59,579 mRNA isoforms associated with 10,957 reference genes out of 25,335 present in the genome. After filtering, and in combination with a current genome annotation, we estimate that the octopus genome encodes ~80,000 alternative mRNA isoforms. An updated annotation that combines original genome annotation with the isofroms obtained in this study is available as Supplementary Data 1.
We have used our newly assembled transcriptome to characterize alternative splicing in the genome of O. sinensis (Methods). O. sinensis chromosome-level genome assembly ASM634580v1 (71) was used due to its much higher completeness in comparison to a draft genome assembly of O. vulgaris (58). Both species belong to the same species complex and only recently have been recognized as separate species belonging to the same species complex (72,73).
In the obtained annotation, 9769 out of 25,335 protein coding genes (38.6%) encoded multiple mRNA isoforms. As in other bilaterians (54), the predominant alternative splicing event in the octopus is exon skipping (6513 cases) followed by alternative transcription start and termination. (4773 and 2580 cases respectively), alternative first exon (2313) and intron retention (1927) (Methods). Individual tissues varied in alternative splicing abundance with neuronal tissues often having higher exon-skipping rates (Methods, Fig. S2). We checked whether A-to-I editing may contribute to an increased number of mRNA isoforms by affecting alternative splicing of messenger RNAs. Splicing requires donor and acceptor splicing signals in the mRNA (GT and AG respectively) as well as other splicing-regulatory sequences within introns (branch point, polypyrimidine tract, etc.) As A-to-I editing is thought to occur co-transcriptionally, it is feasible for splice sites to be created by adenosine deamination. A notable example of such event in the autoregulatory loop of ADAR2 itself, where adenosine deamination in ADAR2 mRNA creates a proximal splicing acceptor site and results in a premature translation termination due to a frame shift thus possibly acting in self-inhibitory fashion (74). However, in human cell lines, ADAR enzymes have been found to target splicing-related motifs only rarely (75). If splice donor or/and acceptor site are created by A-to-I editing, their genomic sequence will be AT or/and AA respectively. To identify such cases, we have mapped available RNA-seq data from representative protostomian species (Methods). As the result, we did not find any non-canonical splice junctions in any of the species investigated (Table S3).
We have also used total RNA-seq data to annotate circular RNAs (circRNAs) in the tissues by detecting back-splicing events (56). As ADARs have been shown to antagonize back-splicing events by melting dsRNA structures within introns (14), we expected circRNAs in cephalopods to be not as abundant as in other animals. We have predicted circRNAs using total RNA-seq datasets by searching for backsplicing events, and rigorously filtered putative predictions to retain a set of high-quality 296 circRNAs (Methods). These RNAs were expressed in highly tissue-specific manner: the majority (200 / 296) were detected in only one of the tissues, and only 32 in more than 3 tissues.

Alternative cleavage and polyadenylation
We used FLAM-seq to reconstruct mRNA cleavage sites. The sequence context around cleavage sites in octopus strongly resembles that of other metazoans. More precisely, 71% of the sites were canonical metazoan polyadenylation signals (A(A/U)UAAA and variants) ( Fig. S3 A-D). Thus, 3'-UTRs and their isoforms are likely generated by a canonical mechanism.
To determine 3'-UTR lengths, we have used FLAMseq data. Genes with higher coverage may theoretically exhibit longer maximal 3'-UTRs as there would be a higher chance of capturing those. However, the relationship between the sequencing depth per gene and the maximal recovered 3'-UTR length is weak and explained about 1% of the variability in the 3'-UTR length ( fig. S3E). We thus concluded that our FLAM-seq dataset allows for an unbiased 3'-UTR length estimation for sampled genes. The median length of 3'-UTRs in the octopus was approximately 380 nucleotides. As in other bilaterians, we find that genes expressed in the nervous system utilize, on average, longer 3'UTRs ( Fig. S3F) (76,77). We also found 4,800 genes utilizing tandem alternative polyadenylation sites. As expected, distal cleavage sites usually contained stronger polyadenylation signals ( Fig. S3 D) (78).
Different genes showed profound differences in 3'-UTR lengths based on whether they have multiple cleavage sites annotated and their tissue expression patterns. Genes with multiple annotated 3'-UTRs exhibited longer 3'-UTRs compared to genes with only one cleavage site captured. The genes specifically expressed in the nervous tissues utilized the longest 3'-UTRs ( fig S3). In addition to the lengths of 3'-UTRs, we have investigated steady-state lengths of mRNA poly(A) tails. Similar to the 3'-UTRs, poly(A) tails were longer in the neuronal tissues (123 -138.5 nt) than in non-neuronal (45-110 nt). The shortest poly(A) tails have been observed in the testis (median length of 45 nt). In addition, mRNA tails in testis showed unusually high percentage of guanosines (up to 10%) ( Fig. S3) Finally, we investigated poly(A) tail lengths. As in other animals, steady-state lengths of poly(A) tails are longer in neuronal tissues (16) Interestingly, poly(A) tails in the testis were shorter than in other tissues and contained a high proportion of guanosines ( Fig. S3 G,H), which have not been described in any other animal.
Thus, while we found unusual poly(A) tails in the testis, we conclude that the octopus employs polyadenylation patterns similar to those observed in other lophotrochozoans. In summary, the transcriptome of O. vulgaris does not show major departures from other lophotrochozoans in terms of alternative splicing diversity and rates, as well as in mRNA cleavage and polyadenylation. This matches a previous observation of gene content and intronic architecture of coleoids resembling that of other "slow-evolving" lophotrochozoans (5).

RNA editing index
A-to-I editing index has been computed as in (59). The main motivation for using this index instead of defining editing sites is explained below. First of all, it avoids defining RNA editing sites explicitly using an arbitrary RNA sequencing depth threshold. Secondly, different genomic features have different coverage by the RNA sequencing reads. Such uneven coverage leads to more editing sites being defined in the coding sequences per unit of length and may lead to false conclusion that coding sequences are edited at higher levels. Thus, when defining editing sites, the coverage should be accounted for. RNA editing index, on the other hand, is obtained by pulling the information from all reference adenosines for which there is any coverage present. This explicitly accounts for uneven coverage and thus allows comparison of different genomic regions. A hypothetical example illustrating this is shown in (Fig.S1A and B) When explicitly calling editing sites, feature A appears to be edited more (i.e. more editing sites per unit length) than feature B where no editing sites have been called due to low coverage. However, when comparing editing indices, the opposite is true. We have computed the editing index per each type of genomic feature in O. sinensis and O. bimaculoides genomes using the data generated in this study and (5) dataset respectively. Among the genic features in both species, the introns appear to have the highest editing indices followed by 3'-UTRs and coding sequences (Fig. S1).

Dependence between tissue sampling and complement of annotated miRNAs
MiRNAs were annotated in every species independently based on genome sequence and small RNA sequencing data (Materials and Methods). In only 4 cases, a putative microRNA locus has lacked reads in one of the species but was then annotated based on sequence similarity of the pre-miRNA in each of the respective genomes (see annotations in Mirgenedb. Dependence between sequencing depth and the capture of miRNAs To understand whether and how sequencing depth would affect miRNA detection rates we have subsampled small-RNA sequencing libraries from all 3 species in the study to the same number of reads (from 100,000 up to 8 millions). For every subsampled dataset, the reads have been aligned to the set of miRNA precursors and the number of detected miRNAs has been recorded (Fig. S5). In E. scolopes and O. bimaculoides, a sequencing depth of 1M reads is already sufficient to recover all coleoid miRNAs annotated in the genome despite the fact that in the case of E. scolopes the sequencing library has been produced from a cell-free hemolymph. In O. vulgaris, a sequencing depth of 1-3 million reads has been sufficient to recover all predicted coleoid microRNAs in almost all tissues of the animal. Non-neuronal tissues exhibited higher saturation depths when compared to neuronal tissues (Wilcoxon ranked sum test p-value = 0.0442), while no difference has been found between 2 library preparation methods used in the study (Wilcoxon paired ranked sum test p-value = 0.2).

A-to-I editing of miRNAs
Changes in the seed sequence of a miRNA (positions 2-8 counted from the 5' end) are expected to alter target recognition (28). However, similar to mammals (79), our data show that miRNA seed regions are not edited to any considerable extent. In the mature miRNA sequences, the mismatches were more abundant towards the 3'-end of the molecule (Fig. S7B). The most common types of mismatches were non-templated 3' additions of cytosine, adenine and uridine (Fig. S7C). When pooling the data from all datasets, the average number of reads mapping to the position in a seed was 49,532 (median 27,967; minimum 12). 146/147 miRNAs have had all 7 bases of the seed profiled with at least 10 sequencing reads and 136 with at least 100 reads. The majority of mismatches were specific to either Truseq (6,443) or Clontech (2,355) datasets and thus probably represent sequencing errors. In the remaining 894 cases, where a particular mismatch has been recovered by both library preparation methods, there was a mild correlation in the estimated mismatch frequencies between the methods (Spearman's rho = 0.46, correlation test p-value < 4.93E-43). In total, 5 miRNAs show signs of editing in their seed sequence above 1%. Of these, only 3 cases have been recovered in multiple tissues and none by both library preparation methods simultaneously (Table S4).
De-novo creation of miRNA targeting sites by RNA editing Having established the higher conservation for miRNA response elements ("MRE's") compared to non-MRE controls (Main text, Methods), we tallied cases where A-to-I editing possibly created functional MREs by introducing A-to-G substitutions. We reasoned that if such "one-off MREs" (8-mers convertible to an MRE via A-to-G mismatch) exist and are functional, they would, similarly to canonical targeting sites, display higher conservation rates compared to other octamers (vis above). We recorded, genome-wide, the conservation of such one-off MREs and observed no trace of higher conservation when compared to G-to-A one-off controls (i.e. octamers convertible to an MRE via G-to-A substitution, Fig. 8E, p > 0.05, Wilcoxon rank sum test with continuity correction). Furthermore, we have checked whether one-off MREs with editing events had higher conservation rates compared to the same type of 8-mers not targeted by ADAR and found no difference. Out of 71,772 one-off MREs conserved between 2 octopus species (306 sequences), only 159 (57 different sequences) have been found to be targeted by ADAR. These 57 one-off MREs were conserved at the same rates as their unedited counterparts (p > 0.05; Wilcoxon rank sum test). This suggests that the de-novo creation of miRNA targeting sites via A-to-I editing may not be a widespread phenomenon in cephalopods. Similarly, A-to-I editing events with the potential to destroy the target sites have been found to happen rarely. Out of 10,053 MREs conserved between 2 octopus species, only 39 (0.3%) have been found to be targeted by ADAR (Methods). This is in line with the known preference of ADAR for the double-stranded RNA and the general depletion of the secondary RNA structure around functional miRNA targeting sites (80). Fig. S1. Genomic targets of A-to-I editing (A) Definition of calling editing sites. Editing sites are required to have minimal coverage (e.g., 10 RNA-seq reads) to increase the precision. Some genomic features (e.g., introns) will not have coverage above the threshold across their whole length. Thus, normalization by feature-length will lead to the underestimation of editing intensity in such loci. (B) Editing index approach introduced in 76 . A-to-G mismatches are pooled across the whole feature. Note that the index implicitly accounts for the coverage and uncovered positions have no effect on the resulting value.   (5)) and O. vulgaris (C) (Methods). Wilcoxon rank sum test with continuity correction was used to test for the higher median per-tissue exon skipping rate between neuronal and nonneuronal tissue types, * p< 0.05,** p < 0.01, *** p < 0.001   Relationship between subsampled sequencing depth (x-axis) per library and the number of captured miRNAs for every species and small RNA sequencing library used in the study. For each library, a total number of miRNA precursor sequences with at least one mapping read has been counted. Dots specify the depth at which all coleoid miRNAs have been captured (if such saturation has been achieved). For further details, vis Supplementary Text.

Fig. S6. Fraction of transcriptome dedicated miRNAs of different phylogenetic ages
MiRNA expression as fraction of counts (as quantified by miRDeep2) colored according to the phylogenetic node of origin in all tissues profiled in the study. Tissues in have been sorted according to the proportion of reads assigned to miRNAs of coleoid origins.  (81)).
(C). Total proportion of mismatches called at different positions of miRNAs separated by the alternative base type.  Tables   Table S1. (separate file) Overview of datasets generated in this study (O. vulgaris only)