1 Retrieval of UCSC RepeatMasker annotations through AnnotationHub resources

The UCSCRepeatMasker package provides metadata for AnnotationHub resources associated with UCSC RepeatMasker annotations. The original data can be found through UCSC download URLs https://hgdownload.soe.ucsc.edu/goldenPath/XXXX/database/rmsk.txt.gz, where XXXX is the corresponding code to a UCSC genome version. Details about how those original data were processed into AnnotationHub resources can be found in the source file:

UCSCRepeatMasker/scripts/make-data_UCSCRepeatMasker.R

while details on how the metadata for those resources has been generated can be found in the source file:

UCSCRepeatMasker/scripts/make-metadata_UCSCRepeatMasker.R

UCSC RepeatMasker annotations can be retrieved using the AnnotationHub, which is a web resource that provides a central location where genomic files (e.g., VCF, bed, wig) and other resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be found. A Bioconductor AnnotationHub web resource creates and manages a local cache of files retrieved by the user, helping with quick and reproducible access.

For example, to list the available UCSC RepeatMasker annotations for the human genome, we should first load the AnnotationHub package:

library(AnnotationHub)

and then query the annotation hub as follows:

ah <- AnnotationHub()
query(ah, c("RepeatMasker", "Homo sapiens"))
## AnnotationHub with 2 records
## # snapshotDate(): 2022-01-31
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH99002"]]' 
## 
##             title                                                   
##   AH99002 | UCSC RepeatMasker annotations (Mar2020) for Human (hg19)
##   AH99003 | UCSC RepeatMasker annotations (Sep2021) for Human (hg38)

We can retrieve the desired resource, e.g., UCSC RepeatMasker annotations for hg38, using the following syntax:

rmskhg38 <- ah[["AH99003"]]
rmskhg38
## GRanges object with 5633664 ranges and 4 metadata columns:
##                        seqnames        ranges strand |   swScore     repName
##                           <Rle>     <IRanges>  <Rle> | <integer> <character>
##         [1]                chr1   10001-10468      + |       463   (TAACCC)n
##         [2]                chr1   15798-15849      + |        18   (TGCTCC)n
##         [3]                chr1   16713-16744      + |        18      (TGG)n
##         [4]                chr1   18907-19048      + |       239         L2a
##         [5]                chr1   19972-20405      + |       994          L3
##         ...                 ...           ...    ... .       ...         ...
##   [5633660] chrX_KV766199v1_alt 179150-179234      - |       255    MIR1_Amn
##   [5633661] chrX_KV766199v1_alt 184474-184785      - |      2039       AluJb
##   [5633662] chrX_KV766199v1_alt 186964-187271      - |       386      MLT1G3
##   [5633663] chrX_KV766199v1_alt 187486-187569      - |       270      MLT1G3
##   [5633664] chrX_KV766199v1_alt 187597-187822      - |      1301       L1MA8
##                  repClass     repFamily
##               <character>   <character>
##         [1] Simple_repeat Simple_repeat
##         [2] Simple_repeat Simple_repeat
##         [3] Simple_repeat Simple_repeat
##         [4]          LINE            L2
##         [5]          LINE           CR1
##         ...           ...           ...
##   [5633660]          SINE           MIR
##   [5633661]          SINE           Alu
##   [5633662]           LTR     ERVL-MaLR
##   [5633663]           LTR     ERVL-MaLR
##   [5633664]          LINE            L1
##   -------
##   seqinfo: 640 sequences (1 circular) from hg38 genome

Note that the data is returned using a GRanges object, please consult the vignettes from the GenomicRanges package for details on how to manipulate this type of object. The contents of the 4 metadata columns are described at the UCSC Genome Browser web page for the RepeatMasker database schema. Please consult the credits and references sections on that page for information on how to cite these data.

The GRanges object contains further metadata accessible with the metadata() method as follows:

metadata(rmskhg38)
## $srcurl
## [1] "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz"
## 
## $srcVersion
## [1] "Sep2021"
## 
## $citation
## A. Smit, R. Hubley, P. Green (1996-2010). _RepeatMasker Open-3.0_.
## <URL: https://www.repeatmasker.org>.
## 
## $gdesc
## | organism: Homo sapiens (Human)
## | genome: hg38
## | provider: UCSC
## | release date: Jun. 2013
## | ---
## | seqlengths:
## |                  chr1                 chr2 ...  chrX_KV766199v1_alt
## |             248956422            242193529 ...               188004

2 Session information

sessionInfo()
## R Under development (unstable) (2022-01-05 r81451)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] GenomicRanges_1.47.6 GenomeInfoDb_1.31.4  IRanges_2.29.1      
## [4] S4Vectors_0.33.10    AnnotationHub_3.3.8  BiocFileCache_2.3.4 
## [7] dbplyr_2.1.1         BiocGenerics_0.41.2  BiocStyle_2.23.1    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8                    png_0.1-7                    
##  [3] Biostrings_2.63.1             assertthat_0.2.1             
##  [5] digest_0.6.29                 utf8_1.2.2                   
##  [7] mime_0.12                     R6_2.5.1                     
##  [9] RSQLite_2.2.9                 evaluate_0.14                
## [11] httr_1.4.2                    pillar_1.7.0                 
## [13] zlibbioc_1.41.0               rlang_1.0.1                  
## [15] curl_4.3.2                    jquerylib_0.1.4              
## [17] blob_1.2.2                    rmarkdown_2.11               
## [19] stringr_1.4.0                 RCurl_1.98-1.6               
## [21] bit_4.0.4                     shiny_1.7.1                  
## [23] compiler_4.2.0                httpuv_1.6.5                 
## [25] xfun_0.29                     pkgconfig_2.0.3              
## [27] htmltools_0.5.2               tidyselect_1.1.1             
## [29] KEGGREST_1.35.0               GenomeInfoDbData_1.2.7       
## [31] tibble_3.1.6                  interactiveDisplayBase_1.33.0
## [33] bookdown_0.24                 fansi_1.0.2                  
## [35] withr_2.4.3                   crayon_1.4.2                 
## [37] dplyr_1.0.8                   later_1.3.0                  
## [39] bitops_1.0-7                  rappdirs_0.3.3               
## [41] jsonlite_1.7.3                xtable_1.8-4                 
## [43] lifecycle_1.0.1               DBI_1.1.2                    
## [45] magrittr_2.0.2                cli_3.1.1                    
## [47] stringi_1.7.6                 cachem_1.0.6                 
## [49] XVector_0.35.0                promises_1.2.0.1             
## [51] bslib_0.3.1                   ellipsis_0.3.2               
## [53] filelock_1.0.2                generics_0.1.2               
## [55] vctrs_0.3.8                   tools_4.2.0                  
## [57] bit64_4.0.5                   Biobase_2.55.0               
## [59] glue_1.6.1                    purrr_0.3.4                  
## [61] BiocVersion_3.15.0            fastmap_1.1.0                
## [63] yaml_2.2.2                    AnnotationDbi_1.57.1         
## [65] BiocManager_1.30.16           memoise_2.0.1                
## [67] knitr_1.37                    sass_0.4.0