Using the GEOfastq Package



GEOfastq can be installed from Bioconductor as follows:

if(!requireNamespace("BiocManager", quietly = TRUE))

Overview of GEOfastq

The NCBI Gene Expression Omnibus (GEO) offers a convenient interface to explore high-throughput experimental data such as RNA-seq. GEO deposits RNA-seq data as sra files to the Sequence Read Archive (SRA) which can be converted to fastq files using fastq-dump. This conversion process can be quite slow and it is usually more convenient to download fastq files for a GEO accession generated by the European Nucleotide Archive (ENA). GEOfastq crawls GEO to retrieve metadata and ENA fastq urls, and then downloads them.

Getting Started using GEOfastq

To get fastq data for a GEO series, we first retrieve the metadata for a GEO accession:

gse_name <- 'GSE133758'
gse_text <- crawl_gse(gse_name)

Next, we extract the sample accessions for this study and retrieve the GEO metadata and ENA fastq url for an example:

gsm_names <- extract_gsms(gse_text)
gsm_name <- gsm_names[182]
srp_meta <- crawl_gsms(gsm_name)
#> 1 GSMs to process

Now that we have retrieved the necessary metadata, we are ready to download the fastq files for this sample:

data_dir <- tempdir()

# example using smaller file
srp_meta <- data.frame(
        run  = 'SRR014242',
        row.names = 'SRR014242',
        gsm_name = 'GSM315559',
        ebi_dir = get_dldir('SRR014242'), stringsAsFactors = FALSE)

res <- get_fastqs(srp_meta, data_dir)

Session info

The following package and versions were used in the production of this vignette.

#> R version 4.4.0 RC (2024-04-16 r86468)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/ 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> other attached packages:
#> [1] GEOfastq_1.13.0
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.35     R6_2.5.1          codetools_0.2-20  fastmap_1.1.1    
#>  [5] doParallel_1.0.17 xfun_0.43         iterators_1.0.14  cachem_1.0.8     
#>  [9] parallel_4.4.0    knitr_1.46        RCurl_1.98-1.14   htmltools_0.5.8.1
#> [13] rmarkdown_2.26    lifecycle_1.0.4   bitops_1.0-7      cli_3.6.2        
#> [17] foreach_1.5.2     sass_0.4.9        jquerylib_0.1.4   compiler_4.4.0   
#> [21] plyr_1.8.9        tools_4.4.0       evaluate_0.23     bslib_0.7.0      
#> [25] Rcpp_1.0.12       yaml_2.3.8        rlang_1.1.3       jsonlite_1.8.8