library(BiocStyle)
library(HPAanalyze)
library(dplyr)

1 Background

The Human Protein Atlas (HPA) is a comprehensive resource for exploration of human proteome which contains a vast amount of proteomics and transcriptomics data generated from antibody-based tissue micro-array profiling and RNA deep-sequencing.

The program has generated protein expression profiles in human normal tissues with cell type-specific expression patterns, cancer and cell lines via an innovative immunohistochemistry-based approach. These profiles are accompanied by a large collection of high quality histological staining images, annotated with clinical data and quantification. The database also includes classification of protein into both functional classes (such as transcription factors or kinases) and project-related classes (such as candidate genes for cancer). Starting from version 4.0, the HPA includes subcellular location profiles generated based on confocal images of immunofluorescent stained cells. Together, these data provide a detailed picture of protein expression in human cells and tissues, facilitating tissue-based diagnostic and research.

Data from the HPA are freely available via proteinatlas.org, allowing scientists to access and incorporate the data into their research. Previously, the R package hpar has been created for fast and easy programmatic access of HPA data. Here, we introduce HPAanalyze, an R package aims to simplify exploratory data analysis from those data, as well as provide other complementary functionality to hpar.

1.1 The different HPA data formats

The Human Protein Atlas project provides data via two main mechanisms: Full datasets in the form of downloadable compressed tab-separated files (.tsv) and individual entries in XML, RDF and TSV formats. The full downloadable datasets includes normal tissue, pathology (cancer), subcellular location, RNA gene and RNA isoform data. For individual entries, the XML format is the most comprehensive, providing information on the target protein, antibodies, summary for each tissue and detailed data from each sample including clinical data, IHC scoring and image download links.

1.2 HPAanalyze overview

HPAanalyze is designed to fullfill 3 main tasks: (1) Import, subsetting and export downloadable datasets; (2) Visualization of downloadable datasets for exploratory analysis; and (3) Working with the individual XML files. This package aims to serve researchers with little programming experience, but also allow power users to use the imported data as desired.

2 Visualize protein expression data

Currently, this is available for the normal tissue, pathology (cancers) and subcellular location datasets. The fastest and easiest way is to use the defaults of hpaVis.

hpaVis(targetGene = c("GCH1", "PTS", "SPR", "DHFR"),
       targetTissue = c("cerebellum", "cerebral cortex", "hippocampus"),
       targetCancer = c("glioma"))
#> No data provided. Use version 18.
#> targetCellType variable not specified, visualize all.
#> targetCellType variable not specified, visualize all.
#> Use hpaListParam() to list possible values for target variables.
#> Use hpaListParam() to list possible values for target variables.

Of course, we cannot visualize everything in those big datasets, so some defauts will be used and you will receive some warning messages.

hpaVis()

# No data provided. Use version 18.
# targetGene variable not specified, default to TP53, RB1, MYC, KRAS and EGFR.
# targetTissue variable not specified, default to breast.
# targetCellType variable not specified, visualize all.
# targetCancer variable not specified, default to breast cancer
# Use hpaListParam() to list possible values for target variables.

You can also use hpaVis to show just one or two of the three graphs.

hpaVis(visType = "Patho",
       targetGene = c("GCH1", "PTS", "SPR", "DHFR"),
       targetCancer = c("glioma", "breast cancer"))
#> No data provided. Use version 18.