library(BiocStyle) library(HPAanalyze) library(dplyr) library(xml2)
The Human Protein Atlas allow you to download very detailed data for each protein in the form of an xml file, and
hpaXml allow you to retrieve those files automatically from HPA server and parse them. However, due to technical limitation, you will not be able to save those
"xml_document"/"xml_node" objects. The question is: How do you keep a version of these files to use when you are not connected to the internet, or for reproducibility?
Look at the “Downloadable data” page from HPA website, you will see how these files are downloaded. Basically, you add
http://www.proteinatlas.org to download individual entries (that’s what
hpaXmlGet does behind the scene), or download the whole big set.
From there, you can import the file using
xml2::read_xml(). The output should be exactly the same as
## same as hpaXmlGet("ENSG00000134057") CCNB1xml <- xml2::read_xml("data/ENSG00000134057.xml")
Since the umbrella function
hpaXml take either the ensembl id or the imported
xml_document object, you can feed what you just imported to it and get the expected result.
CCNB1_parsed <- hpaXml(CCNB1xml)
You can obviously use other
hpaXml functions as well.
hpaXmlProtClass(CCNB1xml) hpaXmlTissueExprSum(CCNB1xml) hpaXmlAntibody(CCNB1xml) hpaXmlTissueExpr(CCNB1xml)
It is recommended that you save your parsed objects for reproducibility. Unlike the
xml_document object, these parsed objects are just regular R lists of standard vectors or data frames. You can save them just as usual.
Anh Tran, 2018
Please cite: Tran AN, Dussaq AM, Kennell T, Willey C, Hjelmeland A. HPAanalyze: An R Package that Facilitates the Retrieval and Analysis of The Human Protein Atlas Data. bioRxiv 355032; doi: https://doi.org/10.1101/355032