R xml2 error: Start tag expected, '<' not found [4]

173 views Asked by At

I am trying to import an XML file from a URL:

library(xml2)

x <- read_xml('https://ftp.ncbi.nlm.nih.gov/pub/GTR/data/gtr_ftp.xml.gz')
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Start tag expected, '<' not found [4]

According to the documentation I should be able to pass a URL for a .gz file and it will be uncompressed. If I download the file, unzip it locally, and then use read_xml it works fine. This is a pretty large file (~ 2 GB unzipped) and so I am not sure if that is a problem over a connection. Any thoughts on how I can read this directly from a connection?

2

There are 2 answers

0
MrFlick On BEST ANSWER

The catch is that the documentation says "Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed" (emphasis added). The logic seems to be in the xml2:::path_to_connection function. URLs are not automatically uncompressed, only local files on disc.

The read_xml function will use the curl package to work with URLs if installed. If you have that package, you can wrap the download call with gzcon to do the decoding. Assuming you have enough RAM, you could try

x <- read_xml(gzcon(curl::curl(url)))
0
kjhughes On

According to the xml2 documentation, read_xml()'s argument can be

A string can be either a path, a url or literal xml. Urls will be converted into connections either using base::url or, if installed, curl::curl. Local paths ending in .gz, .bz2, .xz, .zip will be automatically uncompressed. [emphasis added]

The argument in your failing case is a remote compressed file.