From R xml2 library, I don't understand how xml_find_all and xml_find_first work

106 views Asked by At

I am trying to mimic a simple example to retrieve named nodes with xml_find_first() and xml_find_all() functions. The simple example works very well:

library(xml2)
x <- read_xml("<foo><bar><baz/></bar><baz/></foo>")
xml_find_all(x, ".//baz")
xml_find_all(x, ".//bar")
xml_find_first(x, ".//bar")

As expected, the output for the three cases is:

{xml_nodeset (2)}
[1] <baz/>
[2] <baz/>

{xml_nodeset (1)}
[1] <bar>\n  <baz/>\n</bar>

{xml_node}
<bar>
[1] <baz/>

Now, with the more complex, production example, it seems that the two functions behave differently

library(xml2)
yy <- read_xml(
  '<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
      <fileVersion appName="xl" lastEdited="3" lowestEdited="5" rupBuild="9302"/>
      <workbookPr/>
      <workbookProtection/>
      <bookViews>
          <workbookView windowWidth="27090" windowHeight="8700" tabRatio="500" activeTab="1"/>
      </bookViews>
      <sheets>
          <sheet name="PARTICIPANTES" sheetId="1" r:id="rId1"/>
          <sheet name="ORDENADOS" sheetId="2" r:id="rId2"/>
      </sheets>
      <calcPr calcId="144525"/>
  </workbook>'
)

xml_find_first(yy, ".//sheets")
xml_find_first(yy, "//sheets")
xml_find_all(yy, "//sheets")

In all cases, the answer is a missing node:

{xml_missing}
<NA>

{xml_missing}
<NA>

{xml_nodeset (0)}

Is there something I am missing about these functions?

1

There are 1 answers

0
Parfait On

Consider xml_ns_rename to rename the default namespace, identified by xmlns="..." which differs from prefixed namespace xmlns:r="...". Renaming allows you then to use a temporary prefix in any XPath expression.

ns <- xml_ns_rename(xml_ns(yy), d1 = "doc")

xml_find_first(yy, ".//doc:sheets", ns)
xml_find_first(yy, "//doc:sheets", ns)
xml_find_all(yy, "//doc:sheets", ns)