I have tried using Pandas read_xml and it reads most of the XML fine but it leaves some parts out because its in a slightly different format. I have included an extract below and it reads "Type", "Activation" fine but doesn't for the "Amt" value. It picks up the column heading "Amt" just not the value. Could anyone point me in the right direction on how to get it to read it. Thanks
<Type>PYI</Type>
<Activation>N</Activation>
<Amt val="4000" curr="GBP"/>
xml_df = pd.read_xml(xml_data)
Anybody able to help I have tried going through the documentation for Pandas.read_xml but I can see why it wouldn't pick this up?
By default,
pandas.read_xmlparses all the immediate descendants of a set of nodes including its child nodes and attributes. Unless, thexpathargument indicates it,read_xmlwill not go further than immediate descendants.To illustrate your use case. Below is likely the generalized set up of your XML where
<Type>and its siblings,<Activation>and<Amt>are parsed. However,<Amt>does not contain a text node, only attributes. So the value in that column should be empty.But then you ask, why did
read_xmlignore the val and curr attributes? Because each are not an immediate descendant of<row>. They are descendants of<Amt>(i.e., grandchildren of<row>). If attributes were moved to<row>, then they will be captured as shown below:To capture those attributes, adjust
xpathargument to point to its immediate parent:To have such attributes captured with
<row>level information, consider the special-purpose language, XSLT, to transform your original XML to the following:Above is the intermediate output that is parsed by
read_xmlwhen using thestylesheetargument as shown below: