Use Python to read a XML file with duplicated tag name

46 views Asked by At

I need to read a XML file which including multiple duplicated Tag name Keyword,

And here is my example of xmlfile.xml input:

<LogPattern>    
    <DefectID>5664</DefectID>   
    <EscalationID>5782</EscalationID>
    <Title>ProcessImageRequests exception in ImageServer</Title>
    
    <KeywordList Type="sequence">       
        <Keyword name="a1" subrelationship="And">
            <Keyword name="a1001" subrelationship="Or">
                <Keyword name="a1001001" Value="Keyword1001001"/>
                <Keyword name="a1001002" subrelationship="Not" value="Keyword1001002"/>
            </Keyword>
            <Keyword name="a1002" LoggerName="Generator" Value="Keyword1002"/>
            <Keyword name="a1003" Value="Keyword1003"/>
            <Keyword name="a1004" Value="Keyword1004"/>
                
            
        </Keyword>
        <Keyword name="a2" Value="Keyword2"/>
        <Keyword name="a3" Value="Keyword3"/>
        <Keyword name="a4" Value="Keyword4"/>
    
    </KeywordList>
</LogPattern>

And here is my code for reading xml file and get element with tag name Keyword:

import xml.dom.minidom as xmldom


if __name__ == '__main__':
    xml_file = xmldom.parse('xmlfile.xml')
    eles = xml_file.documentElement
    print(eles.tagName)
    defect = eles.getElementsByTagName('DefectID')[0].firstChild.data
    escalation = eles.getElementsByTagName('EscalationID')[0].firstChild.data
    title = eles.getElementsByTagName('Title')[0].firstChild.data
    key = eles.getElementsByTagName('Keyword')
    print(key[0].getAttribute('name'))

My method would get all Keyword without the relationship.

In the xml file, the Keyword name="a1" is the parent which has children Keyword name="a1001", Keyword name="a1002",Keyword name="a1003" and Keyword name="a1004".

Is there any method I can determine which Keyword is a child of Keyword name="a1" without changing the tag name and print it as a dictionary?

Expect Output as dictionary:


{'a1':
    {'a1001':
        {'a1001001': 'Keyword1001001'},
        {'a1001002': 'Keyword1001002'}
    }, 
    {'a1002':'Keyword1002'}, 
    {'a1003':'Keyword1003'}, 
    {'a1004':'Keyword1004'}
}

1

There are 1 answers

0
Сергей Кох On

xml.dom.minidom is not suitable for solving this problem, since it immediately takes all, even nested, Keyword elements.

import xml.dom.minidom as xmldom
from pprint import pprint

if __name__ == '__main__':
    dom = xmldom.parse('input.xml')
    keywords = dom.getElementsByTagName("KeywordList")
    keyword = keywords[0].getElementsByTagName("Keyword")
    pprint(keyword)

-----------------------------

[<DOM Element: Keyword at 0x216a1bce160>,
 <DOM Element: Keyword at 0x216a1bce1f0>,
 <DOM Element: Keyword at 0x216a1bce280>,
 <DOM Element: Keyword at 0x216a1bce310>,
 <DOM Element: Keyword at 0x216a1bce3a0>,
 <DOM Element: Keyword at 0x216a1bce430>,
 <DOM Element: Keyword at 0x216a1bce4c0>,
 <DOM Element: Keyword at 0x216a1bce550>,
 <DOM Element: Keyword at 0x216a1bce5e0>,
 <DOM Element: Keyword at 0x216a1bce670>]

When using xml.etree.ElementTree, this task becomes solvable, which is recommended in the documentation, since you can iterate over each level of tags with the same name.

import xml.etree.ElementTree as ET


if __name__ == '__main__':
    tree = ET.parse('input.xml')
    keywords = tree.find('KeywordList')
    for keyword in keywords:
        print(keyword.attrib)

--------------------------------

{'name': 'a1', 'subrelationship': 'And'}
{'name': 'a2', 'Value': 'Keyword2'}
{'name': 'a3', 'Value': 'Keyword3'}
{'name': 'a4', 'Value': 'Keyword4'}