I need to preprocess XML files for a NER task and I am struggling with the conversion of the XML files. I guess there is a nice and easy way to solve the following problem.
Given an annotated text in XML with the following structure as input:
<doc>
Some <tag1>annotated text</tag1> in <tag2>XML</tag2>.
</doc>
I want a CoNLL file in IOB2 tagging format as follows as output:
Some O
annotated B-TAG1
text I-TAG1
in O
XML B-TAG2
. O
Let's convert your XML file to a TXT (called 'read.txt') as follows:
Then using regex and several if-else conditions the below code return 'output.txt' file in CONNL format as you want.
output.txt: