TL;DR: I need to stop hxt from converting < into < in the conversion between String and XMLTree, and vice-versa.
I'm using hxt to parse XML documents.
In the documents I need to parse, users can put anything they want in specific tags (for the sake of the example, let's call them <mytag> and </mytag>), but they have to escape any HTML inside. So, instead of writing <mytag><b>text</b></mytag> they have to write <mytag><b>text</b></mytag>.
I don't control the format, I just have to parse it.
Specifically, I have to load the document into a String, use a user-provided xpath-query (given also as a String) to select portions of the document, convert the matching results to a list of Strings, and do the same thing again individually for each result to extract the sub-portions I need. For example, having:
<mytags>
<mytag><attribute type="somethingIrecognize"><b>text1</b></attribute><attribute type="someOtherThing">fdfdjskadjfk</attribute></mytag>
<mytag><attribute type="somethingIrecognize"><b>text2</b></attribute><attribute type="someOtherThing">fdfdjskadjfk</attribute></mytag>
</mytags>
The first XPath string (which has to be given as argument to my function as a Haskell String), has to first produce a list of tags. In this case, for example, applying "//mytags/*" first should produce:
[ "<mytag><attribute type=\"somethingIrecognize\"><b>text1</b></attribute><attribute type=\"someOtherThing\">value1</attribute></mytag>"
, "<mytag><attribute type=\"somethingIrecognize\"><b>text2</b></attribute><attribute type=\"someOtherThing\">value2</attribute></mytag>"
]
Now I can run the second search individually on each result to extract the data I need. For example, applying "//mytag/attribute[contains(@type,"somethingIRecognize")]/text()" to the first value should give me:
<b>text1</b>
Sometimes, users include HTML that is absolutely broken, but I still want to keep it literally the way they wrote it. For example, they might write (without a closing tag </b>):
<mytags>
<mytag>
<attribute type="somethingIrecognize"><b>text1</attribute>
<attribute type="someOtherThing">value1</attribute>
</mytag>
</mytags>
I'm finding an issue using HXT where, when I read the XML, select what I need (with xpath) and then turn it into a String for further processing at a later stage, the <b>text</b> has been converted into <b>text</b>.
This makes later processors choke because, when the inner HTML is very broken, and no matter how hard I try to disable error detection, HXT still complains about missing closing tags, etc. and won't let me process it further.
In case it matters, I'm using this xshow to turn each result back into a string so that I can process it later with a second xpath query: https://hackage.haskell.org/package/hxt-9.3.1.22/docs/Text-XML-HXT-DOM-ShowXml.html#v:xshow
Is there a way to prevent that conversion?
Substitution for default XML entities (
<,>,&,'and") occurs within the low-level parser and can't be disabled, unless you're willing to patch a custom version ofhxt.However, if you use
xshowEscapeXmlin place ofxshow, it will escape&and<in text (though not>or the others), which is typically going to be enough for XML to correctly re-parse the output as XML.