" from va" /> " from va" /> " from va"/>

Java: remove < and > from text in XML (not tags)

284 views Asked by At

I'm having a hard time escaping xml to be processed by Java. I'm using JTidy to escape unwanted characters, but struggle to remove "<" and ">" from values such as <tag> capacity < 1000 </tag>

I'm using below code to escape the input

    public String CleanXML(String input){

        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-16");
        tidy.setOutputEncoding("UTF-16");
        tidy.setWraplen(Integer.MAX_VALUE);
        tidy.setXmlOut(true);
        tidy.setSmartIndent(true);
        tidy.setXmlTags(true);
        tidy.setMakeClean(true);
        tidy.setForceOutput(true);
        tidy.setQuiet(true);
        tidy.setShowWarnings(false);
        StringReader in = new StringReader(input);
        StringWriter out = new StringWriter();
        tidy.parse(in, out);

        return out.toString();
    }
2

There are 2 answers

0
Nilanka Manoj On BEST ANSWER

use following function

private static final Pattern TAG_REGEX = Pattern.compile("<tag>(.+?)</tag>", Pattern.DOTALL);

public String CleanXML(String input){
    final Matcher matcher = TAG_REGEX.matcher(input);
    while (matcher.find()) {
        String value = matcher.group(1);
        String valueReplace = value.replaceAll("[^a-zA-Z0-9\\s]", "");
        input.replace(value,valueReplace);
    }
    return input;        
}

It uses regular expression search to get values between tags then, remove all non alphanumeric characters. Regular expressions and basic idea was gained from Java regex to extract text between tags

4
Nilanka Manoj On

If you want to remove tag terminals of XML, just convert it to a map and build string as you required refer XML to map in Java.

If you want to clean attribute values, you can iterate map and clean it then build a string or re convert it to the XML by map to XML in java