Background
I need to parse some string from HTML that is of a URL (seems it's inside JSON), so I tried to use org.apache.commons.text.StringEscapeUtils.unescapeJson.
An example of such a URL started with this as the input:
https:\/\/scontent.cdninstagram.com\/v\/t51.2885-19\/40405422_462181764265305_1222152915674726400_n.jpg?stp=dst-jpg_s150x150\\u0026
The problem
It seems it had some characters that weren't handled so if I perform this:
val test="https:\\/\\/scontent.cdninstagram.com\\/v\\/t51.2885-19\\/40405422_462181764265305_1222152915674726400_n.jpg?stp=dst-jpg_s150x150\\\\u0026\n"
Log.d("AppLog", "${StringEscapeUtils.unescapeJson(test)}")
the result is:
https://scontent.cdninstagram.com/v/t51.2885-19/40405422_462181764265305_1222152915674726400_n.jpg?stp=dst-jpg_s150x150\u0026
You can see that there is still "0026" in it, so I've found that using this solved it:
StringEscapeUtils.unescapeJson(input).replace("\\u0026","&").replace("\\/", "/")
This works, but I think I should use something more official, as it might fail due to too-direct replacing of substrings.
What I've tried
Looking at unescapeJson code (which is the same for Java&Json, it seems), I thought that maybe I could just add the rules:
/**based on StringEscapeUtils.unescapeJson, but with addition of 2 more rules*/
fun unescapeUrl(input: String): String {
val unescapeJavaMap= hashMapOf<CharSequence, CharSequence>(
"\\\\" to "\\",
"\\\\" to "\\",
"\\\"" to "\"",
"\\'" to "'",
"\\" to StringUtils.EMPTY,
//added rules:
"\\u0026" to "&",
"\\/" to "/"
)
val aggregateTranslator = AggregateTranslator(
OctalUnescaper(),
UnicodeUnescaper(),
LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_UNESCAPE),
LookupTranslator(Collections.unmodifiableMap(unescapeJavaMap))
)
return aggregateTranslator.translate(input)
}
This doesn't work. It leaves the string with "\u0026" in it.
The questions
What did I do wrong here? How can I fix this?
It is true it's best to use something similar to the original code, instead of using "replace", right?
BTW, I use this on Android using Kotlin, but same can be done on Java on PC.
Let me just give you my working example using
StringEscapeUtils.unescapeJson(input)withoutreplace. I've also looked into theStringEscapeUtilssource code, which might help you a bit.Here is my working Kotlin code (Java works the same in my test).
Output:
As you can see, the outputs are identical regardless of using the
replacelogic. I'm usingorg.apache.commons:commons-text:1.10.0.If we look into their source code, it's could be clear that we don't have to add any
replace("\\u0026", "&").replace("\\/", "/")because:unescapeUrloriginally replicated from UNESCAPE_JAVA implementation.\\/string is handled by another existing rule at UNESCAPE_JAVA, which isunescapeJavaMap.put("\\", StringUtils.EMPTY)and also replicated in yourunescapeUrl.So, answering your questions (NB: also see the UPDATE below taking into account the "broken" input from the author, which was posted later):
StringEscapeUtils.unescapeJson(input)as you can see, it works in my Kotlin example (Java as well). Maybe the version of the "common-text" library? But I doubt that. I'm also using PC, not Android. See the UPDATE below explaining the "broken" input posted later by the author and how to deal with that.I hope, this answer helps. Also, as it was mentioned in the comments, a good example from you would be very helpful!
UPDATE: Looking into the author's example (posted later), I can see that the escaped-unicode representation of ampersand is sort of double-escaped in the input as
\\u0026instead of\u0026. Thus, the problem. If you look into the source code of that UNESCAPE_JAVA (UNESCAPE_JSON), you will see that the\\string get transformed into a single backslash\as inunescapeJavaMap.put("\\\\", "\\"), and then in that translators iteration the index advances by 2 as two characters have been replaced, which places the index at theucharacter.I would say, this is the upstream problem that sends you a badly formatted string. Ideally, it should be fixed so that they don't double escape the characters represented in escaped-unicode format. Then
\\u0026should become\u0026.You can also compose your own
AggregateTranslatorthe way it properly handles this scenario. There might be few options but they could all be error-prone and stop working properly in other scenarios. So, you have to be careful with that.You can also run the
unescapeJsonmethod twice and it works in your particular example as inStringEscapeUtils.unescapeJson(StringEscapeUtils.unescapeJson(input)). But obviously, you could easily over-unescape the input.