I need to convert universal character name (UCN) data from a database to UTF-8. Seems trivial, but I spent hours reading about unicode, UTF-8, wide strings, ... without any result.
As example, the following string needs to be converted from D\u00c3\u00bcsseldorf to Düsseldorf.
What I tried:
char str[] = "\u00c3\u00bc"; // corresponds to ü
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
printf("%02hhx ", str[i]);
printf("- %zu - %s\n", str_len, str); // prints "c3 83 c2 bc - 4 - ü"
c3 is correct, but the next 3 bytes are unexpected.
The compiler only considers the first part of the UCN (\u00c3).
wchar_t wcs[] = L"\u00c3\u00bc";
size_t wcs_len = wcslen(wcs);
for (i = 0; i < wcs_len; i++)
printf("%02hhx ", wcs[i]);
printf("- %zu - %ls\n", wcs_len, wcs); // prints "c3 bc - 2 - ü"
Looks better.
The entire UCN is considered (c3 bc), but still no ü.
char str[] = "\xc3\xbc";
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
printf("%02hhx ", str[i]);
printf("- %zu %s\n", str_len, str); // prints "c3 bc - 2 ü"
This prints the ü, but I modified str from UCN to hex code.
What am I missing to get from \u00c3\u00bc to ü?
--- UPDATE ---
Like Rob Napier described, I have to change the initial string literal since it was badly/double encoded. I believe the only solution would be to manually change to "D\u00c3\u00bcsseldorf" to "Düsseldorf" or "D\u00fcsseldorf". Both ways require manual change.
Changing it to "D\xc3\xbcsseldorf" produces the correct result "Düsseldorf", but only by coincidence because the byte following the second byte injection (\xbc) is non-hex (the letter s). "AAA\xc3\xbcBBB" gives "AAAû" (0x41 0x41 0x41 0xc3 0xbb). Too bad that \x in a string literal doesn't stop after 1 byte. See this.
This is where you went wrong. This is not
ü. This isü, just as is being output.The UCN for
üis\u00fc: LATIN SMALL LETTER U WITH DIAERESISUnicode code points (which are what UCN encode) assign a single number to each Unicode character. They are the identifier for the character, not the encoding.
What you've written here is the UTF-8 encoding of
ü. UTF-8 is a way of writing down Unicode code points. Except for ASCII values (0-127), the UTF-8 bytes are always very different from the code point's value. (UTF-8 is possibly the most clever and useful text encoding ever devised. But it is not trivial to understand.)If you want to hand-encode UTF-8, then the
\xsyntax is correct. You can inject arbitrary bytes into a C string that way. Generally you should prefer the\u00fcsyntax when expressing a character, however.The reason your first byte seemed correct is that the UTF-8 encoding of à is c3 83. "c3" is the first byte of the UTF-8 encoding of many modified Latin characters. Seeing a lot of c3 bytes is an easy way to detect Western European UTF-8 text.