Convert universal character name to UTF-8 in C

Question

Convert universal character name to UTF-8 in C

168 views Asked by geohei At 20 February 2024 at 12:43

I need to convert universal character name (UCN) data from a database to UTF-8. Seems trivial, but I spent hours reading about unicode, UTF-8, wide strings, ... without any result.

As example, the following string needs to be converted from D\u00c3\u00bcsseldorf to Düsseldorf.

What I tried:

char str[] = "\u00c3\u00bc"; // corresponds to ü
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
    printf("%02hhx ", str[i]);
printf("- %zu - %s\n", str_len, str); // prints "c3 83 c2 bc - 4 - Ã¼"

c3 is correct, but the next 3 bytes are unexpected.
The compiler only considers the first part of the UCN (\u00c3).

wchar_t wcs[] = L"\u00c3\u00bc";
size_t wcs_len = wcslen(wcs);
for (i = 0; i < wcs_len; i++)
    printf("%02hhx ", wcs[i]);
printf("- %zu - %ls\n", wcs_len, wcs); // prints "c3 bc - 2 - Ã¼"

Looks better.
The entire UCN is considered (c3 bc), but still no ü.

char str[] = "\xc3\xbc";
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
    printf("%02hhx ", str[i]);
printf("- %zu %s\n", str_len, str); // prints "c3 bc - 2 ü"

This prints the ü, but I modified str from UCN to hex code.

What am I missing to get from \u00c3\u00bc to ü?

--- UPDATE ---

Like Rob Napier described, I have to change the initial string literal since it was badly/double encoded. I believe the only solution would be to manually change to "D\u00c3\u00bcsseldorf" to "Düsseldorf" or "D\u00fcsseldorf". Both ways require manual change.

Changing it to "D\xc3\xbcsseldorf" produces the correct result "Düsseldorf", but only by coincidence because the byte following the second byte injection (\xbc) is non-hex (the letter s). "AAA\xc3\xbcBBB" gives "AAAû" (0x41 0x41 0x41 0xc3 0xbb). Too bad that \x in a string literal doesn't stop after 1 byte. See this.

Original Q&A

There are 2 answers

chqrlie On 21 February 2024 at 12:08

As correctly explained by @RobNapier, the initial encoding posted in the question is incorrect and results in double encoding if the compiler uses UTF-8 to encode unicode escapes in 8-bit strings.

To ensure UTF-8 encoding on all platforms, you should indeed use hex escape sequences as in "D\xc3\xbcsseldorf" and to avoid potential problems with subsequent characters in case they happen to be hex digits, you should use split the string string literal after the hex sequence:

    char city1[] = "D\xc3\xbc""sseldorf";
    char city2[] = "Saarbr\xc3\xbc""cken";

You could also use macros to avoid typos:

#define u_umlaut  "\xc3\xbc"

    char city1[] = "D" u_umlaut "sseldorf";
    char city2[] = "Saarbr" u_umlaut "cken";

This is only necessary if the source code does not use UTF-8 already of if the compiler is broken or misconfigured and converts the source character set to a different character set at compile time. With a modern properly configured system the source can be made more readable as:

    char city1[] = "Düsseldorf";
    char city2[] = "Saarbrücken";

**Rob Napier** · Accepted Answer · 2024-02-20T14:19:37+00:00

char str[] = "\u00c3\u00bc"; // corresponds to ü

This is where you went wrong. This is not ü. This is Ã¼, just as is being output.

The UCN for ü is \u00fc: LATIN SMALL LETTER U WITH DIAERESIS

$ uni print c3 bc
     CPoint  Dec    UTF8        HTML       Name (Cat)
'¼'  U+00BC  188    c2 bc       &frac14;   VULGAR FRACTION ONE QUARTER (Other_Number)
'Ã'  U+00C3  195    c3 83       &Atilde;   LATIN CAPITAL LETTER A WITH TILDE (Uppercase_Letter)

$ uni id ü
     CPoint  Dec    UTF8        HTML       Name (Cat)
'ü'  U+00FC  252    c3 bc       &uuml;     LATIN SMALL LETTER U WITH DIAERESIS (Lowercase_Letter)

Unicode code points (which are what UCN encode) assign a single number to each Unicode character. They are the identifier for the character, not the encoding.

What you've written here is the UTF-8 encoding of ü. UTF-8 is a way of writing down Unicode code points. Except for ASCII values (0-127), the UTF-8 bytes are always very different from the code point's value. (UTF-8 is possibly the most clever and useful text encoding ever devised. But it is not trivial to understand.)

If you want to hand-encode UTF-8, then the \x syntax is correct. You can inject arbitrary bytes into a C string that way. Generally you should prefer the \u00fc syntax when expressing a character, however.

The reason your first byte seemed correct is that the UTF-8 encoding of Ã is c3 83. "c3" is the first byte of the UTF-8 encoding of many modified Latin characters. Seeing a lot of c3 bytes is an easy way to detect Western European UTF-8 text.

TechQA.

Convert universal character name to UTF-8 in C

There are 2 answers

Related Questions in C

Related Questions in UNICODE

Related Questions in UTF-8

Related Questions in WCHAR-T

Popular Questions

Trending Questions