Does UTF-8 have more than one version?

110 views Asked by At

I read the following in PHP Manual > Language Reference > Types: Details of the String Type:

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation?

What does "UTF-8, C form" and "UTF-8, D form" mean - are they two versions of UTF-8?

1

There are 1 answers

1
Ayb009 On

UTF-8 C form and UTF-8 D form are two alternate ways of encoding the same Unicode code points in UTF-8, with C form using a single code unit for characters that can be represented in ASCII, and D form using two code units for all characters. Example:

  • (é) in UTF-8 C is represented as two bytes: 0xC3 and 0xA9
  • (é) UTF-8 D is represented as a single code point: 0xE9