I am going through Go By Example, and the strings and runes section is terribly confusing.
Running this:
sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
fmt.Println(sample)
fmt.Printf("%%q: %q\n", sample)
fmt.Printf("%%+q: %+q\n", sample)
yields this:
��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"
..which is fine. The 1st, 2nd and 4th rune seem to be non-printable, which I guess means that \xbd, \xb2 and \xbc are simply not supported by Unicode or something (correct me if im wrong), and so they show up as �. Both %q and %+q also correctly escape those 3 non-printable runes.
But now when I iterate over the string like so:
for _, runeValue := range sample {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
suddenly the 3 non-printable runes are not escaped by %q and remain as �, and %+q attempts to reveal their underlying code point, which is obviously incorrect:
fffd, '�', '\ufffd'
fffd, '�', '\ufffd'
3d, '=' , '='
fffd, '�', '\ufffd'
20, ' ' , ' '
2318, '⌘', '\u2318'
Even stranger, if I iterate over the string as a byte slice:
for _, runeValue := range []byte(sample) {
fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
}
suddenly, these runes are no longer non-printable, and their underlying code points are correct:
bd, '½', '\u00bd'
b2, '²', '\u00b2'
3d, '=', '='
bc, '¼', '\u00bc'
20, ' ', ' '
e2, 'â', '\u00e2'
8c, '\u008c', '\u008c'
98, '\u0098', '\u0098'
Can someone explain whats happening here?
fmt.Printfwill do lot of magic under the covers to render as much useful information via type inspection etc. If you want to verify if a string (or a byte slice) is validUTF-8use the standard library packageencoding/utf8.For example:
Scanning the individual runes of the string we can identify what makes this string invalid (from a
UTF-8encoding perspective). Note: the hex value0xfffdindicates a bad rune was encounter. This error value is defined as a package constant utf8.RuneError:https://go.dev/play/p/9NO9xMvcxCp
produces: