runes within strings

Question

runes within strings

145 views Asked by Farhan Alvi At 25 March 2023 at 17:24

I am going through Go By Example, and the strings and runes section is terribly confusing.

Running this:

    sample := "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
    fmt.Println(sample)
    fmt.Printf("%%q: %q\n", sample)
    fmt.Printf("%%+q: %+q\n", sample)

yields this:

��=� ⌘
%q: "\xbd\xb2=\xbc ⌘"
%+q: "\xbd\xb2=\xbc \u2318"

..which is fine. The 1st, 2nd and 4th rune seem to be non-printable, which I guess means that \xbd, \xb2 and \xbc are simply not supported by Unicode or something (correct me if im wrong), and so they show up as �. Both %q and %+q also correctly escape those 3 non-printable runes.

But now when I iterate over the string like so:

    for _, runeValue := range sample {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

suddenly the 3 non-printable runes are not escaped by %q and remain as �, and %+q attempts to reveal their underlying code point, which is obviously incorrect:

 fffd, '�', '\ufffd'
 fffd, '�', '\ufffd'
 3d,   '=' ,  '='
 fffd, '�', '\ufffd'
 20,   ' ' ,  ' '
 2318, '⌘', '\u2318'

Even stranger, if I iterate over the string as a byte slice:

    for _, runeValue := range []byte(sample) {
        fmt.Printf("% x, %q, %+q\n", runeValue, runeValue, runeValue)
    }

suddenly, these runes are no longer non-printable, and their underlying code points are correct:

 bd, '½', '\u00bd'
 b2, '²', '\u00b2'
 3d, '=', '='
 bc, '¼', '\u00bc'
 20, ' ', ' '
 e2, 'â', '\u00e2'
 8c, '\u008c', '\u008c'
 98, '\u0098', '\u0098'

Can someone explain whats happening here?

Original Q&A

There are 1 answers

**colm.anseo** · Accepted Answer · 2023-03-25T18:44:43+00:00

fmt.Printf will do lot of magic under the covers to render as much useful information via type inspection etc. If you want to verify if a string (or a byte slice) is valid UTF-8 use the standard library package encoding/utf8.

For example:

import "unicode/utf8"

var sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

fmt.Printf("%q valid? %v\n", sample, utf8.ValidString(sample)) // reports "false"

Scanning the individual runes of the string we can identify what makes this string invalid (from a UTF-8 encoding perspective). Note: the hex value 0xfffd indicates a bad rune was encounter. This error value is defined as a package constant utf8.RuneError:

for _, r := range sample {

    validRune := r != utf8.RuneError // is 0xfffd? i.e. bad rune?

    if validRune {
        fmt.Printf("'%c' validRune: true   hex: %4x\n", r, r)
    } else {
        fmt.Printf("'%c' validRune: false\n", r)
    }
}

https://go.dev/play/p/9NO9xMvcxCp

produces:

'�' validRune: false
'�' validRune: false
'=' validRune: true   hex:   3d
'�' validRune: false
' ' validRune: true   hex:   20
'⌘' validRune: true   hex: 2318

TechQA.

runes within strings

There are 1 answers

Related Questions in STRING

Related Questions in GO

Related Questions in UNICODE

Related Questions in UTF-8

Related Questions in RUNE

Popular Questions

Trending Questions