Test if char* string contains multibyte characters

Question

Test if char* string contains multibyte characters

5.8k views Asked by cpx At 16 February 2011 at 05:26

I receive a byte stream buffer from a TCP server which could contain multibyte characters forming unicode characters. I was wondering if there's always a way to check for BOM to detect those characters or else how would you like to do it?

Original Q&A

There are 5 answers

Enno On 16 February 2011 at 05:32

BOM are mostly optional. If the server that you're receiving from is serving multibyte characters, it might assume that you know this, and save itself the 2 bytes for the BOM. Are you asking for a way to tell whether data that you receive is likely to be a multi-byte string?

Washu On 16 February 2011 at 05:34

There are lots of ways to detect multibyte characters, and unfortunately... none of them are reliable.

If this is a web request being returned, check the headers, for the Content-Type header will often indicate the page encoding (which can be indicative of multibyte character presense).

You can also check for BOMs, as they are invalid characters they shouldn't appear in normal text anyways, so it can't hurt to see if they're there. However, they are optional and many times will not be present (depends on implementation, configuration, etc.).

Artyom On 19 February 2011 at 19:34

In UTF-8 anything that has 8th bit on is part of multibyte codepoint. So basically checking (0x80 & c)!=0 for each byte is the simples way to do this.

ynn On 22 March 2020 at 10:42

Let me implement dan04's answer.

Hereafter I use C++14. If you can only use an older version of C++, you have to rewrite binary literals (e.g. 0b10) to integer literals (e.g. 2).

Implementation

int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
    if ((c >> 7) == 0b1) {
        if ((c >> 6) == 0b10) {
            return 2; //2nd, 3rd or 4th byte of a utf-8 character
        } else {
            return 1; //1st byte of a utf-8 character
        }
    } else {
        return 0; //a single byte character (not a utf-8 character)
    }
}

Example

Code

using namespace std;
#include <iostream>

namespace N {

    int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
        if ((c >> 7) == 0b1) {
            if ((c >> 6) == 0b10) {
                return 2; //2nd, 3rd or 4th byte of a utf-8 character
            } else {
                return 1; //1st byte of a utf-8 character
            }
        } else {
            return 0; //a single byte character (not a utf-8 character)
        }
    }

    unsigned get_string_length(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) != 2) {
                ++width;
            }
        }
        return width;
    }

    unsigned get_string_display_width(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) == 0) {
                width += 1;
            } else if (is_utf8_character(s[i]) == 1) {
                width += 2; //We assume a multi-byte character consumes double spaces than a single-byte character.
            }
        }
        return width;
    }

}

int main() {

    const string s = "こんにちはhello"; //"hello" is "こんにちは" in Japanese.

    for (int i = 0; i < s.size(); ++i) {
        cout << N::is_utf8_character(s[i]) << " ";
    }
    cout << "\n\n";

    cout << "       Length: " << N::get_string_length(s) << "\n";
    cout << "Display Width: " << N::get_string_display_width(s) << "\n";

}

Output

1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 0 0 0 0 0 

       Length: 10
Display Width: 15

**dan04** · Accepted Answer · 2011-02-16T06:41:04+00:00

If you know that the data is UTF-8, then you just have to check the high bit:

0xxxxxxx = single-byte ASCII character
1xxxxxxx = part of multi-byte character

Or, if you need to distinguish lead/trail bytes:

10xxxxxx = 2nd, 3rd, or 4th byte of multi-byte character
110xxxxx = 1st byte of 2-byte character
1110xxxx = 1st byte of 3-byte character
11110xxx = 1st byte of 4-byte character

TechQA.

Test if char* string contains multibyte characters

There are 5 answers

Implementation

Example

Code

Output

Related Questions in C++

Related Questions in UNICODE

Related Questions in MULTIBYTE

Popular Questions

Trending Questions