I receive a byte stream buffer from a TCP server which could contain multibyte characters forming unicode characters. I was wondering if there's always a way to check for BOM to detect those characters or else how would you like to do it?
Test if char* string contains multibyte characters
5.8k views Asked by cpx At
5
There are 5 answers
2
On
There are lots of ways to detect multibyte characters, and unfortunately... none of them are reliable.
If this is a web request being returned, check the headers, for the Content-Type header will often indicate the page encoding (which can be indicative of multibyte character presense).
You can also check for BOMs, as they are invalid characters they shouldn't appear in normal text anyways, so it can't hurt to see if they're there. However, they are optional and many times will not be present (depends on implementation, configuration, etc.).
0
On
Let me implement dan04's answer.
Hereafter I use C++14. If you can only use an older version of C++, you have to rewrite binary literals (e.g. 0b10) to integer literals (e.g. 2).
Implementation
int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
if ((c >> 7) == 0b1) {
if ((c >> 6) == 0b10) {
return 2; //2nd, 3rd or 4th byte of a utf-8 character
} else {
return 1; //1st byte of a utf-8 character
}
} else {
return 0; //a single byte character (not a utf-8 character)
}
}
Example
Code
using namespace std;
#include <iostream>
namespace N {
int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
if ((c >> 7) == 0b1) {
if ((c >> 6) == 0b10) {
return 2; //2nd, 3rd or 4th byte of a utf-8 character
} else {
return 1; //1st byte of a utf-8 character
}
} else {
return 0; //a single byte character (not a utf-8 character)
}
}
unsigned get_string_length(const string &s) {
unsigned width = 0;
for (int i = 0; i < s.size(); ++i) {
if (is_utf8_character(s[i]) != 2) {
++width;
}
}
return width;
}
unsigned get_string_display_width(const string &s) {
unsigned width = 0;
for (int i = 0; i < s.size(); ++i) {
if (is_utf8_character(s[i]) == 0) {
width += 1;
} else if (is_utf8_character(s[i]) == 1) {
width += 2; //We assume a multi-byte character consumes double spaces than a single-byte character.
}
}
return width;
}
}
int main() {
const string s = "こんにちはhello"; //"hello" is "こんにちは" in Japanese.
for (int i = 0; i < s.size(); ++i) {
cout << N::is_utf8_character(s[i]) << " ";
}
cout << "\n\n";
cout << " Length: " << N::get_string_length(s) << "\n";
cout << "Display Width: " << N::get_string_display_width(s) << "\n";
}
Output
1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 0 0 0 0 0
Length: 10
Display Width: 15
If you know that the data is UTF-8, then you just have to check the high bit:
Or, if you need to distinguish lead/trail bytes: