Accessing individual characters (wchar_t) in a wstring

188 views Asked by At

I am reading in text from a file that contains unicode characters and I store the text into a wstring. I am interested in iterating over the wstring to determine which characters need more than one byte for storage.

My issue is that str.length() (where str is a wstring) seems to be indicating the number of bytes in the string instead of the number of characters. Also, as I iterate over the string using str[i], the bracket operator seems to return only 1 byte.

Here is some example code to replicate my issue:

wifstream inFile;
inFile.open(L"myFile.txt");
    
wstring str;
getline(inFile, str);

wcout << str.length() << endl;
for (unsigned int i = 0; i < str.length(); i++) {
  wcout << str[i] << L" (" << (unsigned int)str[i] << L')' << endl;
}

wofstream outFile;  outFile.open(L"outFile.txt");
outFile << str << endl;

outFile.close();
inFile.close();

Output of code:

5
H (72)
├ (195)
í (161)
l (108)
o (111)

I tried with a file that contains the string "Hálo". str.length() reports 5, which appears to be minimum number of bytes needed to store the string (assuming you use one byte for all characters except for the á). This confuses me because sizeof(wchar_t) is 2 within my environment. I figure an array of 4 characters within the wstring would require 8 bytes minimum. Yet, it seems "Hálo" is being stored as 01001000 {11000011 10100001} 01101100 01101100 (curly brackets to indicate the unicode character). So as I iterate over this, I get everything returned as if they were just char and that unicode character á comes back as 2 characters ├í.

Strangely enough, when I write the wstring to a file (seen in the code above), the text comes out as expected with the unicode character properly interpreted.

Is there a way to iterate over the actual characters within the wstring instead of just the bytes? Also, why is the wstring storing it in just 5 bytes instead of 8? I suppose it saves space but it makes accessing the elements seem unintuitive.

EDIT: I understand that my terminal may not be able to display a wchar_t properly, though I would still hope to print the integer value of it.

1

There are 1 answers

1
Remy Lebeau On

All that you said about std::wstring is incorrect. It does not store bytes, and its length() is not expressed in bytes (those are true for std::string instead).

std::wstring holds wchar_t chars, and its length() is the number of wchar_t elements in the string. On Windows wchar_t is 2 bytes (used for UCS-2/UTF-16), whereas on other platforms wchar_t is 4 bytes (used for UTF-32).

To read a file into a std::wstring using std::wifstream, you need to imbue() the correct std::locale into the std::wifstream to handle the file's encoding (ANSI, UTF-8, etc) so it can be decoded into wchar_t characters.

In your case, your file is encoded in UTF-8, as the UTF-8 encoded form of Hálo is the byte sequence:

H - 0x48
á - 0xC3 0xA1
l - 0x6C
o - 0x6F

Since your std::wifstream doesn't know the data is UTF-8, it is simply upscaling each byte as-is into a wchar_t. You need to imbue() a UTF-8 locale into the stream to read this file, so that the bytes 0xC3 0xA1 are correctly interpreted as á rather than ├í.