How to use fribidi with std::string?

171 views Asked by At

I'm trying to write a function that will run the fribidi algorithm on a std::string and return a reordered std::string. I hope it to be safe enough for any std::string, and in case something fails in the way, it can return the original std::string.

I saw many examples online that use std::wstring, but I wonder whether I can avoid this conversion. Here's my attempt (I may have forgotten some includes).

# fribidi-test.cpp
#include <cstring>
#include <iostream>
#include <string>
#include <stdio.h>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>

std::string fribidi_str_convert(std::string string_orig) {
    std::cerr << "dbg: orig: " + string_orig + "\n";
    FriBidiChar fribidi_in_char;
    FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
        FRIBIDI_CHAR_SET_UTF8,
        string_orig.c_str(),
        string_orig.size(),
        &fribidi_in_char
    );
    fprintf(stderr, "len is %i\n", fribidi_len);
    // https://github.com/fribidi/fribidi#api
    // Let fribidi think about the main direction by it's own (https://stackoverflow.com/q/58166995/4935114)
    FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
    // Prepare output variable
    FriBidiChar     fribidi_visual_char;
    fribidi_boolean stat = fribidi_log2vis(
        /* input */
        &fribidi_in_char,
        fribidi_len,
        &fribidi_pbase_dir,
        /* output */
        &fribidi_visual_char,
        NULL,
        NULL,
        NULL
    );
    fprintf(stderr, "stat is: %d\n", stat);
    if (stat) {
        char string_formatted_ptr;
        // Convert from fribidi unicode back to ptr
        FriBidiStrIndex new_len = fribidi_unicode_to_charset(
            FRIBIDI_CHAR_SET_UTF8,
            &fribidi_visual_char,
            fribidi_len,
            &string_formatted_ptr
        );
        fprintf(stderr, "new_len is: %d\n", new_len);
        if (new_len) {
            fprintf(stderr, "string_formatted_ptr is: %s\n", &string_formatted_ptr);
            std::string string_formatted_out(&string_formatted_ptr, new_len);
            return string_formatted_out;
        };
    };
    return string_orig;
};

int main() {
    std::string orig = "אריק איינשטיין";
    std::cerr << "main: orig: " + orig + "\n";
    std::cerr << "main: transformed: " + fribidi_str_convert(orig) + "\n";
};

I compile and run it with:

g++ $(pkg-config --libs fribidi) fribidi-test.cpp -o fribidi-test && ./fribidi-test

My problem is that I'm getting a malformed output:

main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אĐןייטשנייא קי
main: transformed: אĐןייטשנייא ק

That Đ character is not supposed to be there. What I want to get is:

main: orig: ןייטשנייא קירא
dbg: orig: ןייטשנייא קירא
len is 14
stat is: 2
new_len is: 27
string_formatted_ptr is: אריק איינשטיין
main: transformed: אריק איינשטיין

Is this related to UTF16 encoding? and the fact that the new length is 27 - almost twice as that of the original length?

2

There are 2 answers

0
KamilCuk On BEST ANSWER

This is very wrong. You can't expect to store a string into one character. Char is a char. It is not a pointer. Not a string. Remember to compile your programs with -fsanitize=undefined and also check with valgrind.

FriBidiChar fribidi_in_char;
FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(.... 
    &fribidi_in_char
);
    char string_formatted_ptr;
    FriBidiStrIndex new_len = fribidi_unicode_to_charset(...
        &string_formatted_ptr
    );

Also, }; - just use }. There is no (need for) ; after } (in these cases).

It's cstdio in C++.

Prefer to << string << string instead of << string + string to (I think) reduce memory allocations.

Fribidi API is bad, because I do not see how to calculate memory needed for the charset_to_unicode. Even the fribidy program - https://github.com/fribidi/fribidi/blob/cffa3047a0db9f4cd391d68bf98ce7b7425be245/bin/fribidi-main.c#L64 - just uses a constant amount of super big value. Also, fribidi program is the example that does not use std::wstring, because it is in C.

The following program uses a constant big buffer size like fribidi program:

#include <cassert>
#include <cstdio>
#include <cstring>
#include <iostream>
#include <string>
#include <string_view>
#include <vector>
#include <iomanip>
#define FRIBIDI_NO_DEPRECATED
#include <fribidi/fribidi.h>

#define MAX_STR_LEN 65000

std::string fribidi_str_convert(const std::string& string_orig) {
        std::cerr << "dbg: orig: " + string_orig + "\n";
        std::vector<FriBidiChar> fribidi_in_char(MAX_STR_LEN);
        const FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(FRIBIDI_CHAR_SET_UTF8, string_orig.c_str(),
                                                                       string_orig.size(), fribidi_in_char.data());
        assert(fribidi_len < MAX_STR_LEN);
        fribidi_in_char.resize(fribidi_len);
        fprintf(stderr, "len is %i\n", fribidi_len);
        //
        FriBidiCharType fribidi_pbase_dir = FRIBIDI_TYPE_LTR;
        std::vector<FriBidiChar> fribidi_visual_char(fribidi_len + 1);
        const fribidi_boolean stat = fribidi_log2vis(fribidi_in_char.data(), fribidi_len, &fribidi_pbase_dir,
                                               fribidi_visual_char.data(), NULL, NULL, NULL);
        fprintf(stderr, "stat is: %d\n", stat);
        //
        if (stat) {
                //
                std::string string_formatted_ptr(MAX_STR_LEN, 0);
                const FriBidiStrIndex new_len = fribidi_unicode_to_charset(FRIBIDI_CHAR_SET_UTF8, fribidi_visual_char.data(),
                                                                           fribidi_len, string_formatted_ptr.data());
                assert(new_len < MAX_STR_LEN);
                string_formatted_ptr.resize(new_len);
                fprintf(stderr, "new_len is: %d\n", new_len);
                //
                return string_formatted_ptr;
        }
        return string_orig;
}

int main() {
        const std::string orig = "אריק איינשטיין";
        std::cerr << "main: orig: " << orig << "\n";
        const auto ret = fribidi_str_convert(orig);
        std::cerr << "main: transformed: " << std::setw(10 + orig.size()) << ret << "\n";
}

and outputs:

$ g++ -lfribidi 1.cpp && ./a.out 
main: orig: אריק איינשטיין
dbg: orig: אריק איינשטיין
len is 14
stat is: 2
new_len is: 27
main: transformed:           ןייטשנייא קירא

Knowing that FriBidiChar is uint32_t and fribidi internally uses UTF-32 and that wchar_t on Linux is UTF-32, it would be preferable to use std::wstring (or wchar_t) to know how much memory to allocate. You could also count codepoints in UTF-8 input string and then precalculate the length of UTF-8 represetation of fribidi_visual_char to allocate memory for fribidi_unicode_to_charset.

0
Mooing Duck On
FriBidiChar fribidi_in_char;

FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
    FRIBIDI_CHAR_SET_UTF8,
    string_orig.c_str(),
    string_orig.size(),
    &fribidi_in_char
);

The last parameter has to be a pointer to a buffer large enough to write the results. Due to this violation, the fribidi_charset_to_unicode will write its output to whatever bytes happen to be in memory after this character, which is undefined behavior. Sometimes the app might crash, sometimes it'll produce the right results, sometimes it'll produce the wrong results, and sometimes it'll erase all the files in the user's home directory. The solution to this is to simply pass nullptr as the output, in which case it calculates the length, but does not write.


Similarly:
FriBidiChar     fribidi_visual_char;

fribidi_boolean stat = fribidi_log2vis(
    /* input */
    &fribidi_in_char,
    fribidi_len,
    &fribidi_pbase_dir,
    /* output */
    &fribidi_visual_char, //here
    NULL,
    NULL,
    NULL
);

The fourth to last parameter, visual_str is supposed to be a pointer to a buffer of length fribidi_len, but you're just passing in a pointer to a single character. Due to this violation, the fribidi_log2vis will write its output to whatever bytes happen to be in memory after this character, which is undefined behavior. Sometimes the app might crash, sometimes it'll produce the right results, sometimes it'll produce the wrong results, and sometimes it'll erase all the files in the user's home directory.

You need to allocate enough memory for it to do it's job:

std::basic_string<FriBidiChar> fribidi_visual_chars(fribidi_len+1, '\0');

fribidi_boolean stat = fribidi_log2vis(
    /* input */
    &fribidi_in_char,
    fribidi_len,
    &fribidi_pbase_dir,
    /* output */
    fribidi_visual_chars.data(), //here
    NULL,
    NULL,
    NULL
);

And again with string_formatted_ptr and fribidi_unicode_to_charset.


At a casual glance over their API, I think what you want is this:
std::string fribidi_str_convert(const std::string& string_orig)
  FriBidiStrIndex fribidi_len = fribidi_charset_to_unicode(
    FRIBIDI_CHAR_SET_UTF8,
    string_orig.c_str(),
    string_orig.size(),
    nullptr);
  std::basic_string<FriBidiChar> fribidi_visual_chars(fribidi_len+1, '\0');
  fribidi_charset_to_unicode(
    FRIBIDI_CHAR_SET_UTF8,
    string_orig.c_str(),
    string_orig.size(),
    fribidi_visual_chars.data());

  // I'm uncertain how to calculate the length, 
  // so I assumed its the same as the input :(
  std::basic_string<FriBidiChar> fribidi_visual_char(fribidi_len+1, '\0');
  fribidi_boolean stat = fribidi_log2vis(
    /* input */
    &fribidi_in_char,
    fribidi_len,
    &fribidi_pbase_dir,
    /* output */
    fribidi_visual_char.data(),
    NULL,
    NULL,
    NULL);
  if (!stat) return string_orig;

  FriBidiStrIndex new_len = fribidi_unicode_to_charset(
        FRIBIDI_CHAR_SET_UTF8,
        &fribidi_visual_char,
        fribidi_len,
        nullptr);
  std::string string_formatted_out(new_len+1, '\0');
  fribidi_unicode_to_charset(
        FRIBIDI_CHAR_SET_UTF8,
        &fribidi_visual_char,
        fribidi_len,
        string_formatted_out.data());
  return string_formatted_out;
}