libstdc++ deprecation message for u8path suggests a strict aliasing violation as a workaround?

179 views Asked by At

C++20 deprecates std::filesystem::u8path:

run on gcc.godbolt.org

#include <filesystem>

std::string foo();

int main()
{
    auto path = std::filesystem::u8path(foo());
}

libstdc++ 13 has a deprecation warning in place:

<source>:7:40: warning: 'std::filesystem::__cxx11::path std::filesystem::__cxx11::u8path(
const _Source&) [with _Source = std::__cxx11::basic_string<char>; _Require = path; _CharT
 = char]' is deprecated: use 'path((const char8_t*)&*source)' instead [-Wdeprecated-decla
rations]
    7 |     auto path = std::filesystem::u8path(foo());
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

The proposed cast path((const char8_t*)&*source) looks like an outright strict aliasing violation to me, and hence UB.

Is that correct? Is GCC making any additional guarantees that make this legal?

And lastly, is there a better workaround if my path is stored in std::string and I don't want to rewrite everything to std::u8string?

2

There are 2 answers

3
Jan Schultke On

In short, there is undefined behavior in your example. However, the actual cause is not a strict aliasing violation, but a precondition violation because of a hypothetical strict aliasing violation.

No undefined behavior due to strict aliasing

There is no strict aliasing violation ([basic.lval] p11) because any access of the characters would happen within the constructor of std::filesystem::path or other parts of the filesystem library, and those could be permitted to type-pun in ways that the user can't.

(const char8_t*)&* is essentially a reinterpret_cast<const char8_t*> of your data. reinterpret_cast on its own is valid, even if accessing objects through the pointer wouldn't be. With the resulting pointer, you would call the following constructor:

template<class Source>
path(const Source& source, format fmt = auto_format);

Effects: Let s be the effective range of source or the range [first, last), with the encoding converted if required. Finds the detected-format of s and constructs an object of class path for which the pathname in that format is s.

- [fs.class.path] std::path constructor 3

The format detection, argument format conversions, and type and encoding conversions for the path are all defined mathematically or through prose. For example, the encoding conversion is defined in [fs.path.type.cvt] p3:

For member function arguments that take character sequences representing paths and for member functions returning strings, value type and encoding conversion is performed if the value type of the argument or return value differs from path​::​value_type. For the argument or return value, the method of conversion and the encoding to be converted to is determined by its value type:

  • [...]
  • char8_t: The encoding is UTF-8. The method of conversion is unspecified.

The implementation has a lot of freedom when it comes to implementing this. The std::filesystem::path constructor could have relaxed aliasing rules for instance.

Undefined behavior due to precondition violation

The issue lies in the use of value type:

An input iterator i supports the expression *i, resulting in a value of some object type T, called the value type of the iterator.

Your iterator would be of type const char8_t*, and indirection (*i) would not be valid for it because it would hypothetically violate strict aliasing. Therefore, what you're passing to the path constructor has no value type, and the behavior is undefined because of a precondition violation.

GCC strict aliasing relaxations between character types

I was unable to find details about this in the GCC documentation, but char8_t appears to be able to alias char:

auto alias(char c) {
    return *reinterpret_cast<char8_t*>(&c); // OK, no -Wstrict-aliasing
}

See Compiler Explorer.

Presumably, you are thus relying on compiler extensions.

3
Tom Honermann On

Disclaimer: I'm the author of the P0482 (char8_t: A type for UTF-8 characters and strings) proposal that deprecated std::filesystem::u8path() in C++20.

The recommendation offered by the libstdc++ warning is fine. Normally such casts are problematic, but char, unsigned char, and std::byte are granted special powers to alias other types. This is courtesy of [basic.lval]p11 that states:

If a program attempts to access the stored value of an object through a glvalue whose type is not similar to one of the following types the behavior is undefined:

  • the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object, or
  • a char, unsigned char, or std​::​byte type.

...

However, aliasing casts are to be avoided whenever possible and I recommend a different solution. std::filesystem::path has a constructor template that accepts a classic range (an iterator pair) and deduces encoding based on the value type of the range. For UTF-8 input that is held in char-based storage, all that is needed is a range adapter to convert the char-based values to char8_t. This can be performed with no run-time overhead:

std::string utf8_encoded_filename = ...;
auto char8_view =
    std::ranges::views::transform(utf8_encoded_filename,
                                  [](char c) { return (char8_t)c; });
std::filesystem::path p(char8_view.begin(), char8_view.end());

Such a view adapter is sufficiently useful that a family of charN_t view adapters is being considered for standardization in P2728 (Unicode in the Library, Part 1: UTF Transcoding) (revision 6 at the time of this writing).

If desired, such a view adapter may be employed to provide a u8path() replacement that, again, poses no run-time overhead.

template<std::ranges::viewable_range R>
requires std::same_as<std::ranges::range_value_t<R>, char>
constexpr auto as_char8_t(R &&r) {
  return std::ranges::views::transform(std::forward<R>(r),
                                       [](char c) { return (char8_t)c; });
}
std::filesystem::path u8path(const std::string &s) {
  auto char8_view = as_char8_t(s);
  return std::filesystem::path(char8_view.begin(), char8_view.end());
}

If you've read to this point, you are probably now wondering why u8path was deprecated if there are demonstrable use cases for it. Unfortunately, the original proposal lacks much defense of the deprecation. The motivation was that, prior to deprecation, there were only two interfaces in the standard library that required char-based input that held UTF-8 data; all other interfaces expect char-based input to hold text data encoded in the ordinary literal encoding used for character and string literals or in the execution encoding. Those two interfaces were:

The ordinary literal encoding and the execution encoding are not UTF-8 in practice for some popular (and some not-so-popular) implementations and this is expected to remain true for the foreseeable future. It is challenging to properly use and maintain text stored in multiple encodings without support from the type system and mistakes lead to quality and security issues. It was seen as important that the standard not encourage further use of char with UTF-8 encoded text; at least, not for portable code and there was concern about the potential to have to add u8-prefixed version of potentially many interfaces in the future. The inability to use the type system to overload based on encoding would have been an impediment for the development of generic code libraries.

Note: For backward compatibility with existing code that passes a u8"" string literal to std::filesystem::u8path(), though deprecated, std::filesystem::u8path() was modified via P1423 (char8_t backward compatibility remediation) to accept a range of char8_t.