Encoding.UTF8.GetString includes BOM in output string

556 views Asked by At

Consider this code

using var mem = new MemoryStream();
await using var writer = new StreamWriter(mem, Encoding.UTF8);

await writer.WriteLineAsync("Test");
await writer.FlushAsync();
mem.Position = 0;

Then this code throws

var x = Encoding.UTF8.GetString(mem.ToArray());
if (x[0] != 'T') throw new Exception("Bom is present in string");

Becaus BOM is present. Which doesnt make sense since GetString should decode the stream to decoded string.

This code works as intended and does not include the BOM

using var reader = new StreamReader(mem, Encoding.UTF8);
var x = await reader.ReadToEndAsync();
if (x[0] != 'T') throw new Exception("Bom is present in string");

Anyone know Microsofts reasoning about this? To me it seems strange to keep a BOM in a method called GetString.

1

There are 1 answers

16
Panagiotis Kanavos On BEST ANSWER

It's important to remember that the Encoding class only deals with the encodingn, not streams, files or packets. GetString converts the full or partial contents of a byte buffer into a Unicode string. It may be called on the entire buffer, or it may be called on just a part of it with GetString (byte[] bytes, int index, int count);

GetString neither generates nor handles BOM bytes. The bytes were emitted by StreamWriter because the encoding used explicitly specifies it. The StreamWriter.Flush() source code shows that the method explicitly emits the output of Encoding.GetPreamle() to the stream :

if (preamble.Length > 0)
    stream.Write(preamble, 0, preamble.Length);

GetBytes generates the bytes for the actual string contents. Its inverse, GetString doesn't handle BOMs either, those are handled by the StreamReader class or any custom code that reads raw bytes.


From the Encoding.UTF8 property remarks:

The UTF8Encoding object that is returned by this property might not have the appropriate behavior for your app.

  • It returns a UTF8Encoding object that provides a Unicode byte order mark (BOM). To instantiate a UTF8 encoding that doesn't provide a BOM, call any overload of the UTF8Encoding constructor.

StreamWriter uses UTF8 without BOM when no encoding is specified, both in .NET Framework and .NET Core :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM), so its GetPreamble method returns an empty byte array.