Big file compression using DeflateStream

69 views Asked by At

I'm trying to write zipstream compression tool, like getting data from 3rd party service, transform it, compress it and then add those bytes to zip archive byte stream and send to another service.
Crc32 is recalculating after every chunk.

Made 3rd party service emulation - reading file by chunk.
This version works, but after extracting i get empty file. But it is not empty - i see data in hex editor. I think there is something with crc32.
But if i compressing the whole file at once, it works just fine.
Here is my question. Is it possible to compress big amount of data by chunks with deflatestream? I need to extract this data later with regular zip tools.

public async Task<byte[]> Compress(string fileName, IAsyncEnumerable<byte[]> data)
{
    var crc32Helper = new System.IO.Hashing.Crc32();
    var lfh = ZipTools.GetLocalFileHeaderEntry(fileName);
    testResult.AddRange(lfh);
    var bytearray = new List<byte>();
    await foreach (var chunk in data)
    {
        _originalSize += (ulong) chunk.Length;
        var compressedData = Compress(chunk);
        _compressedSize += (ulong) compressedData.Length;
        crc32Helper.Append(chunk);
        testResult.AddRange(compressedData);
    }
    _originalSize += (ulong) bytearray.Count;

    _crc32 = crc32Helper.GetCurrentHashAsUInt32();
    var cd = ZipTools.GetCentralDirectoryEntry(
        fileName,
        _crc32,
        (ulong) lfh.Length + _compressedSize,
        _compressedSize,
        _originalSize);
    testResult.AddRange(cd);
    return testResult.ToArray();
}

public byte[] Compress(byte[] data)
{
    using var input = new MemoryStream(data);
    using var resultStream = new MemoryStream();
    using (DeflateStream compressionStream = new DeflateStream(resultStream, CompressionMode.Compress))
    {
        input.CopyTo(compressionStream);
    }

    return resultStream.ToArray();
  }
2

There are 2 answers

3
Mark Adler On

You can try to adapt the C code in zipflow to use in your C# application. zipflow takes in chunks of data and streams out a zip file, without ever having to keep either the entire input of any entry or the output zip file in memory or the file system.

I don't know what "ZipTools" is, nor can I find it. However writing multiple deflate streams to a single entry will not work. It needs to be a single deflate stream. Also there is no evidence of writing an end-of-central-directory record, which must follow the central directory.

5
Charlieface On

I don't really understand why you are messing around with central directories and building the Zip manually. None of this is necessary as you can use ZipArchive to do this in one go.

Furthermore, you can't compress chunks of bytes like that and then just concatenate them. The Deflate algorithm doesn't work that way.

Your concern about flushing is misplaced: if the ZipArchive is closed then everything is flushed. You just need to make it leave the stream open once you dispose it.

I would advise you to only work with Stream, but you could use byte[] or Memory<byte> if absolutely necessary.

public async Task<Memory<byte>> Compress(string fileName, IAsyncEnumerable<Memory<byte>> data)
{
    var ms = new MemoryStream();
    using (var zip = new ZipArchive(ms, ZipArchiveMode.Create, leaveOpen: true))
    {
        var entry = zip.CreateEntry(fileName);
        using var zipStream = entry.Open();
        await foreach (var bytes in data)
        {
            zipStream.Write(bytes);
        }
    }
    return new ms.GetBuffer().AsMemory(0, (int)ms.Length);
}

If you want to avoid even the MemoryStream and upload directly to HttpClient then you can use a custom HttpContent that "pulls" the data as and when needed.

This example is taken from the documentation.

public class ZipUploadContent : HttpContent
{
    private readonly string _fileName;
    private readonly IAsyncEnumerable<Stream> _data;

    public MyContent(string fileName, IAsyncEnumerable<Stream> data)
    {
        _fileName = fileName
        _data = data;
    }

    protected override bool TryComputeLength(out long length)
    {
        length = 0;
        return false;
    }

    protected override Task SerializeToStreamAsync(Stream stream, TransportContext? context)
        => SerializeToStreamAsync(stream, context, CancellationToken.None)

    protected override Task SerializeToStreamAsync(Stream stream, TransportContext? context, CancellationToken cancellationToken)
    {
        using var zip = new ZipArchive(ms, ZipArchiveMode.Create, leaveOpen: true);
        var entry = zip.CreateEntry(_fileName);
        await using var zipStream = entry.Open();
        await foreach (var inputStream in _data.WithCancellation(cancellationToken))
        {
            inputStream.CopyToAsync(zipStream, cancellationToken);
        }
    }

    protected override void SerializeToStream(Stream stream, TransportContext? context, CancellationToken cancellationToken)
        => Task.Run(() => SerializeToStreamAsync(stream, context, cancellationToken)).Wait();
}