I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue).
If I compare the sizes (bytes) or the decompressed contents of a few files (diff or another checksum), they match. (In this case, I'm comparing files pulled directly down from Google via gsutil vs pulling down my distcp'd files from S3).
Using file, I do see a difference between the two:
file1-gs-direct.gz: gzip compressed data, original size modulo 2^32 91571
file1-via-s3.gz: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 91571
My Goal/Question:
My goal is to verify that my downloaded files match the original files' checksums, but I don't want to have to re-download or analyze the files directly on Google. Is there something I can do on my s3-stored files to reproduce the original checksum?
Things I've tried:
Re-gzipping with different compressions: While I wouldn't expect s3DistCp to change the original file's compression, here's my attempt at recompressing:
target_sha=$(shasum -a 1 file1-gs-direct.gz | awk '{print $1}')
for i in {1..9}; do
cur_sha=$(cat file1-via-s3.gz | gunzip | gzip -n -$i | shasum -a 1 | awk '{print $1}')
echo "$i. $target_sha == $cur_sha ? $([[ $target_sha == $cur_sha ]] && echo 'Yes' || echo 'No')"
done
1. abcd...1234 == dcba...4321 ? No
2. ... ? No
...
2. ... ? No
While typing out my question, I figured out the answer:
S3DistCp is apparently switching the "OS" version in the gzip header, which explains the "FAT filesystem" label I'm seeing with
file. (Note: to rule out S3 directly causing the issue, I copied my "file1-gs-direct.gz" up to S3, and after pulling down, the checksum remains the same.)Here's the diff between the two files:
It turns out the 10th byte in a gzip file "identifies the type of file system on which compression took place" (Gzip RFC):
Using
hexedit, I'm able to change my "via-s3" file's OS from00toFFand then the checksums match.Caveat: Editing this on a file that is later decompressed may cause unexpected issues, so use with caution. (In my case, I'm doing a file checksum, so worse case a file shows as mismatching even when the uncompressed contents remained the same).