I am trying to compare mongodb ( latest from git repo) compression rates of snappy, zstd, etc. Here is relevant snip from my my /etc/mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
wiredTiger:
engineConfig:
journalCompressor: snappy
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
My test case inserts entries into a collection. Each db entry has an _id and 1MB of binary. The binary is randomly generated using faker. I input 5GB/7GB of data but the storage size does not appear to be compressed. The AWS instance hosting the monodb has 15GB of mem and 100GB disk space. Here is what I see for example data collected from dbstat:
5GB data:
{'Data Size': 5243170000.0,
'Index Size': 495616.0,
'Storage size': 5265686528.0,
'Total Size': 5266182144.0}
7GB data:
{'Data Size': 7340438000.0,
'Index Size': 692224.0,
'Storage size': 7294259200.0,
'Total Size': 7294951424.0}
Is there something wrong with my config? or does compression not kick in until the data size is substantially larger than memory size? Or available storage size? What am I missing here?
Thanks a ton for you help.
Compression algorithms work by identifying repeating patterns in the data, and replacing those patterns with identifiers that are significantly smaller.
Unless the random number generator is a very bad one, random data doesn't have any patterns and so doesn't compress well.
A quick demonstration:
This shows that even gzip with it's best/slowest compression setting generates a "compressed" file that is actually bigger than the original.
You might try using a random object generator to create the documents, like this one: https://stackoverflow.com/a/2443944/2282634