Spark Bzip2 compression ratio is not efficient

186 views Asked by At

Today am seeking your help with an issue am having in the last couple of days with bzip2 compression. We need to compress our output text files into bzip2 format.

The problem is that we only pass from 5 Gb uncompressed to 3.2 Gb compressed with bzip2. Seeing other projects compressing their 5 GB files to only 400 Mb makes me wonder if am doing something wrong.

Here is my code:

iDf
  .repartition(iNbPartition)
  .write
  .option("compression","bzip2")
  .mode(SaveMode.Overwrite)
  .text(iOutputPath)

I am also importing this codec :

import org.apache.hadoop.io.compress.BZip2Codec

Besides that am not setting any configs in my spark-submit because i've tried many with no luck.

Would really appreciate your help with this.

1

There are 1 answers

1
KhribiHamza On

Thanks for your help guys, the solution was in the algorithm bzip itself. Actually given that my data is anonymized in a random way, it was very random that the algorithme is no longer efficient.

Thank you again