Spark Bzip2 compression ratio is not efficient

Question

Spark Bzip2 compression ratio is not efficient

186 views Asked by KhribiHamza At 04 May 2022 at 16:15

Today am seeking your help with an issue am having in the last couple of days with bzip2 compression. We need to compress our output text files into bzip2 format.

The problem is that we only pass from 5 Gb uncompressed to 3.2 Gb compressed with bzip2. Seeing other projects compressing their 5 GB files to only 400 Mb makes me wonder if am doing something wrong.

Here is my code:

iDf
  .repartition(iNbPartition)
  .write
  .option("compression","bzip2")
  .mode(SaveMode.Overwrite)
  .text(iOutputPath)

I am also importing this codec :

import org.apache.hadoop.io.compress.BZip2Codec

Besides that am not setting any configs in my spark-submit because i've tried many with no luck.

Would really appreciate your help with this.

Original Q&A

There are 1 answers

**KhribiHamza** · Answer 1 · 2022-06-16T15:16:48+00:00

KhribiHamza On 16 June 2022 at 15:16

Thanks for your help guys, the solution was in the algorithm bzip itself. Actually given that my data is anonymized in a random way, it was very random that the algorithme is no longer efficient.

Thank you again

TechQA.

Spark Bzip2 compression ratio is not efficient

There are 1 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in COMPRESSION

Related Questions in BZIP2

Popular Questions

Trending Questions