Hive INSERT OVERWRITE creates 1 large file in every partition

Question

Hive INSERT OVERWRITE creates 1 large file in every partition

103 views Asked by Ishan Tiwary At 08 May 2023 at 12:27

I am executing a hive INSERT OVERWRITE query.

SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;

INSERT OVERWRITE TABLE CS.STAGING_AVRO PARTITION (CPD_DT='${hiveconf:CPD_DT}',CUT_NO='${hiveconf:CUT_NO}')
select DISTINCT from CS_TRAN.field1, CS_TRAN.field2 .... 
from ( SELECT A, B, C, ... FROM NMS1.TBL1 where ... )CS_TRAN
join  NMS2.TBL2 where ....

This insert overwrite query joins data from two tables and writes that into the CS.STAGING_AVRO table in partitions by date and cut

the issue that I am facing here is that in 1 partition it is creating 1 big file

For example:

/data/cpd_dt=20230220/cut_no=00/000000_0 --> 3gb size
/data/cpd_dt=20230220/cut_no=01/000000_0 --> 3gb size
/data/cpd_dt=20230220/cut_no=02/000000_0 --> 3gb size

In all the partitions it is creating 1 file of 3gb size. I want to split this 1 large file into many files of 500mb size.

This I am not able to achieve.

Configuration that I am currently using:

SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;

I tried the following things but it didn't work:

SET mapred.max.split.size=1000000;
SET hive.exec.reducers.bytes.per.reducer=1000000;
SET hive.merge.size.per.task=1000000;
SET hive.merge.smallfiles.avgsize=1000000;
SET hive.exec.compress.output=true;

Please tell me what needs to be done for the same. Any help is appreciated.

Original Q&A

There are 1 answers

**Koushik Roy** · Answer 1 · 2023-05-09T03:42:12+00:00

Koushik Roy On 09 May 2023 at 03:42

Try this

SET dfs.block.size=536870912;  
SET parquet.block.size=536870912;

They will set block size to 512MB. If you want to increase it to 1GB just multiply by 2. But i am afraid you have to recreate the partitioned table, otherwise hive wont create smaller files.

TechQA.

Hive INSERT OVERWRITE creates 1 large file in every partition

There are 1 answers

Related Questions in HIVE

Related Questions in MAPREDUCE

Related Questions in APACHE-TEZ

Popular Questions

Trending Questions