I am executing a hive INSERT OVERWRITE query.
SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;
INSERT OVERWRITE TABLE CS.STAGING_AVRO PARTITION (CPD_DT='${hiveconf:CPD_DT}',CUT_NO='${hiveconf:CUT_NO}')
select DISTINCT from CS_TRAN.field1, CS_TRAN.field2 ....
from ( SELECT A, B, C, ... FROM NMS1.TBL1 where ... )CS_TRAN
join NMS2.TBL2 where ....
This insert overwrite query joins data from two tables and writes that into the CS.STAGING_AVRO table in partitions by date and cut
the issue that I am facing here is that in 1 partition it is creating 1 big file
For example:
- /data/cpd_dt=20230220/cut_no=00/000000_0 --> 3gb size
- /data/cpd_dt=20230220/cut_no=01/000000_0 --> 3gb size
- /data/cpd_dt=20230220/cut_no=02/000000_0 --> 3gb size
In all the partitions it is creating 1 file of 3gb size. I want to split this 1 large file into many files of 500mb size.
This I am not able to achieve.
Configuration that I am currently using:
SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;
I tried the following things but it didn't work:
SET mapred.max.split.size=1000000;
SET hive.exec.reducers.bytes.per.reducer=1000000;
SET hive.merge.size.per.task=1000000;
SET hive.merge.smallfiles.avgsize=1000000;
SET hive.exec.compress.output=true;
Please tell me what needs to be done for the same. Any help is appreciated.
Try this
They will set block size to 512MB. If you want to increase it to 1GB just multiply by 2. But i am afraid you have to recreate the partitioned table, otherwise hive wont create smaller files.