I am trying to move data in s3 which is partitioned on a date string at rest(source) to another location where it is partitioned at rest (destination) as year=yyyy/month=mm/day=dd/
While I am able to read the entire source location data in Spark and partition it in the destination format in tmp hdfs, the s3DistCp fails to copy this from hdfs to s3. It fails with OutOnMemory error.
I am trying to write close to 2 million small files (20KB each)
My s3Distcp is running with the following args
sudo -H -u hadoop nice -10 bash -c "if hdfs dfs -test -d hdfs:///<source_path>; then /usr/lib/hadoop/bin/hadoop jar /usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar -libjars /usr/share/aws/emr/s3-dist-cp/lib/ -Dmapreduce.job.reduces=30 -Dmapreduce.child.java.opts=Xmx2048m --src hdfs:///<source_path> --dest s3a://<destination_path> --s3ServerSideEncryption;fi"
It fails with
[2020-08-06 14:23:36,038] {bash_operator.py:126} INFO - # java.lang.OutOfMemoryError: Java heap space
[2020-08-06 14:23:36,038] {bash_operator.py:126} INFO - # -XX:OnOutOfMemoryError="kill -9 %p"```
The emr cluster I am running this is
"master_instance_type": "r5d.8xlarge",
"core_instance_type": "r5.2xlarge",
"core_instance_count": "8",
"task_instance_types": [ "r5.2xlarge","m5.4xlarge"],
"task_instance_count": "1000"
Any suggestions what I could increase configurations on s3Distcp for it to be able to copy this without running out of memory?
I ended up running this iteratively, for the said aws stack it was able to handle about 300K files in each iteration without OOM