Running distcp java job using hadoop yarn

787 views Asked by At

I want to copy files present in hdfs to s3 bucket using java code. My java code implementation looks like this:

import org.apache.hadoop.tools.DistCp;
import org.apache.hadoop.tools.DistCpOptions;
import org.apache.hadoop.tools.OptionsParser;
import org.apache.hadoop.conf.Configuration;

private void setHadoopConfiguration(Configuration conf) {

        conf.set("fs.defaultFS", hdfsUrl);
        conf.set("fs.s3a.access.key", s3AccessKey);
        conf.set("fs.s3a.secret.key", s3SecretKey);
        conf.set("fs.s3a.endpoint", s3EndPoint);
        conf.set("hadoop.job.ugi", hdfsUser);
        System.setProperty("com.amazonaws.services.s3.enableV4", "true");
  
    }

public static void main(String[] args){
  
        Configuration conf = new Configuration();
        setHadoopConfiguration(conf);
      try {
                DistCpOptions distCpOptions = OptionsParser.parse(new String[]{srcDir, dstDir});
                DistCp distCp = new DistCp(conf, distCpOptions);
                distCp.execute();
          } 
      catch (Exception e) { 
                   logger.info("Exception occured while copying file {}", srcDir);
                   logger.error("Error ", e);
         }
}

Now this code runs fine but the problem is that it doesn't launch the distcp job on yarn cluster. It launches local job runner due to which it timesout incase of large file copies.

[2020-08-23 21:16:53.759][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 (***.distcp.tmp.attempt_local367303638_0001_m_000000_0)
[2020-08-23 21:16:53.922][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Delete path s3a://***.distcp.tmp.attempt_local367303638_0001_m_000000_0 - recursive false
[2020-08-23 21:16:53.922][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://*** .distcp.tmp.attempt_local367303638_0001_m_000000_0 (**.distcp.tmp.attempt_local367303638_0001_m_000000_0)
[2020-08-23 21:16:54.007][LocalJobRunner Map Task Executor #0][INFO][S3AFileSystem:?] Getting path status for s3a://****
[2020-08-23 21:16:54.118][LocalJobRunner Map Task Executor #0][ERROR][RetriableCommand:?] Failure in Retriable command: Copying hdfs://*** to s3a://***
com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1189)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1135)

Please help me understand how can I configure yarn configs so that distcp job runs on cluster rather than running locally

0

There are 0 answers