TechQA.

Parallelization with AWS Glue DynamicFrames and ThreadPoolExecutor?

63 views Asked by Nicolás Sánchez At 15 February 2024 at 15:29

So I have been trying to explode AWS Glue jobs parallelization features. This is the approach that I have been following:

Reading data from the Glue DataCatalog with glueContext.create_dynamic_frame.from_catalog into DynamicFrames
Creating transformations function to process these DynamicFrames.
Using ThreadPoolExecutor to run the transformations in parallel (code example below)

with ThreadPoolExecutor(max_workers=5) as executor:
    output_dfs = executor.map(transform_func, dynamic_df)

The problem is that I get the following error:

TypeError: 'DynamicFrame' object is not iterable

I tried also converting the DynamicFrame to a PySpark dataframe using the .toDF() method (see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-toDF) before passing it to executor.map, but toDF() must be part of the transformation I want to parallelize (I had found that running this method is way to slow outside, so I want to include it inside the body of transform_func)

My question is: How can we parallelize a DynamicFrame i.e. pass it as executor.map() arguments?

There are 0 answers