How to pre-build worker container Dataflow? [Insights "SDK worker container image pre-building: can be enabled"]

56 views Asked by At

I'm wondering how to pre-build worker container and at the same time use setup.py file for multiple file dependencies.

Even when i used this official template for that i still had Insights: "SDK worker container image pre-building: can be enabled". Is it a bug?

https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/dataflow/flex-templates/pipeline_with_dependencies

1

There are 1 answers

0
Valentyn On

When you specify additional dependencies for a Python pipeline using flags like --requirements_file, --setup_file, --extra_package, you can pre-build a container image and install these dependencies before starting a Dataflow Python worker. It can be accomplished by supplying a --prebuild_sdk_container_image pipeline option, see: https://cloud.google.com/dataflow/docs/guides/build-container-image#pre-build_a_container_image_when_submitting_the_job . This option might be helpful for users who want to optimize worker startup time, but don't want to build their own custom container manually.

However, when you use a custom container image, it is better to install necessary dependencies directly in the image, when you build it: https://cloud.google.com/dataflow/docs/guides/build-container-image#preinstall_using_a_dockerfile . If you do that, you no longer need to supply the pipeline options to install dependencies at runtime, and pre-building workflow is not necessary.

Custom containers provide more control over image customization and result in a more reproducible runtime environments compared to container pre-building.

If a pipeline package is installed in the custom container image, supplying the --setup_option is not necessary, unless you made some changes locally that are no longer reflected in the custom image. If you omit --setup_option, the insight will not be shown in the next observation period (next day).