I'm trying to install dependencies in a dataflow pipeline. First I used requirements_file flag but i get (ModuleNotFoundError: No module named 'unidecode' [while running 'Map(wordcleanfn)-ptransform-54']) the unique package added is unidecode. trying a second option I configured a Docker image following the google documentation:
FROM apache/beam_python3.10_sdk:2.52.0
ENV RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1
RUN pip install unidecode
RUN apt-get update && apt-get install -y
ENTRYPOINT ["/opt/apache/beam/boot"]
It was compiled in the gcp project vm and pushed to artifact registry Then I generated the template for pipeline with:
python -m mytestcode \
--project myprojectid \
--region us-central1 \
--temp_location gs://mybucket/beam_test/tmp/ \
--runner DataflowRunner \
--staging_location gs://mybucket/beam_test/stage_output/ \
--template_name mytestcode_template \
--customvariable 500 \
--experiments use_runner_v2 \
--sdk_container_image us-central1-docker.pkg.dev/myprojectid/myimagerepo/dataflowtest-image:0.0.1 \
--sdk_location container
After all I created the job from template with the UI, but the error is the same, please someone can help me? I understand that the workers are using de default beam sdk, is correct that? how I can fix it?