Reading GCS Files from Pcollection of file paths cannot scale to multiple workers in dataflow

64 views Asked by Andy_9876 At 03 February 2024 at 03:16

I have a dataflow file written in Apache Beam which read millions of gcs objects containing logs from gcs folder containing file pointer files of these millions of objects. This pipeline cannot scale to more workers. It is stuck in one worker.

Pipeline code:

Pcollection<String> bucket = Create.of("gs://file-pointer-bucket/*") --> This bucket has five files each containing 500K file paths.

Pcollection<String> filePaths = bucket.apply(FileIO.matchAll())
                .apply(FileIO.readMatches())
                .apply(TextIO.readFiles())

PCollection<String> validFilePaths = filePaths.apply(Filter.by("find valid paths")) --> This will filter file paths to little less based on time range.

PCollection<String> logs = validFilePaths.apply(FileIO.matchAll())
                .apply(FileIO.readMatches())
                .apply(TextIO.readFiles())

Why dataflow cannot scale to a very high number of workers to read these 500k files vs getting stuck at one worker and reading each file seriously?

I tried putting Shuffle at different stage and see if breaking fusion can solve this but no success.

Original Q&A

TechQA.

Reading GCS Files from Pcollection of file paths cannot scale to multiple workers in dataflow

There are 0 answers

Related Questions in GOOGLE-CLOUD-DATAFLOW

Related Questions in APACHE-BEAM

Related Questions in APACHE-BEAM-IO

Popular Questions

Trending Questions