I have a dataflow file written in Apache Beam which read millions of gcs objects containing logs from gcs folder containing file pointer files of these millions of objects. This pipeline cannot scale to more workers. It is stuck in one worker.
Pipeline code:
Pcollection<String> bucket = Create.of("gs://file-pointer-bucket/*") --> This bucket has five files each containing 500K file paths.
Pcollection<String> filePaths = bucket.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles())
PCollection<String> validFilePaths = filePaths.apply(Filter.by("find valid paths")) --> This will filter file paths to little less based on time range.
PCollection<String> logs = validFilePaths.apply(FileIO.matchAll())
.apply(FileIO.readMatches())
.apply(TextIO.readFiles())
Why dataflow cannot scale to a very high number of workers to read these 500k files vs getting stuck at one worker and reading each file seriously?
I tried putting Shuffle at different stage and see if breaking fusion can solve this but no success.