I'm a newby on Java Dataflow and I want to read a CSV file from GCS and have in output a PCollection.
How can I do it efficiently?
Thanks a lot for your help.
I'm a newby on Java Dataflow and I want to read a CSV file from GCS and have in output a PCollection.
How can I do it efficiently?
Thanks a lot for your help.
Ideally you would have to convert that CSV file to something else. If you can share more details on what you are trying to do with that PCollection, we could provide more insight.
Here is a very generic example that might help. Suppose that you have a CSV
gs://sample-bucket-name/teams.csv:You could create a pipeline and a component to convert those lines to a POJO in which Beam knows the schema, for example:
A ParDo function could look like this (although you can use any tools you'd like to read and parse CSV, instead of using
String.split):Finally, the pipeline, which uses TextIO to read the text file into a
PCollectionofTeam. Ideally, you would want to use other IOs to fulfill your needs, and write your results somewhere.