join datasets with tfx tensorflow transform

Question

join datasets with tfx tensorflow transform

174 views Asked by DarioB At 25 March 2022 at 15:50

I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform. I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?

Original Q&A

There are 1 answers

**robertwb** · Accepted Answer · 2022-03-26T00:16:56+00:00

You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use to_pcollection to get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.

For top-level functions (such as merge) one needs to do

from apache_beam.dataframe.pandas_top_level_functions import pd_wrapper as beam_pd

and use operations beam_pd.func(...) in place of pd.func(...).

TechQA.

join datasets with tfx tensorflow transform

There are 1 answers

Related Questions in APACHE-BEAM

Related Questions in TFX

Related Questions in TENSORFLOW-TRANSFORM

Popular Questions

Trending Questions