I am trying to replicate some data preprocessing that I have done in pandas into tensorflow transform.
I have a few CSV files, which I joined and aggregated with pandas to produce a training dataset. Now as part of productionising the model I would like this preprocessing to be done at scale with apache beam and tensorflow transform. However it is not quite clear to me how I can reproduce the same data manipulation there. Let's look at two main operations: JOIN dataset a and dataset b to produce c and group by col1 on dataset c. This would be a quite straightforward operation in pandas, but how would I do this in tensorflow transform running on apache beam? Am I using the wrong tool for the job? What would be the right tool then?
join datasets with tfx tensorflow transform
174 views Asked by DarioB At
1
There are 1 answers
Related Questions in APACHE-BEAM
- Can anyone explain the output of apache-beam streaming pipeline with Fixed Window of 60 seconds?
- Does Apache Beam's BigQuery IO Support JSON Datatype Fields for Streaming Inserts?
- How to stream data from Pub/Sub to Google BigTable using DataFlow?
- PulsarIO.read() failing with AutoValue_PulsarSourceDescriptor not found
- Reading partitioned parquet files with Apache Beam and Python SDK
- How to create custom metrics with labels (python SDK + Flink Runner)
- Programatically deploying and running beam pipelines on GCP Dataflow
- Is there a ways to speed up beam_sql magic execution?
- NameError: name 'beam' is not defined while running 'Create beam Row-ptransform
- How to pre-build worker container Dataflow? [Insights "SDK worker container image pre-building: can be enabled"]
- Writing to bigquery using apache beam throws error in between
- Beam errors out when using PortableRunner (Flink Runner) – Cannot run program "docker"
- KeyError in Apache Beam while reading from pubSub,'ref_PCollection_PCollection_6'
- Unable to write the file while using windowing for streaming data use to ":" in Windows
- Add a column to an Apache Beam Pcollection in Go
Related Questions in TFX
- pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000
- Trouble Visualizing Evaluation Metrics with TensorFlow Model Analysis and Fairness Indicators
- Reading TFX BulkInferrer results out to BigQuery or Dataframe
- TFX pipeline-root not found
- TFX TypeError: Argument input_params should be a Channel of type <class 'tfx.types.standard_artifacts.ExternalArtifact'> (got test_string)
- AttributeError: module 'tfx.utils.io_utils' has no attribute 'file_io'
- How to configure optional component with TFX?
- How to Run a TFX Orchestration Pipeline Outside Jupyter?
- TFX CSVExampleGen component: How to read data with "|" as separator?
- Developer workflow for tensorflow/tfx
- I don't understand how the Trainer component works
- Is there a way to save a TFX DatasetFeatureStatisticsList?
- How can I make tfx use tensorflow-cpu rather than the full tensorflow package?
- TFX's Evaluator Component cannot prepare the inputs for evaluation
- Tensorflow: How to add a property in execution object in MLMD MetadataStore?
Related Questions in TENSORFLOW-TRANSFORM
- Using tft.scale_to_gaussian for preprocessing a dataset without using other tensorflow operations
- Dataflow Tensorflow Transform write transformed data to BigQuery
- Creating Tensors from features that are linked together
- TensorFlow Transform unexpected behavior while using tf.strings.unicode_split
- universal sentence encoder batch pipeline failing
- tensorflow_transform installation failure on Mac M2
- Dealing with missing values in tensorflow
- Transforming tensorflow datasets to beam datasets
- Add reserved tokens to `tft.vocabulary`
- apache beam rows to tfrecord in order to GenerateStatistics
- join datasets with tfx tensorflow transform
- How to get vocabulary size in tensorflow_transform before apply_vocabulary?
- How can I use BigQuery in a standalone tensorflow transform (TFT) pipeline?
- Tensorflow Extended (TFX): Is there an easy way to debug functions from Transorm component?
- How do I pass a TensorFlow Dataset through a TensorFlow Transform pipeline?
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
You can use the Beam Dataframes API to do the join and other preprocessing exactly as you would have in Pandas. You can then use
to_pcollectionto get a PCollection that you can pass directly to your Tensorflow Transform operations, or save it as a file to read in later.For top-level functions (such as merge) one needs to do
and use operations
beam_pd.func(...)in place ofpd.func(...).