As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:
- results of pipeline components
- metadata about artifacts generated through the components of the pipelines
- metadata about executions of these components
- metadata about the pipeline and associated lineage information
Features:
Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?
Orchestration and pipeline history:
Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?
TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata. In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though. And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.