I have a data connection source that creates two datasets:
- Dataset X (Snapshot)
- Dataset Y (Incremental)
The two datasets pull from the same source. Dataset X consists of the current state of all rows in the source table. Dataset Y pulls all rows that have been updated since the last build. These two datasets are then merged downstream into dataset Z with dataset Z being either dataset X or the most recent version of each row from dataset Y. This allows us to both have low latency updates and maintain good partitioning.
When rows are deleted in the source table, the rows are no longer present in dataset X but are still present in dataset Y.
What would be the best way keep these 'deleted' rows in dataset Z? Ideally I would also be able to snapshot dataset Y without losing any of the 'deleted' rows.
Good question! From what I understand, you want dataset
Zto have only the most up-to-date rows, including the most up-to-date deleted rows. Both updated rows and deleted rows are present inY. In this case, I would suggest first unioningYandXtogether, so that all rows, including deleted rows are present in the union dataset. Then, use a window function over a date column in order to get the most recent version of each row. Here is an outline of the pyspark code I would suggest for this:Note that this solution does not get around the issue of what happens if Y snapshots.