I have 2 tables in SQL Server - staging_table and final_table.
Data is pulled from an API and goes through staging_table and from there it is merged into the final table.
The final_table contains 20 million rows accumulated over time. The staging table is a temporary table that gets refreshed during each data pipeline run, and may contain about 10000 records at a time. It is then merged into final table for updates, deletes and inserts.
This is what my SQL merge query currently looks like:
WITH trg AS
(
SELECT *
FROM final_table
WHERE transaction_id IN (SELECT transaction_id FROM staging_table)
)
MERGE trg
USING staging_table AS src ON trg.transaction_id = src.transaction_id
WHEN NOT MATCHED BY TARGET THEN
INSERT (practice_id...)
VALUES (practice_id...)
-- matching records in source and target - update
WHEN MATCHED AND
(ISNULL(trg.[practice_id], '') != ISNULL(src.[practice_id], '')
...)
THEN
UPDATE
SET [practice_id] = src.[practice_id],
...;
-- exists in target but not in source - delete
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
I'm using a CTE for target table trg so that during deletion, it will only delete against the 10000 or so records in staging table if not matched by source. Otherwise it would've deleted millions of existing records. However there is obviously issue here because if the API deleted 1 record from their end, this record won't even be in staging table but it exists in the final table and I'm wondering how I can delete this.
Please help.
If you can get the API to add a flag in staging for "deleted_flag" and get those records inserted into the staging table, you can simplify your MERGE statement to only do UPSERTs. Completely gut the portion doing DELETE and handle it outside the MERGE.
Then, handle the delete outside of the MERGE.
Also, make sure you've got an good indexes on transaction_id for both the staging and final table. Include any other evaluated columns as indexed or INCLUDE to avoid key lookups.