Let's for instance say I have a Cloud environment and a Client environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files and i want the exact identical table to exist in the client environment.
Now let assume a few things:
- The files table is very big.
- The data of each row in files can be updated at any time and has a
last-updatecolumn. - I want to fetch the delta's and make sure I am identical in both environments.
My solution:
- I make a full sync first, returning all the entries to the client.
- I keep the
LastSynctime in the client environment and keep syncing delta's from theLastSynctime. - I do the full sync and the delta syncs using paging: the client will fire a first request for getting the
Countof results for the delta and as many other requests needed by thePage Sizeof each request.
For example, the count:
SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime
The page fetching:
SELECT col1, col2..
FROM files
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT
OFFSET @OFFSET
My problem:
What if for example the first fetch(the Count fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update fetch.
For example:
- The Count fetch gave 100 entries for
last-update 1000 seconds. - 1 entry updated while fetching the
Count. - Now the
last-update 1000 secondswill give 101 entries. - The page fetch will only get 100 entries from the 101 with order by
id - 1 entry is missed and not synced to the client
I have tried 2 other options:
- Syncing with
from-todate limit forlast-update. - Ordering by
last-updateinstead of theidcolumn.
I see issues in both options.
Do not use
OFFSETandLIMIT; it goes from OK to slow to slower. Instead, keep track of "where you left off" withlast_updateso that it can be more efficient. More DiscussionSince there can be dups of the datetime, be flexible about how many rows to do at once.
Run this continually. Don't use cron except as a 'keep-alive'.
There is no need for the initial copy; this code does it for you.
It is vital to have
INDEX(last_update)Here is the code:
SELECT @cutoffwill be fast -- it is a brief scan of 100 consecutive rows in the index.SELECT *does the heavy lifting, and takes time proportional to the number of rows -- no extra overhead forOFFSET. 100 rows should take about 1 second for the read (assuming spinning disk, non-cached data).Instead of initially getting
COUNT(*), I would start by gettingMAX(last_update)since the rest of the code is basing onlast_update. This query is "instantaneous" since it only has to probe the end of the index. But I claim you don't even need that!A possible bug: If rows in the 'source' can be deleted, how to you recognize that?