Insert data with timestamps from the past to time series DB

608 views Asked by At

Let consider Influxdb as an example of TSDB. In outline looks like Influxdb stores data in sorted by time append-only files. But also it claims that it's possible to insert data with random timestamps, not just append. And for IoT world it's a quite usual scenario to occasionally find some data from the past (for example some devices were offline for some time and then get online again) and put this data to the time series db to plot some charts. How influxdb can deal with such scenarios? Will it rewrite the append-only files completely?

1

There are 1 answers

0
peterdn On

This is how I understand it. InfluxDB creates a logical database (shard) for each block of time for which it has data. By default, the shard group duration is 1 week. Therefore, if you insert measurements with timestamps from e.g. 4 weeks ago, they will not affect shards from subsequent weeks.

Within each shard, incoming writes are first appended to a WAL (write ahead log) and also cached in memory. When the WAL and cache are sufficiently full, they are snapshotted to disk, converting them to level 0 TSM (time structured merge tree) files. These files are read-only and measurements are ordered firstly by series and then by time.

As TSM files grow, they are compacted together, increasing their level. Multiple level 0 snapshots are compacted to produce level 1 files. Less often, multiple level 1 files are compacted to produce level 2 files, and so on up to a maximum level 4. Compaction ensures that TSM files are optimised to (ideally) contain a minimum set of series, and to minimally overlap with other TSM files. This means that fewer TSM files need to be searched for any particular series/time lookup.

So knowing this, how would InfluxDB suffer under a workload of writes with random timestamps? If the timestamps are sparsely distributed and our shard group duration is short, i.e. most writes hit different shards, then we will end up with many shards. This means many almost-empty data files which is inefficient (this very issue is addressed in their FAQ). On the other hand, if the random timestamps are concentrated in one or two shards, their lower-level TSM files will likely significantly overlap in time, meaning all of them need to be searched even for queries over narrow time ranges. This will affect read performance on these kinds of queries.

More information can be found in these resources: