Reproducible replays of single-threaded write logs using Clojure STM

Question

Reproducible replays of single-threaded write logs using Clojure STM

75 views Asked by Nathan Tuggy At 05 May 2023 at 14:51

I'm working on a project that replays an event log (flat file format, for historical reasons) that should produce a resulting model identical to the legacy code, which is single-threaded. I'm trying to figure out good ways to speed this up; obviously the parsing and some of the validation can be parallelized, which brings them down to maybe 10% of the total time including I/O, but much of the validation is ordering-specific, and allowing reordering of events could result in unexpected validation failures, or worse, silently different results.

This seems like a job for Software Transactional Memory, since this is basically just a classic case of transactions just like databases. Except it isn't: this is offline processing, not online; the purpose of parallelizing this is to maximize deserialization speed, which effectively maximizes write contention and also the chance for reordering. (And these replays will occur repeatedly, so even a 0.1% chance of trouble per run is too high.) And Clojure's STM is of course specifically designed to retry transactions from a later basis if there's conflict in the refs — reordering semantically-related events. (This is fine in online transaction processing, where there is by definition no way to tell which event should run first. It's not fine when there is already a set ordering.) There are options to set minimum and maximum basis histories for each ref, which could potentially help, but seem unable to completely prevent this in general.

Is there a robust, principled way to still make this work (exploiting the concurrency of unrelated writes to different refs), or is it fundamentally a Hard™ problem? Or, equivalently, is STM actually the right hammer for this nail?

Original Q&A

There are 1 answers

**Alan Thompson** · Answer 1 · 2023-05-05T16:36:07+00:00

It seems the transaction part must be serialized in a single thread to guarantee the same ordering. However, all pre-processing (and post-processing?) could be parallelized for a nice speedup.

A. Pre-proc:

Read the source file in large chunks of data as a lazy stream of bytes.
Use a thread pool of workers to parse the text into a stream of small, efficient events.

B. Tx:

A single thread performs the transactions into a mutable data store (Clojure atom or real DB or...?). Since a single thread will have no contention (i.e. no aborted transactions or re-trys) it will be very fast (please see many YouTube videos about the LMax Exchange from Martin Thompson and Dave Farley).

C. Post-Proc:

This could also be parallelized if desired.

For (A), you could use the pmap function to get started, or just jump into the excellent Claypoole library.

TechQA.

Reproducible replays of single-threaded write logs using Clojure STM

There are 1 answers

Related Questions in PARALLEL-PROCESSING

Related Questions in CLOJURE

Related Questions in STM

Popular Questions

Trending Questions