I'm working on a project that replays an event log (flat file format, for historical reasons) that should produce a resulting model identical to the legacy code, which is single-threaded. I'm trying to figure out good ways to speed this up; obviously the parsing and some of the validation can be parallelized, which brings them down to maybe 10% of the total time including I/O, but much of the validation is ordering-specific, and allowing reordering of events could result in unexpected validation failures, or worse, silently different results.
This seems like a job for Software Transactional Memory, since this is basically just a classic case of transactions just like databases. Except it isn't: this is offline processing, not online; the purpose of parallelizing this is to maximize deserialization speed, which effectively maximizes write contention and also the chance for reordering. (And these replays will occur repeatedly, so even a 0.1% chance of trouble per run is too high.) And Clojure's STM is of course specifically designed to retry transactions from a later basis if there's conflict in the refs — reordering semantically-related events. (This is fine in online transaction processing, where there is by definition no way to tell which event should run first. It's not fine when there is already a set ordering.) There are options to set minimum and maximum basis histories for each ref, which could potentially help, but seem unable to completely prevent this in general.
Is there a robust, principled way to still make this work (exploiting the concurrency of unrelated writes to different refs), or is it fundamentally a Hard™ problem? Or, equivalently, is STM actually the right hammer for this nail?
It seems the transaction part must be serialized in a single thread to guarantee the same ordering. However, all pre-processing (and post-processing?) could be parallelized for a nice speedup.
A. Pre-proc:
B. Tx:
atomor real DB or...?). Since a single thread will have no contention (i.e. no aborted transactions or re-trys) it will be very fast (please see many YouTube videos about the LMax Exchange from Martin Thompson and Dave Farley).C. Post-Proc:
For (A), you could use the
pmapfunction to get started, or just jump into the excellent Claypoole library.