Is cache coherency required for memory consistency?

261 views Asked by At

Cache coherency deals with read/write ordering for a single memory location in the presence of caches, while memory consistency is about ordering accesses across all locations with/without caches.

Normally, processors/compilers guarantee weak memory ordering, requiring the programmer to insert appropriate synchronisation events to ensure memory consistency.

My question is if programmers anyway have to insert these events, is it possible to achieve memory consistency without cache coherency in cache-based processors? What are the trade-offs here? As far as I know, GPUs don't have coherent caches, so, it should be indeed possible.

My intuition is that synchronisation events would become terribly slow without cache coherency, because whole caches might need to be invalidated/flushed at syncs, instead of specific lines getting flushed/invalidated continuously in the background through the coherency machinery. But I could not find any material discussing these trade-offs (and neither did chatGPT help ;)).

1

There are 1 answers

0
Nimish Shah On

There is a section on this in the book, "Parallel Computer Architecture: A Hardware/Software Approach" (perhaps a bit outdated).

Shared Address Space without Coherent Replication

Systems in this category support a shared address space abstraction through the language and compiler but without automatic replication and coherence, just like the CRAY T3D and T3E did in hardware. One type of example is a data parallel language like High Performance Fortran (see Chapter 2). The distributions of data specified by the user, together with the owner computes rule, are used by the compiler or run-time system to translate off-node memory references to explicit messages, to make messages larger, to align data for better spatial locality, and so on. Replication and coherence are usually left up to the user, which compromises ease of programming; alternatively, system software may try to manage coherent replication in main memory automatically. Efforts similar to HPF are being made with languages based on C and C++ as well (Bodin et al. 1993; Larus, Richards, and Viswanathan 1996).

A more flexible language- and compiler-based approach is taken by the Split-C language (Culler et al. 1993). Here, the user explicitly specifies arrays as being local or global (shared) and for global arrays specifies how they should be laid out among physical memories. Computation may be assigned independently of the data layout, and references to global arrays are converted into messages by the compiler or run-time system based on the layout. The decoupling of computation assignment from data distribution makes the language much more flexible than an owner computes rule for load-balancing irregular programs, but it still does not provide automatic support for replication and coherence, which can be difficult for the programmer to manage. Of course, all these software systems can be easily ported to hardware-coherent shared address space machines, in which case the shared address space, replication, and coherence are implicitly provided. In this case, the run-time system may be used to manage replication and coherence in main memory and to transfer data in larger chunks than cache blocks, but these capabilities may not be necessary.

The languages based-on C++ mentioned above are:

Parallel Programming in C**: A Large-Grain Data-Parallel Programming Language

Implementing a parallel C++ runtime system for scalable parallel systems

Parallel programming in Split-C

These look like predecessors of CUDA. So, lack of coherency perhaps make sense for massively-parallel workloads, for which the relatively slow synchronizations (due to a lack of coherency) could still account for only a tiny fraction of the overall runtime.

CRAY T3D and T3E indeed had a shared address space without hardware-supported coherency.