Atomic Read-Modify-Write Operations are Unnecessary for Shared-Memory Work Stealing

Abstract

We present a work-stealing algorithm for total-store memory architectures, such as Intel's X86, that does not rely on atomic read-modify-write instructions such as compare-and-swap. In our algorithm, processors communicate solely by reading from and writing (non-atomically) into weakly consistent memory. We also show that join resolution, an important problem in scheduling parallel programs, can also be solved without using atomic read-modify-write instructions. At a high level, our work-stealing algorithm closely resembles traditional work-stealing algorithms, but certain details are more complex. Instead of relying on atomic read-modify-write operations, our algorithm uses a steal protocol that enables processors to perform load balancing by using only two memory cells per processor. The steal protocol permits data races but guarantees correctness by using a time-stamping technique. Proving the correctness of our algorithms is made challenging by weakly consistent shared-memory that permits processors to observe sequentially inconsistent views. We therefore carefully specify our algorithms and prove them correct by considering a costed refinement of the X86-TSO model, a precise characterization of total-store-order architectures. We show that our algorithms are practical by implementing them as part of a C++ library and performing an experimental evaluation. Our results show that our work-stealing algorithm is competitive with the state-of-the-art implementations even on current architectures where atomic read-modify-write instructions are cheap. Our join resolution algorithm incurs a relatively small overhead compared to an efficient algorithm that uses atomic read-modify-write instructions.

Paper

Umut A. Acar, Arthur Charguéraud, Stefan Muller and Mike Rainey
Inria Technical Report, September 2013