Abstractions for parallelism and scheduling algorithms

While multicore computers have become mainstream, multicore programming remains full of pitfalls. Using P cores to speed up one's program by a factor close to P turns out to be far trickier than it sounds. Beyond the immediate difficultly of parallelizing existing algorithms, one faces the problem of amortizing the costs of task creation, the challenge of dealing with dynamic load balancing, the hazards of concurrent programming, and the headaches associated with weak memory models.

However, by using the right programming language abstractions together with a flexible scheduler, multicore programming can become much more accessible. Since January 2011, I have been working with Umut Acar and Mike Rainey, at the Max Planck Institute for Software System, on the development of new abstractions and scheduling algorithms for multicore programming.

Granularity control using complexity annotations

One central difficulty in parallelism is to spawn sequential subtasks of the right size. On the one hand, creating sequential tasks that are too small leads to unacceptable overheads. On the other hand, creating sequential tasks that are too big caps the number of cores that can be used. The traditional approach to granularity control consists in deciding whether to sequentialize subtasks or not based on a cutoff value hard-coded in the source code, however this approach is not portable at all. Another approach is auto-tuning, which consists in automatically trying various possible values for the cutoff on a given hardware, but this takes time and requires samples of input data.

We have developed a new approach to granularity control that combines asymptotic complexity annotations with runtime profiling. This approach applies to any divide-and-conquer algorithm whose worst-case complexity matches its average complexity. We only require the programmer to annotate his parallel functions with an asymptotic complexity expression. We then use runtime profiling for deducing the constant factors that apply. Using this information, we are able to predict execution time and enforce our scheduling policy: any subtask that is predicted to take less than a fixed amount of time gets sequentialized.

We have established bounds showing that our granularity control strategy leads to provably-good parallel runtimes. Moreover, we have shown that it works very well in practice. These results are described in the following publication.

Oracle Scheduling: Controlling Granularity in Implicitly Parallel Languages
with U. Acar and M. Rainey
OOPSLA, October 2011

A modular scheduler for dynamic computation DAGs

We have developed a new interface for parallel schedulers that contains only three functions: one function to add a node to the computation DAG (i.e., spawn a subtask), one function to add an edge (i.e., add a dependency between two tasks), and one function to transfer outgoing edges from a node to another one (this is useful for dynamically expanding an existing node into a subgraph). For each task, one may explicitly specify the strategy that should be used for detecting the readiness of the task, that is, for maintaining the number of incoming edges. Indeed, using a shared counter does not always scales up well with the number of cores, so depending on the algorithms other strategies such as message-passing or distributed counters might perform better.

All the existing parallelism constructs such as fork-join, sync-spawn, futures and lazy futures, as well as global termination detection (useful for, e.g., a DFS algorithm) can be easily expressed on top of our interface. We have implemented in C++ a scheduler that implements our general dynamic DAG interface. More details can be found in the following workshop paper.

Efficient Primitives for Creating and Scheduling Parallel Computations
with U. Acar and M. Rainey
Declarative Aspects and Applications of Multicore Programming (DAMP), January 2012
We have moreover implemented our scheduler modularly with respect to the load balancing algorithm, allowing to switch at runtime between different strategies. This flexibility allows to select the load balancing strategy that is the best suited for each algorithm, and thereby obtain good speedups. For example, we were able to set up a new sorting algorithm that, on 30 cores, achieves a 26x speedup over a sequential call to the quicksort from the C library.

Synchronization-free work-stealing

Work stealing is a classic algorithm for performing dynamic load-balancing. Each processor maintains a deque of ready tasks, and processors that run out of work steal work from the deque of another processor picked uniformly at random. The algorithm has been improved over the years in order to reduce the number of expensive synchronization operations, such as compare-and-swap of fetch-and-add. The state-of-the-art algorithm (Chase and Lev, 2005) only requires compare-and-swap when the deques contain less than one element, however it still requires to pay for one load-store memory barrier for every task.

We have developed a variant to work-stealing that does not require any expensive synchronization operation. More precisely, our algorithm does not use any lock, compare-and-swap operation, fetch-and-add operation, nor load-store barrier. We require only a small number of store-store memory barriers. These barriers are much cheaper than load-store barrier because they do not require the immediate flushing of the write buffer. In particular, these barriers are not even needed at all on x86 architecture.

This work has been submitted to an international conference.