Alain Darte, Daniel Chavarria-Miranda, Robert Fowler, and John Mellor-Crummey. Latin hyper-rectangles for efficient parallelization of line-sweep computations. Submitted to the Annals of Operations Research, December 2001. [ps], [pdf]

Multipartitioning is a strategy for partitioning multi-dimensional arrays among a collection of processors so that line-sweep computations can be performed efficiently. The principal property of a multipartitioned array is that for a line sweep along any array dimension, all processors have the same number of tiles to compute at each step in the sweep, in other words, it describes a latin hyper-rectangle, natural extension of the notion of latin squares. This property results in full, balanced parallelism. A secondary benefit of multipartitionings is that they induce only coarse-grain communication.
All of the multipartitionings described in the literature to date assign only one tile per processor per hyperplane of a multipartitioning (latin hyper-cube). While this class of multipartitionings is optimal for two dimensions, in three dimensions it requires the number of processors to be a perfect square. This paper considers the general problem of computing optimal multipartitionings for multi-dimensional data volumes on an arbitrary number of processors. We describe an algorithm to compute a d-dimensional multipartitioning of a multi-dimensional array for an arbitrary number of processors. When using a multipartitioning to parallelize a line sweep computation, the best partitioning is the one that exploits all of the processors and has the smallest communication volume. To compute the best multipartitioning of a multi-dimensional array, we describe a cost model for selecting d, the dimensionality of the best partitioning, and the number of cuts along each partitioned dimension. In practice, our technique will choose a 3-dimensional multipartitioning for a 3-dimensional line-sweep computation, except when p is a prime; previously, a 3-dimensional multipartitioning could be applied only when sqrt(p) is integral.
Finally, we describe a prototype implementation of generalized multipartitioning in the Rice dHPF compiler and performance results obtained when using it to parallelize a line sweep computation for different numbers of processors.
Keywords: Generalized latin squares, partitions of integers, loop parallelization, array mapping, High Performance Fortran.