prange

Introduction

The goal is for a non-intrusive worksharing construct that enables the use of many CPU cores, without getting in the way, and while still being as close to pure Python as we can manage. Simplicity and safety is valued above handling all cases. We divide into:

A very simple example is to multiply an array by a scalar:

from cython.parallel import *
def func(np.ndarray[double] x, double alpha):
    cdef Py_ssize_t
    with nogil:
        for i in prange(x.shape[0]):
            x[i] = alpha * x[i]

The only change required is range -> prange. This simply hints to Cython that the iterations of the loop body can be executed in any order. If this can be proven to not be the case, Cython is even free to raise a compile-time error.

Also, the prange loop body has the restriction that any functions called must be reentrant.

Python compatability, parallel_for

In pure Python, prange (from the shadow module) is simply implemented as xrange/range. Using multithreading would slow down the majority of usecases anyway, because of the GIL.

Other choices of syntax, such as parallel_for i in range(n), may be more rational, but are unable to execute as pure Python code, and are therefore ruled out.

Threading technology

The first implementation will use (a small subset of!) OpenMP, simply in order to get somewhere quickly. But we do keep the door open for changing that.

Simple usecases

We do automatic inference of thread-private variables in the cases where it is safe. Consider the following example:

from cython.parallel cimport *
def f(np.ndarray[double] x, double alpha):
    cdef double s = 0
    cdef double tmp
    with nogil:
        for i in prange(x.shape[0]):
            # alpha is only read, so shared
            # tmp assigned before being used -> safe and natural to make it implicitly thread-private
            tmp = alpha * i
            s += x[i] * tmp # turns into reduction + thread-private
        s += tmp * 10 # after the loop we emulate sequential loop execution(OpenMP lastprivate)
    return s

By using prange, we promise that the order of execution of each iteration is arbitrary, so the loop body can not depend on values computed by another iteration of the same loop:

Advanced usecases

Explicit thread-local variables

Unlike the simple case above, one may need explicit thread-local variables, e.g., for caching purposes. Because values now carry across iterations of the loop, one needs to explicitly declare them.

from cython.parallel import prange, threadlocal
def f(np.ndarray[int] ell_arr, np.ndarray[double] out):
    cdef Py_ssize_t i, ell
    cdef threadlocal(int) old_ell = -1
    cdef threadlocal(double) alpha = np.nan
    assert ell_arr.shape[0] == m_arr.shape[0] == out.shape[0]
    assert np.all(ell_arr > 0)
    with nogil:
        for i in prange(out.shape[0]):
            ell = ell_arr[i]
            if ell != old_ell:
                alpha = function_of_ell(ell)
                old_ell = ell
            out[i] = alpha * sqrt(ell + i)

The idea is that ell_arr has long stretches of constant value, and that function_of_ell is slow to compute, so that each thread wants to have their own cached copy of alpha between each iteration.

Note that this is a very different use of thread-local variables. We:

Note that pure Python operation still works just fine.

The parallel block, threadsavailable, threadid

Some threads require setup and teardown of scratch space. For these cases there's the parallel block, as well as a couple of auxiliary functions:

from cython.parallel cimport parallel, prange, threadsavailable, threadid

cdef double* buf = <double*>malloc(100 * threadsavailable(schedule='dynamic') * sizeof(double))
cdef double* threadbuf
with nogil, parallel:
    threadbuf = buf + threadid() * 100 # thread setup
    for i in prange(n, schedule='dynamic'):
        ...
    # any thread teardown
free(buf)

There may be more than one prange after one another; at least for the time being they'll terminate with a barrier. Non-prange loops are executed in each thread (if you use the parallel section, you're considered an expert that can deal with this subtle difference).

The parallel block is an obvious place to support more of the OpenMP standard eventually, such as critical sections, barriers etc., if we wish to do so.

Note that threadsavailable must take the same scheduling parameters as prange in order to give an accurate answer. The implementation is based on simply entering an OpenMP parallel section, fetch number of threads, and return.

Scheduling hints

prange takes a number of keyword arguments. For now these would simply be a subset of the OpenMP scheduler flags. We may have a mix of arguments that are hints (backends that don't support them ignore them) and required (backends that don't support them can't be used).

Implementation notes/future

with gil

with gil should be supported eventually, although it can be skipped in a first release. It can be implemented through using the master block in OpenMP. Of course, getting the GIL serializes the execution, so the likely usecase is for raising exceptions.

Break/continue/raise

In the first implementation one would perhaps not support break, continue and raise. (yield is likely never supported?). But when they get implemented:

It is likely OK to sacrifice a little bit of performance (on the order of 10-20%) in order to have threads be responsive and terminate quickly on an exception. If one needs top performance one can always avoid raise/continue/break and use flags manually.

enhancements/prange (last edited 2011-04-06 08:48:27 by DagSverreSeljebotn)