Cython for NumPy users
Cython for NumPy users
- Cython at a glance
- Your Cython environment
- The first Cython program
- Adding types
- Cython specific methods
- Using the Numpy C API
- The future
This tutorial is aimed at NumPy users who have no experience with Cython at all. If you have some knowledge of Cython you may want to skip to the Efficient indexing section which explains the new improvements made in summer 2008.
For passing (the data in) a numpy array to C or C++ code, see: tutorials/NumpyPointerToC
The main scenario considered is NumPy end-use rather than NumPy/SciPy development. The reason is that Cython is not (yet) able to support functions that are generic with respect to datatype and the number of dimensions in a high-level fashion. This restriction is much more severe for SciPy development than more specific, "end-user" functions. See the last section for more information on this.
The style of this tutorial will not fit everybody, so you can also consider:
Robert Bradshaw's slides on Cython for SciPy2008 (a higher-level and quicker introduction)
Basic Cython documentation (see Cython front page).
Note: The fast array access documented below is a completely new feature, and there may be bugs waiting to be discovered. It might be a good idea to do a manual sanity check on the C code Cython generates before using this for serious purposes, at least until some months have passed.
Cython at a glance
Cython is a compiler which compiles Python-like code files to C code. Still, Cython is not a Python to C translator. That is, it doesn't take your full program and "turns it into C" -- rather, the result makes full use of the Python runtime environment. A way of looking at it may be that your code is still Python in that it runs within the Python runtime environment, but rather than compiling to interpreted Python bytecode one compiles to native machine code (but with the addition of extra syntax for easy embedding of faster C-like code).
This has two important consequences:
- Speed. How much depends very much on the program involved though. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative).
- Easy calling into C code. One of Cython's purposes is to allow easy wrapping of C libraries. When writing code in Cython you can call into C code as easily as into Python code.
Some Python constructs are not yet supported, though making Cython compile all Python code is a stated goal (among the more important omissions are inner functions and generator functions).
Your Cython environment
Using Cython consists of these steps:
Write a .pyx source file
- Run the Cython compiler to generate a C file
- Run a C compiler to generate a compiled library
- Run the Python interpreter and ask it to import the module
However there are several options to automate these steps:
The SAGE mathematics software system provides excellent support for using Cython and NumPy from an interactive command line (like IPython) or through a notebook interface (like Maple/Mathematica). See this documentation.
A version of pyximport is shipped with Cython, so that you can import pyx-files dynamically into Python and have them compiled automatically.
- Cython supports distutils so that you can very easily create build scripts which automate the process, this is the preferred method for full programs.
- Manual compilation (see below)
Note: If using another interactive command line environment than SAGE, like IPython or Python itself, it is important that you restart the process when you recompile the module. It is not enough to issue an "import" statement again.
Unless you are used to some other automatic method: download Cython (0.9.8.1.1 or later), unpack it, and run the usual python setup.py install. This will install a cython executable on your system. It is also possible to use Cython from the source directory without installing (simply launch cython.py in the root directory).
As of this writing SAGE comes with an older release of Cython than required for this tutorial. So if using SAGE you should download the newest Cython and then execute
$ cd path/to/cython-distro $ path-to-sage/sage -python setup.py install
This will install the newest Cython into SAGE.
As it is always important to know what is going on, I'll describe the manual method here. First Cython is run:
$ cython yourmod.pyx
This creates yourmod.c which is the C source for a Python extension module. A useful additional switch is -a which will generate a document (yourmod.html) that shows which Cython code translates to which C code line by line.
Then we compile the C file. This may vary according to your system, but the C file should be built like Python was built. Python documentation for writing extensions should have some details. On Linux this often means something like
$ gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.5 -o yourmod.so yourmod.c
gcc should have access to the NumPy C header files so if they are not installed at /usr/include/numpy or similar you may need to pass another option for those.
This creates yourmod.so in the same directory, which is importable by Python by using a normal import yourmod statement.
The first Cython program
The code below does 2D discrete convolution of an image with a filter (and I'm sure you can do better!, let it serve for demonstration purposes). It is both valid Python and valid Cython code. I'll refer to it as both convolve_py.py for the Python version and convolve1.pyx for the Cython version -- Cython uses ".pyx" as its file suffix.
1 from __future__ import division 2 import numpy as np 3 def naive_convolve(f, g): 4 # f is an image and is indexed by (v, w) 5 # g is a filter kernel and is indexed by (s, t), 6 # it needs odd dimensions 7 # h is the output image and is indexed by (x, y), 8 # it is not cropped 9 if g.shape % 2 != 1 or g.shape % 2 != 1: 10 raise ValueError("Only odd dimensions on filter supported") 11 # smid and tmid are number of pixels between the center pixel 12 # and the edge, ie for a 5x5 filter they will be 2. 13 # 14 # The output size is calculated by adding smid, tmid to each 15 # side of the dimensions of the input image. 16 vmax = f.shape 17 wmax = f.shape 18 smax = g.shape 19 tmax = g.shape 20 smid = smax // 2 21 tmid = tmax // 2 22 xmax = vmax + 2*smid 23 ymax = wmax + 2*tmid 24 # Allocate result image. 25 h = np.zeros([xmax, ymax], dtype=f.dtype) 26 # Do convolution 27 for x in range(xmax): 28 for y in range(ymax): 29 # Calculate pixel value for h at (x,y). Sum one component 30 # for each pixel (s, t) of the filter g. 31 s_from = max(smid - x, -smid) 32 s_to = min((xmax - x) - smid, smid + 1) 33 t_from = max(tmid - y, -tmid) 34 t_to = min((ymax - y) - tmid, tmid + 1) 35 value = 0 36 for s in range(s_from, s_to): 37 for t in range(t_from, t_to): 38 v = x - smid + s 39 w = y - tmid + t 40 value += g[smid - s, tmid - t] * f[v, w] 41 h[x, y] = value 42 return h
This should be compiled to produce yourmod.so (for Linux systems). We run a Python session to test both the Python version (imported from .py-file) and the compiled Cython module.
1 In: import numpy as np 2 In: import convolve_py 3 In: convolve_py.naive_convolve(np.array([[1, 1, 1]], dtype=np.int), \ ... np.array([,,], dtype=np.int)) 4 Out: 5 array([[1, 1, 1], 6 [2, 2, 2], 7 [1, 1, 1]]) 8 In: import convolve1 9 In: convolve1.naive_convolve(np.array([[1, 1, 1]], dtype=np.int), \ ... np.array([,,], dtype=np.int)) 10 Out: 11 array([[1, 1, 1], 12 [2, 2, 2], 13 [1, 1, 1]]) 14 In : N = 100 15 In : f = np.arange(N*N, dtype=np.int).reshape((N,N)) 16 In : g = np.arange(81, dtype=np.int).reshape((9, 9)) 17 In : %timeit -n2 -r3 convolve_py.naive_convolve(f, g) 18 2 loops, best of 3: 1.86 s per loop 19 In : %timeit -n2 -r3 convolve1.naive_convolve(f, g) 20 2 loops, best of 3: 1.41 s per loop
There's not such a huge difference yet; because the C code still does exactly what the Python interpreter does (meaning, for instance, that a new object is allocated for each number used). Look at the generated html file and see what is needed for even the simplest statements you get the point quickly. We need to give Cython more information; we need to add types.
To add types we use custom Cython syntax, so we are now breaking Python source compatability. Here's convolve2.pyx. Read the comments!
1 from __future__ import division 2 import numpy as np 3 # "cimport" is used to import special compile-time information 4 # about the numpy module (this is stored in a file numpy.pxd which is 5 # currently part of the Cython distribution). 6 cimport numpy as np 7 # We now need to fix a datatype for our arrays. I've used the variable 8 # DTYPE for this, which is assigned to the usual NumPy runtime 9 # type info object. 10 DTYPE = np.int 11 # "ctypedef" assigns a corresponding compile-time type to DTYPE_t. For 12 # every type in the numpy module there's a corresponding compile-time 13 # type with a _t-suffix. 14 ctypedef np.int_t DTYPE_t 15 # The builtin min and max functions works with Python objects, and are 16 # so very slow. So we create our own. 17 # - "cdef" declares a function which has much less overhead than a normal 18 # def function (but it is not Python-callable) 19 # - "inline" is passed on to the C compiler which may inline the functions 20 # - The C type "int" is chosen as return type and argument types 21 # - Cython allows some newer Python constructs like "a if x else b", but 22 # the resulting C file compiles with Python 2.3 through to Python 3.0 beta. 23 cdef inline int int_max(int a, int b): return a if a >= b else b 24 cdef inline int int_min(int a, int b): return a if a <= b else b 25 # "def" can type its arguments but not have a return type. The type of the 26 # arguments for a "def" function is checked at run-time when entering the 27 # function. 28 # 29 # The arrays f, g and h is typed as "np.ndarray" instances. The only effect 30 # this has is to a) insert checks that the function arguments really are 31 # NumPy arrays, and b) make some attribute access like f.shape much 32 # more efficient. (In this example this doesn't matter though.) 33 # The "not None" will enforce that an exception is raised if None is passed in. 34 # If the "not None" is left out, the function will accept either an ndarray or a None, 35 # but the None will cause problems later if not explicitly handled. 36 def naive_convolve(np.ndarray f not None, np.ndarray g not None): 37 if g.shape % 2 != 1 or g.shape % 2 != 1: 38 raise ValueError("Only odd dimensions on filter supported") 39 assert f.dtype == DTYPE and g.dtype == DTYPE 40 # The "cdef" keyword is also used within functions to type variables. It 41 # can only be used at the top indentation level (there are non-trivial 42 # problems with allowing them in other places, though we'd love to see 43 # good and thought out proposals for it). 44 # 45 # For the indices, the "int" type is used. This corresponds to a C int, 46 # other C types (like "unsigned int") could have been used instead. 47 # Purists could use "Py_ssize_t" which is the proper Python type for 48 # array indices. 49 cdef int vmax = f.shape 50 cdef int wmax = f.shape 51 cdef int smax = g.shape 52 cdef int tmax = g.shape 53 cdef int smid = smax // 2 54 cdef int tmid = tmax // 2 55 cdef int xmax = vmax + 2*smid 56 cdef int ymax = wmax + 2*tmid 57 cdef np.ndarray h = np.zeros([xmax, ymax], dtype=DTYPE) 58 cdef int x, y, s, t, v, w 59 # It is very important to type ALL your variables. You do not get any 60 # warnings if not, only much slower code (they are implicitly typed as 61 # Python objects). 62 cdef int s_from, s_to, t_from, t_to 63 # For the value variable, we want to use the same data type as is 64 # stored in the array, so we use "DTYPE_t" as defined above. 65 # NB! An important side-effect of this is that if "value" overflows its 66 # datatype size, it will simply wrap around like in C, rather than raise 67 # an error like in Python. 68 cdef DTYPE_t value 69 for x in range(xmax): 70 for y in range(ymax): 71 s_from = int_max(smid - x, -smid) 72 s_to = int_min((xmax - x) - smid, smid + 1) 73 t_from = int_max(tmid - y, -tmid) 74 t_to = int_min((ymax - y) - tmid, tmid + 1) 75 value = 0 76 for s in range(s_from, s_to): 77 for t in range(t_from, t_to): 78 v = x - smid + s 79 w = y - tmid + t 80 value += g[smid - s, tmid - t] * f[v, w] 81 h[x, y] = value 82 return h
At this point, have a look at the generated C code for convolve1.pyx and convolve2.pyx. Click on the lines to expand them and see corresponding C. (Note that this code annotation is currently experimental and especially "trailing" cleanup code for a block may stick to the last expression in the block and make it look worse than it is -- use some common sense).
Especially have a look at the for loops: In convolve1.c, these are ~20 lines of C code to set up while in convolve2.c a normal C for loop is used.
After building this and continuing my (very informal) benchmarks, I get
1 In : import convolve2 2 In : %timeit -n2 -r3 convolve2.naive_convolve(f, g) 3 2 loops, best of 3: 828 ms per loop
Cython specific methods
This section is about the Cython's special features working with numpy arrays. The paper Fast numerical computations with Cython (2009) is a more up to date source on Cython's numpy specific features.
There's still a bottleneck killing performance, and that is the array lookups and assignments. The -operator still uses full Python operations -- what we would like to do instead is to access the data buffer directly at C speed.
What we need to do then is to type the contents of the ndarray objects. We do this with a special "buffer" syntax which must be told the datatype (first argument) and number of dimensions ("ndim" keyword-only argument, if not provided then one-dimensional is assumed).
More information on this syntax can be found here.
Showing the changes needed to produce convolve3.pyx only:
1 ... 2 def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f not None, 3 np.ndarray[DTYPE_t, ndim=2] g not None): 4 ... 5 cdef np.ndarray[DTYPE_t, ndim=2] h = ...
1 In : import convolve3 2 In : %timeit -n3 -r100 convolve3.naive_convolve(f, g) 3 3 loops, best of 100: 11.6 ms per loop
Note the importance of this change.
Gotcha: This efficient indexing only affects certain index operations, namely those with exactly ndim number of typed integer indices. So if v for instance isn't typed, then the lookup f[v, w] isn't optimized. On the other hand this means that you can continue using Python objects for sophisticated dynamic slicing etc. just as when the array is not typed.
Tuning indexing further
The array lookups are still slowed down by two factors:
- Bounds checking is performed.
- Negative indices are checked for and handled correctly.
The code above is explicitly coded so that it doesn't use negative indices, and it (hopefully) always access within bounds. We can add a decorator to disable bounds checking:
1 ... 2 cimport cython 3 @cython.boundscheck(False) # turn of bounds-checking for entire function 4 def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f not None, 5 np.ndarray[DTYPE_t, ndim=2] g not None): 6 ...
Now bounds checking is not performed (and, as a side-effect, if you do happen to access out of bounds you will in the best case crash your program and in the worst case corrupt data). It is possible to switch bounds-checking mode in many ways, see compiler directives for more information.
Negative indices are dealt with by ensuring Cython that the indices will be positive, by casting the variables to unsigned integer types (if you do have negative values, then this casting will create a very large positive value instead and you will attempt to access out-of-bounds values). Casting is done with a special <>-syntax. The code below is changed to use either unsigned ints or casting as appropriate:
1 ... 2 cdef int s, t # changed 3 cdef unsigned int x, y, v, w # changed 4 cdef int s_from, s_to, t_from, t_to 5 cdef DTYPE_t value 6 for x in range(xmax): 7 for y in range(ymax): 8 s_from = max(smid - x, -smid) 9 s_to = min((xmax - x) - smid, smid + 1) 10 t_from = max(tmid - y, -tmid) 11 t_to = min((ymax - y) - tmid, tmid + 1) 12 value = 0 13 for s in range(s_from, s_to): 14 for t in range(t_from, t_to): 15 v = <unsigned int>(x - smid + s) # changed 16 w = <unsigned int>(y - tmid + t) # changed 17 value += g[<unsigned int>(smid - s), <unsigned int>(tmid - t)] * f[v, w] # changed 18 h[x, y] = value 19 ...
(In the next Cython release we will likely add a compiler directive or argument to the np.ndarray-type specifier to disable negative indexing so that casting so much isn't necesarry; feedback on this is welcome.)
The function call overhead now starts to play a role, so we compare the latter two examples with larger N:
1 In : %timeit -n3 -r100 convolve4.naive_convolve(f, g) 2 3 loops, best of 100: 5.97 ms per loop 3 In : N = 1000 4 In : f = np.arange(N*N, dtype=np.int).reshape((N,N)) 5 In : g = np.arange(81, dtype=np.int).reshape((9, 9)) 6 In : %timeit -n1 -r10 convolve3.naive_convolve(f, g) 7 1 loops, best of 10: 1.16 s per loop 8 In : %timeit -n1 -r10 convolve4.naive_convolve(f, g) 9 1 loops, best of 10: 597 ms per loop
(Also this is a mixed benchmark as the result array is allocated within the function call.)
Speed comes with some cost. Especially it can be dangerous to set typed objects (like f, g and h in our sample code) to None. Setting such objects to None is entirely legal, but all you can do with them is check whether they are None. All other use (attribute lookup or indexing) can potentially segfault or corrupt data (rather than raising exceptions as they would in Python).
The actual rules are a bit more complicated but the main message is clear: Do not use typed objects without knowing that they are not set to None.
Passing None (or not) into a function
By default, when you declare type in the function signature, Cython will raise an exception if the function is called with another type passed in. However, It will also except None, so that there is a way to signify "no value". However, if that None is not handled properly and used in an indexing operation, you can get a segfault or bus error or other hard crash. If you want your function not to be able to accept None, you can use the "not None" declaration, as so:
1 ... 2 def naive_convolve(np.ndarray[DTYPE_t, ndim=2] f not None, 3 np.ndarray[DTYPE_t, ndim=2] g not None): 4 ...
In this case the function will raise an exception immediately if the argument is None. It's probably a good practice to always use "not None" unless you write code that explicitly checks for None.
More generic code
It would be possible to do
def naive_convolve(object[DTYPE_t, ndim=2] f, ...):
i.e. use object rather than np.ndarray. Under Python 3.0 this can allow your algorithm to work with any libraries supporting the buffer interface; and support for e.g. the Python Imaging Library may easily be added if someone is interested also under Python 2.x.
There is some speed penalty to this though (as one makes more assumptions compile-time if the type is set to np.ndarray, specifically it is assumed that the data is stored in pure strided mode and not in indirect mode).
Using the Numpy C API
If you're not writing a highly specialized function, you may want to write very general functions that work a lot like UFuncs. There are at least three generality features that numpy provides: handling of arbitrary dimensional arrays, array broadcasting and array type casting. The simplest way to take advantage of these features is to use the Numpy C-API.
To access the C-API you must cimport numpy and call import_array. If you also wish to use Python numpy functions, you must import that separately.
1 import numpy as np 2 cimport numpy as np 3 4 np.import_array()
Dimensionally simple functions
If your function is dimensionally simple (no dimension is special), then it is easy to implement input array broadcasting by using multi-iterators to simultaneously iterate over multiple arrays. The API multi-iterator documentation is useful here.
The following example demonstrates how to do this.
1 import numpy as np 2 cimport numpy as np 3 4 np.import_array() 5 6 def prod2(a, b): 7 8 #generate a new output array of the correct shape by broadcasting input arrays together 9 out = np.empty(np.broadcast(a, b).shape, np.float) 10 11 #generate the iterator over the input and output arrays, does the same thing as 12 # PyArray_MultiIterNew 13 14 cdef np.broadcast it = np.broadcast(a, b, out) 15 16 while np.PyArray_MultiIter_NOTDONE(it): 17 18 #PyArray_MultiIter_DATA is used to access the pointers the iterator points to 19 aval = (<double*>np.PyArray_MultiIter_DATA(it, 0)) 20 bval = (<double*>np.PyArray_MultiIter_DATA(it, 1)) 21 22 (<double*>np.PyArray_MultiIter_DATA(it, 2)) = aval * bval 23 24 #PyArray_MultiIter_NEXT is used to advance the iterator 25 np.PyArray_MultiIter_NEXT(it) 26 27 return out
Dimensionally complex functions
Sometimes your function will not be dimensionally simple. You may want to write general functions that can take an axis argument to treat a particular axis differently than other axes, for example for a moving average function.
One way to implement a dimensionally complex function is to use PyArray_IterAllButAxis to iterate over a single array but leave out iteration along the desired axis. The function then manually iterates over the distinct axis using pointer arithmetic. Manually iterating over the smallest axis also happens to be somewhat faster than using the iterator to iterate over it.
The following example demonstrates how to do this.
1 def ewma( a, double weight, int axis): 2 3 cdef int i 4 5 out = np.empty(a.shape, np.double) 6 7 cdef np.flatiter ita = np.PyArray_IterAllButAxis(a, &axis) 8 cdef np.flatiter ito = np.PyArray_IterAllButAxis(out, &axis) 9 10 cdef int a_axis_stride = a.strides[axis] 11 cdef int o_axis_stride = out.strides[axis] 12 13 cdef int axis_length = out.shape[axis] 14 15 cdef double avg 16 cdef double value 17 18 while np.PyArray_ITER_NOTDONE(ita): 19 20 avg = 0.0 21 22 for i in range(axis_length): 23 24 """ 25 PyArray_ITER_DATA is used to access the pointer of the iterator 26 (<char*>pointer) + i * a_axis_stride) moves ahead the correct 27 number of bytes to the next element. The result must then be 28 dereferenced and cast to the correct type (<double*>(char_pointer)) 29 """ 30 31 value = (<double*>((<char*>np.PyArray_ITER_DATA (ita)) + i*a_axis_stride)) 32 33 avg = value*weight + avg * (1.0 - weight) 34 35 #the result is written to the output array 36 (<double*>((<char*>np.PyArray_ITER_DATA (ito)) + i*o_axis_stride)) = avg 37 38 #PyArray_ITER_NEXT increments the iterators 39 np.PyArray_ITER_NEXT(ita) 40 np.PyArray_ITER_NEXT(ito) 41 42 return out
For the time being, it is not possible to use array broadcasting using this method of implementing dimensionally complex functions. It is also not possible to have multiple special dimensions using this method. However, the Optimizing Iterator/UFunc Performance NEP will soon provide a more unified and convenient iteration interface which will support both features simultaneously and allow multiple special dimensions.
Keep in mind that the Numpy C API has a strange convention where most entry points which take a "dtype" reference "steal" a reference to the dtype (they Py_DECREF the dtype), so if you wish to use one of those entry points you will need to do a Py_INCREF of the dtype before passing it into the entry point:
1 from python_ref cimport Py_INCREF, Py_DECREF 2 cimport numpy as np 3 4 np.import_array() 5 ... 6 Py_INCREF(dtype) 7 return np.PyArray_FromArray( 8 instance, dtype, np.NPY_CARRAY|np.NPY_FORCECAST 9 )
These are some points to consider for further development. All points listed here has gone through a lot of thinking and planning already; still they may or may not happen depending on available developer time and resources for Cython.
Support for efficient access to structs/records stored in arrays; currently only primitive types are allowed (but see http://trac.cython.org/cython_trac/ticket/56).
- Support for efficient access to complex floating point types in arrays. The main obstacle here is getting support for efficient complex datatypes in Cython.
Calling NumPy/SciPy functions currently has a Python call overhead; it would be possible to take a short-cut from Cython directly to C. (This does however require some isolated and incremental changes to those libraries; mail the Cython mailing list for details).
Efficient code that is generic with respect to the number of dimensions. This can probably be done today by calling the NumPy C multi-dimensional iterator API directly; however it would be nice to have for-loops over enumerate and ndenumerate on NumPy arrays create efficient code.
- A high-level construct for writing type-generic code, so that one can write functions that work simultaneously with many datatypes. Note however that a macro preprocessor language can help with doing this for now.