CEP 508 - Compile time unrolling
Implementation status: Not started
This is a proposal for creating a code tree transform that will take a normal code-tree and perform compile-time unrolling, calculations etc. The purpose is not to be an optimizing compiler, but have a quick way of allowing some of the same programming techniques that C++ templates allow for.
The creative use C++'s templates has gotten over the years should demonstrate that there are good uses this (see http://ubiety.uwaterloo.ca/~tveldhui/papers/Expression-Templates/exprtmpl.html, Blitz++ etc.).
This attempts to get some of the advantages without introducing a full template compiler. It also allows using the regular Python syntax rather than a function-style side-effect of the template compilation.
Examples of use
These examples may seem useless. The main usecase for now is the operators for NumPy arrays, which will contain lots of tests on access methods, contiguous mode etc. declared compile-time; and where it is not acceptable to insert run-time tests.
Make it possible to use compile-time constants in new ways:
DEF STRATEGY = 2 @cython.unroll def allocate_node(): if STRATEGY = 1: ..lots of code .. elif STRATEGY == 2: .. lots of code ..
This will generate C code that only contains the second block, while the first block will never make it past Cython.
A small modification gives:
@cython.unroll("strategy") def allocate_node(strategy): if strategy == 1: ..lots of code .. elif strategy == 2: .. lots of code .. a = allocate_node(1) b = allocate_node(2) c = allocate_node(1 if after_midnight() else 2)
This will generate three different C functions:
- One called to get the result of a, containing only the first block
- One called to get the result of b, containing only the second block
- A generic version for when the argument is only known run-time, which contains the if-test and both blocks. (Admittedly, one could "know" that there are only two cases that are already generated and move the if-test to the calling code, but that is not within the scope of this proposal.)
Nothing is done to prevent the user from hanging oneself -- the following will output 10000 seperate functions to the C file:
@cython.unroll("i") def fibonacci(i): if i == 0 or i == 1: return 1 return fibonacci(i-1) + fibonacci(i-2) a = fibonacci(10000)
Ie, @cython.unroll will only be used by people who know what they do.
All the expressions in a function are classified to either be known compile-time or run-time. A lot of cases are obvious, other can demand developing a full-fledged optimizing compiler. To avoid the latter some simple restrictions are put on this. Some guidelines:
- Compile-time DEFs are known compile-time.
- Only a restricted set of types (numbers, strings, lists, tuples; all Python-style) can be regarded as known compile-time.
- A simple corrollary to this is that in methods, "self" is always classified as run-time.
- Global variables are never considered known compile-time.
- Only parameters listed in @cython.unroll are candidates for being known compile-time. Whether these are known compile-time or not is determined at the place of calling the function; different compile-time-known arguments will result in multiple versions of the function being generated. It is perfectly ok to pass values only known run-time, this will create a specific version of the function where this parameter is considered not known compile-time.
- Functions: Most functions are considered run-time-only. The exception are whitelisted Python builtins (len, range, zip, etc.); only functions without side-effects and operating on and returning primitive types can be whitelisted.
Uncertain about this one:
- Function references. Compiling in a compile-time known function reference would be an excellent way of stringing together much code. This would mimic passing around code-generating templates to templates in C++. But it should not be a priority.
Constructs that will be supported:
Constructs that might be supported but not a priority:
Not sure about whether internal functions/closures should be supported, at least they should be supported in Cython first Anyway, they would then inherit .
How to implement
I'll introduce an example:
t = False cdef int s = 0 if t: a = [3, -2, 7] else: a = [1, 3, 2] for i in range(1, len(a)+1): s += a[i-1] ** 2 a = "something else"
The chances of the C translation of the above to be effectively inlined by GCC is effectively none at all: There will be loads of Python objects created, dictionary accesses and so on. Still the code can in principle be inlined to:
cdef int s = 14 a = "something else"
(This seems useless -- the really useful cases are when you have code that cannot be inlined at code-writing-time but can be inlined at code-compilation-time - because it gets different type arguments, or is simply in a "compile-time library" like mentioned above.)
This might seem like magic, but in reality I don't believe it is all that difficult. One important principle first: Everything that happens happen on a best-effort basis. I.e., if a case turns out to be difficult, simply leave it non-inlined!
This basically requires a "Python interpreter" transform that can be run on any fragments of our parse tree, and evaluates any compile-time-evaluateable expressions using the full Python machinery while leaving run-time-only expressions alone. This simplistic interpreter will:
- Keep a current scope of variables in a dictionary. Each variable can either be in a RUNTIME state, meaning that their value cannot be resolved at compile-time, or a COMPILETIME state meaning that their value can be known and is resolved.
- Recursively process nodes, always returning clones of the nodes instead of just passing them through, even if they don't change (this is so that the old syntax tree isn't destroyed when iterating through loops for instance, then it is needed many times).
- Handle each type of node in obvious and not-so-obvious ways. For instance for-loops one must first determine if they should be unrolled (see bottom), and if so they would repeatedly call the interpreter on the contents while setting the iterator variables to COMPILETIME but with different values.
- Functions are a bit tricky:
- For functions compiled in the same phase the compiler might simply proceed into them under certain conditions (or drop the conditions but make sure to memoize the results, also keying on the RUNTIME/COMPILETIME flags).
- For calls into Python libraries, a whitelist (of the full module name of the function) is checked. Functions on the whitelist are the common ones without side-effects (len, range, etc.); these can be called directly from the compiler itself to get the compile-time results. Functions that are not on the whitelist are deferred to runtime.
Statement lists must be processed in two passes: First process subnodes, then a seperate pass afterwards checks for redundant/combinable assignments/augmented arithmetic operations (taking care to not replace augmented aritmethic operations on Python objects as __iadd__ and friends might be overloaded in nontrivial ways, however on C types it is safe to replace many += with a = and many +).
In the example above, one would:
- The whole block is a statement list, enter first pass
- The assignment to t is cloned and t put into the symbols list with COMPILETIME status and its value recorded.
- The cdef instruction does about the same
- if t: Recursively enter the condition expression...it returns (COMPILETIME, False), so the if statement recurses into the else: part and returns it (replacing itself with the else-block). If RUNTIME was returned it would simply recurse but return itself.
- And so the story goes... at the for-loop things get interesting:
range is resolved to builtin.range, which is whitelisted, so the interpreter calls range itself with the parameters (since they are COMPILETIME), then once for each variable sets i to COMPILETIME and the index in the scope ... etcetera
- I'm sure you can imagine the rest.
- Finally, one is left with one big statement list with lots of += to s. Since s is of a native int type we know that these can be replaced with s = ... + ... instead, so that is done; the interpreter is then invoked on the chain in order to sum it up, and we are effectively done.
How much time?
A week of hard work? All the cases that should be handled are already nicely declared as node types in our tree, it's just about writing a small piece of interpretive action for each of them (in a Transform descendant of course, not in the nodes themselves). A day or so for the basic structure that will simply pass things through
More if it is found that the code tree of Cython relies heavily on "cross-references", which the cloning approach would break. But this will tend to break many transforms anyway and must be dealt with.