CEP 1000: Convention for native dispatches through Python callables

Many callable objects are simply wrappers around native code. This holds for any Cython function, f2py functions, manually written CPython extensions, Numba, etc.

Obviously, when native code calls other native code, it would be nice to skip the significant cost of boxing and unboxing all the arguments. Early binding at compile-time is only possible between different Cython modules, not between all the tools listed above.

CEP 523 deals with Cython-specific aspects (and is out-of-date w.r.t. this CEP); this CEP is intended to be about a cross-project convention only. If a success, this CEP may be proposesd as a PEP in a modified form.

Motivating example (looking a year or two into the future):

@numba
def f(x): return 2 * x

@cython.inline
def g(x : cython.double): return 3 * x

from fortranmod import h

print f(3)
print g(3)
print h(3)
print scipy.integrate.quad(f, 0.2, 3) # fast callback!
print scipy.integrate.quad(g, 0.2, 3) # fast callback!
print scipy.integrate.quad(h, 0.2, 3) # fast callback!

The native-call slot

We need fast access to probing whether a callable object supports this CEP. Other mechanisms, such as an attribute in a dict, is too slow for many purposes (quoting robertwb: "We're trying to get a 300ns dispatch down to 10ns; you do not want a 50ns dict lookup"). (Obviously, if you call a callable in a loop you can fetch the pointer outside of the loop. But in particular if this becomes a language feature in Cython it will be used in all sorts of places.)

So we hack another type slot into existing and future CPython implementations in the following way: This CEP provides a C header that for all Python versions define a macro Py_TPFLAGS_UNOFFICIAL_EXTRAS for a free bit in tp_flags in the PyTypeObject.

If present, then we extend PyTypeObject as follows:

typedef struct {
   PyTypeObject tp_main;
   size_t tp_unofficial_flags;
   size_t tp_nativecall_offset;
} PyUnofficialTypeObject;

tp_unofficial_flags is unused and should be all 0 for the time being, but can be used later to indicate features beyond this CEP.

If tp_nativecall_offset != 0, this CEP is supported, and the information for doing a native dispatch on a callable obj is located at

(char*)obj + ((PyUnofficialTypeObject*)obj->ob_type)->tp_nativecall_offset;

GIL-less accesss

It is OK to access the native-call table without holding the GIL. This should of course only be used to call functions that state in their signature that they don't need the GIL.

This is important for JITted callables who would like to rewrite their table as more specializations gets added; if one needs to reallocate the table, the old table must linger along long enough that all threads that are currently accessing it are done with it.

Native dispatch descriptor

The final format for the descriptor is not agreed upon yet; this sums up the major alternatives.

The descriptor should be a list of specializations/overload, each described by a function pointer and a signature specification string, such as "id)i" for int f(int, double).

The way it is stored must cater for two cases; first, when the caller expects one or more hard-coded signatures:

if (obj has signature "id)i") {
   call;
} else if (obj has signature "if)i") {
   call with promoted second argument;
} else {
   box all arguments;
   PyObject_Call;
}

The second is when a call stack is built dynamically while parsing the string. Since this has higher overhead anyway, optimizing for the first case makes sense.

Approach 1: Interning/run-time allocated IDs

1A: Let each overload have a struct

struct {
   size_t signature_id;
   char *signature;
   void *func_ptr;
};

Within each process run, there is a 1:1 between signature and signature_id. signature_id is allocated by some central registry.

1B: Intern the string instead:

struct {
   char *signature; /* pointer must come from the central registry */
   void *func_ptr;
};

However this is not trivial, since signature strings can be allocated on the heap (e.g., a JIT would do this), so interned strings must be memory managed and reference counted. This could be done by each object passing in the signature both when incref-ing and decref-ing the signature string in the interning machinery. Using Python bytes objects is another option. Another option would to re-use and never reclaim them, as they are likely to be small and limited in (distinct) number over even a long-running session.

Discussion

The cost of comparing a signature: Comparing a global variable (needle) to a value that is guaranteed to already be in cache (candidate match)

Pros:

Cons:

Approach 2: Efficient strcmp of verbatim signatures

The idea is to store the full signatures and the function pointers together in the same memory area, but still have some structure to allow for quick scanning through the list.

Each entry has the structure [signature_string, funcptr] where:

The point is that if you know a signature, you can quickly scan through the binary blob for the signature in 128 bit increments, without worrying about the variable size nature of each entry. The rules above protects against spurious matches.

Optional: Encoding

The strcmp approach can be made efficient for larger signatures by using a more efficient encoding than ASCII. E.g., an encoding could use 4 bits for the 12 most common symbols and 8 bits for 64 symbols (for a total of 78 symbols), of which some could be letter combinations ("Zd", "T{"). This should be reasonably simple to encode and decode.

The CEP should provide C routines in a header file to work with the signatures. Callers that wish to parse the format string and build a call stack on the fly should probably work with the encoded representation.

Discussion

The cost of comparing a signature: For the vast majority of functions, the cost is comparing a 64-bit number stored in the CPU instruction stream (needle) to a value that is guaranteed to already be in cache (candidate match).

Pros:

Cons:

Signature strings

Example: The function

int f(double x, float y);

would have the signature string "df)i" (or, to save space, "idf").

Fields would follow the PEP3118 extensions of the struct-module format string, but with some modifications:

enhancements/cep1000 (last edited 2012-04-15 05:43:09 by robertwb)