CEP 903 - Abstract Syntax Tree
Community status: Wanted
Implementation status: Prototyped
Comments: This is part of FabrizioM's GSoC 2008 project proposal
Contents
This Page describes the idea of introducing an Abstract Syntax Tree in Cython.
I suggest to follow the same transition made by Python from the 2.4 version to 2.5 PEP0339
Intro
In the current compilation phase, from Cython source to C code, two steps are involved:
- Parse the source code into a parse tree (Compiler/Parser.py)
- Emit C code based on the parse tree (Compiler/Nodes.py Compiler/ExprNodes.py)
This is not how a standard compiler works. The usual steps for compilation are:
- Parse source code into a parse tree (new Parser)
- Transform parse tree into an Abstract Syntax Tree (new Abstract Syntax Tree)
- Transform AST into a Control Flow Graph ( new Flow Graph)
- Emit C code based on the Control Flow Graph (Compiler/Nodes.py Compiler/ExprNodes.py)
Compare the current code for the 'if' stmt
1 def p_simple_expr(s):
2 pos = s.position()
3 expr = p_or_test(s)
4 if s.sy == 'if':
5 s.next()
6 test = p_or_test(s)
7 if s.sy == 'else':
8 s.next()
9 other = p_test(s)
10 return ExprNodes.CondExprNode(pos, test=test, true_val=expr, false_val=other)
11 else:
12 s.error("Expected 'else'")
13 else:
14 return expr
to the new one
1 ast = parser.parse (source)
2 ast.visit(AstVisitor())
3
4 class AstVisitor:
5 ...
6 def visitIfStmt (self, node):
7 node.test.accept (self)
8 node.then.accept (self)
9 node.else_.accept (self)
To generate the AST, you provide the required action. For example the for_stmt to build the ast.For Node:
1 def build_for_stmt(builder, nb):
2 """for_stmt: 'for' exprlist 'in' testlist ':' suite ['else' ':' suite]"""
3 atoms = get_atoms(builder, nb)
4 else_ = None
5 # skip 'for'
6 assign = to_lvalue(atoms[1], consts.OP_ASSIGN)
7 # skip 'in'
8 iterable = atoms[3]
9 # skip ':'
10 body = atoms[5]
11 # if there is a "else" statement
12 if len(atoms) > 6:
13 # skip 'else' and ':'
14 else_ = atoms[8]
15 builder.push(ast.For(assign, iterable, body, else_, atoms[0].lineno))
Where the ast.For in the last line is:
1 class For(Node):
2 def __init__(self, assign, list, body, else_, lineno=None):
3 self.assign = assign
4 self.list = list
5 self.body = body
6 self.else_ = else_
7 self.lineno = lineno
8
9 # ...
10 def accept(self, visitor):
11 return visitor.visitFor(self)
Notice here the visit Method. Ast's Nodes support the VisitorPattern
Creating the current Expression Tree from an AST
The steps to generate the current Cython Expression Tree are:
- Create the AST by parsing the source file
Create a CythonVisitor
Visit the Ast with the CythonVisitor
1 class CythonVisitor(ASTVisitor):
2 # ...
3 def visitCFunction (self, node):
4 pos = (__file__, 0, node.lineno)
5 basenode = CSimpleBaseTypeNode(pos,name = 'int',module_path = [],is_basic_c_type = True,signed = 1,longness = 0,is_self_arg = False)
6 base = CNameDeclaratorNode(pos, name = node.name, cname = node.name)
7 declarator = CFuncDeclaratorNode (pos,base = base,args = [],has_varargs = False,exception_value = None, exception_check = 0,nogil = True, with_gil = False)
8 suite = node.code.accept(self)
9 return CFuncDefNode(pos,visibility = 'private',base_type = basenode, declarator = declarator, body = suite,doc = node.doc,modifiers = [],api = False, overridable = False)
10 # ...
The above code is just a prototype of course. But it already translates with the new parser small Cython programs. The CFuncDefNode and other classes are used unmodified.
To use the ASTVisitor you simply do:
1 v = CythonVisitor()
2 tree = ast.accept (v)
The main benefit of this approach is this: you can feed different AstVisitor for a same ast. This is really useful if you want to support different Python API versions:
1 if opt.output == PYTHON_23:
2 v = CythonVisitor23()
3 elif opt.output == PYTHON_24:
4 v = CythonVisitor24()
5 elif opt.output == PYTHON_25:
6 v = CythonVisitor25()
7
8 tree = ast.accept (v)
9 output_code (tree)
10 # untested code
Why this is needed? Every release of Python API brings new C API features. You can add support for new C API language feature in a clear way. Look at the new Python 2.5 API features especially the big change for the Py_ssize_t at a high level should look like:
1 class CythonVisitor24(ASTVisitor):
2 # ...
3 def visitIndex (self, node):
4 return ssize_tProducer(node)
5 # ...
6
7 class CythonVisitor25(ASTVisitor):
8 # ...
9 def visitIndex (self, node):
10 return Py_ssize_tProducer(node)
11 # ...
12 #untested invented code
Right now, handling the logic for different versions will involve to plug obscure if in the parser.
More AST features
On fly code injection For the example supporting the with statement (an other proposal will be easy: You simply detect the WithNode in the generate AST and replace it with an new generate AstNodes:
1 class ASTVisitor:
2 def expandWith (self, node):
3 with_transform = """
4 mgr = (%(EXPR)s)
5 exit = mgr.__exit__ # Not calling it yet
6 value = mgr.__enter__()
7 exc = True
8 try:
9 try:
10 %(VAR)s
11 %(BLOCK)
12 except:
13 # The exceptional case is handled here
14 exc = False
15 if not exit(*sys.exc_info()):
16 raise
17 # The exception is swallowed if exit() returns true
18 finally:
19 # The normal and non-local-goto cases are handled here
20 if exc:
21 exit(None, None, None)
22 """ % {
23 EXPR : flatten(node.expr), # re transform ast to python code,
24 VAR : if node.value "%(VAR)s = value # Only if "as VAR" is present" % node.var or " ",
25 BLOCK : flatten(node.block)
26 }
27 new_node = build_ast (with_transform_code ) # parse the code and generate an ast subtree
28 return new_node
29
30 def visitWithNode (self, node):
31 return self.transformWith (node)
The above code is only a vision of how it would operate the With Transform, more advance transformations should be handled with the Mutator interface
For a complete FUNCTIONAL test case look at: /CythonVisitorTest
Design changes
DRAFT
Cython's current steps, for code transformation, are the following:
- analyse_declarations
- analyse_expressions
- generate_code
analyse_declarations
- current
- Make symbol table entries for all declarations at the current level, both explicit (def, cdef, etc.) and implicit (assignment to an otherwise undeclared name).
- new
this will be done in the CythonVisitor, after constructing the AST, it will be optimized with all the transformation that does not require type or flow information.
analyse_expressions
- current
-
- Determine the result types of expressions and fill in the
'type' attribute of each ExprNode. Insert coercion nodes into the tree where needed to convert to and from Python objects. Allocate temporary locals for intermediate results. Fill in the 'result_code' attribute of each ExprNode with a C code fragment.
- Determine the result types of expressions and fill in the
- new
for each function will be constructed a FlowGraph by visiting it with a FlowGraphBuilder(Visitor). The graph will be analyzed with well know standard techniques, with the objective of infer every variable's type, and apply transformation where needed. (i.e. dead code removal) After all the possible transformations are applied to the graph, the graph will be converted back to an aST subtree and attached to the main tree.
generate_code
- current
- Emit C code for all declarations, statements and expressions. Recursively applies the 3 processing phases to the bodies of functions.
generate_function_definitions
- Emit C code for the definitions of any structs, unions, enums and functions defined in the current scope-block.
generate_execution_code
- Emit C code for executable statements.
- new
- this phase should be just a simple visitor that flattens the AST structure, or creates the DOM structure.
With the current implementation, the code generation and the type inference are mixed in the same place, look for example at
1 # 2 class FuncDefNode(Node): 3 # ... 4 def generate_function_definitions(self, env, code, transforms): 5 code.mark_pos(self.pos) 6 7 # XXX A code generator should know anything about a Program Scope! 8 genv = env.global_scope() 9 lenv = LocalScope(name = self.entry.name, outer_scope = genv) 10 lenv.return_type = self.return_type 11 code.init_labels() 12 self.declare_arguments(lenv) 13 # XXX 14 15 # XXX This Phase should not be here! 16 transforms.run('before_analyse_function', self, env=env, lenv=lenv, genv=genv) 17 self.body.analyse_declarations(lenv) 18 self.body.analyse_expressions(lenv) 19 transforms.run('after_analyse_function', self, env=env, lenv=lenv, genv=genv) 20 # XXX 21 22 # Just this 23 self.generate_interned_num_decls(lenv, code) 24 self.generate_interned_name_decls(lenv, code) 25 self.generate_py_string_decls(lenv, code) 26 self.generate_cached_builtins_decls(lenv, code) 27 self.generate_const_definitions(lenv, code)
All this enhancements will be applied GRADUALLY.
Comments
Pros
Design improvement: Separation of how to parse the logic (Parser) from how to give a Meaning to it (Nodes, ExprNodes).
- Peephole optimization like a = 10 * 5 directly in the AST
- Code flow analysis could be implemented easily by
Cons
- should be clear that any future Pyrex feature parser-dependent will be not supported, but I am sure will be emulated without too much effort The previous code is functional but not yet published
Thanks for reading! (and Sorry for the BAD English)
robertwb: Could you please write an example mutator for the visitWithNode? Re-creating and re-parsing a huge string is a bad idea on several levels (efficiency, loosing line number information, indentation woes, etc.)
robertwb: you need to have a way of actually modifying the grammar file, not just adding new nodes (e.g. the positional_arguments note must accept type information.
robertwb: your Py_ssize_t index visitor example doesn't make any sense, could you come up with a more concrete example?
robertwb: I would really like to see control flow, if possible, as a central part of the project fabrizioM: here it is Control Flow
