TPG (Toy Parser Generator) is a Python1 parser generator. It is aimed at easy usage rather than performance. My inspiration was drawn from two different sources. The first was GEN6. GEN6 is a parser generator created at ENSEEIHT2 where I studied. The second was PROLOG3 , especially DCG4 parsers. I wanted a generator with a simple and expressive syntax and the generated parser should work as the user expects. So I decided that TPG should be a recursive descendant parser (a rule is a procedure that calls other procedures) and the grammars are attributed (attributes are the parameters of the procedures). This way TPG can be considered as a programming language or more modestly as Python extension.
TPG is available under the GNU Lesser General Public License.
Toy Parser Generator: A Python parser generator
Copyright (C) 2001-2013 Christophe Delord
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
TPG is freely available on its web page (http://cdsoft.fr/tpg). It is distributed as a package using distutils1 .
TPG is a pure Python package. It may run on any platform supported by Python. The only requirement of TPG is Python 2.2 or newer (Python 3.2 is also supported). Python can be downloaded at http://www.python.org.
Download TPG-X.Y.Z.tar.gz, unpack and run the installation program:
You may need to be logged as root to install TPG.
TPG should run on any system provided that Python is installed. You should be able to install it by running the setup.py script (see 2.3).
This short tutorial shows how to make a simple calculator. The calculator will compute basic mathematical expressions (+, -, *, /) possibly nested in parenthesis. We assume the reader is familiar with regular expressions.
Expressions are defined with a grammar. For example an expression is a sum of terms and a term is a product of factors. A factor is either a number or a complete expression in parenthesis.
We describe such grammars with rules. A rule describe the composition of an item of the language. In our grammar we have 3 items (Expr, Term, Factor). We will call these items ‘symbols’ or ‘non terminal symbols’. The decomposition of a symbol is symbolized with →. The grammar of this tutorial is given in figure 3.1.
Grammar rule | Description |
Expr → Term ((′ + ′|′−′) Term)∗ | An expression is a term eventually followed with a plus (′ + ′) or a minus (′−′) sign and an other term any number of times (∗ is a repetition of an expression 0 or more times). |
Term → Fact ((′∗′|′∕′) Fact)∗ | A term is a factor eventually followed with a ′∗′ or ′∕′ sign and an other factor any number of times. |
Fact → number | ′(′ Expr ′)′ | A factor is either a number or an expression in parenthesis. |
We have defined here the grammar rules (i.e. the sentences of the language). We now need to describe the lexical items (i.e. the words of the language). These words - also called terminal symbols - are described using regular expressions. In the rules we have written some of these terminal symbols (+,−,∗,∕,(,)). We have to define number. For sake of simplicity numbers are integers composed of digits (the corresponding regular expression can be [0 − 9]+). To simplify the grammar and then the Python script we define two terminal symbols to group the operators (additive and multiplicative operators). We can also define a special symbol that is ignored by TPG. This symbol is used as a separator. This is generaly usefull for white spaces and comments. The terminal symbols are given in figure 3.2
Terminal symbol | Regular expression | Comment |
number | [0 − 9]+ or \d+ | One or more digits |
add | [+−] | a + or a − |
mul | [∗∕] | a ∗ or a ∕ |
spaces | \s+ | One or more spaces |
This is sufficient to define our parser with TPG. The grammar of the expressions in TPG can be found in figure 3.3.
Figure 3.3: Grammar of the expression recognizer
class Calc(tpg.Parser):
r""" separator spaces: ’\s+’ ; token number: ’\d+’ ; token add: ’[+-]’ ; token mul: ’[*/]’ ; START -> Expr ; Expr -> Term ( add Term )* ; Term -> Fact ( mul Fact )* ; Fact -> number | ’\(’ Expr ’\)’ ; """
|
Calc is the name of the Python class generated by TPG. START is a special non terminal symbol treated as the axiom1 of the grammar.
With this small grammar we can only recognize a correct expression. We will see in the next sections how to read the actual expression and to compute its value.
The input of the grammar is a string. To do something useful we need to read this string in order to transform it into an expected result.
This string can be read by catching the return value of terminal symbols. By default any terminal symbol returns a string containing the current token. So the token ′\(′ always returns the string ′(′. For some tokens it may be useful to compute a Python object from the token. For example number should return an integer instead of a string, add and mul should return a function corresponding to the operator. That why we will add a function to the token definitions. So we associate int to number and make_op to add and mul.
int is a Python function converting objects to integers and make_op is a user defined function (figure 3.4).
Figure 3.4: make_op function
def make_op(s):
return { ’+’: lambda x,y: x+y, ’-’: lambda x,y: x-y, ’*’: lambda x,y: x*y, ’/’: lambda x,y: x/y, }[s]
|
To associate a function to a token it must be added after the token definition as in figure 3.5
Figure 3.5: Token definitions with functions
separator spaces: ’\s+’ ;
token number: ’\d+’ int ; token add: ’[+-]’ make_op; token mul: ’[*/]’ make_op;
|
We have specified the value returned by the token. To read this value after a terminal symbol is recognized we will store it in a Python variable. For example to save a number in a variable n we write number/n. In fact terminal and non terminal symbols can return a value. The syntax is the same for both sort of symbols. In non terminal symbol definitions the return value defined at the left hand side is the expression return by the symbol. The return values defined in the right hand side are just variables to which values are saved. A small example may be easier to understand (figure 3.6).
Rule | Comment |
X/x -> | Defines a symbol X. When X is called, x is returned. |
Y/y | X starts with a Y. The return value of Y is saved in y. |
Z/z | The return value of Z is saved in z. |
$ x = y+z $ | Computes x. |
; | Returns x. |
In the example described in this tutorial the computation of a Term is made by applying the operator to the factors, this value is then returned:
This example shows how to include Python code in a rule. Here $...$ is copied verbatim in the generated parser.
Finally the complete parser is given in figure 3.7.
Figure 3.7: Expression recognizer and evaluator
class Calc(tpg.Parser):
r""" separator spaces: ’\s+’ ; token number: ’\d+’ int ; token add: ’[+-]’ make_op ; token mul: ’[*/]’ make_op ; START -> Expr ; Expr/t -> Term/t ( add/op Term/f $t=op(t,f)$ )* ; Term/f -> Fact/f ( mul Fact/a $f=op(f,a)$ )* ; Fact/a -> number/a | ’\(’ Expr/a ’\)’ ; """
|
Since TPG 3 embeding parsers in a script is very easy since the grammar is the doc string2 of a class (see figure 3.8).
Figure 3.8: Writting TPG grammars in Python
import tpg
class MyParser(tpg.Parser): r""" # Your grammar here """ # You can instanciate your parser here my_parser = MyParser()
|
To use this parser you now just need to instanciate an object of the class Calc as in figure 3.9.
Figure 3.9: Complete Python script with expression parser
import tpg
def make_op(s): return { ’+’: lambda x,y: x+y, ’-’: lambda x,y: x-y, ’*’: lambda x,y: x*y, ’/’: lambda x,y: x/y, }[s] class Calc(tpg.Parser): r""" separator spaces: ’\s+’ ; token number: ’\d+’ int ; token add: ’[+-]’ make_op ; token mul: ’[*/]’ make_op ; START/e -> Term/e ; Term/t -> Fact/t ( add/op Fact/f $t=op(t,f)$ )* ; Fact/f -> Atom/f ( mul/op Atom/a $f=op(f,a)$ )* ; Atom/a -> number/a | ’\(’ Term/a ’\)’ ; """ calc = Calc() expr = raw_input(’Enter an expression: ’) print expr, ’=’, calc(expr)
|
This tutorial shows some of the possibilities of TPG. If you have read it carefully you may be able to start with TPG. The next chapters present TPG more precisely. They contain more examples to illustrate all the features of TPG.
Happy TPG’ing!
TPG is a package which main function is to define a class which particular metaclass converts a doc string into a parser. You only need to import TPG and use these five objects:
The grammar must be in the doc string of the class (see figure 4.1).
Figure 4.1: Grammar embeding example
class Foo(tpg.Parser):
r""" START/x -> Bar/x ; Bar/x -> ’bar’/x ; """
|
Then you can use the new generated parser. The parser is simply a Python class (see figure 4.2).
Figure 4.2: Parser usage example
test = "bar"
my_parser = Foo() x = my_parser(test) # Uses the START symbol print x x = my_parser.parse(’Bar’, test) # Uses the Bar symbol print x
|
The tpg script reads a Python script and replaces TPG grammars (in doc string) by Python code. To produce a Python script (*.py) from a script containing grammars (*.pyg) you can use tpg as follow:
tpg accepts some options on the command line:
Notice that .pyg files are valid Python scripts. So you can choose the run .pyg file (slower startup but easier for debugging purpose) or turn them into a .py file (faster startup but needs a ”precompilation” stage). You can also write .py scripts containing grammars to be used as Python scripts.
In fact I only use the tpg script to convert tpg.pyg into tpg.py because TPG needs obviously to be a pure Python script (it can not translate itself at runtime). Then in most cases it is very convenient to directly write grammars in Python scripts.
TPG grammars are contained in the doc string1 of the parser class. TPG grammars may contain three parts:
See figure 5.1 for a generic TPG grammar.
Figure 5.1: TPG grammar structure
class Foo(tpg.Parser):
r""" Here is some ReStructuredText documentation (for Sphinx for instance). And now, the grammar:: # Options set lexer = CSL # Tokens separator spaces ’\s+’ ; token int ’\d+’ int ; # Rules START -> X Y Z ; """ foo = Foo() result = foo("input string")
|
Comments in TPG start with # and run until the end of the line.
Some options can be set at the beginning of TPG grammars. The syntax for options is:
The lexer option tells TPG which lexer to use.
The word_boundary option tells the lexer to search for word boundaries after identifiers.
The re module accepts some options to define the behaviour of the compiled regular expressions. These options can be changed for each parser.
Python code sections are not handled by TPG. TPG won’t complain about syntax errors in Python code sections, it is Python’s job. They are copied verbatim to the generated Python parser.
Before TPG 3, Python code was enclosed in double curly brackets. That means that Python code must not contain two consecutive close brackets. You can avoid this by writting } } (with a space) instead of }} (without space). This syntaxe is still available but the new syntax may be more readable. The new syntax uses $ to delimit code sections. When several $ sections are consecutive they are seen as a single section.
Python code can appear in several parts of a grammar. Since indentation has a special meaning in Python it is important to know how TPG handles spaces and tabulations at the beginning of the lines.
When TPG encounters some Python code it removes from all non blank lines the spaces and tabulations that are common to every lines. TPG considers spaces and tabulations as the same character so it is important to always use the same indentation style. Thus it is advised not to mix spaces and tabulations in indentation. Then this code will be reindented when generated according to its location (in a class, in a method or in global space).
The figure 5.2 shows how TPG handles indentation.
Code in grammars (old syntax) | Code in grammars (new syntax) | Generated code | Comment |
{{
␣␣␣␣if␣1==2: ␣␣␣␣␣␣␣␣print␣"???" ␣␣␣␣else: ␣␣␣␣␣␣␣␣print␣"OK" }} ␣␣␣␣
| $␣␣if␣1==2: $␣␣␣␣␣␣print␣"???" $␣␣else: $␣␣␣␣␣␣print␣"OK" ␣␣␣␣
| if␣1==2: ␣␣␣␣print␣"???" else: ␣␣␣␣print␣"OK" ␣␣␣␣
| Correct: these lines have four spaces in common. These spaces are removed. |
{{␣␣if␣1==2:
␣␣␣␣␣␣␣␣print␣"???" ␣␣␣␣else: ␣␣␣␣␣␣␣␣print␣"OK" }} ␣␣␣␣
| The new syntax has no trouble in that case. |
if␣1==2:
␣␣␣␣␣␣print␣"???" ␣␣else: ␣␣␣␣␣␣print␣"OK" ␣␣␣␣
| WRONG: it is a bad idea to start a multiline code section on the first line since the common indentation may be different from what you expect. No error will be raised by TPG but Python will not compile this code. |
{{␣␣␣␣print␣"OK"␣}}
␣␣␣␣
|
$␣␣␣␣␣␣␣print␣"OK"
␣␣␣␣ or
$␣print␣"OK"␣$
␣␣␣␣
|
print␣"OK"
␣␣␣␣
| Correct: indentation does not matter in a one line Python code. |
TPG parsers are tpg.Parser classes. The grammar is the doc string of the class.
As TPG parsers are just Python classes, you can use them as normal classes. If you redefine the __init__ method, do not forget to call tpg.Parser.__init__.
Each rule will be translated into a method of the parser.
The lexer is based on the re1 module. TPG profits from the power of Python regular expressions. This document assumes the reader is familiar with regular expressions.
You can use the syntax of regular expressions as expected by the re module except from the grouping by name syntax since it is used by TPG to decide which token is recognized.
Here is a summary2 of the regular expression syntax:
You can match the characters not within a range by complementing the set. This is indicated
by including a ” ” as the first character of the set; ”
” elsewhere will simply match the ”
”
character. For example, [
5] will match any character except ”5”, and [
] will match any
character except ”
”.
Tokens can be explicitely defined by the token and separator keywords.
A token is defined by:
Token definitions must end with a ; when no action is specified. The dots after the token name are optional.
See figure 6.1 for examples.
Figure 6.1: Token definition examples
# name reg. exp action
token integer: ’\d+’ int; token ident : ’[a-zA-Z]\w*’ ; token bool : ’True|False’ $ lambda b: b==’True’ separator spaces : ’\s+’; # white spaces separator comments: ’#.*’; # comments
|
The order of the declaration of the tokens is important. The first token that is matched is returned. The regular expression has a special treatment. If it describes a keyword, TPG also looks for a word boundary after the keyword. If you try to match the keywords if and ifxyz TPG will internally search if\b and ifxyz\b. This way, if won’t match ifxyz and won’t interfere with general identifiers (\w+ for example). This behaviour can be disabled since the version 3 of TPG (see 5.3.2).
There are two kinds of tokens. Tokens defined by the token keyword are parsed by the parser and tokens defined by the separator keyword are considered as separators (white spaces or comments for example) and are wiped out by the lexer.
Tokens can also be defined on the fly. Their definitions are then inlined in the grammar rules. This feature may be useful for keywords or punctuation signs. Inline tokens can not be transformed by an action as predefined tokens. They always return the token in a string.
See figure 6.2 for examples.
Figure 6.2: Inline token definition examples
IfThenElse ->
’if’ Cond ’then’ Statement ’else’ Statement ;
|
Inline tokens have a higher precedence than predefined tokens to avoid conflicts (an inlined if will not be matched as a predefined identifier).
TPG works in two stages. The lexer first splits the input string into a list of tokens and then the parser parses this list. The default lexer is lazy in TPG 3. Tokens are generated while parsing. This way TPG 3 need less memory when parsing huge files.
The lexer splits the input string according to the token definitions (see 6.2). When the input string can not be matched a tpg.LexicalError exception is raised.
The lexer may loop indefinitely if a token can match an empty string since empty strings are everywhere.
Tokens are matched as symbols are recognized. Predefined tokens have the same syntax than non terminal symbols. The token text (or the result of the function associated to the token) can be saved by the infix / operator (see figure 6.3).
Inline tokens have a similar syntax. You just write the regular expression (in a string). Its text can also be saved (see figure 6.4).
A parser is declared as a tpg.Parser class. The doc string of the class contains the definition of the tokens and rules.
Rule declarations have two parts. The left side declares the symbol associated to the rule, its attributes and its return value. The right side describes the decomposition of the rule. Both parts of the declaration are separated with an arrow (→) and the declaration ends with a ;.
The symbol defined by the rule as well as the symbols that appear in the rule can have attributes and return values. The attribute list - if any - is given as an object list enclosed in left and right angles. The return value - if any - is extracted by the infix / operator. When no return value is specified, TPG creates a variable named as the symbol. See figure 7.1 for example.
Figure 7.1: Rule declaration
SYMBOL <att1, att2, att3> / return_expression_of_SYMBOL ->
A <x, y> / ret_value_of_A B <y, z> / ret_value_of_B ; S1 / $f(A,B)$ -> A # return value of A stored in variable A B # return value of B stored in variable B ; S2 -> # we can use S2 to compute the return value A $ S2 = A B $ S2 = f(S2, B) ;
|
Each time a terminal symbol is encountered in a rule, the parser compares it to the current token in the token list. If it is different the parser backtracks.
You can start the parser from the axiom or from any other non terminal symbol. When the parser can not parse the whole token list a tpg.SyntacticError is raised. The value returned by the parser is the return value of the parsed symbol.
The axiom is a special non terminal symbol named START. Parsers are callable objects. When an instance of a parser is called, the START rule is parsed. The first argument of the call is the string to parse. The other arguments of the call are given to the START symbol.
This allows to simply write x=calc("1+1") to parse and compute an expression if calc is an instance of an expression parser.
It’s also possible to start parsing from any other non terminal symbol. TPG parsers have a method named parse. The first argument is the name of the symbol to start from. The second argument is the string to parse. The other arguments are given to the specified symbol.
For example to start parsing a Term you can write:
To parse a non terminal symbol in a rule, TPG calls the rule corresponding to the symbol.
Sequences in grammar rules describe in which order symbols should appear in the input string. For example the sequence A B recognizes an A followed by a B. Sequences can be empty.
For example to say that a sum is a term plus another term you can write:
Alternatives in grammar rules describe several possible decompositions of a symbol. The infix pipe operator (∣) is used to separate alternatives. A ∣ B recognizes either an A or a B. If both A and B can be matched only the first match is considered. So the order of alternatives is very important. If an alternative has an empty choice, it must be the last. Empty choices in other positions will be reported as syntax errors.
For example to say that an atom is an integer or an expression in paranthesis you can write:
Repetitions in grammar rules describe how many times an expression should be matched.
Repetitions are greedy. Repetitions are translated into Python loops. Thus whatever the length of the repetitions, the Python stack will not overflow.
The figure 7.2 lists the different structures in increasing precedence order. To override the default precedence you can group expressions with parenthesis.
Structure | Example |
Alternative | A ∣ B |
Sequence | A B |
Repetitions | A?, A∗, A+ |
Symbol and grouping | A and ( … ) |
Grammar rules can contain actions as Python code. Python code is copied verbatim into the generated code and is delimited by $...$, $...EOL1 or {{...}}.
Please be aware that indentation should obey Python indentation rules. See the grammar description for further information (see figure 5.2).
An abstract syntax tree (AST) is an abstract representation of the structure of the input. A node of an AST is a Python object (there is no constraint about its class). AST nodes are completely defined by the user.
The figure 7.3 shows a node symbolizing a couple.
Figure 7.3: AST example
class Couple: def __init__(self, a, b): self.a = a self.b = b class Foo(tpg.Parser): r""" COUPLE/$Couple(a,b)$ -> ’\(’ ITEM/a ’,’ ITEM/b ’\)’ ; # which is equivalent to # COUPLE/c -> ’\(’ ITEM/a ’,’ ITEM/b ’\)’ $ c = Couple(a,b) $ ; """
|
AST are created in Python code (see section 7.9.1).
When parsing lists for example it is useful to save all the items of the list. In that case one can use a list and its append method (see figure 7.4).
Figure 7.4: AST update example
class ListParser(tpg.Parser): r""" LIST/l -> ’\(’ $ l = [] ITEM/a $ l.append(a) ( ’,’ ITEM/a $ l.append(a) )* ’\)’ ; """
|
TPG can extract a portion of the input string. The idea is to put marks while parsing and then extract the text between the two marks. This extracts the whole text between the marks, including the tokens defined as separators.
TPG 3 doesn’t handle Python object as TPG 2. Only identifiers, integers and strings are known. Other objects must be written in Python code delimited either by $...$ or by {{...}}.
An argument list is a comma separated list of objects. Remember that arguments are enclosed in left and right angles.
Argument lists and tuples have the same syntax except from the possibility to have default arguments, argument lists and argument dictionnaries as arguments as in Python.
A Python code object is a piece of Python code in double curly brackets or in dollar signs. Python code used in an object expression must have only one line.
Text extraction is done by the extract method. Marks can be put in the input string by the mark method or the prefix operator.
The line and column methods return the line and column number of the current token. If the first parameter is a mark (see 7.9.2) the method returns the line number of the token following the mark.
The user can force the parser to backtrack in rule actions. The module has a WrongToken exception for that purpose (see figure 7.5).
Figure 7.5: Backtracking with WrongToken example
# NATURAL matches integers greater than 0
NATURAL/n -> number/n $ if n<1: raise tpg.WrongToken $ ;
|
Parsers have another useful method named check (see figure 7.6). This method checks a condition. If this condition is false then WrongToken is called in order to backtrack.
Figure 7.6: Backtracking with the check method example
# NATURAL matches integers greater than 0
NATURAL/n -> number/n $ self.check(n>=1) $ ;
|
A shortcut for the check method is the check keyword followed by the condition to check (see figure 7.7).
Figure 7.7: Backtracking with the check keyword example
# NATURAL matches integers greater than 0
NATURAL/n -> number/n check $ n>=1 $ ;
|
The user can force the parser to stop and raise an exception. The parser classes have a error method for that purpose (see figure 7.8). This method raises a SemanticError.
Figure 7.8: Error reporting the error method example
# FRACT parses fractions
FRACT/<n,d> -> number/n ’/’ number/d $ if d==0: self.error("Division by zero") $ ;
|
A shortcut for the error method is the error keyword followed by the object to give to the SemanticError exception (see figure 7.9).
Figure 7.9: Error reporting the error keyword example
# FRACT parses fractions
FRACT/<n,d> -> number/n ’/’ number/d ( check d | error "Division by zero" ) ;
|
Before the version 2 of TPG, lexers were context sensitive. That means that the parser commands the lexer to match some tokens, i.e. different tokens can be matched in a same input string according to the grammar rules being used. These lexers were very flexible but slower than context free lexers because TPG backtracking caused tokens to be matched several times.
In TPG 2, the lexer is called before the parser and produces a list of tokens from the input string. This list is then given to the parser. In this case when TPG backtracks the token list remains unchanged.
Since TPG 2.1.2, context sensitive lexers have been reintroduced in TPG. By default lexers are context free but the CSL option (see 5.3.1) turns TPG into a context sensitive lexer.
CSL grammars have the same structure than non CSL grammars (see 5.1) except from the lexer = CSL option (see 5.3.1).
The CSL lexer is based on the re module. The difference with non CSL lexers is that the given regular expression is compiled as this, without any encapsulation.
In CSL parsers, tokens are matched as in non CSL parsers (see 6.3).
There is no difference between CSL and non CSL parsers.
When I need to debug a grammar I often add print statments to visualize the parser activity. Now with TPG 3 it is possible to print such information automatically.
Normal parsers inherit from tpg.Parser. If you need a more verbose parser you can use tpg.VerboseParser instead. This parser prints information about the current token each time the lexer is called. The debugging information has currently two levels of details.
The level is defined by the attribute verbose. Its default value is 1.
Figure 9.1: Verbose parser example
class Test(tpg.VerboseParser):
r""" START -> ’x’ ’y’ ’z’ ; """ verbose = 2
|
The information displayed by verbose parsers has the following format:
This chapter presents an extension of the calculator described in the tutorial (see 3). This calculator has more functions and a memory.
This calculator can compute some numerical functions (sin, cos, sqrt, …). The make_op function (see figure 3.4) has been extended to return these functions. Tokens must also be defined to scan function names. funct1 defines the name of unary functions and funct2 defines the name of binary functions. Finally the grammar rule of the atoms has been added a branch to parse functions. The Function non terminal symbol parser unary and binary functions.
The calculator has memories. A memory cell is identified by a name. For example, if the user types pi = 4 ∗ atan(1), the memory cell named pi contains the value of π and cos(pi) returns −1.
To display the content of the whole memory, the user can type vars.
The variables are saved in a dictionnary. In fact the parser itself is a dictionnary (the parser inherits from the dict class).
The START symbol parses a variable creation or a single expression and the Atom parses variable names (the Var symbol parses a variable name and returns its value).
Here is the complete source code (calc.pyg):
In the previous example, the parser computes the value of the expression on the fly, while parsing. It is also possible to build an abstract syntax tree to store an abstract representation of the input. This may be usefull when several passes are necessary.
This example shows how to parse an expression (infix, prefix or postfix) and convert it in infix, prefix and postfix notations. The expression is saved in a tree. Each node of the tree corresponds to an operator in the expression. Each leave is a number. Then to write the expression in infix, prefix or postfix notation, we just need to walk throught the tree in a particular order.
The AST of this converter has three types of node:
These classes are instanciated by the __init__ method. The infix, prefix and postfix methods return strings containing the representation of the node in infix, prefix and postfix notation.
The grammar for infix expressions is similar to the grammar used in the previous example.
The grammar for prefix expressions is very simple. A compound prefix expression is an operator followed by two subexpressions.
At first sight postfix and infix grammars may be very similar. Only the position of the operators changes. So a compound postfix expression is a first expression followed by a second and an operator. This rule is left recursive. As TPG is a descendant recursive parser, such rules are forbidden to avoid infinite recursion. To remove the left recursion a classical solution is to rewrite the grammar like this:
The parser first searches for an atomic expression and then builds the AST by passing partial expressions by the attributes of the SEXPR_POST symbol.
Here is the complete source code (notation.py):