PLY is a pure-Python implementation of the popular compiler construction tools lex and yacc.
Getting Started with PLY
To install PLY on your machine for python2/3, follow the steps outlined below:
- Download the source code from here.
- Unzip the downloaded zip file
- Navigate into the unzipped
- Run the following command in your terminal:
python setup.py install
If you completed all the above, you should now be able to use the PLY module. You can test it out by opening a python interpreter and typing
Note: Do not use
pip to install PLY, it will install a broken distribution on your machine.
Part 1: Tokenizing Input with Lex
There are two steps that the code from example 1 carried out: one was tokenizing the input, which means it looked for symbols that constitute the arithmetic expression, and the second step was parsing, which involves analysing the extracted tokens and evaluating the result.
This section provides a simple example of how to tokenize user input, and then breaks it down line by line.
Save this file as
calclex.py. We'll be using this when building our Yacc parser.
Import the module using
All lexers must provide a list called
tokensthat defines all of the possible token names that can be produced by the lexer. This list is always required.
tokens could also be a tuple of strings (rather than a string), where each string denotes a token as before.
The regex rule for each string may be defined either as a string or as a function. In either case, the variable name should be prefixed by t_ to denote it is a rule for matching tokens.
For simple tokens, the regular expression can be specified as strings:
t_PLUS = r'\+'
If some kind of action needs to be performed, a token rule can be specified as a function.
Note, the rule is specified as a doc string within the function. The function accepts one argument which is an instance of
LexToken, performs some action and then returns back the argument.
If you want to use an external string as the regex rule for the function instead of specifying a doc string, consider the following example:
An instance of
LexTokenobject (let's call this object
t) has the following attributes:
t.typewhich is the token type (as a string) (eg:
'PLUS', etc). By default,
t.typeis set to the name following the
t.valuewhich is the lexeme (the actual text matched)
t.linenowhich is the current line number (this is not automatically updated, as the lexer knows nothing of line numbers). Update lineno using a function called
t.lexposwhich is the position of the token relative to the beginning of the input text.
If nothing is returned from a regex rule function, the token is discarded. If you want to discard a token, you can alternatively add t_ignore_ prefix to a regex rule variable instead of defining a function for the same rule.
...Is the same as:
This is of course invalid if you're carrying out some action when you see a comment. In which case, use a function to define the regex rule.
If you haven't defined a token for some characters but still want to ignore it, use
t_ignore = "<characters to ignore>"(these prefixes are necessary):
When building the master regex, lex will add the regexes specified in the file as follows:
- Tokens defined by functions are added in the same order as they appear in the file.
- Tokens defined by strings are added in decreasing order of the string length of the string defining the regex for that token.
If you are matching
=in the same file, take advantage of these rules.
Literals are tokens that are returned as they are. Both
t.valuewill be set to the character itself. Define a list of literals as such:
It is possible to write token functions that perform additional actions when literals are matched. However, you'll need to set the token type appropriately. For example:
Handle errors with t_error function.
t.lexer.skip(n)skips n characters in the input string.
Build the lexer using
lexer = lex.lex().
You can also put everything inside a class and call use instance of the class to define the lexer. Eg:
Provide input using
lexer.input(data)where data is a string
To get the tokens, use
lexer.token()which returns tokens matched. You can iterate over lexer in a loop as in:
Part 2: Parsing Tokenized Input with Yacc
This section explains how the tokenized input from Part 1 is processed - it is done using Context Free Grammars (CFGs). The grammar must be specified, and the tokens are processed according to the grammar. Under the hood, the parser uses an LALR parser.
Each grammar rule is defined by a function where the docstring to that function contains the appropriate context-free grammar specification. The statements that make up the function body implement the semantic actions of the rule. Each function accepts a single argument p that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of
p[i]are mapped to grammar symbols as shown here:
For tokens, the "value" of the corresponding
p[i]is the same as the
p.valueattribute assigned in the lexer module. So,
PLUSwill have the value
For non-terminals, the value is determined by whatever is placed in
p. If nothing is placed, the value is None. Also,
p[-1]is not the same as
pis not a simple list (
p[-1]can specify embedded actions (not discussed here)).
Note that the function can have any name, as long as it is preceeded by
p_error(p)rule is defined to catch syntax errors (same as
Multiple grammar rules can be combined into a single function, which is a good idea if productions have a similar structure.
Character literals can be used instead of tokens.
Of course, the literals must be specified in the lexer module.
Empty productions have the form
'''symbol : '''
To explicitly set the start symbol, use
start = 'foo', where
foois some non-terminal.
Setting precedence and associativity can be done using the precedence variable.
Tokens are ordered from lowest to highest precedence.
nonassocmeans that those tokens do not associate. This means that something like
a < b < cis illegal whereas
a < bis still legal.
parser.outis a debugging file that is created when the yacc program is executed for the first time. Whenever a shift/reduce conflict occurs, the parser always shifts.
The "Hello, World!" of PLY - A Simple Calculator
Let's demonstrate the power of PLY with a simple example: this program will take an arithmetic expression as a string input, and attempt to solve it.
Open up your favourite editor and copy the following code:
Save this file as
calc.py and run it.
Which is the right answer for
-4 * - (3 - 5).