Newt: Replacing Bison and Flex

Bison and Flex (or any of the yacc/lex family members) are a quick way to generate reliable parsers and lexers for language development. It's easy to write a token recognizer in Flex and a grammar in Bison, and you can readily hook code up to the resulting parsing operation. However, neither Bison nor Flex are really designed for embedded systems where memory is limited and malloc is to be avoided.

When starting Newt, I didn't hesitate to use them though; it's was nice to use well tested and debugged tools so that I could focus on other parts of the implementation.

With the rest of Newt working well, I decided to go take another look at the cost of lexing and parsing to see if I could reduce their impact on the system.

A Custom Lexer

Most mature languages end up with a custom lexer. I'm not really sure why this has to be the case, but it's pretty rare to run across anyone still using Flex for lexical analysis. The lexical structure of Python is simple enough that this isn't a huge burden; the hardest part of lexing Python code is in dealing with indentation, and Flex wasn't really helping with that part much anyways.

So I decided to just follow the same path and write a custom lexer. The result generates only about 1400 bytes of Thumb code, a significant savings from the Flex version which was about 6700 bytes.

To help make the resulting language LL, I added lexical recognition of the 'is not' and 'not in' operators; instead of attempting to sort these out in the parser, the lexer does a bit of look-ahead and returns a single token for both of these.

Parsing on the Cheap

Many of common languages are "almost" LL; this may come from using recursive-descent parsers. In 'pure' form, recursive descent parsers can only recognize LL languages, but it's easy to hack them up to add a bit of look ahead here and there to make them handle non-LL cases.

Which is one reason we end up using parser generators designed to handle LALR languages instead; that class of grammars does cover most modern languages fairly well, only requiring a small number of kludges to paper over the remaining gaps. I think the other reason I like using Bison is that the way an LALR parser works makes it really easy to handle synthetic attributes during parsing. Synthetic attributes are those built from collections of tokens that match an implicit parse tree of the input.

The '$[0-9]+' notation within Bison actions represent the values of lower-level parse tree nodes, while '$$' is the attribute value passed to higher levels of the tree.

However, LALR parser generators are pretty complicated, and the resulting parse tables are not tiny. I wondered how much space I could save by using a simpler parser structure, and (equally important), one designed for embedded use. Because Python is supposed to be an actual LL language, I decided to pull out an ancient tool I had and give it a try.

Lola: Reviving Ancient Lisp Code

Back in the 80s, I wrote a little lisp called Kalypso. One of the sub-projects resulted an LL parser generator called Lola. LL parsers are a lot easier to understand than LALR parsers, so that's what I wrote.

A program written in a long-disused dialect of lisp offers two choices:

1) Get the lisp interpreter running again

2) Re-write the program in an available language.

I started trying to get Kalypso working again, and decided that it was just too hard. Kalypso was not very portably written and depended on a lot of historical architecture, including the structure of a.out files and the mapping of memory.

So, as I was writing a Python-like language anyways, I decided to transliterate Lola into Python. It's now likely the least "Pythonic" program around as it reflects a lot of common Lisp-like programming ideas. I've removed the worst of the recursive execution, but it is still full of list operations. The Python version uses tuples for lists, and provides a 'head' and 'rest' operation to manipulate them. I probably should have just called these 'car' and 'cdr'...

One benefit of using Lisp was that I could write the grammar as s-expressions and avoid needing a parser for the Lola input language. Of course, Lola is a parser generator, so it actually bootstraps itself by having the grammar for the Lola language written as Python data structures, generating a parser for that and then parsing the user's input. Here's what the Lola grammar looks like, in Lola grammar syntax:

start       : non-term start
        |
        ;
non-term    : SYMBOL @NONTERM@ COLON rules @RULES@ SEMI
        ;
rules       : rule rules-p
        ;
rules-p     : VBAR rule rules-p
        |
        ;
rule        : symbols @RULE@
        ;
symbols     : SYMBOL @SYMBOL@ symbols
        |
        ;

Lola now has a fairly clean input syntax, including the ability to code actions in C (or whatever language). It has two output modules; a Python module that generates just the Python parse table structure, and a C module that generates a complete parser, ready to incorporate into your application much as Bison does.

Lola is available in my git repository, https://keithp.com/cgit/lola.git/

Actions in Bison vs Lola

Remember how I said that Bison makes processing synthetic attributes really easy? Well, the same is not true of the simple LL parser generated by Lola.

Actions in Lola are chucks of C code executed when they appear at to top of the parse stack. However, there isn't any context for them in the parsing process itself; the parsing process discards any knowledge of production boundaries. As a result, the actions have to manually track state on a separate attribute stack. There are pushes to this stack in one action that are expected to be matched by pops in another.

The resulting actions are not very pretty, and writing them somewhat error prone. I'd love to come up with a cleaner mechanism, and I've got some ideas, but those will have to wait for another time.

Bison vs Lola in Newt

Bison generated 4kB of parse tables and a 1470 byte parser. Lola generates 2kB of parse tables and a a 1530 byte parser. So, switching has saved about 2kB of memory. Most of the parser code in both cases is probably the actions, which my guess as to why they're similar. I think the 2kB savings is worth it, but it's a close thing for sure.