First and foremost, let me say I have mixed feelings about the article. Although it addresses important issues, it concludes with a very simple lexing problem, not a challenging one. I think this alone greatly diminishes the strength of his argument.
Now let me give my own list of pros and cons of parser/lexer generators.
Pros
I am a big fan of DSLs. And parser/lexer generators obviously fall in this category of tools. They operate on a high level, declarative description of the language we wish to parse. This results in more maintainable software, with less bugs (most popular generators have been used extensively, so you can already be confident on the correctness of generated code).
At Nu Echo, I wrote two parsers for a language called ABNF (it's a language for defining grammars used in speech recognition applications), one using an LALR(1) parser generator and the other one being an hand-crafted recursive-descent parser. The latter is certainly faster and give better error diagnostics (I'll get back to this in a moment), but from a maintenance point of view, the former wins.
Another important aspect of parser generators is that they often support different parsing technologies/drivers. For instance, bison can generate LALR(1) or GLR parser (lalr-scm can, too). And ANTLR supports LL(1) but also LL(k) grammars. Without having to rewrite the source grammar. You can't do that with an hand-crafted parser. (Of course, you must stay in the same "family" of parsing technologies. A grammars written for an LR parser generator may not be appropriate for an LL parser generator, and vice versa.)
A corollary to this is that you can concentrate on offering the best interface to your users: the language itself. You don't have to adapt the language to a parsing technique. This is very important from a design point of view. Usability is (or should be?) one of the driving goals when designing a new language.
Cons
Unfortunately, they are more pragmatic issues that can preclude the use of parser/lexer generators. Here are a few:
- My experience with parser generators showed me that they are nice for command-line tools, but they do not produce good parsers for use in interactive environments (like an editor). (I must confess that I have almost exclusively used LR parser generators a la yacc/bison, so this argument may be a bit stretched. And I know that there have been quite a lot of work on interactive programming environments and incremental parsing.) Error recovery mechanisms are often limited and tend to produce cryptic error messages.
- Hand-crafted parsers are often much easier to debug. Ever tried to understand what a shift/reduce conflict is? And tried to figure out how to rewrite your grammar to resolve it? You have to understand the parsing technology. Talk about an abstraction leak!
- Generated parsers are derived resources in a development project (i.e. resources obtained from other resources). This means they are usually not kept in the CVS/git/SVN/... repository. This can complicate the build process. This may not be a big deal, but in some projects this can be trickier.
In some languages other than C/C++/Java, like Scheme, this may not be a problem since the parser generator can be called at macro-expansion time.
Et voilà! I'm surely missing a lot of other equally important aspects, but it's getting late and I am tired. And you? What are you reasons for using or not using a parser generator/lexer generator?




10 comments:
Some languages are too complex to hand off to a parser generator. Bjarne Stroustrup once lamented that the folks at Bell Labs twisted his arm to use yacc for the early versions of C++. He wished he'd just written a plain old recursive descent parser.
Currently, I'm maintaining a few OcamlYacc-based parsers. And really glad that the parsers weren't hand-written.
Funny, on the same day that was posted, I wrote an article about an ECMAScript compiler for Guile. It included an old version of your lalr-scm package.
At the time that I wrote the lexer, I didn't have a Scheme lexer generator, so I just wrote a hand-coded lexer. I considered writing the parser by hand as well, but that would be a terrible, terrible maintenance burden -- the generated parser has something like 450 states! So I'm happy with lalr-scm.
The one thing that I wish it had were access to the raw cons tokens, so that I can associate source information with tokens. Perhaps I'll hack that in (e.g. as $$1 or someting).
Anyway, nice article, and thanks for lalr-scm!
@Andy As you may probably know, the latest version of lalr-scm provides support for source location information. Could this new version be ported to Guile?
Hi Dominique,
I was actually unaware of that newer lalr-scm supported source locations. I'm sure that the new version can work with Guile. I'll look into this later this week.
Thanks!
I've talked with a bunch of really smart compiler guys, including Bob Jervis who wrote Turbo C++. They all say the same thing: most compiler authors don't use parser generators. Whatever the reason, the perceived benefits don't outweigh the costs and lack of flexibility.
I'm working on a new parser generator that is intended to be so good that hand-written parsers offer absolutely no benefit over it. It's called Gazelle: http://www.reverberate.org/gazelle/.
@Josh I tend to agree with your smart compiler guys. When you want to do something fancy (as you will inevitably have to do in a real product, as opposed to developing a small prototype or a toy compiler), parser generators are usually in the way. So instead of adapting them to your own requirements, you end up implementing the whole parser from scratch using top-down techniques. That's what i did a few times in the past.
Thanks for the reference to your Gazelle system. I'll take a look at it. Will it really succeed in eliminating our complaints about parser generators? ;-)
@Dominique: it absolutely will! :) Not yet it won't, but the more I work on it (and I work on it a lot) the more convinced I am that this is something big.
My top three goals are:
1. Gazelle parsers are reusable.
2. Gazelle parsers are fast.
3. Gazelle is flexible enough that you never run into a situation where you say "Gazelle is not giving me enough control."
I just finished working on a project where I found a new benefit to the use of a parser generator. This was a large project where a big team was coming up with a new DSL. I started writing a specially tailored parser, but the language spec simply changed too often. In the end I started using a yacc clone simply out of self-defense. It was just so much faster to tweak the grammar rules than the special-purpose code to handle each newly-modified construct.
"In some languages other than C/C++/Java, like Scheme, this may not be a problem since the parser generator can be called at macro-expansion time."
In C++, a parser generator can be called at template-expansion time (templates are C++'s hygienic macros, loosely speaking). This is actually done in the "Spirit" library, part of the boost libraries. Check it out, if you like C++.
George Kangas
@Georges I don't really like C++ ;-) but the Spirit library seems interesting. Thanks for the reference!
Post a Comment