Home Blog CV Projects Patterns Notes Book Colophon Search

OAT - Protocol Experiments

13 Feb, 2015

I've recently been learning Jison, Lex and Yacc, and Ragel for parsing text formats into data structures. Part of the motivation for this was reading Eric Raymond's excellent book The Art of UNIX Programming again.

The code is for these experiments is here:

https://bitbucket.org/thejimmyg/oat

I invested a lot of time into learning these technologies, expecting that the payback would be to be able to write parsers directly for pretty much any text protocol or file format.

The reality though is that:

Almost anything real-world is very hard to parse this way, generally there are a lots of edge cases that don't fit well with a single formal approach
Lex and Yacc generate a huge amount of output, to the extent that I couldn't run even the simplest Lex and Yacc built code on my pebble watch. It was simply too big. If you have to hand-generate code for one of the devices you support, there is little point in then jumping though the lex and yacc hoops for a different device
This is a deep and complex world, and one that isn't really worth getting involved in, unless you know you have a grammar you know is designed to be parsed by a particular technique.
In particular, I put a lot of effort into designing a simple text protocol that looked a bit like a shell session, which I called oat.

I wrote three implementations, one with pure Jison (in the cmd directory), one with C and Ragel and one with Lex and Yacc:

https://bitbucket.org/thejimmyg/oat/src/e5417070654d563a77121b11aa126bfe6c61ad0c/jison/cmd/example/cmds?at=master&fileviewer=file-view-default
https://bitbucket.org/thejimmyg/oat/src/e5417070654d563a77121b11aa126bfe6c61ad0c/ragel/oat/?at=master
https://bitbucket.org/thejimmyg/oat/src/e5417070654d563a77121b11aa126bfe6c61ad0c/lex_and_yacc/?at=master

Once the C and Ragel version was finished, I got an intermittent problem ever 1,000,000 or so requests where the Ragel generated code would mis-behave. Since something was going wrong between my code and Ragel but not in a re-producible way, it was basically impossible to debug:

https://bitbucket.org/thejimmyg/oat/src/e5417070654d563a77121b11aa126bfe6c61ad0c/ragel/oat/?at=master

By, the way, the above code follows many of the recommendations from Zed Shaw's excellent Learn C the Hard Way, and uses some of his macros too.

Update Sept 2016: It looks like this book is for purchase only now.

The code also uses a TDD approach with unit tests written in C and functional tests written in bash.

Although I still like the oat protocol, the reality is that pure-binary protocols like messagepack are much easier to parse because they send the length of the data in each value before the value, so you can allocate enough memory. There are also implementations for lots of languages and the Python implementation even supports a streaming parser out of the box. As a result I'd probably use messagepack over tcp rather than oat in any real-world application.

One of the experiments in the oat repo is dsv.

It is basically a program that takes data like this with | delimiters between values, `;\n' delimiters between rows, and percent-encoded values:

1|2|3;
4|5|%25|6;

And parses it to JavaScript:

[
    [ '1', '2', '3' ],
    [ '4', '5', '%', '6' ]
]

The dsv format used here is much easier to parse than CSV because the escaping of awkward values is much easier. Also, | is less common than , in most data, so probably makes a better delimiter.