Home Blog CV Projects Patterns Notes Book Colophon Search

OAT - Protocol Experiments

13 Feb, 2015

I've recently been learning Jison, Lex and Yacc, and Ragel for parsing text formats into data structures. Part of the motivation for this was reading Eric Raymond's excellent book The Art of UNIX Programming again.

The code is for these experiments is here:

I invested a lot of time into learning these technologies, expecting that the payback would be to be able to write parsers directly for pretty much any text protocol or file format.

The reality though is that:

I wrote three implementations, one with pure Jison (in the cmd directory), one with C and Ragel and one with Lex and Yacc:

Once the C and Ragel version was finished, I got an intermittent problem ever 1,000,000 or so requests where the Ragel generated code would mis-behave. Since something was going wrong between my code and Ragel but not in a re-producible way, it was basically impossible to debug:

By, the way, the above code follows many of the recommendations from Zed Shaw's excellent Learn C the Hard Way, and uses some of his macros too.

Update Sept 2016: It looks like this book is for purchase only now.

The code also uses a TDD approach with unit tests written in C and functional tests written in bash.

Although I still like the oat protocol, the reality is that pure-binary protocols like messagepack are much easier to parse because they send the length of the data in each value before the value, so you can allocate enough memory. There are also implementations for lots of languages and the Python implementation even supports a streaming parser out of the box. As a result I'd probably use messagepack over tcp rather than oat in any real-world application.

One of the experiments in the oat repo is dsv.

It is basically a program that takes data like this with | delimiters between values, `;\n' delimiters between rows, and percent-encoded values:

1|2|3;
4|5|%25|6;

And parses it to JavaScript:

[
    [ '1', '2', '3' ],
    [ '4', '5', '%', '6' ]
]

The dsv format used here is much easier to parse than CSV because the escaping of awkward values is much easier. Also, | is less common than , in most data, so probably makes a better delimiter.

Copyright James Gardner 1996-2020 All Rights Reserved. Admin.