Lexing with Ragel and Parsing with Yacc using Go

Occasionally I need to parse some input for processing and analysis. A Go Advent post from 2014 described handwritten lexers and parsers using Go for this purpose. The post is worth a read, describing how you can implement your own sql interpreter using nothing more than the standard library. The approach is used in InfluxDB’s query engine and can be quite powerful, but requires a certain amount of ceremony that can also be avoided. By using a parser generator like Yacc and a lexer like Ragel you can better define what the structure is from how you interpret that structure.

Interestingly the article states the difficulty with using parser generators like Ragel and Yacc. These tools can be difficult but with a baseline understanding they can be used to create powerful and performant state-machines for processing input that requires a specific structure. The complexity of a simple parser can only go so far but a properly implemented generator can drive entire applications efficiently. As a concrete example the official GraphQL library uses Yacc as its query interpreter. This Yacc parser creates an AST representing a GraphQL query that can be fed into another component for execution.

Unfortunately there is little documentation on how these tools work together using Go. Even the well-written go documentation is a little thin on its Yacc integration. This post attempts to alleviate the initial hurdle.

As a concrete example we’re going to implement the linux documentation project’s yacc thermostat example using Go. This little program maintains a state machine managing a thermostat: you can use a specific set of commands to turn heat on/off or set a specific temperature. We’ll implement a program which takes in a string of commands as text and outputs the current state at each step.

To interpret the set of commands we’ll need a way of processing the raw text input. The tldp example uses lex, but we’ll use Ragel with its Go support. We do this by extracting symbols from the raw input. Once we have enough symbols to figure out what to do, like heat on, (notice how both heat and on mean nothing on their own) we’ll take an action using Yacc. Ragel tokenizes the input string into symbols while Yacc handles a sequence of these symbols.

Both Ragel and Yacc are generators: you define your lexer and parser, conceptually a state machine, using their grammar, and you run a tool to generate code so your program can run the machine. It’s essentially a DSL with a slight exception in that you embed your own code — in this case Go code — into their DSL. Code generation removes a lot of the ceremony around processing input and can be very performant for processing.

A Ragel Lexer

To start out we need a Go struct that implements our lexer. We’ll define a file, lex.rl, to hold our struct. This Ragel file will be transpiled to Go using Ragel’s cli. Even though this is a .rl file we will mix Go code with the Ragel DSL, and everything will be outputted as Go code.

type lexer struct {
data []byte
p, pe, cs int
ts, te, act int

Our struct has a field to hold data for the interpreter as well as a set of odd variables. These variables are used by the generated Ragel code to maintain the current position of its state machine as it interprets the data array. ts and te, for instance, refer to token start and token end. They are described in the Ragel documentation which is worth a read.

Our lexer is driven by Yacc. A lexer used by Yacc must adhere to a specific interface so the generated Yacc code can continuously invoke the lexer looking for symbols. The interface is called yyLexer:

type yyLexer interface {
Lex(lval *yySymType) int
Error(e string)

We add these functions to a pointer of ourlexer struct with a skeleton implementation:

func (lex *lexer) Lex(out *yySymType) int {
tok := 0

The %%{ and }%%elements denote a block of Ragel commands. These blocks will be replaced by generated Go code. main is the main Ragel state machine and we’ll eventually add our checks to this block. I’ve omitted the full state machine for brevity; you can view the full contents of this file on Github.

We also need some additional Ragel commands at the top of the file to augment the generated Ragel code:

machine thermostat;
write data;
access lex.;
variable p lex.p;
variable pe lex.pe;

This block declares the name of our machine, thermostat, and some other declarations. Our Lex method is function off of the lex *lexerpointer which holds our state variables. We need to tell the code generator to prefix lex. when the generated code must access any of its token variables like ts and teotherwise the code won’t compile. We also do this for the p and pe variables (used for partially matched positioning) which isn’t explicitly handled by the access command.

Let’s revisit our Lex method. It’s important to understand the signature of Lex(out *yySymType) int and how it’s used by Yacc. Using Yacc you define a set of symbols and an arrangement of those symbols. Our command target temperature 68is defined using three symbols: TOKTARGET for the word target, TOKTEMPERATUREfor temperature and NUMBER for any set of digits. Tokens can have associated values, so ourNUMBER can also hold a value 68.

fmt.Println("\tTemperature set ", $3);

The $3 pulls out the value of the third token, NUMBER. We define tokens as part of the Yacc grammar. I’ve created a file thermostat.y that defines our thermostat, and I defined the following tokens I want the Ragel lexer to produce:


The Lex method takes a single parameter, a pointer to yySymType. The yySymType is generated by Yacc via the %uniondeclaration. Using%union you can declare fields on yySymType that can be associated with tokens. This allows you to set strongly-typed values from your lexer accessible in your Yacc file.

We need to two fields, one to determine the on/off state associated with STATE and another to hold the digit value for NUMBER:

%union {   
isOn bool
temp int

So in our fmt.Println("\tTemperature set ", $3) command, where $3 is pulled from NUMBER, Yacc will inject a call to yySymType.temp to get the actual value. We just have to ensure we set the temp field on the out parameter passed to Lex.

Let’s ensure we set the out parameter in our Ragel lexer. A Ragel engine defines a set of states, and what to do when those states are reached. In order to be compatible with Yacc, we must tell Yacc what token is matched and set the value for that token. As an example we’ll implement two of the five tokens we want to extract: NUMBER and TOKHEAT.

digit+ => { out.temp, _ = strconv.Atoi(string(lex.data[lex.ts:lex.te])); tok = NUMBER; fbreak;};
'heat' => { tok = TOKHEAT; fbreak;};

When we hit any sequence of numbers, via the regex digit+, we’ll set out.temp and set the tok variable to NUMBER. If we see the literal string 'heat' we don’t need to set any output, but we do need to return a TOKHEAT. The => operation denotes the action to be taken, and anything between the {} characters are snippets of Go code that will execute when the regex is matched. In the case of digit+ we use the ts and te variables (token start and token end) to pull out the digit from the data field and convert the bytes to an integer to set temp. ts and te tell you were exactly in the data buffer the engine has matched. Finally we call fbreak which is a Ragel command to break execution from the engine. This preserves current state and will return command to the parser. Yacc will continue to gobble up these individual symbols until it hits a sequence of symbols it’s supposed to know about, like TOKHEAT TOKTEMPERATURE NUMBER we defined above.

Finally our main method instantiates both our lexer and parser and passes some raw text input to our lexer, along with some go:generate commands that will invoke the ragel and go tool yacc commands to autogenerate our code from lex.rl and thermostat.y files (you can either run make to generate and compile or justgo generate to generate).

//go:generate ragel -Z -G2 -o lex.go lex.rl
//go:generate go tool yacc thermostat.y

Ragel and Yacc complement each other with some overlap. They both define their own grammar for processing input. In the case of Ragel the input is an array of bytes. It reads through these bytes trying to match against of set of regular expressions you’ve defined. Once it has found a match it will invoke an action you define. Our sample we’ve defines a function that is compatible with Yacc via the yyLexer interface. This interfaces outputs a set of defined tokens and optional corresponding values. These symbols can be assembled together to build more complex processing. You define a set of arrangements in which tokens should appear. If Yacc sees a sequence of tokens as you’ve defined them it will also invoke a function. In our example we simple call Println but you could do whatever you want, like build an AST or invoke an action that’s part of a protocol.

As a follow up, try and modify the lexer to take in an input stream so you can implement a cli-based application to interact with the thermostat.

Code. Create. Conquer.

Code. Create. Conquer.