Comparing Rust parsers
Three main options come to mind when it comes to parsing in Rust: nom, pest and simple vanilla Rust. Let’s compare them.
What to parse
Let’s take a problem from Advent of Code as the string to parse. The input looks something like this:
[D]
[N] [C]
[Z] [M] [P]
1 2 3
move 1 from 2 to 1
move 3 from 1 to 3
move 2 from 2 to 1
move 1 from 1 to 2
The first part represents three stacks of crates. The second part is a list of instructions on how to move the crates.
Nom
Nom works by composing parsers. For example, to parse a single crate [X]
the following parser is used:
// parser for "[X]"
delimited(
tag("["),
take(1u8),
tag("]")
);
Both tag
and take
are parsers that are composed inside of the delimited
parser. This parser will only return the inside letter of a crate and discard the rest.
A parser for a single line of the move instructions has to following shape:
// parser for "move 1 from 2 to 1"
tuple((
preceded(tag("move "), u32), // "move" precedes "1"
preceded(tag(" from "), u32), // "from" precedes "2"
preceded(tag(" to "), u32), // "to" precedes "1"
)) // All inside a tuple
The cool thing is that u32
here is actually also a parser that will automatically parse the string into an integer for you. As you can see it’s all about composability.
Resources: nom repository, documentation, list of parsers
Pest
Parsers are defined in a special .pest
file. The parsing of a simple crate [X]
is done using
box = { "[" ~ ASCII_ALPHA_UPPER ~ "]" }
and a single move instruction
stack_number = { ASCII_DIGIT+ }
amount = { ASCII_DIGIT+ }
instr = { "move " ~ amount ~ " from " ~ stack_number ~ " to " ~ stack_number}
It is quite easy to read and write this syntax. However, the Rust code to run the parser must still be run. The function to convert the strings into integers is
fn parse_move(mut pairs: Pairs<Rule>) -> Instruction {
let amount = pairs.next().unwrap().as_str().parse::<u32>().unwrap();
let from = pairs.next().unwrap().as_str().parse::<usize>().unwrap();
let to = pairs.next().unwrap().as_str().parse::<usize>().unwrap();
Instruction { amount, from, to }
}
It’s pretty ugly. Also, if you make a change in the .pest
file you also have to make changes in your code.
Resources: playground, documentation, book.
Vanilla
Good ‘ol vanilla. You don’t have to learn anything new. Just use the reliable .split
. It’s also a bit more compact than the libraries.
Parsing not only a single crate but all of them with
// parses all the crates
s.split("\n")
.map(|s| s.chars().skip(1).step_by(4).collect::<Vec<_>>())
.collect::<Vec<_>>();
Parsing a single move instruction
// parses move instruction
let mut s = line.split(" ");
s.next();
let amount = s.next().unwrap().parse::<u32>().unwrap();
s.next();
let from = s.next().unwrap().parse::<usize>().unwrap();
s.next();
let to = s.next().unwrap().parse::<usize>().unwrap();
Instruction { amount, from, to }
Measurements
Comparing the different parsers is done using Criterion. This is a statistical benchmarking library written in Rust. Running the parsers gives the following plot.
Nom is the fastets, followed by the vanilla implementation, leaving pest in the last place. The funny thing is that pest claims on its home page that they are faster than nom. The creator of nom even wrote a post about this claim.
I wondered if any of the code that I wrote was slowing pest down, so I ran another benchmark with as little of my own code as possible.
Nom still wins. And nom is actually parsing strings to u32
here, which pest isn’t doing.
Another thing that bothered me was seeing how slow my vanilla implementation was compared to nom. Using callgrind and kcachegrind we can track down what’s slowing us down.
Parsing takes up 71% of the runtime while solving the actual problem is only 20%. And the parsing is almost entirely the parsing of the instructions. Following the instruction parsing node brings us to
We can see most of the time is spent in the next()
function of Split
, which we used earlier.
It is possible to avoid using split
by converting the input string into a list of bytes and picking the numbers out that way. Comparing the two versions shows a notable speedup.
Placing this new method next to the other parsers yields the final verdict.
Nom is faster still.
Conclusion
Nom is the fastest of them all, and also pretty easy to use. Pest is pretty slow and lies about it on its main page and the code isn’t very pretty. Vanilla code isn’t bad but would require more work to be on par with nom.
All the code is available on github.