---
title: Building a Toy Lexer in Ruby
published: "2020-07-15"
publisher: Honeybadger
author: Alex Braha Stoll
category: Ruby articles
tags:
  - Ruby
description: "Lexers are magical. They take your messy, hand-typed, human text, and convert it into a clean data structure that the computer can process. Every time you run a ruby program, use structured search or type in a date by hand, you'll find a lexer hard at work. In this article, Alex Braha Stoll pulls back the curtain to show us how lexers work and how to implement one for a simple programming language."
url: "https://www.honeybadger.io/blog/building-lexer-ruby/"
---

> ## Full Source on Github
> 
> A complete implementation of the Stoffle programming language is available at [GitHub](https://github.com/alexbrahastoll/stoffle). This reference implementation has a lot of comments to help make reading the code easier. Feel free to open an issue if you find bugs or have questions.

In this blog post, we're going to implement the lexer for Stoffle, a toy programing language built entirely in Ruby. You can read more about this project in the [first part of this series](https://www.honeybadger.io/blog/).

Learning how to build a lexer, even if it's a simple one, will make you a better programmer. It will help you understand how programming languages really work and will open doors to other fields, such as natural-language processing. It might even help you ace that technical interview.

## What's a Lexer?

Lexers, also known as scanners, convert strings of characters into groups of tokens. Imagine we have the following line of code:

```ruby
my_var = 1
```

This code may mean something to us as human beings, but to the computer, it's just text.

Our lexer should be able to parse this text and generate a data structure that looks something like this:

```ruby
[:identifier, :'=', :number]
```

Once parsed, we no longer need the orgininal text. Implementing our programming language will be a matter of transforming and interpreting the data structure generated by the lexer.

## How do you Build a Lexer?

If I were to give an elevator pitch of our implementation strategy, it would sound something like this:

We read the source code character by character, trying to assign each character to a specific "lexeme." Lexemes tell us which tokens to use in our final data structure.

Here are some helpful definitions:

- **Lexeme:** the sequence of characters that matches the pattern associated with a token. Stoffle uses an `fn` token to define a function, and to produce this token, our lexer must find the lexeme `"fn"` present in the source code.
    
- **Token:** a data structure that conveys the meaning of a lexeme. For example, the lexeme `"42.42"` would be interpreted as a number. Its token would probably include the numeric value `42.42`.
    

## Introducing Stoffle

Before we can build a lexer for Stoffle's syntax, we need to have some idea of what the syntax actually is.

The snippet below shows a simple Stoffle program. It sums all integers between two numbers.

```stoffle
fn sum_integers: first_integer, last_integer i = first_integer sum = 0 while i <= last_integer sum = sum + i i = i + 1 end println(sum) end sum_integers(1, 100)
```

At this point, our objective is not to comprehend every corner of the syntax, in which characters and combinations thereof are allowed, or every bit of the semantics (i.e., the meanings of characters and their combinations), but to start gaining a general sense of the Stoffle programming language.

> ## The Mathematician Who Inspired our Stoffle Sample Program
> 
> While thinking of a simple and suitable sample program, stories that a friend of mine used to tell about different mathematicians came to mind. [Carl Friedrich Gauss](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss#Early_years) supposedly figured out on his own, at only 7 years old, a formula to sum up numbers in a series.

## Stoffle's Reserved Words

After reading the sample above, you may be thinking: "this looks a lot like Ruby!". Well, you're darn right if that crossed your mind! It's no mystery that here at Honeybadger, we adore the Ruby programming language. Stoffle is heavily inspired by Ruby; however, there are some minor differences to keep things fresh.

Let's review the main differences between Ruby and Stoffle:

- In Stoffle, we use `fn` instead of `def` to declare functions;
- We indicate the parameters a function receives by inserting a `:` after the function's name instead of using matching `()`;
- We always use `()` when calling a function, even when it is not receiving any parameters.

Before being able to actually build a lexer for Stoffle, it's necessary to spell out the whole list of characters/keywords we will be looking for. So, here we go:

```
( ) : , . + - * / ! = == != > < >= <= " # true false and or if else elsif end fn return while println nil
```

## Getting Started: Implementing a Token

A token is a data structure. In our implementation, this will be expressed as a Ruby class with a couple attributes. Here are its essential bits:

```ruby
require 'forwardable' module Stoffle class Token extend Forwardable attr_reader :type, :lexeme, :literal, :location def_delegators :@location, :line, :col, :length def initialize(type, lexeme, literal, location) @type = type @lexeme = lexeme @literal = literal @location = location end # ... end end
```

The `@type` holds a symbol to identify which token we're dealing with. Our convention is to expect the symbol to be the lexeme (i.e., a string matching the token) converted to a Symbol type; for example, the symbol expected for the `"("` lexeme is `:'('`.

The exception is tokens whose pattern is a rule instead of a fixed set of characters. Identifiers, such as keywords and variable and function names, in Stoffle must be started by a letter and can also have numbers and underlines in their composition. As an example, the token object created for a `my_variable` identifier will have the `:identifier` symbol in its `@type` attribute.

The `@lexeme` is the string matching the pattern expected by a token. Here are some examples: `":"`, `">="`, `"my_var"`.

The `@literal` is optional and will be used for tokens that have a runtime value associated with them. For example, a `:number` token may have `"2020"` as the lexeme and `2020.0` as its value.

The `@location` holds a Ruby `Struct` containing the line and the column where we can find the lexeme at the source code. It also has the length of the lexeme. We won't go into detail about this localization mechanism, but you can check the `Location` implementation at [GitHub](https://github.com/alexbrahastoll/stoffle/blob/master/lib/stoffle/location.rb).

## The Lexer's Core Logic

The most important methods of the `Lexer` class are responsible for giving life to the basic strategy we discussed above. The public API of the `Lexer` class is very simple; after creating the object, we just have to call `#start_tokenization` to get the list of tokens that compose the source code that was supplied. `#start_tokenization` calls the private method `#tokenize` repeatedly to have each token produced. `#tokenize` - as we will see shortly - interacts with many other smaller, private methods, each responsible for dealing with a specific case (some examples: `#string`, `#number`, and `#identifier`).

Let's start from the beginning with `#start_tokenization`:

```ruby
class Lexer # ... def initialize(source) @source = source @tokens = [] @line = 0 @next_p = 0 @lexeme_start_p = 0 end def start_tokenization while source_uncompleted? tokenize end tokens << Token.new(:eof, '', nil, after_source_end_location) end # ... end
```

We loop through the source code until we reach the end of the file. As we consume new characters, we try to match them to various tokens that are part of Stoffle. To mark the end of the source in a clean manner, the method ends by adding a final `:eof` (eof stands for end of file) token to the instance variable `@tokens`.

The meat of the implementation is in the method we repeatedly call from `#start_tokenization`, the `#tokenize` method:

```ruby
# inside Stoffle::Lexer def tokenize self.lexeme_start_p = next_p token = nil c = consume return if WHITESPACE.include?(c) return ignore_comment_line if c == '#' if c == "\n" self.line += 1 tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n" return end token = if ONE_CHAR_LEX.include?(c) token_from_one_char_lex(c) elsif ONE_OR_TWO_CHAR_LEX.include?(c) token_from_one_or_two_char_lex(c) elsif c == '"' string elsif digit?(c) number elsif alpha_numeric?(c) identifier end if token tokens << token else raise('Unknown character.') end end
```

The `#tokenize` method is responsible, along with the help of auxiliary methods, for analyzing the current character, classifying it, and then emitting the appropriate token. Depending on the scenario, it may be necessary to consume multiple next characters before we're able to make a decision on which token to produce.

Before we explore the methods responsible for determining whether a character or group of characters match with the pattern associated with each specific token, we have to comprehend the mechanism used to progress through the source code. Its implementation happens mostly in two methods, `#lookahead` and `#consume`. First, there is the `#lookahead` method:

```ruby
# inside Stoffle::Lexer def lookahead(offset = 1) lookahead_p = (next_p - 1) + offset return "\0" if lookahead_p >= source.length source[lookahead_p] end
```

To move through the source code, we use three "pointers". These are variables holding integers whose values are a position in the array of characters that is our source code (the instance variable `@source`). Two of them - `@lexeme_start_p` and `@next_p` - are instance variables. The third one, `lookahead_p`, is local to `#lookahead`.

Both `@lexeme_start_p` and `@next_p` are initialized to 0 when a `Lexer` object is created. In our lexer, we're always consuming one character at a time. `@next_p` has the job of pointing to the **next** character in the pipeline. The task of `@lexeme_start_p` is a bit different, to track the position of the character that started the current lexeme (remember, some lexemes are made of multiple chars!).

The `#lookahead` method returns a character that is offset from the char we're currently consuming by an arbitrary number of positions. This offset is given by the value of the parameter `offset`, defaulting to 1. Notice that since **the offset is from the character currently being consumed**, calling `#lookahead` with an offset of 1 is equivalent to `source[next_p]`. The `#lookahead` implementation also includes a guard to protect us against going after the end of the source.

A simple but crucial bit is that when comparing `lookahead_p` with `source.length` we use `>=` because `source` is a plain old array whose index starts at 0. Notice also that when using `#lookahead`, we are only **looking** and not **consuming** (i.e., moving `@next_p`) characters.

If you understood what I elucidated above, `#consume` will be a stroll in the park and requires no further explanation:

```ruby
def consume c = lookahead self.next_p += 1 c end
```

## Whitespace Chars, Comments, and Newlines

Now, let's go back to `#tokenize` and inspect it part-by-part. First, let's look at the beginning of its body:

```ruby
# inside Stoffle::Lexer#tokenize self.lexeme_start_p = next_p token = nil c = consume return if WHITESPACE.include?(c) return ignore_comment_line if c == '#' if c == "\n" self.line += 1 tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n" return end
```

Every time we start the process of trying to emit the next token, we have to set `@lexeme_start_p` to the value being held by `@next_p` since this is, potentially, the pointer to the beginning of the next lexeme. We then consume the character we are about to analyze by calling our old and dear `#consume` method.

Next, we start by checking three uninteresting but relevant scenarios. If our character matches one of the chars we are classifying as whitespace, we don't generate a token and simply move on by returning. The same is true if we detect our character is a `#`, but in this case, we ignore all chars until the end of the line, not only the current one. If you want to check out the implementation of `#ignore_comment_line`, please take a look at Stoffle's GitHub repo.

The last lines of this segment are quite significant. As you might have realized, Stoffle does not use any explicit terminator, such as the `;` we see in many popular programming languages. In Ruby, a newline is used to terminate a line of code.

Here, we could go for more sophisticated rules, but since Stoffle is an educational project, it's better to keep it as simple as possible. In Stoffle, every newline at the end of a non-empty line is considered meaningful. It is trivial and easy to implement rule of thumb, but it has some consequences we must be aware of. The following snippet, which would be valid in Ruby, is **not** valid Stoffle code:

```stoffle
two = 1 + 1
```

If we try to later on run a program that is splitting a line in this fashion, we'll get an error. I think this is something we can live with and a reasonable compromise to keep things as uncomplicated as possible. So, if we find out the current char is a `\n`, we have to do two things:

- Advance a counter that keeps track of the current line we are processing (this will be used for precise error reporting, but it's out of the scope of this post);
- We only emit a newline if the last token is not also a newline token. A newline token marks the end of a significant line of code, so we don't want to emit tokens for blank lines that are only being used to organize the source code.

## One and Two Character Lexemes, Strings, Numbers, and Identifiers

Let's continue delving into `#tokenize`. Now, the main part of the method:

```ruby
# inside Stoffle::Lexer#tokenize token = if ONE_CHAR_LEX.include?(c) token_from_one_char_lex(c) elsif ONE_OR_TWO_CHAR_LEX.include?(c) token_from_one_or_two_char_lex(c) elsif c == '"' string elsif digit?(c) number elsif alpha_numeric?(c) identifier end
```

The first two branches of this conditional expression are straightforward. We first determine whether our character matches one of the multiple lexemes of length 1. If so, we emit the appropriate token. In case that doesn't happen, we test against a special group of lexemes containing 1 or 2 characters. This group contains lexemes such as `"="` and `"=="`.

The strategy here is not difficult; if the current character matches a lexeme of size 1 belonging to this group, we just make sure we're in fact dealing with the lexeme in question by taking a look at the next character in the source. The idea is to always "bite" as many chars as we can! This appproach guarantees, for example, that `==` results in the equality token being produced (and not the assignment token twice).

It's time now for us to inspect the `#string` method, which is called when we know we are dealing with a string:

```ruby
# inside Stoffle::Lexer def string while lookahead != '"' && source_uncompleted? self.line += 1 if lookahead == "\n" consume end raise 'Unterminated string error.' if source_completed? consume # consuming the closing '"'. lexeme = source[(lexeme_start_p)..(next_p - 1)] # the actual value of the string is the content between the double quotes. literal = source[(lexeme_start_p + 1)..(next_p - 2)] Token.new(:string, lexeme, literal, current_location) end
```

Here, we use the `#lookahead` method we described earlier (combined with the also familiar `#consume`) to keep consuming characters until the next char to be processed is the closing `"`. If we can't find the closing `"` before the end of the source, we raise an error. Causing our lexer to fail in such a manner is definitely not the best approach, but it will do for now.

Finally, we use our pointers to extract the lexeme and the actual value of the string. We include double quotes for the lexeme but not for the value.

In case our character being processed at `#tokenize` failed the string test, we next determine whether it is a number. If so, here's how we handle it and generate the associated token:

```ruby
# inside Stoffle::Lexer def number consume_digits # Look for a fractional part. if lookahead == '.' && digit?(lookahead(2)) consume # consuming the '.' character. consume_digits end lexeme = source[lexeme_start_p..(next_p - 1)] Token.new(:number, lexeme, lexeme.to_f, current_location) end
```

In this method, we continue moving through the source as long as the next characters are digits; this is basically what `#consume_digits` does. Note that we do take numbers with a fractional part into consideration. To reduce complexity, all numbers in Stoffle are floats, so we make sure to leverage Ruby's `#to_f` method to make the process of generating an actual float value for the token a piece of cake.

For characters that had no match up to this point, we perform a final verification to see if we are dealing with an alphanumeric. If positive, the method `#identifier` is the one used to handle the task:

```ruby
# inside Stoffle::Lexer def identifier while alpha_numeric?(lookahead) consume end identifier = source[lexeme_start_p..(next_p - 1)] type = if KEYWORD.include?(identifier) identifier.to_sym else :identifier end Token.new(type, identifier, nil, current_location) end
```

In this case, the plan is to keep consuming the next characters until we find one that is not a number, letter, or underscore, which is what `#alpha_numeric?` is verifying. For identifiers, there are two scenarios. We may be dealing with a language keyword (such as `fn` or `while`) or an identifier defined by the programmer (e.g., a variable name). Since keywords can't be used to name a variable or a function, we first try to match the lexeme against one of Stoffle's keywords. Not being able to do so, we are sure that the right thing to do is generate a generic `:identifier` token.

Finally, here's the last bit of `#tokenize`:

```ruby
# inside Stoffle::Lexer#tokenize if token tokens << token else raise('Unknown character.') end
```

We push the most recently generated token to `@tokens`. However, if no token was emitted, we can be sure at this point that we hit a character unsupported by Stoffle, and we should then ring the alarm.

## Your Next Steps

Hooray, we have just gone through the principal parts of our lexer! Since it would be exhausting to explore all the minor details, I urge you to scan the complete implementation at GitHub. If you like, take a look at the [test suite](https://github.com/alexbrahastoll/stoffle/blob/master/spec/lib/stoffle/lexer_spec.rb) included, as the examples provided there may help illuminate some detail of the implementation that you may have found mysterious at first glance.

> ## Other Usages of a Lexer
> 
> Now that you understand the basics on how to implement a lexer, you may be thinking of where else lexers are applied. A curious usage is in the field of natural language processing, in which a lexer can be used to transform raw input, either text or sound, into a more structured list of tokens to be leveraged by higher layers in a processing pipeline. Expect, however, these lexers to be significantly more complex than the one we have just implemented.

## Parting Thoughts

Our little language is starting to come to life! We now have a working lexer that is able to produce the valuable tokens our parser will need to do its job. The star of the next post in the series will be the parser itself. We'll see you soon in the next step of bringing Stoffle into being!

---

## Try Honeybadger for FREE

Intelligent logging, error tracking, and Just Enough APM™ in one dev-friendly platform. Find and fix problems before users notice.

[Start free trial](https://app.honeybadger.io/users/sign_up)

[See plans and pricing](https://www.honeybadger.io/plans/)