Full Source on Github
A complete implementation of the Stoffle programming language is available at GitHub. This reference implementation has a lot of comments to help make reading the code easier. Feel free to open an issue if you find bugs or have questions.
In this blog post, we're going to implement the lexer for Stoffle, a toy programing language built entirely in Ruby. You can read more about this project in the first part of this series.
Learning how to build a lexer, even if it's a simple one, will make you a better programmer. It will help you understand how programming languages really work and will open doors to other fields, such as natural-language processing. It might even help you ace that technical interview.
What's a Lexer?
Lexers, also known as scanners, convert strings of characters into groups of tokens. Imagine we have the following line of code:
my_var = 1
This code may mean something to us as human beings, but to the computer, it's just text.
Our lexer should be able to parse this text and generate a data structure that looks something like this:
[:identifier, :'=', :number]
Once parsed, we no longer need the orgininal text. Implementing our programming language will be a matter of transforming and interpreting the data structure generated by the lexer.
How do you Build a Lexer?
If I were to give an elevator pitch of our implementation strategy, it would sound something like this:
We read the source code character by character, trying to assign each character to a specific "lexeme." Lexemes tell us which tokens to use in our final data structure.
Here are some helpful definitions:
Lexeme: the sequence of characters that matches the pattern associated with a token. Stoffle uses an
fntoken to define a function, and to produce this token, our lexer must find the lexeme
"fn"present in the source code.
Token: a data structure that conveys the meaning of a lexeme. For example, the lexeme
"42.42"would be interpreted as a number. Its token would probably include the numeric value
Before we can build a lexer for Stoffle's syntax, we need to have some idea of what the syntax actually is.
The snippet below shows a simple Stoffle program. It sums all integers between two numbers.
fn sum_integers: first_integer, last_integer i = first_integer sum = 0 while i <= last_integer sum = sum + i i = i + 1 end println(sum) end sum_integers(1, 100)
At this point, our objective is not to comprehend every corner of the syntax, in which characters and combinations thereof are allowed, or every bit of the semantics (i.e., the meanings of characters and their combinations), but to start gaining a general sense of the Stoffle programming language.
The Mathematician Who Inspired our Stoffle Sample Program
While thinking of a simple and suitable sample program, stories that a friend of mine used to tell about different mathematicians came to mind. Carl Friedrich Gauss supposedly figured out on his own, at only 7 years old, a formula to sum up numbers in a series.
Stoffle's Reserved Words
After reading the sample above, you may be thinking: "this looks a lot like Ruby!". Well, you're darn right if that crossed your mind! It's no mystery that here at Honeybadger, we adore the Ruby programming language. Stoffle is heavily inspired by Ruby; however, there are some minor differences to keep things fresh.
Let's review the main differences between Ruby and Stoffle:
- In Stoffle, we use
defto declare functions;
- We indicate the parameters a function receives by inserting a
:after the function's name instead of using matching
- We always use
()when calling a function, even when it is not receiving any parameters.
Before being able to actually build a lexer for Stoffle, it's necessary to spell out the whole list of characters/keywords we will be looking for. So, here we go:
( ) : , . + - * / ! = == != > < >= <= " # true false and or if else elsif end fn return while println nil
Getting Started: Implementing a Token
A token is a data structure. In our implementation, this will be expressed as a Ruby class with a couple attributes. Here are its essential bits:
require 'forwardable' module Stoffle class Token extend Forwardable attr_reader :type, :lexeme, :literal, :location def_delegators :@location, :line, :col, :length def initialize(type, lexeme, literal, location) @type = type @lexeme = lexeme @literal = literal @location = location end # ... end end
@type holds a symbol to identify which token we're dealing with. Our convention is to expect the symbol to be the lexeme (i.e., a string matching the token) converted to a Symbol type; for example, the symbol expected for the
"(" lexeme is
The exception is tokens whose pattern is a rule instead of a fixed set of characters. Identifiers, such as keywords and variable and function names, in Stoffle must be started by a letter and can also have numbers and underlines in their composition. As an example, the token object created for a
my_variable identifier will have the
:identifier symbol in its
@lexeme is the string matching the pattern expected by a token. Here are some examples:
@literal is optional and will be used for tokens that have a runtime value associated with them. For example, a
:number token may have
"2020" as the lexeme and
2020.0 as its value.
@location holds a Ruby
Struct containing the line and the column where we can find the lexeme at the source code. It also has the length of the lexeme. We won't go into detail about this localization mechanism, but you can check the
Location implementation at GitHub.
The Lexer's Core Logic
The most important methods of the
Lexer class are responsible for giving life to the basic strategy we discussed above. The public API of the
Lexer class is very simple; after creating the object, we just have to call
#start_tokenization to get the list of tokens that compose the source code that was supplied.
#start_tokenization calls the private method
#tokenize repeatedly to have each token produced.
#tokenize - as we will see shortly - interacts with many other smaller, private methods, each responsible for dealing with a specific case (some examples:
Let's start from the beginning with
class Lexer # ... def initialize(source) @source = source @tokens =  @line = 0 @next_p = 0 @lexeme_start_p = 0 end def start_tokenization while source_uncompleted? tokenize end tokens << Token.new(:eof, '', nil, after_source_end_location) end # ... end
We loop through the source code until we reach the end of the file. As we consume new characters, we try to match them to various tokens that are part of Stoffle. To mark the end of the source in a clean manner, the method ends by adding a final
:eof (eof stands for end of file) token to the instance variable
The meat of the implementation is in the method we repeatedly call from
# inside Stoffle::Lexer def tokenize self.lexeme_start_p = next_p token = nil c = consume return if WHITESPACE.include?(c) return ignore_comment_line if c == '#' if c == "\n" self.line += 1 tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n" return end token = if ONE_CHAR_LEX.include?(c) token_from_one_char_lex(c) elsif ONE_OR_TWO_CHAR_LEX.include?(c) token_from_one_or_two_char_lex(c) elsif c == '"' string elsif digit?(c) number elsif alpha_numeric?(c) identifier end if token tokens << token else raise('Unknown character.') end end
#tokenize method is responsible, along with the help of auxiliary methods, for analyzing the current character, classifying it, and then emitting the appropriate token. Depending on the scenario, it may be necessary to consume multiple next characters before we're able to make a decision on which token to produce.
Before we explore the methods responsible for determining whether a character or group of characters match with the pattern associated with each specific token, we have to comprehend the mechanism used to progress through the source code. Its implementation happens mostly in two methods,
#consume. First, there is the
# inside Stoffle::Lexer def lookahead(offset = 1) lookahead_p = (next_p - 1) + offset return "\0" if lookahead_p >= source.length source[lookahead_p] end
To move through the source code, we use three "pointers". These are variables holding integers whose values are a position in the array of characters that is our source code (the instance variable
@source). Two of them -
@next_p - are instance variables. The third one,
lookahead_p, is local to
@next_p are initialized to 0 when a
Lexer object is created. In our lexer, we're always consuming one character at a time.
@next_p has the job of pointing to the next character in the pipeline. The task of
@lexeme_start_p is a bit different, to track the position of the character that started the current lexeme (remember, some lexemes are made of multiple chars!).
#lookahead method returns a character that is offset from the char we're currently consuming by an arbitrary number of positions. This offset is given by the value of the parameter
offset, defaulting to 1. Notice that since the offset is from the character currently being consumed, calling
#lookahead with an offset of 1 is equivalent to
#lookahead implementation also includes a guard to protect us against going after the end of the source.
A simple but crucial bit is that when comparing
source.length we use
source is a plain old array whose index starts at 0. Notice also that when using
#lookahead, we are only looking and not consuming (i.e., moving
If you understood what I elucidated above,
#consume will be a stroll in the park and requires no further explanation:
def consume c = lookahead self.next_p += 1 c end
Whitespace Chars, Comments, and Newlines
Now, let's go back to
#tokenize and inspect it part-by-part. First, let's look at the beginning of its body:
# inside Stoffle::Lexer#tokenize self.lexeme_start_p = next_p token = nil c = consume return if WHITESPACE.include?(c) return ignore_comment_line if c == '#' if c == "\n" self.line += 1 tokens << token_from_one_char_lex(c) if tokens.last&.type != :"\n" return end
Every time we start the process of trying to emit the next token, we have to set
@lexeme_start_p to the value being held by
@next_p since this is, potentially, the pointer to the beginning of the next lexeme. We then consume the character we are about to analyze by calling our old and dear
Next, we start by checking three uninteresting but relevant scenarios. If our character matches one of the chars we are classifying as whitespace, we don't generate a token and simply move on by returning. The same is true if we detect our character is a
#, but in this case, we ignore all chars until the end of the line, not only the current one. If you want to check out the implementation of
#ignore_comment_line, please take a look at Stoffle's GitHub repo.
The last lines of this segment are quite significant. As you might have realized, Stoffle does not use any explicit terminator, such as the
; we see in many popular programming languages. In Ruby, a newline is used to terminate a line of code.
Here, we could go for more sophisticated rules, but since Stoffle is an educational project, it's better to keep it as simple as possible. In Stoffle, every newline at the end of a non-empty line is considered meaningful. It is trivial and easy to implement rule of thumb, but it has some consequences we must be aware of. The following snippet, which would be valid in Ruby, is not valid Stoffle code:
two = 1 + 1
If we try to later on run a program that is splitting a line in this fashion, we'll get an error. I think this is something we can live with and a reasonable compromise to keep things as uncomplicated as possible. So, if we find out the current char is a
\n, we have to do two things:
- Advance a counter that keeps track of the current line we are processing (this will be used for precise error reporting, but it's out of the scope of this post);
- We only emit a newline if the last token is not also a newline token. A newline token marks the end of a significant line of code, so we don't want to emit tokens for blank lines that are only being used to organize the source code.
One and Two Character Lexemes, Strings, Numbers, and Identifiers
Let's continue delving into
#tokenize. Now, the main part of the method:
# inside Stoffle::Lexer#tokenize token = if ONE_CHAR_LEX.include?(c) token_from_one_char_lex(c) elsif ONE_OR_TWO_CHAR_LEX.include?(c) token_from_one_or_two_char_lex(c) elsif c == '"' string elsif digit?(c) number elsif alpha_numeric?(c) identifier end
The first two branches of this conditional expression are straightforward. We first determine whether our character matches one of the multiple lexemes of length 1. If so, we emit the appropriate token. In case that doesn't happen, we test against a special group of lexemes containing 1 or 2 characters. This group contains lexemes such as
The strategy here is not difficult; if the current character matches a lexeme of size 1 belonging to this group, we just make sure we're in fact dealing with the lexeme in question by taking a look at the next character in the source. The idea is to always "bite" as many chars as we can! This appproach guarantees, for example, that
== results in the equality token being produced (and not the assignment token twice).
It's time now for us to inspect the
#string method, which is called when we know we are dealing with a string:
# inside Stoffle::Lexer def string while lookahead != '"' && source_uncompleted? self.line += 1 if lookahead == "\n" consume end raise 'Unterminated string error.' if source_completed? consume # consuming the closing '"'. lexeme = source[(lexeme_start_p)..(next_p - 1)] # the actual value of the string is the content between the double quotes. literal = source[(lexeme_start_p + 1)..(next_p - 2)] Token.new(:string, lexeme, literal, current_location) end
Here, we use the
#lookahead method we described earlier (combined with the also familiar
#consume) to keep consuming characters until the next char to be processed is the closing
". If we can't find the closing
" before the end of the source, we raise an error. Causing our lexer to fail in such a manner is definitely not the best approach, but it will do for now.
Finally, we use our pointers to extract the lexeme and the actual value of the string. We include double quotes for the lexeme but not for the value.
In case our character being processed at
#tokenize failed the string test, we next determine whether it is a number. If so, here's how we handle it and generate the associated token:
# inside Stoffle::Lexer def number consume_digits # Look for a fractional part. if lookahead == '.' && digit?(lookahead(2)) consume # consuming the '.' character. consume_digits end lexeme = source[lexeme_start_p..(next_p - 1)] Token.new(:number, lexeme, lexeme.to_f, current_location) end
In this method, we continue moving through the source as long as the next characters are digits; this is basically what
#consume_digits does. Note that we do take numbers with a fractional part into consideration. To reduce complexity, all numbers in Stoffle are floats, so we make sure to leverage Ruby's
#to_f method to make the process of generating an actual float value for the token a piece of cake.
For characters that had no match up to this point, we perform a final verification to see if we are dealing with an alphanumeric. If positive, the method
#identifier is the one used to handle the task:
# inside Stoffle::Lexer def identifier while alpha_numeric?(lookahead) consume end identifier = source[lexeme_start_p..(next_p - 1)] type = if KEYWORD.include?(identifier) identifier.to_sym else :identifier end Token.new(type, identifier, nil, current_location) end
In this case, the plan is to keep consuming the next characters until we find one that is not a number, letter, or underscore, which is what
#alpha_numeric? is verifying. For identifiers, there are two scenarios. We may be dealing with a language keyword (such as
while) or an identifier defined by the programmer (e.g., a variable name). Since keywords can't be used to name a variable or a function, we first try to match the lexeme against one of Stoffle's keywords. Not being able to do so, we are sure that the right thing to do is generate a generic
Finally, here's the last bit of
# inside Stoffle::Lexer#tokenize if token tokens << token else raise('Unknown character.') end
We push the most recently generated token to
@tokens. However, if no token was emitted, we can be sure at this point that we hit a character unsupported by Stoffle, and we should then ring the alarm.
Your Next Steps
Hooray, we have just gone through the principal parts of our lexer! Since it would be exhausting to explore all the minor details, I urge you to scan the complete implementation at GitHub. If you like, take a look at the test suite included, as the examples provided there may help illuminate some detail of the implementation that you may have found mysterious at first glance.
Other Usages of a Lexer
Now that you understand the basics on how to implement a lexer, you may be thinking of where else lexers are applied. A curious usage is in the field of natural language processing, in which a lexer can be used to transform raw input, either text or sound, into a more structured list of tokens to be leveraged by higher layers in a processing pipeline. Expect, however, these lexers to be significantly more complex than the one we have just implemented.
Our little language is starting to come to life! We now have a working lexer that is able to produce the valuable tokens our parser will need to do its job. The star of the next post in the series will be the parser itself. We'll see you soon in the next step of bringing Stoffle into being!