The more you know about your tools, the better decisions you will make as a developer. It's often useful — especially when debugging performance issues — to understand what Ruby is actually doing when it runs your program.
In this post we'll follow the journey of a simple program as it's lexed, parsed and compiled into bytecode. We'll use the tools that Ruby gives us to spy on the interpreter every step of the way.
Don't worry — even if you're not an expert this post should be pretty easy to follow. It's more of a guided tour than a technical manual.
Meet our sample program
As an example, I'm going to use a single if/else statement. To save space, I'll write this using the ternary operator. But don't be fooled, it's just an if/else.
x > 100 ? 'foo' : 'bar'
As you'll see, even a simple program like this gets translated into quite a lot of data as it is processed.
Note: All of the examples in this post were written in Ruby (MRI) 2.2. If you're using other implementations of Ruby, they probably won't work.
Before the Ruby interpreter can run your program it has to convert it from a somewhat free-form programming language into more structured data.
The first step might be to break the program into chunks. These chunks are called tokens.
# This is a string "x > 1" # These are tokens ["x", ">", "1"]
The Ruby standard library provides a module called Ripper that lets us process Ruby code in much the same way as the Ruby interpreter.
In the example below we are using the tokenize method on our Ruby code. As you can see, it returns an array of tokens.
require 'ripper' Ripper.tokenize("x > 1 ? 'foo' : 'bar'") # => ["x", " ", ">", " ", "1", " ", "?", " ", "'", "foo", "'", " ", ":", " ", "'", "bar", "'"]
The tokenizer is pretty stupid. You can feed it completely invalid Ruby and it will still tokenize it.
# bad code Ripper.tokenize("1var @= \/foobar`") # => ["1", "var"]
Lexing is one step beyond tokenization. The string is still broken into tokens, but additional data is added to the tokens.
In the example below we are using Ripper to Lex our small program. as you can see, it's now tagging each token as being an identifier
:on_ident, an operator
:on_op, an integer
require 'ripper' require 'pp' pp Ripper.lex("x > 100 ? 'foo' : 'bar'") # [[[1, 0], :on_ident, "x"], # [[1, 1], :on_sp, " "], # [[1, 2], :on_op, ">"], # [[1, 3], :on_sp, " "], # [[1, 4], :on_int, "100"], # [[1, 5], :on_sp, " "], # [[1, 6], :on_op, "?"], # [[1, 7], :on_sp, " "], # [[1, 8], :on_tstring_beg, "'"], # [[1, 9], :on_tstring_content, "foo"], # [[1, 12], :on_tstring_end, "'"], # [[1, 13], :on_sp, " "], # [[1, 14], :on_op, ":"], # [[1, 15], :on_sp, " "], # [[1, 16], :on_tstring_beg, "'"], # [[1, 17], :on_tstring_content, "bar"], # [[1, 20], :on_tstring_end, "'"]]
There is still no real syntax checking going on at this point. The lexer will happily process invalid code.
Now that Ruby has broken up the code into more manageable chunks, it's time for parsing to begin.
During the parsing stage, Ruby transforms the text into something called an abstract syntax tree, or AST. The abstract syntax tree is a representation of your program in memory.
You might say that programming languages in general are just more user-friendly ways of describing abstract syntax trees.
require 'ripper' require 'pp' pp Ripper.sexp("x > 100 ? 'foo' : 'bar'") # [:program, # [[:ifop, # [:binary, [:vcall, [:@ident, "x", [1, 0]]], :>, [:@int, "100", [1, 4]]], # [:string_literal, [:string_content, [:@tstring_content, "foo", [1, 11]]]], # [:string_literal, [:string_content, [:@tstring_content, "foobar", [1, 19]]]]]]]
It might not be easy to read this output, but if you stare at it for long enough you can kind of see how it maps to the original program.
# Define a progam [:program, # Do an "if" operation [[:ifop, # Check the conditional (x > 100) [:binary, [:vcall, [:@ident, "x", [1, 0]]], :>, [:@int, "100", [1, 4]]], # If true, return "foo" [:string_literal, [:string_content, [:@tstring_content, "foo", [1, 11]]]], # If false, return "bar" [:string_literal, [:string_content, [:@tstring_content, "foobar", [1, 19]]]]]]]
At this point, the Ruby interpreter knows exactly what's you want it to do. It could run your program right now. And before Ruby 1.9, it would have. But now, there's one more step.
Compiling to bytecode
Instead of traversing the abstract syntax tree directly, nowadays Ruby compiles the abstract syntax tree into lower-level byte code.
This byte code is then run by the Ruby virtual machine.
We can take a peek into the inner workings of the virtual machine via the
RubyVM::InstructionSequence class. In the example below, we compile our sample program and then disassemble it to make a human readable.
puts RubyVM::InstructionSequence.compile("x > 100 ? 'foo' : 'bar'").disassemble # == disasm: <RubyVM::InstructionSequence:<compiled>@<compiled>>========== # 0000 trace 1 ( 1) # 0002 putself # 0003 opt_send_without_block <callinfo!mid:x, argc:0, FCALL|VCALL|ARGS_SIMPLE> # 0005 putobject 100 # 0007 opt_gt <callinfo!mid:>, argc:1, ARGS_SIMPLE> # 0009 branchunless 15 # 0011 putstring "foo" # 0013 leave # 0014 pop # 0015 putstring "bar" # 0017 leave
Whoa! This suddenly looks a lot more like assembly language than Ruby. Let's step through it and see if we can make sense of it.
# Call the method `x` on self and save the result on the stack 0002 putself 0003 opt_send_without_block <callinfo!mid:x, argc:0, FCALL|VCALL|ARGS_SIMPLE> # Put the number 100 on the stack 0005 putobject 100 # Do the comparison (x > 100) 0007 opt_gt <callinfo!mid:>, argc:1, ARGS_SIMPLE> # If the comparison was false, go to line 15 0009 branchunless 15 # If the comparison was true, return "foo" 0011 putstring "foo" 0013 leave 0014 pop # Here's line 15. We jumped here if comparison was false. Return "bar" 0015 putstring "bar" 0017 leave
The ruby virtual machine (YARV) then steps through these instructions and executes them. That's it!
This ends our very simplified, cartoony tour of the Ruby interpreter. With the tools I've shown you here, it's possible to take a lot of the guesswork out of how Ruby is interpreting your programs. I mean, it doesn't get more concrete than an AST. And next time you're stumped by some weird performance issue, try looking at the bytecode. It probably won't solve your problem, but it might take your mind off of it. :)