A Rubyist's Guide to Big-O Notation
- By Starr Horne
- ruby
I don't have a degree in computer science. Lots of us Rubyists don't. So for a long time I avoided learning about big-O notation. It looks a little too much like higher math. I mean: O(N^2)
. Come on.
Instead, I learned rules of thumb:
- Finding a specific item in a Hash is faster than in an Array
- Avoid nested loops
- Watch out for accidental DB queries when generating lists in your views
These rules are nice, but unless you understand WHY they work, you're going to find yourself making mistakes and having inexplicable performance problems.
Why does it matter?
Big-O notation is just a fancy way of describing how your code's performance depends on the amount of data it's processing.
Performance means one of two things: speed, or ram usage. In computer science class you'd refer to these as "time complexity" and "space complexity." Big-O notation is used for both, but we're going to focus on speed here, since that seems to be the more common usage.
You would expect that processing an array of 100 items would be slower than an array of 10 items. But by how much? 10x, 100x? 1000x?
It's not a big deal with small datasets, but if your app gets exponentially slower for each row in the database it soon becomes a problem.
Before we get into the details, I made a quick chart showing common big-Os with emoji showing how they'll make you feel as your data scales.
Big O | Rank | Meaning |
---|---|---|
O(1) | 😎 | Speed doesn't depend on the size of the dataset |
O(log n) | 😁 | 10x the data means 2x more time |
O(n) | 😕 | 10x the data means 10x more time |
O(n log n) | 😖 | 10x the data means about 20x more time |
O(n^2) | 😫 | 10x the data take 100x more time |
O(2^n) | 😱 | The dilithium crystals are breaking up! |
So when someone says Array#bsearch
is better than Array#find
because it's O(log n)
vs O(n)
you can just compare 😁 to 😕 and see that they might be onto something.
For something a little more robust, check out the Big-O Cheat Sheet
Deciphering the Notation
You don't have to memorize all of the different Big-O values, so long as you understand how the notation works.
Take, for example, the horrible horrible O(2^n)
. If we were to express that in Ruby, it might look like this:
# O(2^n) translated to Ruby
def o(n)
2 ** n # This is ruby for 2^n
end
Still not obvious? How about if I rename the method and argument to something more user-friendly?
# O(2^n) translated to prettier Ruby
def execution_time(size_of_dataset)
2 ** size_of_dataset
end
You can do this for all of them:
# O(1)
def o1_execution_time(size_of_dataset)
1
end
# O(n)
def on_execution_time(size_of_dataset)
size_of_dataset
end
# O(n^2)
def on2_execution_time(size_of_dataset)
size_of_dataset * size_of_dataset
end
...etc
Now that you know how the notation works, let's take a look at some typical ruby code and see how it relates.
O(1)
When we say that something is O(1)
it means that its speed doesn't depend on the size of its data set.
For example, hash lookup times don't depend on the hash size:
# These should all take the same amount of time
hash_with_100_items[:a]
hash_with_1000_items[:a]
hash_with_10000_items[:a]
This is why we tend to say that hashes are faster than arrays for larger datasets.
O(n)
In contrast, Array#find
is O(n)
. That means Array#find
depends linearly on the number of items in the array. An array with 100 items will take 100 times longer to search than an array with one item
A lot of code that iterates over arrays follows the O(n)
pattern.
(0..9).each do |i|
puts i
end
# This example is 1/2 the speed of the previous because it contains 2x the items.
(0..19).each do |i|
puts i
end
O(n^2)
Code that fits an O(n^2)
profile tends to involve nested loops. That makes sense if you think about it. One loop gives you O(n)
, a second nested loop gives you O(n^2)
. If — for some unholy reason — you had a 5-level nested loop, it'd be O(n^5)
.
data = (0..100)
data.each do |d1|
data.each do |d2|
puts "#{ d1 }, #{ d2 }"
end
end
O(n log n)
O(n log n)
code is often the result of someone finding a clever way to reduce the amount of work that an otherwise O(n^2)
algorithm would do.
You wouldn't be able to just look at a piece of code and tell it's O(n log n)
. This is where the higher math comes in, and it's also where I bow out.
But it is important to know about O(n log n)
because it describes a lot of common search algorithms. Ruby's Array#sort
uses the venerable quicksort algorithm, which on average is O(n log n)
and worst-case is O(n^2)
If you're not familiar with quicksort, check out this excellent demonstration.
Putting it all together: Your Database
One of the most common problems with new web applications is that they're fast on the developer's computer, but in production they get slower and slower.
This happens because the amount of records in your database grows over time, but your code is asking the DB to do operations that don't scale well. ie. O(n)
or worse.
For example, did you know that in postgres, count queries are always O(n)
?
# This makes the DB iterate over every row of Users
# ...unless you're using a Rails counter cache.
Users.count
We can see this by using the postgres explain
command. Below, I use it to get the query plan for a count query. As you can see, it plans on doing a sequential scan (that means looping) over all 104,791 rows in the table.
# explain select count(*) from users;
QUERY PLAN
-----------------------------------------------------------------
Aggregate (cost=6920.89..6920.90 rows=1 width=0)
-> Seq Scan on users (cost=0.00..6660.71 rows=104701 width=0)
(2 rows)
Lots of common rails idioms can trigger unintentional sequential scans, unless you specifically optimize the database to prevent them.
# This asks the DB to sort the entire `products` table
Products.order("price desc").limit(1)
# If `hobby` isn't indexed, the DB loops through each row of Users to find it.
User.where(hobby: "fishing")
You can use the explain
command to see that as well. Here we see we're doing a sort (probably quicksort) on the whole table. If there are memory constraints, it may have chosen a different sorting algorithm with different performance characteristics.
# explain select * from users order by nickname desc limit 1;
QUERY PLAN
-------------------------------------------------------------------------
Limit (cost=7190.07..7190.07 rows=1 width=812)
-> Sort (cost=7190.07..7405.24 rows=104701 width=812)
Sort Key: nickname
-> Seq Scan on users (cost=0.00..6606.71 rows=104701 width=812)
The answer to all these problems is, of course, indexing. Telling the database to use an index is a little like in Ruby, when you use a hash lookup O(1)
instead of an array find O(n)
.
That's all, folks
I hope this has been a useful introduction to Big-O notation and how in impacts you as a ruby developer! Please ping me @StarrHorne if you have any questions.