When you're working on a young project you're constantly making decisions that will make it either easier or harder to scale later. Sometimes it's good to pick the short-term gain, to accrue a bit of technical debt so you can ship faster. But other times we pick up technical debt because we didn't know there was an alternative.
I can say this because here at Honeybadger, we did a few things that made life much harder on us than it had to be. Had we understood a few key points scaling would have been much less painful.
When you say "primary key" most of us think of an auto-incrementing number. This works well for small systems, but it introduces a big problem as you scale.
At any given moment only one database server can generate the primary key. That means all writes have to go through a single server. That's bad news if you want to do thousands of writes per second.
Using UUIDs as primary keys sidesteps this problem. If you're not familiar with them, UUIDs are unique identifiers that look like this:
Here's how wikipedia describes them:
When generated according to the standard methods, UUIDs are for practical purposes unique, without requiring a central registration authority or coordination between the parties generating them. The probability that a UUID will be duplicated is not zero, but is so close to zero as to be negligible.
Thus, anyone can create a UUID and use it to identify something with near certainty that the identifier does not duplicate one that has already been created to identify something else, and will not be duplicated in the future. Information labeled with UUIDs by independent parties can therefore be later combined into a single database, or transmitted on the same channel, without needing to resolve conflicts between identifiers.
When you use UUIDs as primary keys, all writes no longer have to go through a single database. Instead you can spread them out across many servers.
In addition they give you the flexibility to do things like generate the id of a record before its saved to the database. This might be useful if you want to send the record to a cache or search server, but don't want to wait for the database transaction to complete.
Enabling UUIDs by default in Rails apps is easy. Just edit the config file:
# config/application.rb config.active_record.primary_key = :uuid
You can use UUIDs for an individual table with Rails migrations:
create_table :users, id: :uuid do |t| t.string :name end
It's a simple configuration option that, if you enable it when you start development, will save you a world of trouble when you try to scale. As an extra bonus, it will make it harder for bots and bad actors to guess your private URLs.
Counts and counters
Once you start looking, counts and counters are everywhere. Your email client displays the number of unread emails. Blogs have a pagination footer that uses the total number of posts to calculate the number of pages.
There are two scaling problems with counters:
- Database queries like
select count(*) from usersare inherently slow. They literally loop through each record in the recordset to generate their results. If you have a million records, it's going to take a while.
- Attempts to speed up counters by using "counter caches" do work, but they limit your ability to spread writes across many database servers.
The easiest solution is to avoid using counters wherever possible. This is much easier when you're doing the initial design.
For example, you might choose to page by date-range instead of by count. You might choose not to show a mildly-useful statistic that will be hellish to generate later. You get the idea.
Expiring & warehousing data
There is an upper limit to how much data you can store in a single postgres table given the amount of RAM, CPU and disk IO you have. That means there will come a point when you need to move stale data out of your main tables.
Let's look at a simple case. You want to delete records more than a year old in a database that is a few gigabytes large.
If you've never dealt with this issue before you might be tempted to do something like this:
MyRecords.where("created_at < ?", 1.year.ago).destroy
The problem is that this query will take days or weeks to run. Your database is just too big.
This is a particularly painful problem, because often you don't realize you have it until it's too late. Hardly anyone thinks about data purging strategies when their company is young and their database has 1,000 records.
If you do manage to plan ahead, there is an easy solution. Just partition your tables. Instead of writing all of your data to
my_records you write this week's data to
my_records_1 and next week's data to
my_records_2. When it's time to delete last week's, just
drop table my_records_1. Unlike the delete, this query completes very quickly.
It's possible to partition by fields other than date, too. Whatever makes sense for your use-case.
There's even a postgres extension named pg_partman that takes care of all the details and allows you to partition your database without changing a line of your code. Or, if you prefer to have partitioning managed in Ruby, there's a handy gem called partitionable
Next time you find yourself building a project from scratch, I'd encourage you to take a few moments to think about scaling. Don't obsess over it. Don't spend days worrying about whether to use HAML or ERB. But do ask yourself if there aren't any easy wins that you can pick up simply by planning ahead.