Zero downtime migrations: 500 million rows

In this article I'm going to go over some of the tricks we use to handle large data migrations at Honeybadger.

Data makes things hard

In this article I'm going to go over some of the tricks we use to handle large data migrations at Honeybadger. Check out the video for a quick overview.

When you have a lot of data, your life gets harder. When you only have 1000 rows, you can make DB-wide changes in IRB. With millions of rows, it's not that easy.

If you don't believe me, just try it. RAM will spike. Your app will get slow. It might even stop working altogether.

Of couse it depends on your hardware. If you happen to be running on an underpowered EC2 instance, you might start having problems at 10,000 rows.

Forget about Rails migrations

When I say "migration" here, I'm not talking about Rails migrations. Don't get me wrong. I love 'em. Use 'em all the time. But they're not what I'm talking about.

What is a migration then?

It's a big change to your database.

Imagine that you have a database of mailing addresses. 5 million of them.

The data entry people have been sloppy, and so the addresses are in all kinds of formats. But you really need them to be in strict USPS format.

Fortunately, there's an API you can use for that. Unfortunately, you need to make an API call for each of the 5 million addresses.

...and you can't have any downtime.

That's what I call a migration. You may have another word for it. Maybe a four-letter word. But that's just semantics.

There's no magic bullet

I'm going to be talking about postgres here, but mongo users have the same problems.

No matter what we do, computers stubbornly refuse to be infinitely fast. For any computer, there's is a greater or equal dataset.

I think that the understanding of this is a kind of turning point in the training of young developers. I had to learn it the hard way.

Why Facebook gamers hate me, personally.

I remember when I learned the lesson that big datasets are fundamentally different from small datasets.

Back when I was just starting out, I did some contract work for a facebook gaming company. Not a big one. There was just one other dev. But the game was pretty popular, and this was the height of mafia/jewel/farm game mania.

The database was mysql. But all the data was stored as json inside a text field on the user model. This of course, meant that instead of ugly SQL

`UPDATE users set something=true;
`

...you could use elegant ruby

`Users.find_each do |u| 
  u.update(:data => JSON.decode(u.data).merge( :something => true ).to_json)
end`

One time I had to write a rake task that tweaked a data structure for every user. It took about 30 minutes to run. That wouldn't have been a problem, except I hadn't planned for the "in-between" state. The application was still expecting the old data structure, but not everybody had that anymore. The result was a small UI bug, and a lot of pissed off facebook users. Fortunately, they had an in-game forum that they could use to yell at me.

How do you plan for big migrations?

Problems are simple

Luckily, you've only got two problems to worry about:

  • The app breaking

  • The app slowing down

Yay!

Solutions are more complex

To avoid these, you need to start thinking differenly about changes to your app and your data.

  • Embrace multi-step deploys

  • Expect database-wide changes to take an indefinite amount of time

  • Expect the scripts you wrote to do the DB migration to break

  • Plan ahead for pegged CPU, RAM and disk IO

  • Map out an escape route: You may need it

Code!

Battle-tested

A lot of people write blog posts about things they've never done. But this isn't one of those.

Here at Honeybadger HQ, we've used all the tricks I'm about to show you IRL.

Prepare your app to handle in-between time

If a migration takes a day, then for that day part of your data is "pending" and part is "processed"

Your app needs to be able to handle both pending and processed data.

Suppose you're spitting a single name field into first_name and last_name. Your code might look like this:

`class User
  before_save :split_name

  def to_s
    pending? ? name : [first_name, last_name].join(" ")
  end

  def split_name
    self.first_name, self.last_name = name.split(" ", 2)
  end
end`

Now that you have 2 representations of a user's name you need to

  • Make sure you update both representations when new records are saved

  • Use the new data when possible, but fall back to the old data

  • Remove this ASAP. It's ugly as sin.

Know which data have been migrated

If your migration script fails, you need to be able to re-start it and pick up where you left off.

It can be as simple as saving ids to a flat file. But we like to use the rollout gem.

Rollout

The rollout gem is built to make incremental releases easier. It lets you set a "feature available" flag on any model and gives you an easy way to see if the flag is set.

For the back end

We recently did a migration for all of our Projects. After each project was done, we set a flag like so:

`$rollout.activate_user(:done, project)`

For the front end

Then in our front-end code, we could invoke either the new or old behavior based on that flag.

`render($rollout.active?(:done, project) ? "new_thing" : "old_thing")`

Granted, the rollout method names are a little out of sync with the use-case. But it works nicely.

Log all the things...separately

Chances are, you'll want to keep an eye on your migration. But if the output from your migration is mixed in with production data, it can be hard to tell what's happening.

And it'd be nice to have debug-level logging too, just in case your rake task aborts.

Use Rails environments

Here's a cool trick. Just use a separate "migration" Rails environment. All you need to do is add a few lines to your config/database.yml

`migration:
  adapter: postgresql
  database: ...
  username: ...
  password: ...`

Now all of your logs will be saved to log/migration.log. And if you're reporting errors to Honeybadger, they'll be flagged as migration errors. pretty nifty, eh?

Keep an eye on your database

Large datasets tend to distort space and time. A totally innocent select count(*) from users will work fine with 1000 users, but take forever with 10,000,000 users.

Also, large datasets tend to accumulate strangeness. Did you forget to limit the user's name to 1000 chars? You're likely to find at least one user with a 2Mb name.

Monitor long-running queries

It's important to check your database for queries taking an abnormally long time to run.

In postgres this is easy. The following command will update once per second and show you current queries, as well as how long they’ve been running.

`watch -n1 "psql -c \"select substring(query from 0 for 120) q,  now() - query_start from pg_stat_activity where state='active' order by query_start limit 20\""`

Limit all the things

If you do find that queries are taking longer than they should, chances are that you forgot to put a limit somewhere.

Suppose that you know a user never has more than 100 items in their shopping cart. Nobody has ever bought more than 100 items.

Well, you’re going to find someone who’s put 100,000 items in their cart. And when you do, all of those items are going to be loaded into ram.

Limit all the strings

The same thing goes for string lengths. Sometimes you'll find abnormally long strings in your DB. You can truncate those right in sql.

`select substring(message from 0 for 140) from tweets;
`

Be careful with ActiveRecord

And with that in mind, it's clear that you should prefer Rails' find_each over all. And that you should do as much processing as possible in postgres.

Avoid using production systems

Here at Honeybadger, we have servers who’s only job in life is to mirror the production database, and be available if they happen to die.

These replicants are darn handy if you happen to be doing a read-intensive migration. In our elastic search switch, we had to run some queries that took 30 minutes to complete. By running them against the replicant we were able to avoid slowing down production.

Doing this is pretty simple. Just create a new environment, and point it at your replicant in the database.yml file

`migration:
  adapter: postgresql
  database: replicant_hostname
  username: ...
  password: ...
`

Hot swap systems

You can take this one step further if you like. Just do the migration completely on the replicant then cut over to it. The replicant becomes the new master and the master becomes the new replicant.

And paranoia

Of course there’s no reason that your changes on server A should affect server Z, but it never hurts to keep an eye on server Z. Surprising things can happen when you network computers together.

What to do next:
  1. Try Honeybadger for FREE
    Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
    Start free trial
    Easy 5-minute setup — No credit card required
  2. Get the Honeybadger newsletter
    Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
    author photo

    Starr Horne

    Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

    More articles by Starr Horne
    Stop wasting time manually checking logs for errors!

    Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

    • Know when critical errors occur, and which customers are affected.
    • Respond instantly when your systems go down.
    • Improve the health of your systems over time.
    • Fix problems before your customers can report them!

    As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

    Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

    Start free trial
    Simple 5-minute setup — No credit card required

    Learn more

    "We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
    — Michael Smith, Cofounder & CTO of YvesBlue

    Honeybadger is trusted by top companies like:

    “Everyone is in love with Honeybadger ... the UI is spot on.”
    Molly Struve, Sr. Site Reliability Engineer, Netflix
    Start free trial
    Are you using Sentry, Rollbar, Bugsnag, or Airbrake for your monitoring? Honeybadger includes error tracking with a whole suite of amazing monitoring tools — all for probably less than you're paying now. Discover why so many companies are switching to Honeybadger here.
    Start free trial
    Stop digging through chat logs to find the bug-fix someone mentioned last month. Honeybadger's built-in issue tracker keeps discussion central to each error, so that if it pops up again you'll be able to pick up right where you left off.
    Start free trial