Since I'm the one at Honeybadger primarily responsible for ops, and since we rely heavily on Postgres for everything we do, the Gitlab incident struck close to home. We have fortunately never had a comparable failure at Honeybadger, but at a previous startup I did manage to wipe out the production database by mistake, so I know how it feels. Having read what happened at Gitlab, and having just made some big changes to our infrastructure at Honeybadger, I thought now would be a good time to share how we run a sizeable Postgres installation. If nothing else, this will provide some additional documentation for Starr and Josh, should I ever get hit by a bus. :)
Backups & Distaster Recovery
When we first started Honeybadger we didn't have to worry much about scaling our database. The traffic was low enough that we just deployed a primary server and a backup server and used the default configuration options (with tuning by pg_tune). The only setup beyond installing the apt packages was configuring the replication. I set up streaming replication from the primary to the secondary, and I also set up wal-e on the primary to save the WAL segments to S3. This allowed for the secondary to catch up the primary from the WALs on s3 should the replication lag get so large that the WALs weren't available on the primary. It also allowed for disaster recovery in a separate datacenter, if necessary. In the worst case scenario, we could spin up a new server in another datacenter, restore from the latest full backup generated by wal-e, then load the rest of the WALs to get the new Postgres server up to date. I later set up a hot-standby in a separate datacenter using exactly this method, and used streaming replication to keep that server in sync along with the in-datacenter replica.
As we added more customers and our workload increased, we added more sidekiq workers to handle the load. There was nothing remarkable about this until we hit the maximum number of collections allowed in our Postgres configuration. Eventually we ended up allowing 1024 connections, and at that point we decided we needed to bring in a connection pooler to take the load off the database. I evaluated pgpool and pgbouncer, and pgbouncer ended up working better for us. I really wanted the failover benefits that pgpool offered, but pgbouncer proved more stable, so I delayed my dream of having automated database failover. Using pgbouncer in transaction mode (and setting
prepared_statements: false in
database.yml) greatly reduced the number of active connections to Postgres, and it has been rock solid.
We recently moved from leasing bare metal servers to hosting everything at EC2. When I made this change, I knew it was time to stop pretending that servers don't die (since EC2 instances die all the time) and to come up with a database failover scenario that wouldn't involve one of us waking up at 3am. Achieving HA with a traditional relational database seems to be one of the eternal quests of operators, so it was with some trepidation that I once again set this goal for myself. I wasn't prepared to switch from pgbouncer to pgpool, so I looked for other options. Fortunately, in the time since I had last looked for a solution, two new, good candidates arrived on the scene: Stolon and Patroni. After evaluting both, I opted for Patroni, and I got to work integrating it into our environment. It took a bit of head scratching to figure out how to get a Patroni-controlled Postgres instance to follow and fail over from a non-Patroni-controlled instance (our primary at the old datacenter), but I eventually got it, and it worked like a charm when it came time to do the cutover.
Patroni has high-availability covered — if the leader Postgres instance dies, a leader election happens and one of the followers gets promoted to be the new leader. To handle failover, though, I had to find a way to get that change communicated to the pgbouncer instances. This task is handled by consul-template. Once the leader change is registered in Consul, the consul-template daemon running along-side each pgbouncer instance updates the pgbouncer configuration with the location of the new leader and reloads pgbouncer, which then relays database traffic to the new leader without breaking the database connection that the Rails application has with pgbouncer. Amazingly, it all seems to work. :)
Happy Servers, Happy Humans
It's been a lot of fun scaling up Honeybadger and making the infrastructure more resilient to failure. Kudos to all those who have created and contributed to the open source projects we use to make that happen!