We're working on something new! Hook Relay gives you Stripe-quality webhooks in minutes. Sign up for free today! Check out Hook Relay

Our Postgres Infrastructure

We love Postgres at Honeybadger, but it does require some care and feeding. Here's what we've done to scale Postgres along with the growing needs of our business.

Since I'm the one at Honeybadger primarily responsible for ops, and since we rely heavily on Postgres for everything we do, the Gitlab incident struck close to home. We have fortunately never had a comparable failure at Honeybadger, but at a previous startup I did manage to wipe out the production database by mistake, so I know how it feels. Having read what happened at Gitlab, and having just made some big changes to our infrastructure at Honeybadger, I thought now would be a good time to share how we run a sizeable Postgres installation. If nothing else, this will provide some additional documentation for Starr and Josh, should I ever get hit by a bus. :)

Backups & Distaster Recovery

When we first started Honeybadger we didn't have to worry much about scaling our database. The traffic was low enough that we just deployed a primary server and a backup server and used the default configuration options (with tuning by pg_tune). The only setup beyond installing the apt packages was configuring the replication. I set up streaming replication from the primary to the secondary, and I also set up wal-e on the primary to save the WAL segments to S3. This allowed for the secondary to catch up the primary from the WALs on s3 should the replication lag get so large that the WALs weren't available on the primary. It also allowed for disaster recovery in a separate datacenter, if necessary. In the worst case scenario, we could spin up a new server in another datacenter, restore from the latest full backup generated by wal-e, then load the rest of the WALs to get the new Postgres server up to date. I later set up a hot-standby in a separate datacenter using exactly this method, and used streaming replication to keep that server in sync along with the in-datacenter replica.

Connection Pooling

As we added more customers and our workload increased, we added more sidekiq workers to handle the load. There was nothing remarkable about this until we hit the maximum number of collections allowed in our Postgres configuration. Eventually we ended up allowing 1024 connections, and at that point we decided we needed to bring in a connection pooler to take the load off the database. I evaluated pgpool and pgbouncer, and pgbouncer ended up working better for us. I really wanted the failover benefits that pgpool offered, but pgbouncer proved more stable, so I delayed my dream of having automated database failover. Using pgbouncer in transaction mode (and setting prepared_statements: false in database.yml) greatly reduced the number of active connections to Postgres, and it has been rock solid.

High Availibility

We recently moved from leasing bare metal servers to hosting everything at EC2. When I made this change, I knew it was time to stop pretending that servers don't die (since EC2 instances die all the time) and to come up with a database failover scenario that wouldn't involve one of us waking up at 3am. Achieving HA with a traditional relational database seems to be one of the eternal quests of operators, so it was with some trepidation that I once again set this goal for myself. I wasn't prepared to switch from pgbouncer to pgpool, so I looked for other options. Fortunately, in the time since I had last looked for a solution, two new, good candidates arrived on the scene: Stolon and Patroni. After evaluting both, I opted for Patroni, and I got to work integrating it into our environment. It took a bit of head scratching to figure out how to get a Patroni-controlled Postgres instance to follow and fail over from a non-Patroni-controlled instance (our primary at the old datacenter), but I eventually got it, and it worked like a charm when it came time to do the cutover.


Patroni has high-availability covered — if the leader Postgres instance dies, a leader election happens and one of the followers gets promoted to be the new leader. To handle failover, though, I had to find a way to get that change communicated to the pgbouncer instances. This task is handled by consul-template. Once the leader change is registered in Consul, the consul-template daemon running along-side each pgbouncer instance updates the pgbouncer configuration with the location of the new leader and reloads pgbouncer, which then relays database traffic to the new leader without breaking the database connection that the Rails application has with pgbouncer. Amazingly, it all seems to work. :)

Happy Servers, Happy Humans

It's been a lot of fun scaling up Honeybadger and making the infrastructure more resilient to failure. Kudos to all those who have created and contributed to the open source projects we use to make that happen!

Honeybadger has your back when it counts. We're the only error tracker that combines exception monitoring, uptime monitoring, and cron monitoring into a single, simple to use platform.

Our mission: to tame production and make you a better, more productive developer. Learn more

author photo

Benjamin Curtis

Ben has been developing web apps and building startups since '99, and fell in love with Ruby and Rails in 2005. Before co-founding Honeybadger, he launched a couple of his own startups: Catch the Best, to help companies manage the hiring process, and RailsKits, to help Rails developers get a jump start on their projects. Ben's role at Honeybadger ranges from bare-metal to front-end... he keeps the server lights blinking happily, builds a lot of the back-end Rails code, and dips his toes into the front-end code from time to time. When he's not working, Ben likes to hang out with his wife and kids, ride his road bike, and of course hack on open source projects. :)

“We’ve looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release.”
Michael Smith
Try Error Monitoring Free for 15 Days