Introducing our Sidekiq cluster script
At Honeybadger we depend heavily on Sidekiq in our processing pipeline. Nearly everything we do runs through a queue at some point. As a result, I want to make sure we are running Sidekiq well. With our recent move to EC2, we changed from having a stable set of long-lived servers to an ever-changing set of instances running our jobs. This prompted me to revisit how we start Sidekiq at boot time, as that is now much more important than it was previously.
In the past we depended on god to spin up all the worker processes, since we wanted it to monitor them and terminate the ones that ended up using too much memory. Our god config was hand-tuned for the servers that we had deployed to run those workers. Unfortunately, I found that when booting our old god config on the new instances using systemd, we'd sometimes get more workers running than our configuration specified. This would lead to memory pressure on the instances, which was bad news. After fiddling with it for a while, and realizing that we weren't using much of its functionality, I decided it was time to stop using god to manage the processes.
One simple command for running multiple Sidekiq processes
What I really wanted was a script that I could run as a systemd service that would spawn one Sidekiq process per core and that would restart processes that consumed too much memory. The script needed to be able to work with EC2 instances of various sizes, as the number of cores and the amount of RAM could vary as I tried different instance sizes to see what would work best for our workload. Since I couldn't find a script that would do exactly that, I wrote one:
Of course, there's a bit of borrowed code in that script. :) The code in the
process_count method probably comes from Stack Overflow (I've been using it for years in our unicorn configurations) and the
fork_child method is basically the
sidekiq bin from the sidekiq gem.
This script spawns a number of child processes based on the number of CPU cores present, with each child process being the same as if you had run the
sidekiq command directly. As a result, whatever command-line options you pass to this script will get passed to (and parsed by) the Sidekiq CLI code. In other words, you can pass any options to this script that you can pass to
sidekiq when running it directly, though some options (like the pid file option) don't make sense to use. A thread is also spawned to periodically check the memory usage of each child process. If any child process crosses the usage threshold, it is killed and a new process is spawned to replace it.
Working with the cluster
While the number of processes defaults to the number of cores (assuming you are dedicating an instance to running Sidekiq processes), you can override that by setting the SK_PROCESS_COUNT environment variable before running the script. Likewise, the memory threshold is set as a percentage of the total RAM of the instance (leaving some RAM to spare by adding one to the process_count value on line 58), but you can set whatever percentage you want with the SK_MEMORY_PCT_LIMIT environment variable. We use these two variables to restrict the number of Sidekiq processes and memory usage on instances that are running the Rails app that powers our UI.
As an added bonus, this script catches the signals typically used to manage Sidekiq processes and passes those signals on to the child processes. This allows us to use
pkill -f -USR1 skcluster to have the child processes stop accepting new work (this is an early task in our deployment script), and we can use
pkill -f skcluster to terminate everything.
The systemd service definition is straightforward:
Since we use pgbouncer on every instance to proxy connections to postgres, we ensure that the pgbouncer service is running before the sidekiq processes get booted. If we were running redis on the same instances, we would also add redis-server.service to the Requires and After lines. The EnvironmentFile uses the - prefix to the filename to tell systemd that it's ok if the file doesn't exist. This allows us to omit that file when we want to go with the default process count and memory limit. We use the
--require option to let Sidekiq know where to go to load our Rails application's configuration and initializers. Since the skcluster script catches SIGTERM, we can use
systemctl restart skcluster.service to restart all the workers (this is a late task in our deployment script).
We've been using this script for a while now, and it has worked like a champ. We don't have any more problems with too many processes getting spawned at boot, we don't have any more memory usage warnings from our monitoring, and we can launch any type of EC2 instance that we want. Hassle-free ops is the kind of ops I like the most. :)