Recently I had to heavily use background workers because of the increasing traffic on Orfium. This caused me a little bit of panic because I am mostly an API/Rest guy and not of such a favor of MQTT. I have to admit I was wrong. But let’s take it from the beginning. Have you ever wonder, how do handle scaling in Heroku?

Proudly Deploying our API

For Orfium, our traffic occasionally spikes and we need to scale up quickly. Heroku is great for this, but it requires that one of us is always watching the site. For months we’ve been manually scaling up and back down. Occasionally we’d scale up and then forget to back off on the dynos (web servers) when traffic went back down. It’s wasteful to have those extra dynos idle and costs us a lot of money.

Right now, on our platform we have our first contracts with some distributors and that means that we needed to create a new pipeline for them to upload hundreds of thousands of tracks, albums and playlists!

As an experienced API guy, I designed our API and actually since we have had super specific flows we needed to implement, we used django-rest-framework for providing us the tools we needed.It took us less than a week to have a stable prototype to start evaluating with our first requests.

All tests passed with success and everyone was soooooo happy!

Proudly un-Deploying our API – Long Live MQTT

Then, real life started happening.. Distributors started deploying either super many tracks at once or just one track and the traffic was totally irrelevant. We couldn’t predict this and many http requests started happening on our servers which starting making our server exploding… It never crashed, but NewRelic Apdex score was not happy at all.

With my fellow hacker, Dimitri, we agreed we needed something permanent. We rewrote most of the code, and started processing everything with Celery workers. Obviously we still use tones of APIs, but this couldn’t have happened without MQTT and celery.

cb2fc8174648f5862d73efc33109e8ea_400x400

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.The execution units, called tasks, are executed concurrently on a single or more worker servers using multiprocessing, Eventlet, or gevent. Tasks can execute asynchronously (in the background) or synchronously (wait until ready). Celery is used in production systems to process millions of tasks a day.

Now NewRelic provides us with complete analytics for each background process and the overall server load has been dramatically reduced.

Reducing Costs – Improving Scalability and Availability

I know this may sound a little bit provoking but it couldn’t be more true. The specific miracle has a also a name.

It is called, Hirefire.

We have been using it the last month for our web dynos, and we already paid less than half on dynos and now it takes care of our workers as well!

Hirefire is also also OpenSource which means that you can host it anywhere you like, but myself and many others prefer the hosted solution directly.

In this blog post I will focus only on how to scale celery workers, but in any case, the guys are awesome and they have the BEST CUSTOMER SUPPORT EVER!! (thank you michael)

Some Python

Let’s dive-in a little bit on some code.

First, let’s assume you have a django app (or not) already set up with celery.

Then you just need to install HireFire to automatically scale your worker dynos.

This is as simple as:

# pip install hirefire

from celery import Celery

from hirefire.procs.celery import CeleryProc

celery = Celery('myproject', broker='amqp://guest@localhost//')

class WorkerProc(CeleryProc):
    # the name field should be identical with the worker name on your
# Procfile!! It is nowhere explicitly written,
# but otherwise it won't work!
    name = 'worker'
    queues = ['celery']
    app = celery

You can find more information on the corresponding docs. It is not difficult to setup, but it needs to be a little bit careful.

If you follow the instruction README of the lib, under the Django section (or any section that works for you),  you can test to see if your configuration works in development by accessing http://localhost:8000/hirefire/development/info . When you access this URL, you should see a JSON object in your browser containing the configured queues and their sizes (try enqueuing a few jobs and refresh this url to see if the sizes increase).

A brief description of how it works:

Once per minute, HireFire will perform an HTTP GET request to https://your-domain.com/hirefire/HIREFIRE_TOKEN/info to fetch your current queue size(s). This, combined with whatever you’ve configured in the HireFire web interface for your worker manager (Worker.HireFire.JobQueue), will determine how to scale. So basically the only thing the Python lib does is it exposes a route which HireFire will access to retrieve your configured queue sizes.

Basically what you want to do is the following:

  1. Log in to HireFire
  2. Add a new manager for your app with the name worker (assuming your worker is also named “worker” in your Procfile) of type Worker.HireFire.JobQueue
  3. On the app overview page in the HireFire UI you should see a row labeled “Token”. Click it and it’ll give you a command to run in the terminal to add the token to your app
  4. Add the Python library to your app and configure it
  5. Test to see if JSON is returned when accessing localhost:8000/hirefire/development/info — add some jobs and refresh this url to see if the queue sizes change
  6. If it works, push this to Heroku
  7. Enable the newly created worker manager on HireFire and it should work.

 

SPECIAL ALERT

If you don’t have a super specific case regarding history of your tasks, consider the setting


CELERY_RESULT_BACKEND = None

It will really, really save you tones of money and debugging time!

Final Thoughts

I wouldn’t recommend more, experimenting with heroku autoscaling and Hirefire.io! It is a decent stable solution and it works great with a mature framework as celery.

Apart from recommending our solution, this blog post is also a future reference for myself the next time I have a similar problem 😀 😀

 

 

Leave a Reply

Trending