Daemons-Rails: scaling up to multiple workers - ruby-on-rails-3

So, I've been given a code base which uses daemons, daemons-rails and delayed jobs to trigger a number of *_ctl files in /lib/daemons/ but here's the problem:
If two people do an action which starts the daemons doing some heavy lifting then whichever one clicks second will have to wait for the first to complete. Not good. We need to start multiple daemons on each queue.
Ideally what I want to do is read a config file like this:
default:
queues: default ordering
num_workers: 10
hours:
queues: slow_admin_tasks
num_workers: 2
minutes:
queues: minute
num_workers: 2
This would mean that 10 daemon processes are started to listen to the default and ordering queues, 2 for slow_admin tasks etc.
How would I go about defining multiple daemons like this, it looks like it might be in one of these places:
/lib/daemons/*_ctl
/lib/daemons/*.rb
/lib/daemons/daemons
I thought it might be a change to the daemons-rails rake tasks, but they just hit the daemons file.
Has anyone looked in to scaling daemons-rails in this way? Where can I get more information?

I suggest you to try Foreman.
Take a look at this discution.
Foreman can help manage multiple processes that your Rails app depends upon when running in development. You can find a tutorial regarding Foreman on RailsCasts. There's a video tutorial + some source code examples.
I also suggest you to take a look at Sidekiq.
Sidekiq allows you to move jobs into the background for asynchronous processing. It uses threads instead of forks so it is much more efficient with memory compared to Resque. You can find a tutorial here.
I also suggest you to take a look at Resque
Resque is a Redis-backed Ruby library for creating background jobs, placing them on multiple queues, and processing them later. You can find a tutorial here.

Related

Monitoring long lasting tasks in Airflow

I've seen people using Airflow to schedule hundreds of scraping jobs through Scrapyd daemons. However, one thing they miss in Airflow is monitoring long-lasting jobs like scraping: getting number of pages and items scraped so far, number of URL that failed so far or were retried without success.
What are my options to monitor current status of long lasting jobs? Is there something already available or I need to resort to external solutions like Prometheus, Grafana and instrument Scrapy spiders myself?
We've had better luck keeping our airflow jobs short and sweet.
With long-running tasks, you risk running into queue back-ups. And we've found the parallelism limits are not quite intuitive. Check out this post for a good breakdown.
In a case kind of like yours, when there's a lot of work to orchestrate and potentially retry, we reach for Kafka. The airflow dags pull messages off of a Kafka topic and then report success/failure via a Kafka wrapper.
We end up with several overlapping airflow tasks running in "micro-batches" reading a tune-able number of messages off Kafka, with the goal of keeping each airflow task under a certain run time.
By keeping the work small in each airflow task, we can easily scale the number of messages up or down to tune the overall task run time with the overall parallelism of the airflow workers.
Seems like you could explore something similar here?
Hope that helps!

When to use - Delayed Job vs RabbitMQ

Can someone give me the clarity of the advantages of using RabbitMQ(message queue) instead of Delayed Job(background processing) ?
Basically I want to know when to use background processing and messaging queue ?
My web application has 3 components one main server which will handle all user requests and 2 app servers where all the background jobs(like es reindex, es record update, sending emails, crons) should be run.
I saw articles which say Database as a queue(delayed job) is very bad as the consumers will be polling the database for new jobs and updating the statuses of jobs which will lock the tables. Then how does rabbit MQ or other messaging queues store to avoid this problem.
There are other alternatives for delayed job like sidekiq which will run over redis instead of mysql. It is better to use sidekiq instead of rabbitmq?
And are there any advantages of using sidekiq over delayed job?
You have 2 workers and 1 web server: I guess your web app dispatches some delayed jobs to your workers. So you need a way to store the data related to those background jobs.
For that, you can use both a database (like Redis, this is what sidekick is doing) or a message queue (like RabbitMQ). A message queue is a specialized system that is very efficient for this use case (allowing a much higher throughput). A database would let you have a better introspection (as you can request the jobs table to see what your current situation is), while the queuing system would be more efficient but also is more a black box and will require new skills.
If you do not have performance issues, the simpler the better, even a simple mysql database should be enough. If you want a more powerful system or need a lot of monitoring you can also consider using a specialized hosted service such as zenaton (I'm founder) that will do all the heavy lifting for you, including scheduling or more sophisticated orchestration of your background jobs.
Both perform the same task, i.e executing jobs in the background, but go about it differently.
With delayed job one uses some sort of a database for storage, queries for the jobs thereafter then processes them. It's simple to set up but the performance and scalability aren't great.
RabbitMQ or its alternatives Redis e.t.c are harder to set up but their performance, flexibility and scalability is great, we are talking in the upwards of 5000 jobs per second besides you have tend to use less code.
Another option is to use task orchestration system like Cadence Workflow. It supports both delayed execution and queueing, but provides higher level programming model and tons of features that neither queues or delayed execution frameworks.
Cadence offers a lot of advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
Built in distributed CRON
See the presentation that goes over Cadence programming model.

Celery with RabbitMQ creating too many queues

When running Django/Celery/RabbitMQ on production server, some tasks are sent and consumed correctly. However, RabbitMQ starts using up all the CPU after processing is done. I believe this is related to the following report.
RabbitMQ on EC2 Consuming Tons of CPU
In that thread, it is suggested to set these config values:
CELERY_IGNORE_RESULT
CELERY_AMQP_TASK_RESULT_EXPIRES
I forked and customized the celery-haystack package to set both those values when calling appl_async(), however it seems to have had no effect.
I think Celery is creating a large number (one per task) of uid-named queues automatically to store results. But I don't seem to be able to stop it.
Any ideas?
I just got a day of digging into this problem myself. I think the two options you meantioned can be explained like this:
CELERY_IGNORE_RESULT: if True then the results of tasks will be ignored, hence they won't return anything where you call them with delay or apply_async.
CELERY_AMQP_TASK_RESULT_EXPIRES: the expiration time for a result stored in the result backend. You can set this option to a reasonable value so RabbitMQ can delete expired results.
The many queues generated are for storing results only. So in case you don't want to store any results, you can remove CELERY_RESULT_BACKEND option from your config file.
Have a ncie day!

Repeatedly running a background script in Rails3/Heroku Cedar deployment

I am developing a Rails3 app which will run on Heroku Cedar stack and needs to constantly check for new tweets under a certain hashtag. I have the logic to do this in place but I would like to run this task in the background so as not to interfere with the main app performance. I also need to write any new tweets found to a database so I will need access to Active Record. I am looking for advise on what might be the best way to achieve this.
I do something similar, it doesn't matter for me if tweets are slightly out of date - we use the scheduler for 10 minute executions of a rake task which is watching a hashtag. We can change the frequency of the executing to hourly/daily should we feel 10 mins is too frequent.
You could use the Heroku scheduler to regularly execute a Rake task (or some other script).
Alternatively, if you're checking for Tweets in response to a certain user action or some other event, you could use a task queue like Delayed Job.

Gems/Services for autoscaling Heroku's dynos and workers

I want to know if there are any good solutions for autoscaling dynos AND workers on Heroku in a production environment (probably a different solution for each of those, as they are pretty unrelated). What are you/companies using, regarding this?
I found lots of options, but none of them seem really mature for a production environment.
There is Heroscale, which seem to introduce some latency as it does not run locally, and I also heard of some downtime. There are modifications of delayed_jobs, which have not been updated for a long time, and there are some issues with current bundlers. There is also some alternatives related to reque, which seem not to handle very well some HTTP exceptions, which results in app crashing, and others which seem to need an always-running worker to schedule other workers, and may also suffer from some HTTP exceptions problems.
Well. In the end. What is being used, nowadays, for autoscaling Heroku's dynos and workers on a production environment with Rails3?
Thanks in advance.
We ran into this a while ago and I spent quite a bit of time on this to my great frustration. I'll try to stick to the salient point. There are several Heroku autoscaling solution that seem decent at first glance.
The example that has already been given heroku-autoscaler is actually for autoscaling dynos and is pretty much the only solution out there that claims to do this (and it certainly doesn't do it well). Most others will only claim to autoscale workers for you. So, let's focus on that first. The autoscalers you'll look at for workers depend on what you're actually using for you background workers e.g. delayed_job, resque. Those are the most common background processing libs that people use, so the autoscalers will try to hook into one of them. You can use things like:
workless
hirefire
heroku-resque-auto-scale
etc
Some of these work on the Cedar stack some might need a bit of tweaking. The problem with all of them is that it's like trying to pull yourself out of the swamp by your own hair. Let's take hirefire as an example (it's probably the best one of the lot). It modifies delayed_job so that the workers themselves can look at the queue and spin up more workers if necessary, if there are no more jobs in the queue, the workers will all shut each other down. There are several problems:
if you want to put a job on the queue to be executed in the future as opposed to right now, you're out of luck. A worker starts up when jobs enter the queue, but since the job is to be executed in the future the worker will shut down and will not start up unless another job enters the queue (that's the only thing that prompts workers to start up)
you loose the ability to retry failed jobs, this is possible by default in delayed_job, but it takes a little while before a failed job is to be retried (and progressively longer) if it fail multiple times, but the workers will shut down during this time delay and there is nothing to prompt them to start up again (in essence this is the same issue as in the first scenario)
The thing that solves this problem is to have one worker running continuously it can therefore monitor the queue periodically and can execute jobs when necessary or even spin up more workers. But if you do that, you're not saving any money (you have a worker running continuously 24/7 and have to pay for that) and that's the whole premise behind autoscalers on heroku. In essence, if you only have occasional background processing to do, or you have background jobs that are likely to fail but succeed on retry, or you have background jobs that don't need to be executed instantly, there is no autoscaling library you can use that will work for you.
Here is one alternative. The guy who wrote Hirefire, later spun it off into a webapp (Hirefire app), the essence of which is to externally monitor your Heroku workers/dynos for you and spin up/shut down workers dynos as necessary. This was free in beta but it now costs money, less than what you'd pay to run a worker 24/7 but still not insignificant if you only need a few background jobs once in a while. Either way this is the only viable way to make sure your background job infrastructure does what you want (well that and rolling your own solution which means having a machine like an EC2 instance where you can put some scripts which will ping your heroku app and spin up/shut down workers as needed - a non-trivial amount of effort).
Now Hirefire app does offer to autoscale your dynos for you as well, it does this based on hooking in to the latency of your heroku request queue. However I found that this didn't work well, perhaps if you're close to the Amazon datacenter where your heroku app actually lives (we weren't), you might have a different experience. But, for us it unnecessarily spun up a whole bunch of dynos and would never spin them down no matter how much I tweaked the settings. You can put it down to the fact that it was a beta it may have improved since then, but that's the experience that I had.
Long story short, if you want to autoscale your workers, use Hirefire app, you'll be saving a lot less money than you thought, but it is still the cheapest option. If you want to autoscale dynos you're basically out of luck. This is just one of those limitations you live with for having the convenience of a platform like Heroku.
Heroku is offering a new add-on called AdeptScale which is now just out of Beta.
Here is the add-on page for AdeptScale
Here is the more detailed documentation for AdeptScale
Here is the form to sign up for Heroku's Beta Program
Hopefully this will be a robust solution for autoscaling Heroku Dynos, as I'm not still not happy with the current options.
Update (2/4/13): I signed up for Heroku's Beta program to try out this add-on, and its worked really well for me. Occasionally scaling up with traffic, but mostly sitting on the minimum number of dynos I've set of 2. It's greatly reduced my bill, and eliminated worry that I might be slow during peak usage times.
Update (3/6/13): Added link to Heroku's Sign up page for their beta program.
Update (4/14/13): Looks like auto-scaling is out of Beta. It's still working really well for me.
HireFire.io (The Service, not the Open Source Project) now allows you to use your New Relic metrics to auto-scale your web dynos. New Relic is a performance monitoring tool provided as an add-on through Heroku. They have a free tier and it's sufficient to use with HireFire.
You can auto-scale based on:
Response Time
This is the Response Time you find on the New Relic Dashboard. It's a combination of various factors including Request Queuing, Database Performance, App-Layer, Router, etc.
Apdex Score
This allows you to scale based on your New Relic Apdex Score, enabling you to scale based on user experience/satisfaction, which is determined by this score.
Aside of this we have become language/framework agnostic. For worker dynos all you have to do to get auto-scaling working is to setup a JSON end-point at a certain path in your app that returns a very simple JSON string containing the queue size (we provide convenient, but not required, macros for the Ruby language and some out-of-the-box support for Django apps, but like I said it works for any language/framework by manually setting up a JSON end-point - it's very easy). For web dynos, you can use the HireFire Metric Source with basically any language/framework, and the above mentioned New Relic Metric Source for languages/frameworks that are supported by New Relic (these are common languages such as Ruby, Python, Java, etc).
Disclaimer: I built HireFire.
I'm trying to find a good way to autoscale dyno's too.
https://github.com/ddollar/heroku-autoscale does this but has a disclaimer about its immaturity.
I've recently written a heroku auto scaling system called Heroku Vector:
https://github.com/wpeterson/heroku-vector
It allows you to scale multiple types of dynos based on different traffic sources. It currently supports NewRelic throughout and Sidekiq number of busy threads. As traffic goes up or down, it will scale the number of dynos up or down. It's a daemon process that can be run in its own dyno on Heroku or elsewhere.