Monitoring long lasting tasks in Airflow - scrapy

I've seen people using Airflow to schedule hundreds of scraping jobs through Scrapyd daemons. However, one thing they miss in Airflow is monitoring long-lasting jobs like scraping: getting number of pages and items scraped so far, number of URL that failed so far or were retried without success.
What are my options to monitor current status of long lasting jobs? Is there something already available or I need to resort to external solutions like Prometheus, Grafana and instrument Scrapy spiders myself?

We've had better luck keeping our airflow jobs short and sweet.
With long-running tasks, you risk running into queue back-ups. And we've found the parallelism limits are not quite intuitive. Check out this post for a good breakdown.
In a case kind of like yours, when there's a lot of work to orchestrate and potentially retry, we reach for Kafka. The airflow dags pull messages off of a Kafka topic and then report success/failure via a Kafka wrapper.
We end up with several overlapping airflow tasks running in "micro-batches" reading a tune-able number of messages off Kafka, with the goal of keeping each airflow task under a certain run time.
By keeping the work small in each airflow task, we can easily scale the number of messages up or down to tune the overall task run time with the overall parallelism of the airflow workers.
Seems like you could explore something similar here?
Hope that helps!

Related

When to use - Delayed Job vs RabbitMQ

Can someone give me the clarity of the advantages of using RabbitMQ(message queue) instead of Delayed Job(background processing) ?
Basically I want to know when to use background processing and messaging queue ?
My web application has 3 components one main server which will handle all user requests and 2 app servers where all the background jobs(like es reindex, es record update, sending emails, crons) should be run.
I saw articles which say Database as a queue(delayed job) is very bad as the consumers will be polling the database for new jobs and updating the statuses of jobs which will lock the tables. Then how does rabbit MQ or other messaging queues store to avoid this problem.
There are other alternatives for delayed job like sidekiq which will run over redis instead of mysql. It is better to use sidekiq instead of rabbitmq?
And are there any advantages of using sidekiq over delayed job?
You have 2 workers and 1 web server: I guess your web app dispatches some delayed jobs to your workers. So you need a way to store the data related to those background jobs.
For that, you can use both a database (like Redis, this is what sidekick is doing) or a message queue (like RabbitMQ). A message queue is a specialized system that is very efficient for this use case (allowing a much higher throughput). A database would let you have a better introspection (as you can request the jobs table to see what your current situation is), while the queuing system would be more efficient but also is more a black box and will require new skills.
If you do not have performance issues, the simpler the better, even a simple mysql database should be enough. If you want a more powerful system or need a lot of monitoring you can also consider using a specialized hosted service such as zenaton (I'm founder) that will do all the heavy lifting for you, including scheduling or more sophisticated orchestration of your background jobs.
Both perform the same task, i.e executing jobs in the background, but go about it differently.
With delayed job one uses some sort of a database for storage, queries for the jobs thereafter then processes them. It's simple to set up but the performance and scalability aren't great.
RabbitMQ or its alternatives Redis e.t.c are harder to set up but their performance, flexibility and scalability is great, we are talking in the upwards of 5000 jobs per second besides you have tend to use less code.
Another option is to use task orchestration system like Cadence Workflow. It supports both delayed execution and queueing, but provides higher level programming model and tons of features that neither queues or delayed execution frameworks.
Cadence offers a lot of advantages over using queues for task processing.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
Built in distributed CRON
See the presentation that goes over Cadence programming model.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

Dataflow reading using PubSubIO is really slow

I'm having some trouble with a Dataflow pipeline that reads from PubSub and writes to BigQuery.
I had to drain it to perform some more complex updates. When I rerun the pipeline it started reading fom PubSub at a normal rate, but then after some minutes it stopped and now it is not reading messages from PubSub anymore! Data watermark is almost one week delayed and not progressing. There are more than 300k messages in the subscription to be read, according to Stackdriver.
It was running normally before the update, and now even if I downgrade my pipeline to the previous version (the one running before update), I still doesn't get it to work.
I tried several configurations:
1) We use Dataflow autoscaling, and I tried starting the pipeline with more powerful workers (n1-standard-64), and limiting it to ten workers, but it won't improve performance neither autoscale (it keeps only the initial worker).
2) I tried providing more disk through diskSizeGb (2048) and diskType (pd-ssd), but still no improvement.
3) Checked PubSub quotas and pull/push rates, but it's absolutely normal.
Pipeline shows no errors or warnings, and just won't progress.
I checked instances resources and CPU, RAM, disk read/write rates are all okay, compared to other pipelines. The only thing a little higher is network rates: about 400k bytes/sec (2000 packets/sec) outgoing and 300k bytes/sec incoming (1800 packets/sec).
What would you suggest I do?
The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam. Make sure you are following the documentation as a reference when you update. Quotas can be an issue for slow running pipeline and lack of output but you mentioned those are fine.
It seems there is a need to look at the job. I recommend to open an issue on the PIT here and we’ll take a look. Make sure to provide your project id, job id and all the necessary details.

Daemons-Rails: scaling up to multiple workers

So, I've been given a code base which uses daemons, daemons-rails and delayed jobs to trigger a number of *_ctl files in /lib/daemons/ but here's the problem:
If two people do an action which starts the daemons doing some heavy lifting then whichever one clicks second will have to wait for the first to complete. Not good. We need to start multiple daemons on each queue.
Ideally what I want to do is read a config file like this:
default:
queues: default ordering
num_workers: 10
hours:
queues: slow_admin_tasks
num_workers: 2
minutes:
queues: minute
num_workers: 2
This would mean that 10 daemon processes are started to listen to the default and ordering queues, 2 for slow_admin tasks etc.
How would I go about defining multiple daemons like this, it looks like it might be in one of these places:
/lib/daemons/*_ctl
/lib/daemons/*.rb
/lib/daemons/daemons
I thought it might be a change to the daemons-rails rake tasks, but they just hit the daemons file.
Has anyone looked in to scaling daemons-rails in this way? Where can I get more information?
I suggest you to try Foreman.
Take a look at this discution.
Foreman can help manage multiple processes that your Rails app depends upon when running in development. You can find a tutorial regarding Foreman on RailsCasts. There's a video tutorial + some source code examples.
I also suggest you to take a look at Sidekiq.
Sidekiq allows you to move jobs into the background for asynchronous processing. It uses threads instead of forks so it is much more efficient with memory compared to Resque. You can find a tutorial here.
I also suggest you to take a look at Resque
Resque is a Redis-backed Ruby library for creating background jobs, placing them on multiple queues, and processing them later. You can find a tutorial here.

what are the best practices to run Hive queries

Usually Hive queries takes some time to execute which could be few minutes to hours. If several hundred Java clients are executing Hive queries then potentially such clients will be waiting for long time to get the results and may time out due to network issues. Is there a asynchronous feature with Hive that can be used instead of synchronous behavior?
What are the best practices to mitigate such issues?
Hadoop job scheduler provides guaranteed capacity to production jobs and good response time to interactive jobs while allocating resources fairly between users. You can check the following blog.
http://blog.cloudera.com/blog/2008/11/job-scheduling-in-hadoop/
There is no asynchronous feature with Hive