Scrapy Redis: fetch next_request without waiting for idle signal

Scrapy Redis: fetch next_request without waiting for idle signal - scrapy

I am using scrapy framework to make api calls (broad crawls) and using scrapy redis for running it in the distributed network. I am fetching the start urls from Redis and then using middleware to make the subsequent request. The response time of the task (initial request + set of subsequent requests) varies with reference to the API parameters.
Since spiders in scrapy-redis rely on the spider idle signal to fetch start urls. I am unable to utilize all the resources as it waits for the batch request to be over ( batch size = 100 ).
How can I tweak the scrapy-redis, so it immediately fetches the start urls after the task is over. I tried running multiple processes with redis-batch-size=1, but it didn't solve my problem as each scrapy process takes a lot of memory.

Related

Handling cache warm-up with twisted and systemd

I have a simple twisted application which I run using a systemd service, executing a script, which subsequently executes a .tac file.
The application is structured as a JSON RPC endpoint (fastjsonrpc), built into a t.w.r.Resource, which is in a t.w.s.Site, and served t.a.i.TCPServer, and the whole thing packed into a t.a.Application. This works fine.
Where I do run into trouble is when I try to warm up caches at startup. This warm-up process is pretty slow (~300 seconds), and makes systemd timeout and kill the process. Increasing the timeout is not really a viable option, since I wouldn't want this to block system boot.
Analogous code is used in a separate stack running on Flask from within Apache and wsgi. That server starts itself off and lets systemd go on while it takes its time building the caches. This behaviour is fine for me.
I've tried calling the warmup function using the following within the setup function of the t.w.r.Resource:
reactor.callLater(1, ep.warmup, None)
I've not yet tried using this from within systemd, and have been testing it from twistd directly on the command line. The server does work as expected, however it no longer responds to SIGINT (^C). Removing the callLater is all that's needed to let the server respond to SIGINT.
If the warmup function is called directly (not by callLater, i.e., the arrangement which makes systemd give up while waiting for warm up to complete), the resulting server also continues to respond to SIGINT.
Is there a better / good way to handle this sort of long-running warmup code?
Why would twistd / the reactor not respond to SIGINT? Am I missing something here?

Twisted is a single-threaded thing. It sounds like your "cache warmup" code is blocking the reactor for those 300 seconds. One easy way to fix this would be using deferToThread to let it run without blocking the reactor.

How to fetch Spark Streaming job statistics using REST calls when running in yarn-cluster mode

I have a spark streaming program running on Yarn Cluster in "yarn-cluster" mode. (-master yarn-cluster).
I want to fetch spark job statistics using REST APIs in json format.
I am able to fetch basic statistics using REST url call:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/metrics/json. But this is giving very basic statistics.
However I want to fetch per executor or per RDD based statistics.
How to do that using REST calls and where I can find the exact REST url to get these statistics.
Though $SPARK_HOME/conf/metrics.properties file sheds some light regarding urls i.e.
5. MetricsServlet is added by default as a sink in master, worker and client driver, you can send http request "/metrics/json" to get a snapshot of all the registered metrics in json format. For master, requests "/metrics/master/json" and "/metrics/applications/json" can be sent seperately to get metrics snapshot of instance master and applications. MetricsServlet may not be configured by self.
but that is fetching html pages not json. Only "/metrics/json" fetches stats in json format.
On top of that knowing application_id pro-grammatically is a challenge in itself when running in yarn-cluster mode.
I checked REST API section of Spark Monitoring page, but that didn't worked when we run spark job in yarn-cluster mode. Any pointers/answers are welcomed.

You should be able to access the Spark REST API using:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/
From here you can select the app-id from the list and then use the following endpoint to get information about executors, for example:
http://yarn-cluster:8088/proxy/application_1446697245218_0091/api/v1/applications/{app-id}/executors
I verified this with my spark streaming application that is running in yarn cluster mode.
I'll explain how I arrived at the JSON response using a web browser. (This is for a Spark 1.5.2 streaming application in yarn-cluster mode).
First, use the hadoop url to view the RUNNING applications. http://{yarn-cluster}:8088/cluster/apps/RUNNING.
Next, select a running application, say http://{yarn-cluster}:8088/cluster/app/application_1450927949656_0021.
Next, click on the TrackingUrl link. This uses a proxy and the port is different in my case: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/. This shows the spark UI. Now, append the api/v1/applications to this URL: http://{yarn-proxy}l:20888/proxy/application_1450927949656_0021/api/v1/applications.
You should see a JSON response with the application name supplied to SparkConf and the start time of the application.

I was able to reconstruct the metrics in the columns seen in the Spark Streaming web UI (batch start time, processing delay, scheduling delay) using the /jobs/ endpoint.
The script I used is available here. I wrote a short post describing and tying its functionality back to the Spark codebase. This does not need any web-scraping.
It works for Spark 2.0.0 and YARN 2.7.2, but may work for other version combinations too.

You'll need to scrape through the HTML page to get the relevant metrics. There isn't a Spark rest endpoint for capturing this info.

rufus-scheduler and delayed_job on Heroku: why use a worker dyno?

I'm developing a Rails 3.2.16 app and deploying to a Heroku dev account with one free web dyno and no worker dynos. I'm trying to determine if a (paid) worker dyno is really needed.
The app sends various emails. I use delayed_job_active_record to queue those and send them out.
I also need to check a notification count every minute. For that I'm using rufus-scheduler.
rufus-scheduler seems able to run a background task/thread within a Heroku web dyno.
On the other hand, everything I can find on delayed_job indicates that it requires a separate worker process. Why? If rufus-scheduler can run a daemon within a web dyno, why can't delayed_job do the same?
I've tested the following for running my every-minute task and working off delayed_jobs, and it seems to work within the single Heroku web dyno:
config/initializers/rufus-scheduler.rb
require 'rufus-scheduler'
require 'delayed/command'
s = Rufus::Scheduler.singleton
s.every '1m', :overlap => false do # Every minute
Rails.logger.info ">> #{Time.now}: rufus-scheduler task started"
# Check for pending notifications and queue to delayed_job
User.send_pending_notifications
# work off delayed_jobs without a separate worker process
Delayed::Worker.new.work_off
end
This seems so obvious that I'm wondering if I'm missing something? Is this an acceptable way to handle the delayed_job queue without the added complexity and expense of a separate worker process?
Update
As #jmettraux points out, Heroku will idle an inactive web dyno after an hour. I haven't set it up yet, but let's assume I'm using one of the various keep-alive methods to keep it from sleeping: Easy way to prevent Heroku idling?.

According to this
https://blog.heroku.com/archives/2013/6/20/app_sleeping_on_heroku
your dyno will go to sleep if he hasn't serviced requests for an hour. No dyno, no scheduling.
This could help as well: https://devcenter.heroku.com/articles/clock-processes-ruby

how to continue the request in mod_wsgi after processing the request

After processing the request in a mod_wsgi module, I want to continue the request as it was supposed to without the module. How to do that ?
def application(environ, startResponse):
// do some processing
then continue the request

If you mean you want to perform some task after the response has been sent, see:
http://code.google.com/p/modwsgi/wiki/RegisteringCleanupCode
Doing such tasks in process can be problematic. You are better off submitting details into a separate task system such as Celery, Redis Queue or Gearman and let it handle it. That way the request handler thread is released to handle other requests and you don't reduce the capacity of the WSGI server as far as handling HTTP requests is concerned.
If this is not what you are asking, you need to explain it a bit better as your description is a little confusing.

concurrent handling in thin, unicorn, puma, webrick

If I have the following action in a controller
def give_a
print a
a = a+1
end
What happens in each webserver when a request comes and when multiple requests are recieved?
I know that webrick and thin and single threaded so I guess that means that the a request doesn't get processed until the current request is done.
What happens in concurrent webservers such as puma or unicorn (perhaps others)
If there are 2 requests coming and 2 unicorn threads handle them, would both responses give the same a value? (in a situation when both request enter the method in the same time)
or does it all depend on what happens on the server itself and the access to data is serial?
Is there a way to have a mutex/semaphore for the concurrent webservers?

afaik, the rails application makes a YourController.new with each request env.
from what you post, it is not possible to see, what a means. when it is some shared class variable, then it is mutuable state and could be modified from both request threads.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas