Scrapy distributed connection count - redis

Let's say I had a couple of servers each running multiple Scrapy spider instances at once. Each spider is limited to 4 concurrent requests with CONCURRENT_REQUESTS = 4. For concreteness, let's say there are 10 spider instances at once so I never expect more than 40 requests max at once.
If I need to know at any given time how many concurrent requests are active across all 10 spiders, I might think of storing that integer on a central redis server under some "connection_count" key.
My idea was then to write some downloader middleware that schematically looks like this:
class countMW(object):
def process_request(self, request, spider):
# Increment the redis key
def process_response(self, request, response, spider):
# Decrement the redis key
return response
def process_exception(self, request, exception, spider):
# Decrement the redis key
However, with this approach it seems the connection count under the central key, can be more than 40. I even get > 4, for a single spider running (when the network is under load), and even for a single spider when the redis store is just replaced with the approach of storing the count as an attribute on the spider instance itself, to remove any lag in remote redis key server updates being the problem.
My theory for the reason this doesn't work is that even though the request concurrency per spider is capped at 4, Scrapy still creates and queues more than 4 requests in the meantime, and those extra requests call process_requests incrementing the count long before they are fetched.
Firstly, is this theory correct? Secondly, if it is, is there a way that I could increment the redis count only when a true fetch was occurring (when the request becomes active), and decrement it similarly.

In my opinion it is better customize scheduler as it fits better to Scrapy architecture and you have full control of the requests emitting process:
Scheduler
The Scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) when the engine requests them.
https://doc.scrapy.org/en/latest/topics/architecture.html?highlight=scheduler#component-scheduler
For example you can find some inspiration ideas about how to customize scheduler here: https://github.com/rolando/scrapy-redis

Your theory is partially correct. Usually requests are made much faster than they are fulfilled and the engine will give, not some, but ALL of these requests to the scheduler. But these queued requests are not processed and thus will not call process_request until they are fetched.
There is a slight lag between when the scheduler releases a request and when the downloader begins to fetch it; and, this allows for the scenario you observe where more than CONCURRENT_REQUESTS requests are active at the same time. Since Scrapy processes requests asynchronously there is this bit of a sloppy double dipping possibility baked in; so, how to deal with it. I'm sure you don't want to run synchronously.
So the question becomes: what is the motivation behind this? Are you just curious about the inner workings of Scrapy? Or do you have some ISP bandwidth cost limitations to deal with, for example? Because we have to define what we really mean by concurrency here.
When does a request become "active"?
When the scheduler releases it?
When the downloader begins to fetch it?
When the underlying Twisted deferred is created?
When the first TCP packet is sent?
When the first TCP packet is received?
Perhaps you could add your own scheduler middleware for finer grained control and perhaps can take inspiration from Downloader.fetch.

Related

Yielding more requests if scraper was idle more than 20s

I would like to yield more requests at the end of a CrawlSpider that uses Rules.
I noticed I was not able to feed more requests by doing this in the spider_closed method:
self.crawler.engine.crawl(r, self)
I noticed that this technic work in spider_idle method but I would like to wait to be sure that the crawl is finished before feeding more requests.
I set the setting CLOSESPIDER_TIMEOUT = 30
What would be the code to wait 20 seconds idle before triggering the process of feeding more requests?
Is there a better way?
If it is really important that the previous crawling has completely finished before the new crawling starts, consider running either two separate spiders or the same spider twice in a row with different arguments that determine which URLs it crawls. See Run Scrapy from a script.
If you don’t really need for the previous crawling to finish, and you simply have URLs that should have a higher priority than other URLs for some reason, consider using request priorities instead. See the priority parameter of the Request class constructor.

Are scrapy CONCURRENT_REQUESTS per spider or per machine?

Newbie designing his architecture question here:
My Goal
I want to keep track of multiple twitter profiles over time.
What I want to build:
A SpiderMother class that interfaces with some Database (holding CrawlJobs) to spawn and manage many small Spiders, each crawling 1 user-page on twitter at an irregular interval (the jobs will be added to the database according to some algorithm).
They get spawned as subprocesses by SpiderMother and depending on the success of the crawl, the database job get removed. Is this a good architecture?
Problem I see:
Lets say I spawn 100 spiders and my CONCURRENT_REQUESTS limit is 10, will twitter.com be hit by all 100 spiders immediately or do they line up and go one after the other?
Most scrapy settings / runtime configurations will be isolated for the current open spider during the run. Default scrapy request downloader will be acting only per spider also, so you will indeed see 100 simultaneous requests if you fire up 100 processes. You have several options to enforce per domain concurrency globally and none of them are particularly hassle free:
Use just one spider running per domain and feed it through redis (check out scrapy-redis). Alternatively don't spawn more than one spider at a time.
Have a fixed pool of spiders or limit the amount of spiders you spawn from your orchestrator. Set concurrency settings to be "desired_concurrency divided by number of spiders".
Overriding scrapy downloader class behavior to store its values externally (in redis for example).
Personally I would probably go with the first and if hit by the performance limits of a single process scale to the second.

Scrapy patterns for large number of requests

I need to scrape the large site there is about ten categories and thousands (I don't really know how much) of articles in each category. The simplest approach would be to create a spider for each category and yield responses with every next article link for further extraction.
What I'm thinking of is to make a Top Level spiders which would extract article urls from categories to a queue. The Second Level (article) spiders then should receive each a constant number of urls (say 100) from the queue, and when a spider is finished another one is started. In this way a) we can control a number of spiders, which is a constant, say 20 b) we have an option of counting the number of articles in advance c) spider has limited memory usage. The similar worked fine in a previous project.
Does this make sense or you can just fire as many requests from one spider as possible and it will work fine?
you could fire as many requests from one spider as possible.
This is because scrapy doesn't process all requests at once, they are just all queued.
You can change the number of requests to be processed on settings with CONCURRENT_REQUESTS, which could indeed give memory usage problems if it is too high (say 100). Remember that a scrapy job sets 512mb of memory by default per job.

How to keep an API idempotent while receiving multiple requests with the same id at the same time?

From a lot of articles and commercial API I saw, most people make their APIs idempotent by asking the client to provide a requestId or idempotent-key (e.g. https://www.masteringmodernpayments.com/blog/idempotent-stripe-requests) and basically store the requestId <-> response map in the storage. So if there's a request coming in which already is in this map, the application would just return the stored response.
This is all good to me but my problem is how do I handle the case where the second call coming in while the first call is still in progress?
So here is my questions
I guess the ideal behaviour would be the second call keep waiting until the first call finishes and returns the first call's response? Is this how people doing it?
if yes, how long should the second call wait for the first call to be finished?
if the second call has a wait time limit and the first call still hasn't finished, what should it tell the client? Should it just not return any responses so the client will timeout and retry again?
For wunderlist we use database constraints to make sure that no request id (which is a column in every one of our tables) is ever used twice. Since our database technology (postgres) guarantees that it would be impossible for two records to be inserted that violate this constraint, we only need to react to the potential insertion error properly. Basically, we outsource this detail to our datastore.
I would recommend, no matter how you go about this, to try not to need to coordinate in your application. If you try to know if two things are happening at once then there is a high likelihood that there would be bugs. Instead, there might be a system you already use which can make the guarantees you need.
Now, to specifically address your three questions:
For us, since we use database constraints, the database handles making things queue up and wait. This is why I personally prefer the old SQL databases - not for the SQL or relations, but because they are really good at locking and queuing. We use SQL databases as dumb disconnected tables.
This depends a lot on your system. We try to tune all of our timeouts to around 1s in each system and subsystem. We'd rather fail fast than queue up. You can measure and then look at your 99th percentile for timings and just set that as your timeout if you don't know ahead of time.
We would return a 504 http status (and appropriate response body) to the client. The reason for having a idempotent-key is so the client can retry a request - so we are never worried about timing out and letting them do just that. Again, we'd rather timeout fast and fix the problems than to let things queue up. If things queue up then even after something is fixed one has to wait a while for things to get better.
It's a bit hard to understand if the second call is from the same client with the same request token, or a different client.
Normally in the case of concurrent requests from different clients operating on the same resource, you would also want to implementing a versioning strategy alongside a request token for idempotency.
A typical version strategy in a relational database might be a version column with a trigger that auto increments the number each time a record is updated.
With this in place, all clients must specify their request token as well as the version they are updating (typical the IfMatch header is used for this and the version number is used as the value of the ETag).
On the server side, when it comes time to update the state of the resource, you first check that the version number in the database matches the supplied version in the ETag. If they do, you write the changes and the version increments. Assuming the second request was operating on the same version number as the first, it would then fail with a 412 (or 409 depending on how you interpret HTTP specifications) and the client should not retry.
If you really want to stop the second request immediately while the first request is in progress, you are going down the route of pessimistic locking, which doesn't suit REST API's that well.
In the case where you are actually talking about the client retrying with the same request token because it received a transient network error, it's almost the same case.
Both requests will be running at the same time, the second request will start because the first request still has not finished and has not recorded the request token to the database yet, but whichever one ends up finishing first will succeed and record the request token.
For the other request, it will receive a version conflict (since the first request has incremented the version) at which point it should recheck the request token database table, find it's own token in there and assume that it was a concurrent request that finished before it did and return 200.
It's seems like a lot, but if you want to cover all the weird and wonderful failure modes when your dealing with REST, idempotency and concurrency this is way to deal with it.

Use Redis to track concurrent outbound HTTP requests

I'm a little new to Redis, but I'd like to see if it can be used to keep track of how many concurrent HTTP connections I'm making.
Here's the high level plan:
INCR requests
// request begins
HTTP.get(...)
// request ends
DECR.requests
Then at any point, just call GET requests to see how many are currently open.
The ultimate goal here is to throttle my http requests to stay below some arbitrary amount, say 50 requests/s.
Is this the right way to do it? Are there any pitfalls?
As for pitfalls, the only one I can see is that a server that goes down or loses connection to Redis mid-request may never call DECR.
Since you don't know which server does which request, you can never reset the count to the correct value without bringing the system to a halt and reset to 0.
I'm not clear what you'd gain by using redis in this situation. It seems to me it would be more suitable to use just a global variable in your server. If your server goes down, so does your counter, so you don't have to put complicated things in place to deal with disconnection, inconsistencies, etc...