How to throttle requests to sites instead of to proxy server in scrapy? - scrapy

I am using a proxy and have set AUTO_AUTOTHROTTLE_ENABLED to True. I had the impression that scrapy throttles the sites which I am crawling, instead it seems like scrapy throttles requests to proxy itself. How do I throttle requests to sites instead of proxy?
Update: I am manually setting proxy in meta while making each request, instead of using the proxy middleware.

I don't think this is possible to do solely from the spider side. By looking at the throttling algorithm and at the AutoThrottle extension source code, you can see that the delay being used is the time difference between sending a request and getting back a response. Everything that happens in between is added up to this delay (including the proxy delay).
To further verify this, consider the steps:
AutoThrottle is using latency information from the response, found
in the response.meta['download_latency] (see here)
The latency information ('download_latency') is set in the dedicated callback once the download is completed, by subtracting the start time from the current time (see here).
The start time is actually set just before the download agent is instructed to download the request, which means everything in between is added up to the final latency (see here).
If you want to actually throttle according to target latency through a proxy, this will have to be handled by the proxy itself. I suggest using some of the managed proxy pool solutions like Crawlera.

Related

Too many 429 errors when the cache extension and the proxy middleware are enabled at the same time in scrapy

I am using scrapy to crawl data. The target website blocks the IP after it sends about 1000 requests.
To deal with this, I wrote a proxy middleware, and because the amount of data is relatively large, I also wrote a cache extension. When I enabled both of them, I get banned more often. It works well when only the proxy middleware is enabled.
I know that when scrapy engine start, extensions start earlier than middlewares. Could this be the reason? If not, what else should I consider?
Any suggestions will be appreciated!

DNS resolves correctly using command line tools but fails on browser

Dig, wget, nslookup and curl commands work perfectly for a specific URL I have pointed to another server less than 24 hours ago.
Problem is, it just refuses to be resolved by the browser (Chrome, Safari and Firefox). The strangest part is that it is being successfully resolved by Postman (by testing the OPTIONS and the GET methods separately), but still doesn't return a proper response on the browser side of things.
DNS checks are returning positive, so this is when I started suspecting that the problem is actually within the headers of the HTTP protocol's requests which are sent - alongside the fact that different responses are being returned for the requests that don't include the default browser headers (being issued through the different command-line tools & Postman) and the ones who do (being issued by the browsers automatically or manually using the dev tools).
After fully flushing the current local system's DNS cache, including the browsers's and even trying another device on another network - I still get still no response on the browser.
Kept going, and attempted to verify that with a VPN (locally - which didn't work), and an online web proxy tool (which did work).
Finally, I extracted the router's default DNS server address, used nslookup to look up the URL again, this time specifically mentioning the desired DNS server (the one stated above), and after getting a successful response with the correct values, I am now pretty much sure the HTTP request is causing the problem.
The URL is hosted on Amazon S3 Static Hosting option, which I used many times before, and didn't have a problem with, with that exact same configuration. Looking up the recent changes/features that were possibly added, pointed out that I may need to explicitly set a CORS policy for the newly created bucket, on top of the usual public access policy that is needed.
After applying that as-well - it still doesn't seem to work.
As a quick change in direction that may possibly make some parts clearer about what's going on (and as I started to think that the browser might not be getting the correct Content-Type header in the response, which should be text/html header as its response, and therefore, possibly doesn't resolve the URL with the expected behavior), I went ahead and applied a 301 redirection on the S3 bucket, instead of the static files hosting, and again, it all works perfectly through the command line tools, but not through the browsers.
Anyway, the browser just doesn't seem to complete any of the requests being sent to the URL.
That might be the OPTIONS pre-flight request failing to respond correctly, and the browser just doesn't continue to issuing the GET request, or the URL is not being found by the DNS route the browser is taking, which is unclear to me currently if that is the option.
Any ideas? (besides the fact that sometimes it just takes longer time for some DNS servers that happen to be on the chosen route to update/refresh their cache, which doesn't appear to be affecting my local machine's DNS route specifically for this case. That, being said with caution, was verified by validating the different parts of DNS configuration and prioritization throughout the different possible parts on my system (Mac OS X), including the fact that the response gets back with the correct address successfully).
Found my answer here:
https://serverfault.com/questions/942030/aws-s3-static-hosting-how-to-debug-connection-timeout
As linked there, more details can be found here:
Non-Authoritative-Reason header field [HTTP]
Solution & Explanation: Because of the nature of the domain extension I have purchased (.dev extension) Chrome was silently using HTTPS because of the URL being part of Chrome's HTTP Strict Transport Security (HSTS), because all .dev domains should be using HTTPS only. Therefore, the issue was still showing up, even when explicitly typing http:// into the URL address bar.
This can be overridden by applying a CloudFront distribution with HTTPS support on top of the S3 Static Hosting, as usual (but still, it should be noted as HSTS listings can cause that for different cases, including this one as part of them, because of the .dev domain extension).
Useful Resources (for debugging purposes)
In addition to what is stated here:
https://gist.github.com/stollcri/7c09bafc97223481920e
You can issue a lookup query (and also add or delete your local set of HSTS listings) through the following Chrome's settings URL:
You can also check the current listings here: https://hstspreload.org/

Why use a service worker for caching when browser cache handles the caching?

I read that using a service worker for offline caching is similar to browser caching. If so, then why would you prefer a service worker for this caching? Browser caching will check if the file is modified or not and then serve from the cache, and with a service worker we are handling the same thing from our code. By default, the browser has that feature so why prefer a service worker?
Service Workers give you complete control over network requests. You can return anything you want for the fetch event, it does not need to be the past or current contents of that particular file.
However, if the HTTP cache handles your needs, you are under no obligation to use Service Workers.
They are also used for things such as push notifications.
Documentation: https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API, https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API/Using_Service_Workers
I wanted to share the points that I observed while going through service worker documentation and implemented it.
Browser cache is different, as the service worker supports offline cache, the webapp will access the content that is cached, even though the network is not available.
Service worker will give native experience.
Service worker cannot modify DOM content but still it can serve the pages within its scope. With the help of events like postMessage, the page can be accessed and DOM can be changed.
Service worker do not require user interaction or webpage .
It runs in the background.
Actually, it's slower to response the request when you use sw instead of http cache...
Because sw use cache api to store the cache content, it's really slower than the browser cache--memory cache and disk cache.
It's not designed for faster than http cache, howerver, when you use sw, you can Fully customizable the response, I think the Fully customizable is the reason why you should use it.
If your situation is not complicated enough, you should not use it

Apache: simultaneous connections to single script

How does Apache (most popular version nowadays, i guess) handle a connection to a script when this script is already being executed for another connection?
My guess has always been - upon receipt of a request to a script, script's contents are copied-to-memory/compiled/executed, and IF during this process there's another request to this script - same things happen (assuming Apache does not lock the script file, and simply gives another share of memory/cpu for another compilation/memory-storage/execution)
Or is there a queuing/waiting mechanism involved?
Assuming this additional connection is afforded enough memory, cpu, and does not pass maximum connections setting.
The quickly (and easy) answer is every request is processes by a new process.
Apache listens in some port and for each request create a new process that handles that request. That means no shared memory.
Also take a look to processes with "ps" command, you will see one "http" process for each request.
Take a look here for more complex working: http://httpd.apache.org/docs/2.0/mod/worker.html
and look at google too :) http://docstore.mik.ua/orelly/weblinux2/apache/ch01_02.htm

Stop an in-progress query-string-authorized download on Amazon S3?

With Amazon S3, can I stop a query-string-authorized download that is in progress?
Are there other file download services that provide such a feature?
I'm not aware of a built in way to do this. If I understand your goal, you want to potentially stop an HTTP response mid-stream based on some custom rules you have. Is that right?
If so, perhaps you could write a very thin proxy to S3 that encapsulates this logic. If you ran the proxy on EC2 you wouldn't incur any additional bandwidth fees.
The downside is that you would have manage scaling the proxy (i.e. add more EC2 nodes based on traffic) so depending on your scaling requirements, this could require a bit of work. But the proxy script itself would probably be fairly trivial. Something like:
Make streaming HTTP request to S3 for object
for each x byte chunk in response from S3:
Check auth condition. Continue if valid. Break if not.
Send chunk to caller
I'm not aware of anyone that allows this. In general, the authentication is only checked once, when you begin downloading, but not thereafter.
Can you describe what you're trying to do more broadly?