Too many 429 errors when the cache extension and the proxy middleware are enabled at the same time in scrapy - scrapy

I am using scrapy to crawl data. The target website blocks the IP after it sends about 1000 requests.
To deal with this, I wrote a proxy middleware, and because the amount of data is relatively large, I also wrote a cache extension. When I enabled both of them, I get banned more often. It works well when only the proxy middleware is enabled.
I know that when scrapy engine start, extensions start earlier than middlewares. Could this be the reason? If not, what else should I consider?
Any suggestions will be appreciated!

Related

Domain URL masking

I am currently hosting the contents of a site with ProviderA. I have a domain registered with ProviderB. I want users to access the contents (www.providerA.com/sub/content) by visiting www.providerB.com. A domain forward is easy enough and works as intended, however, unless I embed the site in a frame (which is a big no-no), the actual URL reads www.providerA.com/sub/content despite the user inputting www.providerB.com.
I really need a solution for this. A domain masking without the use of a frame. I'm sure this has been done before. An .htaccess domain rewrite?
Your help would be hugely appreciated! I'm going nuts trying to find a solution.
For Apache
Usual way: setup mod_proxy. The apache on providerB becomes a client to providerA's apache. It gets the content and sends it back to the client.
But looks like you only have .htaccess. So no proxy, you need full configuration access for that.
So you cannot, see: How to set up proxy in .htaccess
If you have PHP on providerB
Setup a proxy written in PHP. All requests to providerB are intercepted by that PHP proxy. It gets the content from providerA and sends it back. So it does the same thing as the Apache module. However, depending on the quality of the implementation, it might fail on some requests, types, sizes, timeouts, ...
Search for "php proxy" on the web, you will see a couple available on GitHub and others. YMMV as to how difficult it is to setup, and the reliability.
No PHP but some other server side language
Obviously that could be done in another language, I checked PHP because that is what I use the most.
The best solution would be to transfer the content to providerB :-)

How to throttle requests to sites instead of to proxy server in scrapy?

I am using a proxy and have set AUTO_AUTOTHROTTLE_ENABLED to True. I had the impression that scrapy throttles the sites which I am crawling, instead it seems like scrapy throttles requests to proxy itself. How do I throttle requests to sites instead of proxy?
Update: I am manually setting proxy in meta while making each request, instead of using the proxy middleware.
I don't think this is possible to do solely from the spider side. By looking at the throttling algorithm and at the AutoThrottle extension source code, you can see that the delay being used is the time difference between sending a request and getting back a response. Everything that happens in between is added up to this delay (including the proxy delay).
To further verify this, consider the steps:
AutoThrottle is using latency information from the response, found
in the response.meta['download_latency] (see here)
The latency information ('download_latency') is set in the dedicated callback once the download is completed, by subtracting the start time from the current time (see here).
The start time is actually set just before the download agent is instructed to download the request, which means everything in between is added up to the final latency (see here).
If you want to actually throttle according to target latency through a proxy, this will have to be handled by the proxy itself. I suggest using some of the managed proxy pool solutions like Crawlera.

Cloudflare Bad Gateway 502 error

Myself and my users are often running into a Cloudflare Bad Gateway 502 error. Trying to figure out what goes wrong is hard, because Cloudflare blames the hosting company and the hosting company blames Cloudflare. A typical situation when using Cloudflare.
What I noticed is that nothing actually fails. The host receives the request and is handling the request just fine but which sometimes takes a bit longer than usual to complete. But Cloudflare can't wait and instead throws a Bad Gateway error, while the script is actually still running.
I've noticed this behavior when performing heavy back-end tasks (like generating +50 PDFs). My users notice this when they try to upload an image (which often starts a re-sizing task).
Is there a way I can configure my server so that Cloudflare knows that the request is still being processed? Or should I just ditch Cloudflare overall?
The culprit was Railgun. After disabling Railgun (in Cloudflare's control panel) the Bad Gateway 502 errors immediately disappeared.
I've gone through this error for quite a long time, Cloudflare support wasn't able to guide me.
To solve this I tried multiple tweaks and tricks.
the successful one was changing your https to HTTP in your database > wp_option.
for example :
https://xxxxx.com/ to http://xxxxx.com/
switching your SSL setting to "full" in Cloudflare settings.
this should work fine, good luck.
I have researched on this error very deeply and what I found the result https://modernbreeze.in/error-502-bad-gateway-cloudflare-how-to-fix-in-wordpress/
I noted down in the above blog post. Please read and let me know if it's solved or not.

Python BaseHTTPServer vs Apache and mod_wsgi

I am setting up a very simple HTTP server for the first time, am considering my options, and would appreciate any feedback on the best way to proceed. My goal is pretty simple: I'm not serving any files, I only need to respond to a very specific HTTP POST request that will contain geolocation data, run some Python code, and return the results as JSON. I do need to be able to respond to multiple simultaneous requests. I would like to use HTTPS.
In looking on stackoverflow it seems I can potentially go with BaseHTTPServer and ThreadingMixIn, or Apache and mod_wsgi. I already have Apache installed, but have never configured it. Are there compelling reasons to go the more complicated Apache route (more complicated to me, because I will need to do research on configuring Apache and getting mod_wsgi going but already have a test instance of BaseHTTPServer up and running), or is it equally safe, secure (very important), and performance-oriented to use BaseHTTPServer for something so simple?
BaseHTTPServer is not a production grade server.
If you don't understand how to set up Apache, but want to get something with mod_wsgi running quickly and easily, then you probably want to look at mod_wsgi express.
This gives you a way of installing mod_wsgi using Python 'pip' and also provides you a way of starting up Apache/mod_wsgi with a auto generated Apache and mod_wsgiconfiguration such that you don't even need to know how to configure Apache.
The next version of mod_wsgi express to be released (version 4.3.0, likely released this week), can even set up a HTTPS site for you, with you just needing to have obtained a valid certificate or generated a self signed certificate.
I would suggest if interested you use the mod_wsgi mailing list to ask for more details about using mod_wsgi express for running a HTTPS site.
http://code.google.com/p/modwsgi/wiki/WhereToGetHelp?tm=6#Asking_Your_Questions
You can start playing around though with it for a normal HTTP site by following instructions at:
https://pypi.python.org/pypi/mod_wsgi

How do I configure apache - that has not got mod_expires or mod_headers - to send expiry headers?

The webserver hosting my website is not returning last-modified or expiry headers. I would like to rectify this to ensure my web content is cacheable.
I don't have access to the apache config files because the site is hosted on a shared environment that I have no control over. I can however make configurations via an .htaccess file. The server - apache 1.3 - is not configured with mod_expires or mod_headers and the company will not install these for me.
With these limitations in mind, what are my options?
Sorry for the post here. I recognise this question is not strictly a programming question, and more a sys admin question. When serverfault is public I'll make sure I direct questions of this nature there.
What sort of content? If static (HTML, images, CSS), then really the only way to attach headers is via the front-end webserver. I'm surprised the hosting company doesn't have mod_headers enabled, although they might not enable it for .htaccess. It's costing them more bandwidth and CPU (ie, money) to not cache.
If it's dynamic content, then you'll have control when generating the page. This will depend on your language; here's an example for PHP (it's from the PHP manual, and is a bad example, as it should also set the response code):
if (!headers_sent()) {
header('Location: http://www.example.com/');
exit;
}
Oh, and one thing about setting caching headers: don't set them for too long a duration, particularly for CSS and scripts. You may not think you want to change these, but you don't want a broken site while people still have the old content in their browsers. I would recommend maximum cache settings in the 4-8 hour range: good for a single user's session, or a work day, but not much more.