increase scrapy crawlera crawling speed - scrapy

CONCURRENT_REQUESTS = 50
CONCURRENT_REQUESTS_PER_DOMAIN = 50
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_DELAY= 0
after checking How to increase Scrapy crawling speed?, my scraper is still slow and takes about 25 hours to scrape 12000 pages (Google,Amazon), I use Crawlera, Is there more I can do to increase speed and when CONCURRENT_REQUESTS =50 does this mean I have 50 thread like request?

#How to run several instances of a spider
Your spider can take some arguments in the terminal as the following: scrapy crawl spider -a arg=value.
Let's imagine you want to start 10 instances because I guess you start with 10 urls (quote: input is usually 10 urls). The commands could be like this:
scrapy crawl spider -a arg=url1 &
scrapy crawl spider -a arg=url2 &
...
scrapy crawl spider -a arg=url3
Where & indicates you launch a command after the previous one without waiting the end of this previous one. As far as I know it's the same syntax in Windows or Ubuntu for this particular need.
##Spider source code
To be able to launch as I showed you, the spider can look like this
class spiderExample(scrapy.Spiper):
def __init__(arg): #all args in here are able to be entered in terminal with -a
self.arg = arg #or self.start_urls = [arg] , because it can answer your problematic
... #any instructions you want, to initialize variables you need in the proccess
#by calling them with self.correspondingVariable in any method of the spider.
def parse(self,response):#will start with start_urls
... #any instructions you want to in the current parsing method
#To avoid to be banned
As far as I read you use Crawlera. Personally I never used this. I never needed to use payable services for it.
##One IP for each spider
The goal here is clear. As I told you in comment I use Tor and Polipo. Tor needs an HTTP proxy like Polipo or Privoxy to run in a scrapy spider correctly. Tor will be tunneled with the HTTP proxy, and at the end the proxy will work with a Tor IP. Where Crawlera can be interesting is Tor's IPs are well known by some websites with a lot of traffic (so a lot of robots going through it too...). These websites can ban Tor's IP because they detected robots behaviors corresponding with the same IP.
Well, I don't know how work Crawlera, so I don't know how you can open several ports and use several IP's with Crawlera. Look at it by yourself. In my case with Polipo I can run several instances tunneled (polipo is listening the tor's corresponding socks port) on several Tor circuits launched by my own. Each Polipo instances has its own listen port. Then for each spider I can run the following
scrapy crawl spider -a arg=url1 -s HTTP_PROXY:127.0.0.1:30001 &
scrapy crawl spider -a arg=url2 -s HTTP_PROXY:127.0.0.1:30002 &
...
scrapy crawl spider -a arg=url10 -s HTTP_PROXY:127.0.0.1:30010 &
Here, each port will listen with different IP's, so for the website these are different users. Then your spider can be more polite (look at settings options) and your whole project be faster. So no need to go through the roof by setting 300 to CONCURRENT_REQUESTS or to CONCURRENT_REQUESTS_PER_DOMAIN, it will make the website turn the wheel and generates unnecessary event like DEBUG: Retrying <GET https://www.website.com/page3000> (failed 5 times): 500 Internal Server Error.
In my personal preferences I like to set different log file for each spider. It avoids to explodes the amount of lines in the terminal, and allows to let me read the events of processes in a more comfortable text file. Easy to write in the command -s LOG_FILE=thingy1.log. It will show you easily if some urls where not scraped as you wanted.
##Random user agent.
When I read Crawlera is a smart solution because it uses the right user agent to avoid to be banned... I was surprised because actually you can do it by your own-self like here. The most important aspect when you do it by yourself is to choose popular user-agent to be overlooked among the large number of users of this same agent. You have available lists on some websites. Besides, be careful and take computer user-agent and not other device like mobile because the rendered page, I mean source code, is not necessarily the same and you can lose the information you want to scrape.
The main cons of my solution is it consumes your computer resources. So your choice of number of instances will depend on your computer capacities (RAM,CPU...), and router capacities too. Personally I am still using ADSL and as I told you 6000 requests were done in 20-30 minutes... But my solution does not consume more bandwidth than setting a crazy amount on CONCURRENT_REQUESTS.

There are many things that can affect your crawler speed. But CONCURRENT_REQUEST and CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN settings. Try setting them to some absurdly hard number like 300 and go from there.
see: https://docs.scrapy.org/en/latest/topics/settings.html#concurrent-requests
Other than that ensure that you have AUTOTHROTTLE_ENABLED set to False and DOWNLOAD_DELAY set to 0.
Even if scrapy is limited by it's internal behaviour you can always fire up multple instances of your crawler and scale your speed like that. Common atomic approach is to put urls/ids to a redis or rabbitmq queue and pop->crawl them in mutliple scrapy instances.

Related

Too many 429 errors when the cache extension and the proxy middleware are enabled at the same time in scrapy

I am using scrapy to crawl data. The target website blocks the IP after it sends about 1000 requests.
To deal with this, I wrote a proxy middleware, and because the amount of data is relatively large, I also wrote a cache extension. When I enabled both of them, I get banned more often. It works well when only the proxy middleware is enabled.
I know that when scrapy engine start, extensions start earlier than middlewares. Could this be the reason? If not, what else should I consider?
Any suggestions will be appreciated!

Where is a TelnetConsole object constructed in Scrapy?

I'm running Scrapy from scripts, and found that logging doesn't work as expected from the point it constructs a scrapy.extensions.telnet.TelnetConsole object. Thus, I tried to find where the object is constructed from the source files, but I couldn't.
Where does Scrapy construct a scrapy.extensions.telnet.TelnetConsole object when it is run from scripts?
TelnetConsole is a scrapy extension that allows to connect to scrapy processes via telnet:
Telnet is an application protocol used on the Internet or local area network to provide a bidirectional interactive text-oriented communication facility using a virtual terminal connection. User data is interspersed in-band with Telnet control information in an 8-bit byte oriented data connection over the Transmission Control Protocol (TCP).
It allows you to do many things like inspect python objects and even pause/resume crawling.
see more at the extensive official docs for TelnetConsole extension
It is constructed in the extensions initiation step.
To disable it you can simply set TELNETCONSOLE_ENABLED settings to False in your settings.py or when running your crawler:
scrapy crawl myspider -s TELNETCONSOLE_ENABLED=False

Counting the number of requests an apache module has processed in a day

I have an apache module that we use. What we would like to do is measure the number of requests the module has served in a day. I know we can grep the acess_log or error_log and immediately know the number. I would like to do have some kind of counter inside the apache module itself and log it with every request. But this is not practical because the stock installation of apache uses 5 processes by default and so the apache module will be loaded into each process separately and so the counter will be meaningless. Also these helper processes can be restarted as needed.
Is there a way I can achieve this? There is obviously only one httpd process listening on port 80 and then it forwards the request to one of the helper process, right? If so can I have that process have a counter and increment it and log it with every request while forwarding?
There must be a better solution for this.
Thanks
-P

How to set Robots.txt or Apache to allow crawlers only at certain hours?

As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours.
Is there a method to achieve this?
edit:
thanks for all the good advice.
This is another solution we found.
2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses.
the article
the setting of IPTables:
Using connlimit
In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:
iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT
This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously.
*
You cannot determine what time the crawlers do their work, however with Crawl-delay you may be able to reduce the frequency in which they request pages. This can be useful to prevent them from rapidly requesting pages.
For Example:
User-agent: *
Crawl-delay: 5
You can't control that in the robots.txt file. It's possible that some crawlers might support something like that, but none of the big ones do (as far as I know).
Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot.
If crawling is causing too much load on your server, you can sometimes adjust the crawl rate for individual crawlers. For instance, for Googlebot you can do this in Google Webmaster Tools.
Additionally, when crawlers attempt to crawl during times of high load, you can always just serve them a 503 HTTP result code. This tells crawlers to check back at some later time (you can also specify a retry-after HTTP header if you know when they should come back). While I'd try to avoid doing this strictly on a time-of-day basis (this can block many other features, such as Sitemaps, contextual ads, or website verification and can slow down crawling in general), in exceptional cases it might make sense to do that. For the long run, I'd strongly recommend only doing this when your server load is really much too high to successfully return content to crawlers.
This is not possible using some robots.txt syntax - the feature simply isn't there.
You might be able to influence crawlers by actually altering the robots.txt file depending on the time of day. I expect Google will check the file immediately before crawling, for example. But obviously, there is the huge danger of scaring crawlers away for good that way - the risk of that being probably more problematic than whatever load you get right now.
I don't think you can make an appointment with a search engine spider.
First Let It Be Clear:
Blockquote
Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot. answered Jan 22 '11 at 14:25 John Mueller
I tried doing a chron job renaming the robot.txt file during the week. Like an on / off switch. It working say every Monday at Midnight it rename's the "robot.txt" to "def-robot.txt" which now it is not blocking crawlers. I allow two to three days then I have another schedule chron job to rename it back to "robot.txt" for "def-robot.txt" which now its starts blocking any crawler from accessing my sites. So their is a long way of doing this but the first mentioned is exactly is what is happening to me.
I have had major decrease if not all in my indexed links because the GoogleBot could not verify the link to still be correct because "robot.txt is blocking Google from accessing my site half the week. Simple. Chron job scheduling to change the file to the customizations you want could work some what. Thats the only way I have found to customize robot.txt on a scheduled time bases.
I used cron to modify the apache config file.
You can add an include file within the <directory = ...> directive ( eg. Include bot_block.conf) in httpd.conf that has the filters for the spiders . I have not tried this in .htaccess .
I used setenvif to set a variable and then deny so you can select IP address or User-agent, etc... to match on.
eg.
SetEnvIf Remote_Addr ^192.168.53.2$ timed_bot
SetEnvIfNoCase User-Agent "badbadbot.com" timed_bot
Deny from env=timed_bot
Use a cron job to copy the filters into the file at the time when you want to block the spiders and
then restart apache gracefully
Use a cron job to overwrite the file with a blank at the time when you want to allow the spiders and then restart apache gracefully
I've implemented this method and it works.
its not perfect since it won't stop a bot that already has requests pending when the time to block passes. but it should quiet them down after a little while .

Can we detect if a site is on CDN?

Is there a way to detect if a site is on a Content Delivery Network and if yes, can we tell which service are they using?
A method that is achievable from the command line is using the 'host' command, with the -a flag set to see the DNS record e.g.
host -a www.visitbritain.com
Returns:
www.visitbritain.com. 0 IN CNAME d18sjq5nyxcof4.cloudfront.net.
Here you can see that the CNAME entry tells us that the site is using cloudfront as the CDN.
Just take a look at the urls of the images (and other media) of the site.
Reverse lookup IP's of the hostnames you see there and you will see who own them.
I built this little tool to identify the CDN used by a site or a domain, feel free to try it.
The URL: http://www.whatsmycdn.com/
You might also be able to tell from the HTTP headers of the media if the URL doesn't give it away. For example, media served by SimpleCDN has Server: SimpleCDN 5.6a4 in its headers.
cdn planet now have their cdn finder tool on github
http://www.cdnplanet.com/blog/better-cdn-finder/ The tool installs on the command line and allows you the feed in host names and check if they use a CDN.
If Website using GCP CDN you simply check it using curl
curl -I <https://site url>
In reponse you can find following headers there available
x-goog-metageneration: 2
x-goog-stored-content-encoding: identity
x-goog-stored-content-length: 17393
x-goog-meta-object-id: 11602
x-goog-meta-source-id: 013dea516b21eedfd422a05b96e2c3e4
x-goog-meta-file-hash: cf3690283997e18819b224c6c094f26c
Yes you can find by
host -a www.website.com
Apart from some excellent answers already posted here which include some direct methods which may or may not work for all the websites out there, there is also an indirect way to see if a CDN is there. And especially if its your own website and you want to know if you are getting what you are paying for !
The promise of a CDN is that connections from your users are terminated closer to them so that they get less TCP / TLS connection establishment overhead and static content is cached closet to them so that it loads faster, puts less strain on your origin servers.
To verify this, you can take measurements of site load times across the globe and see if all the users get similar loads times. No you dont have to get a machine everywhere in the world to do that ! Someone has already done that for you
Head to https://prober.tech/ and the URL you wish to test for load times.
Because this site itself is in Cloudflare's CDN, you can put that link itself in the test box and use it as baseline !
More information on using the tool can be found here