How to ignore download timeout errors in Scrapy - scrapy

I'm using scrapy 0.24.4
I have set a download timeout to 15 seconds. Some urls take > 15 seconds to download and my log file records these errors. I have many of these exceptions/errors in the log file simply because of the volume of urls I'm crawling and it's becoming hard to see genuine errors. Is there a way to catch these specific exceptions and ignore them?

Related

Apache mem cache entire folder during requsting single file from this folder

Is it possible to configure the Apache server in such a way that when downloading a file from a given folder, e.g. https://images.anywebsite.com/mpSTj8xbeu/1.jpg (but also 2.jpg, 3.jpg etc), added the entire folder to mem cache for a short period of time, e.g. 60 seconds, to download https://images.anywebsite.com/mpSTj8xbeu/4.jpg, https://images.anywebsite.com/mpSTj8xbeu/5.jpg etc. files from RAM, not HDD?
Each photo on the website is downloaded with a few seconds delay due to HTML lazy-loading.
So single request to a www.anywebsite.com makes around 10 requests to:
https://images.anywebsite.com/mpSTj8xbeu/*.jpg
each few seconds after previous one.
I was trying to find right apache module for this, without any luck.

Apache2: upload restarts over and over

We are using different upload scripts with Perl-Module CGI for our CMS and have not encountered such a problem for years.
Our customer's employees are not able to get a successful download.
No matter what kind or size of file, no matter which browser they use, no matter if they do it at work or log in from home.
If they try to use one of our system's upload pages the following happens:
The reload seems to work till approx. 94% are downloaded. Suddenly, the reload restarts and the same procedure happens over and over again.
A look in the error log shows this:
Apache2::RequestIO::read: (70007) The timeout specified has expired at (eval 207) line 5
The wierd thing is if i log in our customer's system using our VPN-Tunnel i never can reproduce the error (i can from home though).
I have googled without much success.
I checked the apache timeout setting which was at 300 seconds - which is more than generous.
I even checked the content length field for a value of 0 because i found a forum entry refering to a CGI bug that related to a content length field of 0.
Now, i am really stuck and running out of ideas.
Can you give me some new ones, please?
The apache server is version 2.2.16, the perl CGI module is version 3.43 .
We are using mod_perl.
We did know our customer didn't use any kind of load balancing.
Without letting anyone else know our customers infrastructure departement activated a load balancer. This way requests went to different servers and timed out.

Scrapy and long-running crawl

I am running a long-running crawl to download images that fails after a while (about an hour). I get the following error:
[1]+ Killed scrapy crawl myScrapper
Killed
I don't have anything in particular in the log other than images being truncated:
exceptions.IOError: image file is truncated (1 bytes not processed)
Does anybody have any idea?
Thanks!

what's the performance impact causing from the large size of Apache's access.log?

If the logs file like access.log or error.log gets very large, will the large-size problem impact the performance of Apache running or user accessing? From my understanding, Apache doesn't read entire logs into memory, but just make use of filehandle to write. Right? If so, I don't have to remove the logs manually every time when it's large enough except for the filesystem issue. Please help and correct me if I'm wrong. Or is there any Apache Log I/O issue I'm supposed to take care when running it?
Thx very much
Well, i totally agree with you. Per my understanding apache access the log files using handlers and just append the new message at the end of the file. That's way a huge log file will not make the difference when has to do with writing to the file. But may be if you want to access the file or open it with a kind of logging monitoring tool then the huge size will slowdown the process of reading the file.
So i would suggest you to use log rotation to have an overall better end result.
This suggestion is directly form the apche web site.
Log Rotation
On even a moderately busy server, the quantity of information stored in the log files is very large. The access log file typically grows 1 MB or more per 10,000 requests. It will consequently be necessary to periodically rotate the log files by moving or deleting the existing logs. This cannot be done while the server is running, because Apache will continue writing to the old log file as long as it holds the file open. Instead, the server must be restarted after the log files are moved or deleted so that it will open new log files.
From the Apache Software Foundation site

How to set Robots.txt or Apache to allow crawlers only at certain hours?

As traffic is distributed unevenly over 24 hours, I would like to disallow crawlers during peek hours and allow them at non-busy hours.
Is there a method to achieve this?
edit:
thanks for all the good advice.
This is another solution we found.
2bits.com has an article on setting IPTables firewall to limit the number of connections from certain IP addresses.
the article
the setting of IPTables:
Using connlimit
In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:
iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT
This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously.
*
You cannot determine what time the crawlers do their work, however with Crawl-delay you may be able to reduce the frequency in which they request pages. This can be useful to prevent them from rapidly requesting pages.
For Example:
User-agent: *
Crawl-delay: 5
You can't control that in the robots.txt file. It's possible that some crawlers might support something like that, but none of the big ones do (as far as I know).
Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot.
If crawling is causing too much load on your server, you can sometimes adjust the crawl rate for individual crawlers. For instance, for Googlebot you can do this in Google Webmaster Tools.
Additionally, when crawlers attempt to crawl during times of high load, you can always just serve them a 503 HTTP result code. This tells crawlers to check back at some later time (you can also specify a retry-after HTTP header if you know when they should come back). While I'd try to avoid doing this strictly on a time-of-day basis (this can block many other features, such as Sitemaps, contextual ads, or website verification and can slow down crawling in general), in exceptional cases it might make sense to do that. For the long run, I'd strongly recommend only doing this when your server load is really much too high to successfully return content to crawlers.
This is not possible using some robots.txt syntax - the feature simply isn't there.
You might be able to influence crawlers by actually altering the robots.txt file depending on the time of day. I expect Google will check the file immediately before crawling, for example. But obviously, there is the huge danger of scaring crawlers away for good that way - the risk of that being probably more problematic than whatever load you get right now.
I don't think you can make an appointment with a search engine spider.
First Let It Be Clear:
Blockquote
Dynamically changing the robots.txt file is also a bad idea in a case like this. Most crawlers cache the robots.txt file for a certain time, and continue using it until they refresh the cache. If they cache it at the "right" time, they might crawl normally all day. If they cache it at the "wrong" time, they would stop crawling altogether (and perhaps even remove indexed URLs from their index). For instance, Google generally caches the robots.txt file for a day, meaning that changes during the course of a day would not be visible to Googlebot. answered Jan 22 '11 at 14:25 John Mueller
I tried doing a chron job renaming the robot.txt file during the week. Like an on / off switch. It working say every Monday at Midnight it rename's the "robot.txt" to "def-robot.txt" which now it is not blocking crawlers. I allow two to three days then I have another schedule chron job to rename it back to "robot.txt" for "def-robot.txt" which now its starts blocking any crawler from accessing my sites. So their is a long way of doing this but the first mentioned is exactly is what is happening to me.
I have had major decrease if not all in my indexed links because the GoogleBot could not verify the link to still be correct because "robot.txt is blocking Google from accessing my site half the week. Simple. Chron job scheduling to change the file to the customizations you want could work some what. Thats the only way I have found to customize robot.txt on a scheduled time bases.
I used cron to modify the apache config file.
You can add an include file within the <directory = ...> directive ( eg. Include bot_block.conf) in httpd.conf that has the filters for the spiders . I have not tried this in .htaccess .
I used setenvif to set a variable and then deny so you can select IP address or User-agent, etc... to match on.
eg.
SetEnvIf Remote_Addr ^192.168.53.2$ timed_bot
SetEnvIfNoCase User-Agent "badbadbot.com" timed_bot
Deny from env=timed_bot
Use a cron job to copy the filters into the file at the time when you want to block the spiders and
then restart apache gracefully
Use a cron job to overwrite the file with a blank at the time when you want to allow the spiders and then restart apache gracefully
I've implemented this method and it works.
its not perfect since it won't stop a bot that already has requests pending when the time to block passes. but it should quiet them down after a little while .