Scrapy and long-running crawl

Scrapy and long-running crawl - scrapy

I am running a long-running crawl to download images that fails after a while (about an hour). I get the following error:
[1]+ Killed scrapy crawl myScrapper
Killed
I don't have anything in particular in the log other than images being truncated:
exceptions.IOError: image file is truncated (1 bytes not processed)
Does anybody have any idea?
Thanks!

Related

power failure handling in Apache Nutch 2.x

I have setup a cluster to crawl few websites from WWW. I am using Apache Nutch 2.3.1 with Hadoop and Hbase. I have cluster backup also. But when power failure goes long then even this backup is finished and complete cluster goes down in no time. When power problem is solved (somehow), I want to resume my job from where it was at last time. For example, if my crawler was crawling 1000 URLs and after 50%, cluster goes down. I want that Nutch should fetch only remaining 50% and should not fetched those documents that are already fetched.

Apache Nutch restart crawl

I am running Apache Nutch 1.12 in local mode.
I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.
Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.
I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.
Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.
How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.
Many thanks

To start from scratch, just specify a different crawl dir or delete the existing one.
Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.

How to ignore download timeout errors in Scrapy

I'm using scrapy 0.24.4
I have set a download timeout to 15 seconds. Some urls take > 15 seconds to download and my log file records these errors. I have many of these exceptions/errors in the log file simply because of the volume of urls I'm crawling and it's becoming hard to see genuine errors. Is there a way to catch these specific exceptions and ignore them?

Server timeout when re-assembling the uploaded file

I am running a simple server app to receive uploads from a fine-uploader web client. It is based on the fine-uploader Java example and is running in Tomcat6 with Apache sitting in front of it and using ProxyPass to route the requests. I am running into an occasional problem where the upload gets to 100% but ultimately fails. In the server logs, as well as on the client, I can see that Apache is timing out on the proxy with a 502 error.
After trying and seeing this myself, I realized the problem occurs with really large files. The Java server app was taking longer than 30 seconds to reassemble the chunks into a single file and so Apache would kill the connection and stop waiting. I have increased Apache Timeout to 300 seconds which should largely correct the problem but the potential remains.
Any ideas on other ways to handle this so that the connection between Apache and Tomcat is not killed while the app is assembling the chunks on the server? I am currently using 2 MB chunks and was thinking maybe I should use a larger chunk size. Perhaps with fewer chunks to assemble the server code could do it faster. I could test that but unless the speedup is dramatic it seems like the potential for problems remain and will just be waiting for a large enough upload to come along to trigger them.

It seems like you have two options:
Remove the timeout in Apache.
Delegate the chunk-combination effort to a separate thread, and return a response to the request as soon as possible.
With the latter approach, you will not be able to let Fine Uploader know if the chunk combination operation failed, but perhaps you can perform a few quick sanity checks before responding, such as determining if all chunks are accessible.
There's nothing Fine Uploader can do here, the issue is server side. After Fine Uploader sends the request, its job is done until your server responds.
As you mentioned, it may be reasonable to increase the chunk size or make other changes to speed up the chunk combination operation to lessen the chance of a timeout (if #1 or #2 above are not desirable).

Scrapy shutdown crawler from pipline

Is there a way to shutdown a scrapy crawl from the pipeline? My pipline processes urls and adds them to a list. When the list reaches a specified amount I want to shutdown the crawler. I know of raise CloseSpider() but it only seems to work when called from the spider.

Yes, it is possible. You can call Crawler.engine.close_spider(spider_name, reason).
There is an example in the CloseSpider extension

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy and long-running crawl - scrapy

Related

power failure handling in Apache Nutch 2.x

Apache Nutch restart crawl

How to ignore download timeout errors in Scrapy

Server timeout when re-assembling the uploaded file

Scrapy shutdown crawler from pipline

Categories

Resources