power failure handling in Apache Nutch 2.x - apache

I have setup a cluster to crawl few websites from WWW. I am using Apache Nutch 2.3.1 with Hadoop and Hbase. I have cluster backup also. But when power failure goes long then even this backup is finished and complete cluster goes down in no time. When power problem is solved (somehow), I want to resume my job from where it was at last time. For example, if my crawler was crawling 1000 URLs and after 50%, cluster goes down. I want that Nutch should fetch only remaining 50% and should not fetched those documents that are already fetched.

Related

Apache Nutch 2.3.1 checkingpointing not working

I have configured apache Nutch 2.3.1 with single node cluster (Hadoop 2.7.x and hbase 1.2.6). I have to check its checkpointing feature. according to my information, resuming is available in Fetch and parse. I assume that at any stage during fetching (or parsing), my complete cluster goes down due to some problem e.g. power failure. I assume that when I restart cluster and crawler with -resume flag, it should start to fetch only those URLs that was not fetched.
But what I observed is that (with debug enabled) that it start to refetch all URLs (with same batchID) till end even with resume flag. Resume flag only works when a job (e.g. fetch) was completely finished. I have cross checked it from its logs with a message like "Skipping express.pk; already fetched".
Does my interprestation is no correct about resume option in Nutch?
or there is some problem in cluster/configuration ?
Your interpretation is right. Also, in this case the output from Nutch (logs) is also correct.
If you check the code on https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/fetcher/FetcherJob.java#L119-L124, Nutch is only logging that is skipping that URL because it was already fetched. Since Nutch works in batches, in needs to check all the URLs on the same batchId, but if you specify the resume flag then (only on DEBUG) will log that it is skipping certain URLs. This is done mainly for troubleshooting if you have some issue.
This happens Nutch doesn't keep a record of the last processed URL, it needs to start at the beginning of the same batch and work its way from there. Even knowing the last URL its not enough because, you would also need the position of that URL in the batch.

Long Running Apps YARN and yarn.nodemanager.log.retain-seconds

Assume my yarn.nodemanager.log.retain-seconds is set to 2 days and my yarn log aggregation is turned off, so every generated log remains in the local filesystem. What would happen if I had a long running job, i.e. Spark Streaming that continued to run over 2 days? Would the logs be deleted from the local filesystem even if the Spark Streaming job continued to run?
There is not much information in the web for the above scenario and in the yarn-site.xml config the yarn.node.manager.log.retain-seconds is only described as Time in seconds to retain user logs. Only applicable if log aggregation is disabled .

Apache Nutch restart crawl

I am running Apache Nutch 1.12 in local mode.
I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.
Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.
I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.
Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.
How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.
Many thanks
To start from scratch, just specify a different crawl dir or delete the existing one.
Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.

How to know the status of apache solr for daily indexed documents

I am using apache solr 4.10.x. APache nutch is being used to to crawl and index documents. Now my crawler is running, I want to know how many documents are indexed on each iteration of nutch or per day.
Any idea or any tool provided by apache solr for this purpose?
facet=true
facet.date=myDateField
facet.date.start=start_date
facet.date.end=end_date
facet.date.gap=+1MONTH(duration like +1DAY in your case).
append everything in your URL if you are using HTTP request with &.
You can hit a url with command=status
for example in my case it was
qt=/dataimport&command=status
it gives you the status like committed or rollback or total documents processed...
for more info check at
http://wiki.apache.org/solr/DataImportHandler
Check for "command"

Apache SSL Plone 4.2 random Proxy Error

I have Apache setup with Plone 4.2 and SSL with the following rules in the ssl.conf Apache file:
RewriteEngine On
ProxyVia On
Redirect permanent / https://mywebsite.com/PloneSite/subfolder
RewriteRule ^/(.*) http://localhost:8080/VirtualHostBase/https/%{SERVER_NAME}:443/VirtualHostRoot/PloneSite/subfolder/$1 [L,P]
However, about once-twice a day (seemingly at random), the site will get really slow and eventually start serving up 502 errors (Proxy Error). The only thing that appears to fix it is to restart plone with "plonectl restart". I'm really at loss as to what is causing this, are there any of the rules above that appear to be incorrect?
This is not a proxy setup problem; Apache proxy rules for Plone either work or they don't. The proxy error is caused by Plone no longer responding, and that is why restarting Plone fixes the problem temporarily.
You'll need to figure out why Plone stops responding. This can have any number of reasons, and you'll have to pinpoint what is going on.
You could have a programming error, one that ties up a thread forever, in part of your site. Once you run out of threads, Plone can no longer serve normal requests and you get your proxy errors. You could use Products.signalstack to peak at what your threads are up to when your site is no longer responding.
It could be something trashed your ZODB cache; if a web crawler tried to load all of your site in short succession, for example, it may have caused so much cache churn that it takes a while to rebuild your catalog cache. Take a close look at your log files (both from Apache and the Plone instance) and look for patterns.
In such cases you'd either have to block the crawler, or install better caching to lighten the load on your Plone server (Varnish does a great job of such caching setups, with some careful tuning).
Some inexperienced catalog usage could have trashed your ZODB cache, with the same results. In one (very bad) case that I've seen, some code would look up all objects of a certain type from the catalog, call getObject() on those results (loading each and every object into memory), then filter the huge set down to a handful of objects that would actually be needed. Instead, the catalog should have been used to narrow the list of objects to load significantly before loading the objects themselves.
It could be you are not taking advantage of ZODB Blobs; large files stored on and served directly from the disk instead of from ZODB objects spares your memory cache significantly.
All in all this could be some work to sort out, depending on what the root cause is.