I have configured apache Nutch 2.3.1 with single node cluster (Hadoop 2.7.x and hbase 1.2.6). I have to check its checkpointing feature. according to my information, resuming is available in Fetch and parse. I assume that at any stage during fetching (or parsing), my complete cluster goes down due to some problem e.g. power failure. I assume that when I restart cluster and crawler with -resume flag, it should start to fetch only those URLs that was not fetched.
But what I observed is that (with debug enabled) that it start to refetch all URLs (with same batchID) till end even with resume flag. Resume flag only works when a job (e.g. fetch) was completely finished. I have cross checked it from its logs with a message like "Skipping express.pk; already fetched".
Does my interprestation is no correct about resume option in Nutch?
or there is some problem in cluster/configuration ?
Your interpretation is right. Also, in this case the output from Nutch (logs) is also correct.
If you check the code on https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/fetcher/FetcherJob.java#L119-L124, Nutch is only logging that is skipping that URL because it was already fetched. Since Nutch works in batches, in needs to check all the URLs on the same batchId, but if you specify the resume flag then (only on DEBUG) will log that it is skipping certain URLs. This is done mainly for troubleshooting if you have some issue.
This happens Nutch doesn't keep a record of the last processed URL, it needs to start at the beginning of the same batch and work its way from there. Even knowing the last URL its not enough because, you would also need the position of that URL in the batch.
Related
I have setup a cluster to crawl few websites from WWW. I am using Apache Nutch 2.3.1 with Hadoop and Hbase. I have cluster backup also. But when power failure goes long then even this backup is finished and complete cluster goes down in no time. When power problem is solved (somehow), I want to resume my job from where it was at last time. For example, if my crawler was crawling 1000 URLs and after 50%, cluster goes down. I want that Nutch should fetch only remaining 50% and should not fetched those documents that are already fetched.
I am running Apache Nutch 1.12 in local mode.
I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.
Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.
I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.
Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.
How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.
Many thanks
To start from scratch, just specify a different crawl dir or delete the existing one.
Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.
We are using different upload scripts with Perl-Module CGI for our CMS and have not encountered such a problem for years.
Our customer's employees are not able to get a successful download.
No matter what kind or size of file, no matter which browser they use, no matter if they do it at work or log in from home.
If they try to use one of our system's upload pages the following happens:
The reload seems to work till approx. 94% are downloaded. Suddenly, the reload restarts and the same procedure happens over and over again.
A look in the error log shows this:
Apache2::RequestIO::read: (70007) The timeout specified has expired at (eval 207) line 5
The wierd thing is if i log in our customer's system using our VPN-Tunnel i never can reproduce the error (i can from home though).
I have googled without much success.
I checked the apache timeout setting which was at 300 seconds - which is more than generous.
I even checked the content length field for a value of 0 because i found a forum entry refering to a CGI bug that related to a content length field of 0.
Now, i am really stuck and running out of ideas.
Can you give me some new ones, please?
The apache server is version 2.2.16, the perl CGI module is version 3.43 .
We are using mod_perl.
We did know our customer didn't use any kind of load balancing.
Without letting anyone else know our customers infrastructure departement activated a load balancer. This way requests went to different servers and timed out.
I'm currently developing a website that's getting fairly frequent javascript updates and have just started using mod_pagespeed in an effort to ensure that customers will always have the latest code.
The docs tell me doing this will clear my pagespeed cache and force clients to get my new javascript/css:
sudo touch /var/cache/pagespeed/cache.flush
I did a test by changing some javascript code, hitting refresh on my browser to verify that I was still seeing the old code (my cache expiration is set to one day), then restarting apache, and I can indeed see my new changes.
Can I trust that a restart will always be sufficient, and that a cache.flush is not needed, or do I need to run the flush command as well? I'm reading that a restart of apache is required to clear the memory cache, but not how the file cache and/or cache.flush fits in with that.
Update:
I pulled the pagespeed code, and if I'm understanding correctly, the cache.flush process updates a timestamp.
It looks like that's happening in RewriteOptions::UpdateCacheInvalidationTimestampMs here:
http://modpagespeed.googlecode.com/svn/trunk/src/net/instaweb/rewriter/rewrite_options.cc
If I could figure out which timestamp this was updating, it seems like I could either check it/restart apache/check it again (to see if the timestamp changed) or deduce from the filename/location/who owns it somehow whether or not that's likely to happen.
Any more thoughts on this? Advice on how to figure out which timestamp is being updated? Other reasoning to make me feel better about either manually doing the extra flush command every time I update (when I'm already restarting apache for other reasons) or leaving it out?
touch the cache.flush file:
sudo touch /var/cache/mod_pagespeed/cache.flush
Reference: https://developers.google.com/speed/pagespeed/module/system#flush_cache
No restart of Apache doesnt clear the pagespeed cache. You have to do it manually by using cache.flush.
What I like to do to ensure that the whole cache on the entire web portion of the server
Apache2, this is a dry run, remove "-D" if you are sure you want to go through with it -l is for size of memory -p for path:
htcacheclean -D -p/var/cache/apache2 -l100M
mod_pagespeed:
sudo touch /var/cache/mod_pagespeed/cache.flush
A restart of Apache should flush the cache.
I am using CloudFront with mod_pagespeed running on the server.
When updating a CSS or flushing the cache I see problematic behavior, first refresh on the browser returns the original css (this is fine). When I refresh a second time I get the correct manipulated CSS file name but the content of the file from CloudFront is still the original and not the correct manipulated content.
Why would this happen?
Any idea how to fix this?
Update:
For some reason it just stopped happening... I don't know why.
SimonW, since your original post there has been a feature added to pagespeed (in March 2013 in version 1.2.24.1) to deal with this issue directly. The directive is enabled via the following:
Apache:
ModPagespeedRewriteDeadlinePerFlushMs deadline_value_in_milliseconds
Nginx:
pagespeed RewriteDeadlinePerFlushMs deadline_value_in_milliseconds;
The docs describe the directive as follows (emphasis mine):
When PageSpeed attempts to rewrite an uncached (or expired) resource
it will wait for up to 10ms per flush window (by default) for it to
finish and return the optimized resource if it's available. If it has
not completed within that time the original (unoptimized) resource is
returned and the optimizer is moved to the background for future
requests. The following directive can be applied to change the
deadline. Increasing this value will increase page latency, but might
reduce load time (for instance on a bandwidth-constrained link where
it's worth waiting for image compression to complete). Note that a
value less than or equal to zero will cause PageSpeed to wait
indefinitely.
So, if you specify a value of 0 for deadline_value_in_milliseconds you should always get the fully optimized page. I would caution that the latency can be high on this in some cases. I my case, I really wanted this behavior, even with the latency concern, because the content was to be cached on my CDN's edge servers and, thus, I wanted the most optimized version possible to be served to the CDN for caching.
This could happen if you have multiple backend servers and CloudFront is hitting a different one than the HTML request went through. In that case the resource was rewritten on the HTML server, but not on the other server. There is a short timeout and if the other server doesn't finish the rewrite in that time, it will just serve the original content with Cache-Control: private,max-age=300. It's possible CloudFront caches that for a little while (even though obviously it shouldn't), but then eventually re-requests the resource from your backend and gets the correctly rewritten version this time.