Apache Nutch restart crawl - apache

I am running Apache Nutch 1.12 in local mode.
I needed to edit the seed file to remove a sub domain and add a few new domains and want to restart the crawl from the start.
Problem is whenever i restart the crawl the crawl re-starts from where i stopped it, which is in the middle of the sub domain i removed.
I stopped the crawl by killing the java process (kill -9) - i tried creating a .STOP file in the bin directory but that did not work so I used kill.
Now whenever i restart the crawl i can see from the output it is restarting where the job was stopped. I googled and have come across stopping the hadoop job but i don't have any hadoop files on my server - the only reference to hadoop are jar files in the apache nutch directory.
How can i restart the crawl from the very start and not from where the crawl was last stopped? Effectively i want to start a fresh crawl.
Many thanks

To start from scratch, just specify a different crawl dir or delete the existing one.
Removing entries from the seed list will not affect the content of the crawldb or the segments. What you could do to remove a domain without restarting from zero would be to add a pattern to the url filters so that the URLs get deleted from the crawldb during the update step or at least not selected during the generation step.

Related

Apache Nutch 2.3.1 checkingpointing not working

I have configured apache Nutch 2.3.1 with single node cluster (Hadoop 2.7.x and hbase 1.2.6). I have to check its checkpointing feature. according to my information, resuming is available in Fetch and parse. I assume that at any stage during fetching (or parsing), my complete cluster goes down due to some problem e.g. power failure. I assume that when I restart cluster and crawler with -resume flag, it should start to fetch only those URLs that was not fetched.
But what I observed is that (with debug enabled) that it start to refetch all URLs (with same batchID) till end even with resume flag. Resume flag only works when a job (e.g. fetch) was completely finished. I have cross checked it from its logs with a message like "Skipping express.pk; already fetched".
Does my interprestation is no correct about resume option in Nutch?
or there is some problem in cluster/configuration ?
Your interpretation is right. Also, in this case the output from Nutch (logs) is also correct.
If you check the code on https://github.com/apache/nutch/blob/release-2.3.1/src/java/org/apache/nutch/fetcher/FetcherJob.java#L119-L124, Nutch is only logging that is skipping that URL because it was already fetched. Since Nutch works in batches, in needs to check all the URLs on the same batchId, but if you specify the resume flag then (only on DEBUG) will log that it is skipping certain URLs. This is done mainly for troubleshooting if you have some issue.
This happens Nutch doesn't keep a record of the last processed URL, it needs to start at the beginning of the same batch and work its way from there. Even knowing the last URL its not enough because, you would also need the position of that URL in the batch.

power failure handling in Apache Nutch 2.x

I have setup a cluster to crawl few websites from WWW. I am using Apache Nutch 2.3.1 with Hadoop and Hbase. I have cluster backup also. But when power failure goes long then even this backup is finished and complete cluster goes down in no time. When power problem is solved (somehow), I want to resume my job from where it was at last time. For example, if my crawler was crawling 1000 URLs and after 50%, cluster goes down. I want that Nutch should fetch only remaining 50% and should not fetched those documents that are already fetched.

Easy Hosting Control Panel creates multiple backup

I have a server that's running Ubuntu 16.04. and apparently Easy Hosting Control Panel keeps on creating multiple back-ups like 50 times a day which fills the 50 gb disk space and it's causing the server to crash.
The backup is creating multiple directories named Apache2.backupbyehcp inside /etc directory.
I've tried deleting the backups one by one and after a day there it is again.
I want to disable or limit the backups created.
Any help is greatly appreciated.
Here's a screen shot of the backup directories that are being created:
This is caused by:
Ehcp trying to recover webserver config, each time it detects that the webserver config is broken or webserver not responding.
This may result in such unexpected/unwanted behaviour.
What to do:
1st, check the problem in webserver configs, like, tail -f /var/log/ehcp.log
so that you can understand what is going wrong.
This is sometime caused by incorrect webserver custom configurations by admin or reseller. You may disable custom webserver configs via ehcp gui-> options.
(I strongly suggest finding the cause of this.)
If everything regarding the webserver is okay, but you just need to disable this backup,
open install_lib.php in ehcp dir, search for backupbyehcp and disable that line.
Hope this helps.

Is it possible to invoke a crontab job once in Apache php slim framework?

I have a php script which runs as the Apache daemon user, to first remove any existing crontabs that might be already installed by the daemon user, then adding the new crontab.
I wish to evoke this only once when my PHP Slim web application loads for the first time, ideally as part of the Apache startup.
It seems that putting it anywhere in the index.php file (either directly or via a require file) doesn't prevent the crontab script from being constantly reloaded. I've tried using $GLOBAL with an if statement, which unfortunately I haven't been able to formulate a working solution, however there must be something more rudimentary which allows me to invoke a php script just once after Apache has been restarted?
Ideally I'd rather not modify the Apache startup script since I wish to keep everything within the slim web application itself (for source control purposes).

Apache Following Old Symbolic Link After Updating The Link

I have apache pointing to a symbolic link for a website. I'm using capistrano to deploy the code so it updates the symbolic link to point to the new release
After capistrano updates the symbolic link to the new release directory apache uses the previous release directory.
The weird thing is it doesn't happen all the time. After a deploy everything either works fine or apache follows the previous link until i restart or reload apache.
any ideas?
This may depend on Apache httpd holding a handle on the old symlink. As long as that is the case the old data (symlink) is not thrown away in Unix based filesystems.
It seems you are unable to predict whether this is the case. Maybe use of lsof can help. What could also help is to trigger a graceful restart of httpd.