bin/nutch inject crawl/crawldb urls not working - apache

I just followed the tutorial to setup Nutch from NutchWiki.
Downloaded Nutch 2.x src and set all configurations.
The problem occurs when I just started to crawl.
When I run this code : bin/nutch inject crawl/crawldb urls I am getting an error message like this : Unrecognized arg urls
I just followed all steps in the tutorial, created directories, made changes to configuration files etc. And I also have a query that there is no crawldb directory in the apache-nutch-2.x/runtime/local/ Is it automatically generated or need to manually generate it ?
Any help to this problem will be appreciated.

I was going through the same problem. The documentation seems to be outdated. It is for 1.x .
For 2.x I have tried the following and it worked for me.
bin/nutch inject urls
Hope it helps.

Related

Nutch 1.12 on Cygwin on Windows 7 - NullPointerException

I'm working to get nutch running for the first time for a work project. At this time, the plan is to run nutch from a single machine (Windows 7) to scrape context from a dozen or so web sites. Below is the command line output from cygwin.
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-10-29 09:16:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
Looking through the source, here are lines 440 thru 443 of org.apache.nutch.crawl.Injector:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
System.exit(res);
}
It's not clear exactly whether it is the NutchConfiguration.create() or the new Injector() which is failing there. I setup my installation from the tutorial on the nutch site. I put a list of 3 urls, 1 per line, in the file ./urls/seed.txt; and edited ./conf/nutch-site.xml.
Any suggestions for investigation/debugging this would be appreciated.
Thank you!
Ok After somewhat struggling here are the final steps to get hadoop working with cygwin/windows.
download the right version of winutils.exe and hadoop.dll under a folder bin from https://github.com/cdarlint/winutils based on hadoop version.
set HADOOP_HOME to the download dir of bin folder above. (note if the above two files are downloaded in dir D:\winutil\bin then HADOOP_HOME = D:\winutil)
make sure to add D:\winutil\bin to the PATH variable of windows. This step is important now (was not a while back).
I had the same issue. Solved it by setting up Hadoop in machine and included winutils.exe in %HADOOP%/bin.
Then will get java.lang.UnsatisfiedLinkError error. To solve that, open nutch file in %NUTCH_HOME%/runtime/local/bin and comment below lines
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS=("${NUTCH_OPTS[#]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi

Collectd : Could not find plugin "rrdtool" in /opt/collectd/lib/collectd

This is first time I am working with collectd.
I have performed the following steps :
Downloaded https://collectd.org/files/collectd-5.5.2.tar.gz
Extracted the tar.
executed configure
executed make all install
changed the collectd.conf in /opt/collectd/etc/collectd.conf
uncommented the necessary plugin and made changes to file paths.
I have used the following link.
I am getting the above error when I try to run collectd.
However when I use csv plugin it works correctly.
As much as I understood rrdtool is necessary in order to visualize data.
I need rrdtool so that I can visualize my data.
Is there any other alternative to rrdtool to view data on my browser, or any other tool or plugin using which I can visualize my csv data.
This is what I have figured out after running configure:
Thank you
In the same output you have to find the missign lib.
Dependencies:
collectd(x86-64) = 5.6.0-1.sdl7
libc.so.6(GLIBC_2.14)(64bit)
libdl.so.2()(64bit)
librrd_th.so.4()(64bit)
rtld(GNU_HASH)

Nutch 2.3 not able to parse a URL which is correctly parsed through parsechecker

I am experimenting with Apache Nutch 2.3.
When I ran the parse command for the URL http://comptuergodzilla.blogspot.com Nutch parse the content correctly. I mean I get all the outlinks and content in ol and p column family respectively.
But when I did the same for URL http://goal.com/en-india it was not able to parse the site outlinks and content.
What makes me scratch my head is after running parsechecker command for URL http://www.goal.com/en-india I get all the parsed contents and outlinks.
Regarding above my questions are:
i. Why parse command is not working? It should work if parsechecker is parsing the URL correctly.
ii. Do I have to build the separate HTMLParser plugin for achieving above.

Nutch 2.x No errors, No results neither

I've been playing with nutch 2.x for awhile, have it set up according to the Nutch 2.x tutorial as advised in this post , still I can't figure it out - any help would be greatly appreciated.
When using the INJECT command as per tutorial, it injects the 2 URLS I have in seeds.txt:
nutch inject ../local/urls/seed.txt
but when running the script it doesn't visit any of the urls:
bin/crawl ../local/urls/seed.txt TestCrawl *ttp://l*calhost:8983/solr 2
I've now started again with a complete new install of Nutch 2.2.1 - Hbase-0.94.10 and Solr 4.4.0 as advised vy someone on the mailinglist, due to that the versions mentioned in the tutorial are years old, and now the error I'm getting is:
[root#localhost local]# bin/nutch inject /urls/seed.txt
InjectorJob: starting at 2013-08-11 17:59:32
InjectorJob: Injecting urlDir: /urls/seed.txt
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: �2249#localhost.localdomainlocalhost,45431,1376235201648
Although this is a long time question, but I have a suggestion here。
Because nutch is apache project, so it will obey robots.txt, perhaps because of that,you got any thing。you can gedit src/java/org/apache/nutch/fetcher/FetcherReducer.java to Uncomment
/*if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/

enchant_broker_init() performance issue

I have a web site which uses enchant enchant_broker_init().
I`m not sure why but enchant_broker_init() takes something like 19~ seconds to load the page.
Once I remove this function page loads right away.
anyone has an idea why this is? or how can I debug it?
Thanks
Steps to install php enchant plugin.
APACHE
http://apache.mivzakim.net//httpd/binaries/win32/#warnings
PHP
http://windows.php.net/download/
ENABLE PHP ON APACHE
http://php.net/manual/en/install.windows.apache2.php
Enchant will not work using WAMP. ( at least for it hasn't... ).
Afterwards add the following path
share\myspell\dicts
to your PHP directory.
should look like that.
C:\PHP\share\myspell\dicts
Put the dictionary files there.
.aff
.dic