Nutch 1.12 on Cygwin on Windows 7 - NullPointerException - nullpointerexception

I'm working to get nutch running for the first time for a work project. At this time, the plan is to run nutch from a single machine (Windows 7) to scrape context from a dozen or so web sites. Below is the command line output from cygwin.
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-10-29 09:16:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
Looking through the source, here are lines 440 thru 443 of org.apache.nutch.crawl.Injector:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
System.exit(res);
}
It's not clear exactly whether it is the NutchConfiguration.create() or the new Injector() which is failing there. I setup my installation from the tutorial on the nutch site. I put a list of 3 urls, 1 per line, in the file ./urls/seed.txt; and edited ./conf/nutch-site.xml.
Any suggestions for investigation/debugging this would be appreciated.
Thank you!

Ok After somewhat struggling here are the final steps to get hadoop working with cygwin/windows.
download the right version of winutils.exe and hadoop.dll under a folder bin from https://github.com/cdarlint/winutils based on hadoop version.
set HADOOP_HOME to the download dir of bin folder above. (note if the above two files are downloaded in dir D:\winutil\bin then HADOOP_HOME = D:\winutil)
make sure to add D:\winutil\bin to the PATH variable of windows. This step is important now (was not a while back).

I had the same issue. Solved it by setting up Hadoop in machine and included winutils.exe in %HADOOP%/bin.
Then will get java.lang.UnsatisfiedLinkError error. To solve that, open nutch file in %NUTCH_HOME%/runtime/local/bin and comment below lines
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS=("${NUTCH_OPTS[#]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi

Related

Nutch 1.4 with Solr 3.4 - can't crawl URL, "no URLs to fetch"

I followed a tutorial for web-crawling with Nutch using cygwin, tomcat, nutch 1.4 and solr 3.4. I already could crawl an URL once, but somehow this doesn't work anymore, no matter which URL i try.
My regex-urlfilter.txt in runtime/local/conf is as following:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
The only URL in my seed.txt in runtime/local/bin/urls is only http://nutch.apache.org/.
For crawling I use command
$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3
Console output is:
cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3
I know there are a few similar questions, but most of them are not resolved. Can anyone help?
Thank you very much in advance!
Why using a Nutch version that is really really old? But nevertheless the problem that you're facing is the space at the beginning of this line:
_+^http://([a-z0-9]*\.)*nutch.apache.org/
(I've highlighted the space with an underscore) every line that starts with a space, \n, # gets ignored by the configuration parser, take a look at:
https://github.com/apache/nutch/blob/master/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java#L258-L269
You can try deleting the directory newCrawl3. Nutch will not crawl an url again, when it has been crawled lately.

bin/nutch inject crawl/crawldb urls not working

I just followed the tutorial to setup Nutch from NutchWiki.
Downloaded Nutch 2.x src and set all configurations.
The problem occurs when I just started to crawl.
When I run this code : bin/nutch inject crawl/crawldb urls I am getting an error message like this : Unrecognized arg urls
I just followed all steps in the tutorial, created directories, made changes to configuration files etc. And I also have a query that there is no crawldb directory in the apache-nutch-2.x/runtime/local/ Is it automatically generated or need to manually generate it ?
Any help to this problem will be appreciated.
I was going through the same problem. The documentation seems to be outdated. It is for 1.x .
For 2.x I have tried the following and it worked for me.
bin/nutch inject urls
Hope it helps.

Nutch 2.x No errors, No results neither

I've been playing with nutch 2.x for awhile, have it set up according to the Nutch 2.x tutorial as advised in this post , still I can't figure it out - any help would be greatly appreciated.
When using the INJECT command as per tutorial, it injects the 2 URLS I have in seeds.txt:
nutch inject ../local/urls/seed.txt
but when running the script it doesn't visit any of the urls:
bin/crawl ../local/urls/seed.txt TestCrawl *ttp://l*calhost:8983/solr 2
I've now started again with a complete new install of Nutch 2.2.1 - Hbase-0.94.10 and Solr 4.4.0 as advised vy someone on the mailinglist, due to that the versions mentioned in the tutorial are years old, and now the error I'm getting is:
[root#localhost local]# bin/nutch inject /urls/seed.txt
InjectorJob: starting at 2013-08-11 17:59:32
InjectorJob: Injecting urlDir: /urls/seed.txt
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: �2249#localhost.localdomainlocalhost,45431,1376235201648
Although this is a long time question, but I have a suggestion here。
Because nutch is apache project, so it will obey robots.txt, perhaps because of that,you got any thing。you can gedit src/java/org/apache/nutch/fetcher/FetcherReducer.java to Uncomment
/*if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/

Fatal error running unit test in Yii app

This Question is local to my situation and not resolved (yet). But if you are experiencing this problem, the trouble shooting steps may give you a good path to start on.
I want to run a unit test in a Yii web application on localhost, which is running via WampServer 2.1 on Windows 7.
<?php
class LittleTest extends CTestCase
{
public function testApprove()
{
$value1 = "1";
$this->assertEquals($value1,$value1);
}
}
?>
I receive a fatal error when I try to run the test. Here is how I run it, on the Windows command line:
C:\wamp\www\app\protected\tests>
C:\wamp\www\app\protected\tests>cd unit
C:\wamp\www\app\protected\tests\unit>phpunit LittleTest.php
I receive (along with some stack trace lines):
PHP Fatal error:
class 'CTestCase' not found in [path to file]\LittleTest.php on line 4
Trouble shooting steps to this point:
The app runs. The default index page of the app looks good and I have used the gii tool to create a model class.
From command line, I can see php and phpunit are available (and I've been over my pear install to make sure it's all good):
C:\wamp\www\app\protected\tests>
C:\wamp\www\app\protected\tests>phpunit --version
PHPUnit 3.7.13 by Sebastian Bergmann.
C:\wamp\www\app\protected\tests>
C:\wamp\www\app\protected\tests>php --version
PHP 5.3.5 (cli)
... etc
display_errors is turned on. display_startup_errors is turned on.
I tried renaming the class so that name did not match the document name:
class LittleTestTweak extends CTestCase
I'm not sure of the precise command that runs the test, so I have tried variants like:
php LittleTest.php
Also I've tried running it various places in the folder structure. Here is the immediate structure:
/tests
| bootstrap.php
| my_tree.txt
| phpunit.xml
| WebTestCase.php
|
|---- /fixtures
|---- /functional
| SiteTest.php
|
|---- /report
`---- /unit
LittleTest.php
I also checked my php.ini for the path to PEAR; as far as I can tell, it's correct (but how can I test it?):
include_path=".;C:\wamp\bin\php\php5.3.5\PEAR;C:\wamp\www\app
More Info
In response to this:
cd wamp\www\app\protected\tests
phpunit unit\LittleTest.php
I receive this:
Warning: require_once(PHPUnit/Extensions/SeleniumTestCase.php):
failed to open stream: No such file or directory in
C:\wamp\www\yii\framework\test\CWebTestCase.php on line 12
Call Stack:
0.0007 339624 1. {main}() C:\wamp\bin\php\php5.3.5\phpunit:0
0.0164 698440 2. PHPUnit_TextUI_Command::main()
C:\wamp\bin\php\php5.3.5\phpunit:46
0.0164 698856 3. PHPUnit_TextUI_Command->run()
C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\TextUI\Command.php:129
0.0164 698856 4. PHPUnit_TextUI_Command->handleArguments()
C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\TextUI\Command.php:138
0.0289 1220944 5. PHPUnit_TextUI_Command->handleBootstrap()
C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\TextUI\Command.php:606
0.0300 1233328 6. PHPUnit_Util_Fileloader::checkAndLoad()
C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\TextUI\Command.php:778
0.0330 1233424 7. PHPUnit_Util_Fileloader::load()
C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\Util\Fileloader.php:76
0.0334 1238096 8. include_once
('C:\wamp\www\app\protected\tests\bootstrap.php')
C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\Util\Fileloader.php:92
0.0412 1520256 9. require_once
('C:\wamp\www\app\protected\tests\WebTestCase.php')
C:\wamp\www\app\protected\tests\bootstrap.php:8
0.0413 1520520 10. YiiBase::autoload()
C:\wamp\www\yii\framework\YiiBase.php:0
0.0423 1543904 11.
include('C:\wamp\www\yii\framework\test\CWebTestCase.php')
C:\wamp\www\yii\framework\YiiBase.php:395
Fatal error: require_once(): Failed opening required
'PHPUnit/Extensions/SeleniumTestCase.php'
(include_path='.;C:\wamp\bin\php\php5.3.5\PEAR\pear;C:\wamp\bin\php\php5.3.5\pear')
in C:\wamp\www\yii\framework\test\CWebTestCase.php on line 12
Call Stack:
0.0007 339624 1. {main}() C:\wamp\bin\php\php5.3.5\phpunit:0
0.0164 698440 2. PHPUnit_TextUI_Command::main()
...et cetera...
The failed requirement is PHPUnit/Extensions/SeleniumTestCase.php. I wonder if the issue is that PHPUnit is installed locally under C:\wamp.
I opened my php.ini and added to include_path: C:\wamp\bin\php\php5.3.5\PEAR\PHPUnit\Extensions. Restarted wamp. No change in error reporting.
RESOLUTION
My security setup uses DansGuardian and I had neglected to loosen the settings for banned extension types, which blocks file downloads. In fact I don't care to ban any types, and modifying that file allows everything to work. Woops, that's my Linux set. PHPUnit is working there, and it is working on WAMP also. Recreating the steps on WAMP is impossible; but I do know I had to open cmd.exe as administrator and pear update-channels, pear upgrade-all, etc. I also had to clear pear's cache at one point, and I had to overcome an issue with curl recognition to install Selenium:
pear install --force phpunit/PHPUnit_Selenium
To run unit tests in Yii with phpunit, you'll need to let phpunit load the protected/tests/bootstrap.php file which basically sets up a configuration, and autoloads the required classes (mainly pertaining to testing). The bootstrap.php file loads yiit.php which actually autoloads the required classes.
Now we can load all this configuration either by command line options when running phpunit, or let the configuration be read automatically through the protected/tests/phpunit.xml file.
For the latter method, the directory from where phpunit is invoked should have the phpunit.xml file in it, and in Yii default webapp, this directory is protected/tests. Therefore you need to do the following to run your tests:
cd wamp\www\app\protected\tests
phpunit unit\LittleTest.php

Hadoop configurations seem not to be read

Every time when I try to start my mapreduce application (in standalone Hadoop), it tries to put stuff in the tmp directory, which it can't:
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-username\mapred\staging\username-1524148556\.staging to 0700
It ties to use an invalid path (slashes should be the other way around for cygwin).
I set hadoop.tmp.dir in core-site.xml (in the conf folder of Hadoop), but it seems that the config file is never read (if I put syntax errors in the file, it makes no difference). I added:
--config /home/username/hadoop-1.0.1/conf
To the command, but no difference. I also tried:
export HADOOP_CONF_DIR=/home/username/hadoop-1.0.1/conf
but also that does not seem to have an effect....
Any pointers on why the configs would not be read, or what else I am failing to see here?
Thanks!
It's not that the slashes are inverted, it's that /tmp is a cygwin path which actually maps to /cygwin/tmp or c:\cygwin\tmp. since hadoop is java and doesn't know about cygwin mappings, it takes /tmp to mean c:\tmp.
there's an awful lot of stuff to patch if you want to get 1.0.1 running on cygwin.
see: http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin
I found the following link useful, it seems that the problem stands with newer version of Hadoop. I'm using version 1.0.4 and I'm still facing this problem.
http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/25837
UPDATED: in Mahout 0.7 and for the ones who use the "Mahoot in Action" book example, you shoud change the example code as follows:
File outFile = new File("output");
if (!outFile.exists()) {
outFile.mkdir();
}
Path output = new Path("output");
HadoopUtil.delete(conf, output);
KMeansDriver.run(conf, new Path("testdata/points"), new Path("testdata/clusters"),
output, new EuclideanDistanceMeasure(), 0.001, 10,
true, 0.1, true);