Nutch 1.4 with Solr 3.4 - can't crawl URL, "no URLs to fetch" - apache

I followed a tutorial for web-crawling with Nutch using cygwin, tomcat, nutch 1.4 and solr 3.4. I already could crawl an URL once, but somehow this doesn't work anymore, no matter which URL i try.
My regex-urlfilter.txt in runtime/local/conf is as following:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
The only URL in my seed.txt in runtime/local/bin/urls is only http://nutch.apache.org/.
For crawling I use command
$ ./nutch crawl urls -dir newCrawl3 -solr http://localhost:8080/solr/ -depth 2 -topN 3
Console output is:
cygpath: can't convert empty path
crawl started in: newCrawl3
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 3
Injector: starting at 2017-05-18 17:03:25
Injector: crawlDb: newCrawl3/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2017-05-18 17:03:28, elapsed: 00:00:02
Generator: starting at 2017-05-18 17:03:28
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 3
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: newCrawl3
I know there are a few similar questions, but most of them are not resolved. Can anyone help?
Thank you very much in advance!

Why using a Nutch version that is really really old? But nevertheless the problem that you're facing is the space at the beginning of this line:
_+^http://([a-z0-9]*\.)*nutch.apache.org/
(I've highlighted the space with an underscore) every line that starts with a space, \n, # gets ignored by the configuration parser, take a look at:
https://github.com/apache/nutch/blob/master/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java#L258-L269

You can try deleting the directory newCrawl3. Nutch will not crawl an url again, when it has been crawled lately.

Related

Nutch 1.12 on Cygwin on Windows 7 - NullPointerException

I'm working to get nutch running for the first time for a work project. At this time, the plan is to run nutch from a single machine (Windows 7) to scrape context from a dozen or so web sites. Below is the command line output from cygwin.
$ bin/nutch inject crawl/crawldb urls
Injector: starting at 2016-10-29 09:16:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
at org.apache.nutch.crawl.Injector.run(Injector.java:467)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:441)
Looking through the source, here are lines 440 thru 443 of org.apache.nutch.crawl.Injector:
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
System.exit(res);
}
It's not clear exactly whether it is the NutchConfiguration.create() or the new Injector() which is failing there. I setup my installation from the tutorial on the nutch site. I put a list of 3 urls, 1 per line, in the file ./urls/seed.txt; and edited ./conf/nutch-site.xml.
Any suggestions for investigation/debugging this would be appreciated.
Thank you!
Ok After somewhat struggling here are the final steps to get hadoop working with cygwin/windows.
download the right version of winutils.exe and hadoop.dll under a folder bin from https://github.com/cdarlint/winutils based on hadoop version.
set HADOOP_HOME to the download dir of bin folder above. (note if the above two files are downloaded in dir D:\winutil\bin then HADOOP_HOME = D:\winutil)
make sure to add D:\winutil\bin to the PATH variable of windows. This step is important now (was not a while back).
I had the same issue. Solved it by setting up Hadoop in machine and included winutils.exe in %HADOOP%/bin.
Then will get java.lang.UnsatisfiedLinkError error. To solve that, open nutch file in %NUTCH_HOME%/runtime/local/bin and comment below lines
if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
NUTCH_OPTS=("${NUTCH_OPTS[#]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi

bin/nutch inject crawl/crawldb urls not working

I just followed the tutorial to setup Nutch from NutchWiki.
Downloaded Nutch 2.x src and set all configurations.
The problem occurs when I just started to crawl.
When I run this code : bin/nutch inject crawl/crawldb urls I am getting an error message like this : Unrecognized arg urls
I just followed all steps in the tutorial, created directories, made changes to configuration files etc. And I also have a query that there is no crawldb directory in the apache-nutch-2.x/runtime/local/ Is it automatically generated or need to manually generate it ?
Any help to this problem will be appreciated.
I was going through the same problem. The documentation seems to be outdated. It is for 1.x .
For 2.x I have tried the following and it worked for me.
bin/nutch inject urls
Hope it helps.

Scrapy Downloading Files Error

I'm using the files pipeline in Scrapy to download subtitle files off of http://opensubtitles.org.
I've got a list of all the http://dl.opensubtitles.org links, and my spider follows these links and sends the urls to the pipeline.
It works to start, and I can download the first ~100 files without any issue.
However, around then the links seem to create the error:
2016-06-09 11:44:02 [scrapy] WARNING: File (code: 301): Error downloading file from http://dl.opensubtitles.org/en/download/vrf-108d030f/sub/24617> referred in
Does it have something to do with my code?
These are in my settings:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
FILES_STORE = 'C:/Users/Rohan/Documents/Fitroom/subtitles/subFiles'
This is my pipeline:
class SubtitlesPipeline(object):
def process_item(self, item, spider):
return item
Thanks!
This error maybe occurred due to download time out, because file maybe bigger in size. Increase download time out.
Try this in setting.py file
DOWNLOAD_TIMEOUT = 500

Nutch 2.3 not able to parse a URL which is correctly parsed through parsechecker

I am experimenting with Apache Nutch 2.3.
When I ran the parse command for the URL http://comptuergodzilla.blogspot.com Nutch parse the content correctly. I mean I get all the outlinks and content in ol and p column family respectively.
But when I did the same for URL http://goal.com/en-india it was not able to parse the site outlinks and content.
What makes me scratch my head is after running parsechecker command for URL http://www.goal.com/en-india I get all the parsed contents and outlinks.
Regarding above my questions are:
i. Why parse command is not working? It should work if parsechecker is parsing the URL correctly.
ii. Do I have to build the separate HTMLParser plugin for achieving above.

Nutch 2.x No errors, No results neither

I've been playing with nutch 2.x for awhile, have it set up according to the Nutch 2.x tutorial as advised in this post , still I can't figure it out - any help would be greatly appreciated.
When using the INJECT command as per tutorial, it injects the 2 URLS I have in seeds.txt:
nutch inject ../local/urls/seed.txt
but when running the script it doesn't visit any of the urls:
bin/crawl ../local/urls/seed.txt TestCrawl *ttp://l*calhost:8983/solr 2
I've now started again with a complete new install of Nutch 2.2.1 - Hbase-0.94.10 and Solr 4.4.0 as advised vy someone on the mailinglist, due to that the versions mentioned in the tutorial are years old, and now the error I'm getting is:
[root#localhost local]# bin/nutch inject /urls/seed.txt
InjectorJob: starting at 2013-08-11 17:59:32
InjectorJob: Injecting urlDir: /urls/seed.txt
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: java.lang.IllegalArgumentException: Not a host:port pair: �2249#localhost.localdomainlocalhost,45431,1376235201648
Although this is a long time question, but I have a suggestion here。
Because nutch is apache project, so it will obey robots.txt, perhaps because of that,you got any thing。you can gedit src/java/org/apache/nutch/fetcher/FetcherReducer.java to Uncomment
/*if (!rules.isAllowed(fit.u.toString())) {
// unblock
fetchQueues.finishFetchItem(fit, true);
if (LOG.isDebugEnabled()) {
LOG.debug("Denied by robots.txt: " + fit.url);
}
output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
CrawlStatus.STATUS_GONE);
continue;
}
*/