Nutch not crawling entire website - apache

I am using nutch 2.3.1
I preform the commands to crawl a site:
./nutch inject ../urls/seed.txt
./nutch generate -topN 2500
./nutch fetch -all
The problem is, nutch is only crawling the first URL (the one specified in seeds.txt). The data is only the HTML from the first URL/page.
All the other URLS that were accumulated by the generate command are not actually crawled.
I cannot get nutch to crawl the other generated urls...I also cannot get nutch to crawl the entire website. What are the options that I need to use to crawl an entire site?
Does anyone have any insights or recommendations?
Thank you so much for your help

In the case that Nutch crawls only one specified URL, please check Nutch filter (conf/regex-urlfilter.txt). To crawl all URLs in the seed, the content of regex-urlfilter.txt should be as follows.
# accept all URLs
+.
See details here: http://wiki.apache.org/nutch/NutchTutorial
Hope this helps,
Le Quoc Do

Related

Apache Nutch 2.3.1 fetch specific MIME type documents

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I want to crawl. The problem in regex-url filter is that there can be many MIME types that will increase with time. So its very difficult to include all? Is there any way that I can instruct Nutch to fetch text/html documents for example.
The URL filters only work with the URL, this means that you can only assert based on that. Since the URL filters are executed before the documents are fetched/parsed there is no mimetype that could be used to allow/block URLs.
There is one other question, what happens if you specify that you want to crawl an specific mimetype but in the current crawl cycle there is no more documents with that mime type? Then the crawl will be stopped until you add more URLs to crawl (manually), or another URL is due to being fetched.
The normal approach for this is to crawl/parse everything and extract all the links (you never know when a new link matching your requirements could appear). Then only index certain mime types.
For Nutch 2.x I'm afraid there is currently no mechanism of doing this. On Nutch 1.x we have two:
https://github.com/apache/nutch/tree/master/src/plugin/index-jexl-filter
https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter (soon to be deprecated)
You could port either of these options into Nutch 2.x.

How to crawl specific pages with nutch 2.3?

I am new to nutch. After serveral hours of searching I couldn't figure out how to choose the right settings for my crawler.
First of all I have installed nutch 2.3 on Ubuntu 14.04 using hbase 0.94.14 and elasticsearch 1.4.2.
I started using nutch running the following commands in nutch's runtime/local directory:
bin/nutch inject seedfolder
bin/nutch generate -topN 20
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index -all
I can then access the crawled data via elasticsearch. So everything seems to work nicely.
My problem starts when I want to bring nutch to only crawl the sites I am interested in. I read lots of tutorials including the ones I found on the apache web site. One thing that really confuses me is the big difference between the versions of nutch. But also some of the questions I have never have been asked or answered.
What I want to do:
I want to tell nutch what page to crawl (of course it will be more that one but let's keep it simple). I do that by adding an url to my seed file and calling nutch inject. Now let's say I want to crawl http://www.pagetocrawl.com/intresting-facts more precisely I am interested in the contents of
http://www.pagetocrawl.com/intresting-facts?interesting-fact-id=1
http://www.pagetocrawl.com/intresting-facts?interesting-fact-id=2
http://www.pagetocrawl.com/intresting-facts?interesting-fact-id=3
...
What I thought was the necessary thing to do was editing NUTCH_HOME/runtime/local/conf/regex-urlfilter.xml and adding something like
+http://www.pagetocrawl.com/intresting-facts
When I tried to run the generate and fetch command I noticed that in fact nutch only crawled sites starting with pagetocrawl.com and didn't touch the other sites I had injected earlier. But then it crawled all the pages that
http://www.pagetocrawl.com/interesting-facts
linked to. That is the imprint, where-to-find-us page etc. In the end it didn't crawl even one sigle interesting-fact site. So my two most imporant questions are: How can I tell nutch to crawl only that subsites of my sites filtered by regex-urlfilter.xml that do also match a specific pattern? And in the next step: How can I be sure to crawl all of the relevant subsites (as long as thy are linked to from http://www.pagetocrawl.com/interesting-facts site)?
I've looked at
http://www.stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl
but here the problem seems to appear one step earlier I added my url to regex-urlfilter.xml and it seems to be working - just not the way I expected it to work.
I've also read that question:
http://www.stackoverflow.com/questions/3253525/how-to-index-only-pages-with-certain-urls-with-nutch
This seems to describe the very same problem I have. But since I am working with nutch 2.3 the mergedb command doesn't seem to work any longer.
I do really hope I described my problem propery and that somebody can help me with that.

How to know the status of apache solr for daily indexed documents

I am using apache solr 4.10.x. APache nutch is being used to to crawl and index documents. Now my crawler is running, I want to know how many documents are indexed on each iteration of nutch or per day.
Any idea or any tool provided by apache solr for this purpose?
facet=true
facet.date=myDateField
facet.date.start=start_date
facet.date.end=end_date
facet.date.gap=+1MONTH(duration like +1DAY in your case).
append everything in your URL if you are using HTTP request with &.
You can hit a url with command=status
for example in my case it was
qt=/dataimport&command=status
it gives you the status like committed or rollback or total documents processed...
for more info check at
http://wiki.apache.org/solr/DataImportHandler
Check for "command"

Apache nutch and solr : queries

I have just started using Nutch 1.9 and Solr 4.10
After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr
This version of package doesnt require Tomcat for running. I started Solr:
java -jar start.jar
and checked localhost:8983/solr/admin, its working.
I planted a seed in bin/url/seed.txt and seed is "simpleweb.org"
Ran Command in Nutch: ./crawl urls -dir crawl -depth 3 -topN 5
I got few IO exceptions in the middle and so to avoid the IO exception I downloaded
patch-hadoop_7682-1.0.x-win.jar and made an entry in nutch-site.xml and placed the jar file in lib of Nutch.
After running Nutch,
Following folders were created:
apache-nutch-1.9\bin\-dir\crawldb\current\part-00000
I can see following files in that path:
data<br>
index<br>
.data.crc<br>
.index.crc<br>
I want to know what to do with these files, what are the next steps? Can we view these files? If yes, how?
I indexed the crawled data from Nutch into Solr:
for linking solr with nutch (command completed successfully)
Command ./crawl urls solr http://localhost:8983/solr/ -depth 3 -topN 5
Why do we need to index the data crawled by Nutch into Solr?
After crawling using Nutch
command used for this: ./crawl urls -dir crawl -depth 3 -topN 5; can we view the crawled data, if yes, where?
OR only after indexing the data crawled by Nutch into Solr, can we view the crawled data entires?
How to view the crawled data in Solr web?
command used for this: ./crawl urls solr localhost:8983/solr/ -depth 3 -topN 5
Although Nutch was built to be a web scale search engine, this is not the case any more. Currently, the main purpose of Nutch is to do a large scale crawling. What you do with that crawled data is then up to your requirements. By default, Nutch allows to send data into Solr. That is why you can run
crawl url crawl solraddress depth level
You can emit the solr url parameter also. In that case, nutch will not send the crawled data into Solr. Without sending the crawled data to solr, you will not be able to search data. Crawling data and searching data are two different things but very related.
Generally, you will find the crawled data in the crawl/segments not crawl/crawdb. The crawl db folder stores information about the crawled urls, their fetching status and next time for fetching plus some other useful information for crawling. Nutch stores actual crawled data in crawl/segments.
If you want to have an easy way to view crawled data, you might try nutch 2.x as it can store its crawled data into several back ends like MySQL, Hbase, Cassandra and etc through the Gora component.
To view data at solr, you simplly issue a query to Solr like this:
curl http://127.0.0.1:8983/solr/collection1/select/?q=*:*
Otherwise, you can always push your data into different stores via adding indexer plugins. Currently, Nutch supports sending data to Solr and Elasticsearch. These indexer plugins send structured data like title, text, metadata, author and other metadata.
The following summarizes what happens in Nutch:
seed list -> crawldb -> fetching raw data (download site contents)
-> parsing the raw data -> structuring the parse data into fields (title, text, anchor text, metadata and so on)->
sending the structured data to storage for usage (like ElasticSearch and Solr).
Each of these stages is extendable and allows you to add your logic to suit your requirements.
I hope that clears your confusion.
u can run nutch on windows-I am also a beginner-yes it is a bit tough to install in windows but it do works!-this input path doesn't exist problem can be solved by:-
replacing Hadoop-core-1.2.0.jar file in apache-nutch-1.9/lib with hadoop-core-0.20.2.jar(from maven)
then rename this new file as hadoop-core-1.2.0

Is there anyway to log the list of urls 'ignored' in Nutch crawl?

I am using Nutch to crawl a list of URLS specified in the seed file with depth 100 and topN 10,000 to ensure a full crawl. Also, I am trying to ignore urls with repeated strings in their path using regex-urlfilter http://rubular.com/r/oSkwqGHrri
However, I am curious to know which urls have been ignored during crawling. Is there anyway i can log the list of urls "ignored" by Nutch while it crawls?
The links can be found by using the following command
bin/nutch readdb PATH_TO_CRAWL_DB -stats -sort -dump DUMP_FOLDER -format csv
this will generate part-00000 file in dump_folder which will contain the url list and their status respectively.
Those with the status of db_unfetched have been ignored by the crawler.