Apache Nutch 2.3.1 fetch specific MIME type documents - apache

I have configured Apache Nutch 2.3.1 with hadoop/hbase ecosystem. I have to crawl specific documents i.e. documents having textual content only. I have found regex-urlfilter.txt to exclude MIMEs but could not find any option to specify MIME that I want to crawl. The problem in regex-url filter is that there can be many MIME types that will increase with time. So its very difficult to include all? Is there any way that I can instruct Nutch to fetch text/html documents for example.

The URL filters only work with the URL, this means that you can only assert based on that. Since the URL filters are executed before the documents are fetched/parsed there is no mimetype that could be used to allow/block URLs.
There is one other question, what happens if you specify that you want to crawl an specific mimetype but in the current crawl cycle there is no more documents with that mime type? Then the crawl will be stopped until you add more URLs to crawl (manually), or another URL is due to being fetched.
The normal approach for this is to crawl/parse everything and extract all the links (you never know when a new link matching your requirements could appear). Then only index certain mime types.
For Nutch 2.x I'm afraid there is currently no mechanism of doing this. On Nutch 1.x we have two:
https://github.com/apache/nutch/tree/master/src/plugin/index-jexl-filter
https://github.com/apache/nutch/tree/master/src/plugin/mimetype-filter (soon to be deprecated)
You could port either of these options into Nutch 2.x.

Related

Regarding Drupal 7 configuring with apache Solr and apache Nutch

I have installed drupal 7 and the apache solr search module and configured with Apache Solr(solr version:4.10.4). The content has been indexed from the drupal to the apache solr and searching also works fine.I need to configure Nutch(Apache Nutch Version:1.12) web crawler to the apache solr and drupal 7 and to fetch the details from the specific URL (for eg: http://www.w3schools.com) and need to search in the drupal for the contents. My problem is how to configure all three solr nutch and drupal 7.Can any one suggest the solution for this?
Ok... here's my ugly solution that maybe fits in what you are doing.
You can use a php field (a custom field with Display Suite) in your node (or page) which basically reads your full page with CURL and then print the contents right there. This field should be only in a display of your node that will see nobody (except Apache Solr).
Finally in Solr config (which honestly I don't remember well how it worked) you could choose which display of the page to be indexed, or the field to be indexed, which will be your full page.
If all these works, you don't need to integrate Nutch with Solr and Drupal.
Good luck :)
PD: If you have a doubt just ask.
My 2 cents on this: looks like you want to aggregate content from your Drupal site (your nodes) and from an external content hosted on your site but not as a Drupal content right? If this is the case then you don't need to any integration between Nutch and Drupal, just to index everything in the same Solr core/collection. Of course you'll need to make sure that the Solr schema is compatible (Nutch has it's own metadata different from the Drupal nodes). Also if you index in separated cores/collections you could use the shards parameter to span you query to several cores and still get only one result set, but with this approach you'll need to keep and eye on the relevance of your results (the order of the documents) and also keep and eye on what fields the Drupal Solr module uses to show the result, so in the end you'll still need to make the schema of both cores compatible at some degree.

Nutch not crawling entire website

I am using nutch 2.3.1
I preform the commands to crawl a site:
./nutch inject ../urls/seed.txt
./nutch generate -topN 2500
./nutch fetch -all
The problem is, nutch is only crawling the first URL (the one specified in seeds.txt). The data is only the HTML from the first URL/page.
All the other URLS that were accumulated by the generate command are not actually crawled.
I cannot get nutch to crawl the other generated urls...I also cannot get nutch to crawl the entire website. What are the options that I need to use to crawl an entire site?
Does anyone have any insights or recommendations?
Thank you so much for your help
In the case that Nutch crawls only one specified URL, please check Nutch filter (conf/regex-urlfilter.txt). To crawl all URLs in the seed, the content of regex-urlfilter.txt should be as follows.
# accept all URLs
+.
See details here: http://wiki.apache.org/nutch/NutchTutorial
Hope this helps,
Le Quoc Do

How to know the status of apache solr for daily indexed documents

I am using apache solr 4.10.x. APache nutch is being used to to crawl and index documents. Now my crawler is running, I want to know how many documents are indexed on each iteration of nutch or per day.
Any idea or any tool provided by apache solr for this purpose?
facet=true
facet.date=myDateField
facet.date.start=start_date
facet.date.end=end_date
facet.date.gap=+1MONTH(duration like +1DAY in your case).
append everything in your URL if you are using HTTP request with &.
You can hit a url with command=status
for example in my case it was
qt=/dataimport&command=status
it gives you the status like committed or rollback or total documents processed...
for more info check at
http://wiki.apache.org/solr/DataImportHandler
Check for "command"

Apache nutch and solr : queries

I have just started using Nutch 1.9 and Solr 4.10
After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr
This version of package doesnt require Tomcat for running. I started Solr:
java -jar start.jar
and checked localhost:8983/solr/admin, its working.
I planted a seed in bin/url/seed.txt and seed is "simpleweb.org"
Ran Command in Nutch: ./crawl urls -dir crawl -depth 3 -topN 5
I got few IO exceptions in the middle and so to avoid the IO exception I downloaded
patch-hadoop_7682-1.0.x-win.jar and made an entry in nutch-site.xml and placed the jar file in lib of Nutch.
After running Nutch,
Following folders were created:
apache-nutch-1.9\bin\-dir\crawldb\current\part-00000
I can see following files in that path:
data<br>
index<br>
.data.crc<br>
.index.crc<br>
I want to know what to do with these files, what are the next steps? Can we view these files? If yes, how?
I indexed the crawled data from Nutch into Solr:
for linking solr with nutch (command completed successfully)
Command ./crawl urls solr http://localhost:8983/solr/ -depth 3 -topN 5
Why do we need to index the data crawled by Nutch into Solr?
After crawling using Nutch
command used for this: ./crawl urls -dir crawl -depth 3 -topN 5; can we view the crawled data, if yes, where?
OR only after indexing the data crawled by Nutch into Solr, can we view the crawled data entires?
How to view the crawled data in Solr web?
command used for this: ./crawl urls solr localhost:8983/solr/ -depth 3 -topN 5
Although Nutch was built to be a web scale search engine, this is not the case any more. Currently, the main purpose of Nutch is to do a large scale crawling. What you do with that crawled data is then up to your requirements. By default, Nutch allows to send data into Solr. That is why you can run
crawl url crawl solraddress depth level
You can emit the solr url parameter also. In that case, nutch will not send the crawled data into Solr. Without sending the crawled data to solr, you will not be able to search data. Crawling data and searching data are two different things but very related.
Generally, you will find the crawled data in the crawl/segments not crawl/crawdb. The crawl db folder stores information about the crawled urls, their fetching status and next time for fetching plus some other useful information for crawling. Nutch stores actual crawled data in crawl/segments.
If you want to have an easy way to view crawled data, you might try nutch 2.x as it can store its crawled data into several back ends like MySQL, Hbase, Cassandra and etc through the Gora component.
To view data at solr, you simplly issue a query to Solr like this:
curl http://127.0.0.1:8983/solr/collection1/select/?q=*:*
Otherwise, you can always push your data into different stores via adding indexer plugins. Currently, Nutch supports sending data to Solr and Elasticsearch. These indexer plugins send structured data like title, text, metadata, author and other metadata.
The following summarizes what happens in Nutch:
seed list -> crawldb -> fetching raw data (download site contents)
-> parsing the raw data -> structuring the parse data into fields (title, text, anchor text, metadata and so on)->
sending the structured data to storage for usage (like ElasticSearch and Solr).
Each of these stages is extendable and allows you to add your logic to suit your requirements.
I hope that clears your confusion.
u can run nutch on windows-I am also a beginner-yes it is a bit tough to install in windows but it do works!-this input path doesn't exist problem can be solved by:-
replacing Hadoop-core-1.2.0.jar file in apache-nutch-1.9/lib with hadoop-core-0.20.2.jar(from maven)
then rename this new file as hadoop-core-1.2.0

How to Crawl .pdf links using Apache Nutch

I got a website to crawl which includes some links to pdf files.
I want nutch to crawl that link and dump them as .pdf files.
I am using Apache Nutch1.6 also i am tring this in java as
ToolRunner.run(NutchConfiguration.create(), new Crawl(),
tokenize(crawlArg));
SegmentReader.main(tokenize(dumpArg));
can some one help me on this
If you want Nutch to crawl and index your pdf documents, you have to enable document crawling and the Tika plugin:
Document crawling
1.1 Edit regex-urlfilter.txt and remove any occurence of "pdf"
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
1.2 Edit suffix-urlfilter.txt and remove any occurence of "pdf"
1.3 Edit nutch-site.xml, add "parse-tika" and "parse-html" in the plugin.includes section
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
If what you really want is to download all pdf files from a page, you can use something like Teleport in Windows or Wget in *nix.
you can either write your own own plugin, for pdf mimetype or there is embedded apache-tika parser, that can retrieve text from pdf..