I wanna to crawl web site with Nutch and then index result in Solr.
I have in solr schema.xml file .that imagine in this file i have field content.
but every site has own pattern for example in some one i wanna to set "body" tag in "content filed(in solr schema)"
and for another site i wanna to set "content" in "content filed(in solr schema)".
I mean if in crawl result i find body tag i use this to store in content field,
else if i find body tag i use this value to store in schema file.
how can i do that?
can i set specialfield in solr fill based on multi Tag value in nutch crael result based on what tag it found in each web site?
Indexing content with Nutch and posting to Solr should be straightforward. But if you want to add logic and the list of rules might grow, it might be recommended to use a content processing engine.
I've seen this tool used for that specific purpose but it uses Heritrix as crawler and you can create Groovy scripts to decide how to handle your content :www.searchtechnologies.com/aspire
Related
I'm trying to fetch a list of several URLs and parse their title keywords and description (and ignore all the rest) using Apache Nutch
After that I just want to save for each URL all the title, keywords and description content (preferably without the tags themselves) without any indexing
I looked at several examples on how to this. Just a few examples of what I encountered:
How to parse content located in specific HTML tags using nutch plugin?
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
However they all propose complicated (at least to a Nutch newbie) plugins configuration and settings
Since my use case sounds like a very common one I was wondering if there is any simpler solution?
Thanks
I am crawling websites using Apache NUTCH 2.2.1, which provides me content to index on SOLR. When NUTCH fetches content, there are contextual information such as "contact us","legal notice" or some other irrelevant information (generally coming from upper menu, left menu or from footer of the page) that I do not need to index.
One of the solution would be to automatically select the most relevant part of the content to index, which can be done by an automatic summarizer. There is a plugin "summary-basic", is it used for this purpose? If so how is it configured? Other solutions are also welcome.
In regex-urlfilter.txt you can specify the list of urls you want to ignore. You can specify the http link for "contact us"(typically all header, footer information that you don't want to crawl) etc.. in that regex list. While crawling web, nutch will ignore those urls and will only fetch the requires content. You can find regex-urlfilter.txt under apache-nutch-2.2.1/conf folder
During solrindex, how to tell Nutch to skip indexing those documents with an empty content field?
I found http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/, but the index-omit plugin will only allow Nutch to filter those documents without certain metatag fields, not general fields such as content.
You might need to implement a new Nutch filter that discards the document if the content is empty.
You can get more information on how to write a plugin following this link: https://wiki.apache.org/nutch/AboutPlugins
EDIT:
I wrote a simple plugin just as an example.
It looks at the "content" field and if it's empty it will ignore the document and not index it.
You can get it from here: https://github.com/nimeshjm/index-discardemptycontent
I am using nutch 1.4. I want to manipulate the crawled url before indexing it.
For example, if my URL is http://xyz.com/home/xyz.aspx then I want to modify the URL to http://xyz.com/index.aspx?role=xyz and only the latter field should be indexed in SOLR. The reason is I don't want to expose the first URL. The 2nd URL will ultimately redirect it to same page.
Do we have a provision in Nutch to manipulate the crawled URL's before indexing it to SOLR??
There is no out of the box way to modify the value fed to solr unless you write a custom plugin to do so.
However, this can be easily handled at client side before the results are displayed to the User.
I am new to Apache Nutch/Solr family of products. I have setup basic Nutch (1.6) with Solr (4.3) and have successfully crawled a site and Solr has indexed my crawled data as well.
Now my question is if I crawl a web blog like where user can give their comments (e.g http://blogs.alliedtechnique.com/2009/04/16/setting-global-environment-variables-in-centos/), how can I make sure Nutch consider user's comments and main blog as separate document, So when I search for keyword , it returns me main blog and comments as separate results and later I could use that data for sentiment analysis as well.
I would greatly appreciate any help here.
Thanks.
Tony
You could use the xpath filter plugin to segregate crawled content into two different fields.
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
Content in class="post" would go to field A, content in class="commentlist" would go to field B.
In your search page logic, you query Solr on field A so your search results are only from your blog post, not the comments.
Comments data is still saved against the document, but not searchable.