Any descriptive tutorials or clear guidance on crawling web by apache solr 6.6 - apache

I read on this question page that solr 5+ supports web crawling which means that we no longer need nutch. Are there any examples or descriptions to explain how to set up solr 6.6 crawl a set of remote websites?

they most probably meant using DIH with the right Datasource, but I doubt this can replace Nutch and such in many scenarios.

Related

I started web crawling using Storm Crawler but I do not know where crawled results go to? Im not using Solr or Elastic Search

The Storm Crawler started crawling data but it seems I cannot find where data is stored I need to save this data to to a database so I can connect the data to a remote server and have it indexed. Storm crawler seems to focus mainly on Solr and Elastic integration!!! I just want to store its data to a database so i can use with any site search solution like Typesense, flex search and any other site search software.
Im very new at the web craping so im just installed required software to run.
SC does not focus on SOLR or ES. There is for instance a SQL module.
As for "where the crawled results go", it depends on what your topology does. Maybe try to explain what bolts are running etc...

How to restrict Apache Nutch 2.3.1 to crawl story content and not side bars

I have to crawl some news websites. I have setup apache Nutch 2.3.1 with Hadoop 2.7.4 and Hbase cluster. I have to provide search via solr 6.6.1. After crawling some websites, I have observed that Nutch crawl everything in a page. In news websites, there are sidebars that contain latest or top news etc. These sidebar content changed with time. Is there any way to ask Nutch to crawl main story content and avoid such side bars.
Well, since you're using Nutch 2.x this is a bit difficult, for Nutch 1.x you could use the boilerpipe implementation that's available on Tika. But unfortunately, it's not ported to the 2.x branch yet.

How can I integrate the Apache Solr Search with my Drupal 7 Site?

Can anyone Give me the Good tutorial links which will be helpful to me so I can check that How to Integrate the Solr Search with my Drupal Site to get good performance.
What are the modules available for Drupal 7.x Version of Apache Solr Search.
Which version of Solr will support the Drupal 7.x.
What are the Configuration should required in Apache Solr / Drupal 7.x to Search?
There are two modules that support Solr with Drupal that are widely used:
Search API Solr
ApacheSolr search
Both have their various configuration 'quirks', I'd say you'd need to try both to see how they fit in with your site, to see which suits you best.
Make sure you have Java 5 or higher installed already on your server.
Tutorial on setting up site with Search API for Solr
Tutorial on setting up site with ApacheSolr
Look at and use Search API - Apache Solr module will not be the way forward for the future of Drupal.
The two maintainers of Search API and Apache Solr have meet in person and have determined a way forward for advanced searches with Drupal and they both have agreed that Search API is it.

Have you indexed nutch crawl results using elasticsearch before?

Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?
I wrote an ElasticSearch plugin that mocks the Solr api. Using this plugin and the standard Nutch Solr indexer you can easily send crawled data into ElasticSearch. Plugin and an example of how to use it with Nutch can be found on GitHub:
https://github.com/mattweber/elasticsearch-mocksolrplugin
I know that Nutch will be adding pluggable backends and glad to see it. I had a need to integrate elasticsearch with Nutch 1.3. Code is posted here. Piggybacked off the (src/java/org/apache/nutch/indexer/solr) code.
https://github.com/ctjmorgan/nutch-elasticsearch-indexer
Haven't done it but this is definitely doable but would require to piggyback the SOLR code (src/java/org/apache/nutch/indexer/solr) and adapt it to ElasticSearch. Would be a nice contrib to Nutch BTW
Time goes by and now Nucth is already integrated well with ElasticSearch. Here is a nice tutorial.

Using Nutch crawler with Solr

Am I able to integrate Apache Nutch crawler with the Solr Index server?
Edit:
One of our devs came up with a solution from these posts
Running Nutch and Solr
Update for Running Nutch and Solr
Answer
Yes
If you're willing to upgrade to nutch 1.0 you can use the solrindex as described in this article by Lucid Imagination: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/.
It's still an open issue. If you're feeling adventurous you could try applying those patches yourself, although it looks like it's not so simple
nutch 2.x is designed to use solr as default. You can follow the steps in http://wiki.apache.org/nutch/Nutch2Tutorial, or a better instruction in the book "Web Crawling and Data Mining with Apache Nutch".