How to auto-index data using solr and nutch? - apache

i want to automatically index a document or a website when it is fed to apache solr . How we can achieve this ? I have seen examples of using a CRON job that need to be called via a php script , but they are not quite clear in explaination. Using java api SolrJ , is there any way that we can index data automatically , without having the need to manually do it ??

You can write a scheduler and call the solrJ code which is doing indexing/reindexing.
For writing the scheduler please refer below links
http://www.mkyong.com/java/how-to-run-a-task-periodically-in-java/
http://archive.oreilly.com/pub/a/java/archive/quartz.html

If you are using Apache Nutch, you have to use Nutch solr-index plugin. With using this plugin you can index web documents as soon as they be crawled by Nutch. But the main question would be how can you schedule Nutch to start periodically.
As far as I know you have to use a scheduler for this purpose. I did know an old Nutch project called Nutch-base which uses Apache Quartz for the purpose of scheduling Nutch jobs. You can find the source code of Nutch-base from the following link:
https://github.com/mathieuravaux/nutchbase
If you consider this project there is a plugin called admin-scheduling. Although it is implemented for and old version of Nutch but it could be a nice start point for developing scheduler plugin for Nutch.
It is worth to say that if you are going to crawl website periodically and fetch the new arrival links you can use this tutorial.

Related

I started web crawling using Storm Crawler but I do not know where crawled results go to? Im not using Solr or Elastic Search

The Storm Crawler started crawling data but it seems I cannot find where data is stored I need to save this data to to a database so I can connect the data to a remote server and have it indexed. Storm crawler seems to focus mainly on Solr and Elastic integration!!! I just want to store its data to a database so i can use with any site search solution like Typesense, flex search and any other site search software.
Im very new at the web craping so im just installed required software to run.
SC does not focus on SOLR or ES. There is for instance a SQL module.
As for "where the crawled results go", it depends on what your topology does. Maybe try to explain what bolts are running etc...

Change the Solr Server URL on Drupal 7 from an external shell script

I have recently received a task of maintaining a Drupal site, with one of the tasks being writing a backup and import script for the dev site, so it can receive a daily dump of the live data.
I have done this, however we need to revert the Solr details to the dev solr database. However I only know how to do this manually using the UI tools (e.g. "https://WEBSITE.co.uk/admin/config/search/apachesolr/settings" clicking "edit" and changing the "Solr server URL" in the UI menu and clicking "Save" e.g. "https://WEBSITE/admin/config/search/apachesolr/settings/dev_environment_search_server__0_0/edit?destination=admin/config/search/apachesolr/settings").
Would there be a way of changing this without in a script?
Also updating the database table manually doesn't work unless you also clear the cache, is there a way of only clearing the Solr Cache for to update this change (I've been asked to not clear all caches)
Could anyone help me?
If Solr is setup using Search API you can add the Search API Override and Search API Solr Override modules to allow you to control the configuration easily in your settings.local.php file.

ElasticSearch Indexing Confluence pages

Can ElasticSearch index Confluence pages?
There are a lot of river plugins but none for Confluence. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Although there is a github project https://github.com/obazoud/elasticsearch-river-confluence but the last commit is a year ago, so I guess it's not up-to-date.
Elasticsearch deprecated river.
Elasticsearch has a solution built over it called workplace search which could connect to confluence for ingesting data.
Ideally, you might need to do it by the Confluent API via a script to Elasticsearch. You might also need to use the "ingest-attachment" plugin if you need to parse PDF content.

What is the SOLR plugin for Liferay meant for?

I am using Liferay 6.1 and I am trying to learn how to incorporate search functionality into Liferay Portal. I was able to run Apache SOLR inside Liferay's tomcat container but I don't understand what the solr plugin for liferay is meant for.
Here is the link
Can someone please explain what are the benefits for using the plugin (for liferay) and what it accomplishes on top of using SOLR?
Thanks
Per this link it is to externalize the search function from the portal.
Using Solr instead of Lucene gives you the additional capabilities of Solr such as Replication, Sharding, Result clustering through Carrot2, Use of custom Analyzers/Stemmers etc.
It also can offload search server processing to a separate cluster.
Opens up the possibilities of search driven UI (facetted classification etc) separate from your portal UI.

Have you indexed nutch crawl results using elasticsearch before?

Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?
I wrote an ElasticSearch plugin that mocks the Solr api. Using this plugin and the standard Nutch Solr indexer you can easily send crawled data into ElasticSearch. Plugin and an example of how to use it with Nutch can be found on GitHub:
https://github.com/mattweber/elasticsearch-mocksolrplugin
I know that Nutch will be adding pluggable backends and glad to see it. I had a need to integrate elasticsearch with Nutch 1.3. Code is posted here. Piggybacked off the (src/java/org/apache/nutch/indexer/solr) code.
https://github.com/ctjmorgan/nutch-elasticsearch-indexer
Haven't done it but this is definitely doable but would require to piggyback the SOLR code (src/java/org/apache/nutch/indexer/solr) and adapt it to ElasticSearch. Would be a nice contrib to Nutch BTW
Time goes by and now Nucth is already integrated well with ElasticSearch. Here is a nice tutorial.