We are using solr for indexing in our project. Indexing is working fine as of now.
We need to work on scenario where solr is down for some reason.
What should indexing process do then?
Is there way during solr restart we can inject some script to execute the backlog indexings?
Just share your experiences how you guys handled such scenarios.
Thanks in Advance.
Related
i want to automatically index a document or a website when it is fed to apache solr . How we can achieve this ? I have seen examples of using a CRON job that need to be called via a php script , but they are not quite clear in explaination. Using java api SolrJ , is there any way that we can index data automatically , without having the need to manually do it ??
You can write a scheduler and call the solrJ code which is doing indexing/reindexing.
For writing the scheduler please refer below links
http://www.mkyong.com/java/how-to-run-a-task-periodically-in-java/
http://archive.oreilly.com/pub/a/java/archive/quartz.html
If you are using Apache Nutch, you have to use Nutch solr-index plugin. With using this plugin you can index web documents as soon as they be crawled by Nutch. But the main question would be how can you schedule Nutch to start periodically.
As far as I know you have to use a scheduler for this purpose. I did know an old Nutch project called Nutch-base which uses Apache Quartz for the purpose of scheduling Nutch jobs. You can find the source code of Nutch-base from the following link:
https://github.com/mathieuravaux/nutchbase
If you consider this project there is a plugin called admin-scheduling. Although it is implemented for and old version of Nutch but it could be a nice start point for developing scheduler plugin for Nutch.
It is worth to say that if you are going to crawl website periodically and fetch the new arrival links you can use this tutorial.
I am using Liferay 6.1 and I am trying to learn how to incorporate search functionality into Liferay Portal. I was able to run Apache SOLR inside Liferay's tomcat container but I don't understand what the solr plugin for liferay is meant for.
Here is the link
Can someone please explain what are the benefits for using the plugin (for liferay) and what it accomplishes on top of using SOLR?
Thanks
Per this link it is to externalize the search function from the portal.
Using Solr instead of Lucene gives you the additional capabilities of Solr such as Replication, Sharding, Result clustering through Carrot2, Use of custom Analyzers/Stemmers etc.
It also can offload search server processing to a separate cluster.
Opens up the possibilities of search driven UI (facetted classification etc) separate from your portal UI.
When I make a change to my Solr (3.4) schema and then restart the process, my changes aren't immediately found. I have to stop the server, wait a few seconds (past when ps aux shows the Java process has terminated), and then start it again. Why is this?
Schema changes require the solr server to be restarted because the schema is loaded into memory. The Solr Wiki has some more information on schema changes. There is, however, a way around this. If you run Solr in a multi-core environment you can dynamically Create, Reload, Rename, Load, and Swap cores on the fly.
A good option for you might be to move your Solr core to the Multi-core directory and start Solr in the multi-core home. Then, when you need to make a schema change, create a new core with the new schema, index your content in the background, then swap your primary core with your new core.
Best of luck to you!
Your browser may be displaying a cached result. I've found that using Shift+Cmd+R in my browser to force it to refresh without using the cache worked for me after reloading the solr config.
Has anyone had any luck writing custom indexers for nutch to index the crawl results with elasticsearch? Or do you know of any that already exist?
I wrote an ElasticSearch plugin that mocks the Solr api. Using this plugin and the standard Nutch Solr indexer you can easily send crawled data into ElasticSearch. Plugin and an example of how to use it with Nutch can be found on GitHub:
https://github.com/mattweber/elasticsearch-mocksolrplugin
I know that Nutch will be adding pluggable backends and glad to see it. I had a need to integrate elasticsearch with Nutch 1.3. Code is posted here. Piggybacked off the (src/java/org/apache/nutch/indexer/solr) code.
https://github.com/ctjmorgan/nutch-elasticsearch-indexer
Haven't done it but this is definitely doable but would require to piggyback the SOLR code (src/java/org/apache/nutch/indexer/solr) and adapt it to ElasticSearch. Would be a nice contrib to Nutch BTW
Time goes by and now Nucth is already integrated well with ElasticSearch. Here is a nice tutorial.
Am I able to integrate Apache Nutch crawler with the Solr Index server?
Edit:
One of our devs came up with a solution from these posts
Running Nutch and Solr
Update for Running Nutch and Solr
Answer
Yes
If you're willing to upgrade to nutch 1.0 you can use the solrindex as described in this article by Lucid Imagination: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/.
It's still an open issue. If you're feeling adventurous you could try applying those patches yourself, although it looks like it's not so simple
nutch 2.x is designed to use solr as default. You can follow the steps in http://wiki.apache.org/nutch/Nutch2Tutorial, or a better instruction in the book "Web Crawling and Data Mining with Apache Nutch".