Creating demo UI ontop of Solr - lucene

I'm looking into some example UI on top of Solr that show of the functionality available in a demo, like e.g. drill down faceted search. I found Blacklight, which looks intensively interesting. Is there any other software that is worth researching or is Blacklight definitive the way to go? Thanks.

Have you looked at using the Velocity templating built into Solr? You can find more about "Solritas" here: http://wiki.apache.org/solr/Solritas
I am about to put together a demo Solr site for a presentation, and am going down the Solritas route. You get faceting, clustering, and more! And no extra server to run.

Related

Solr randomized continuous testing framework

Are there any really good testing frameworks for solr ? I heard about Randomized Continuous Testing . Does anyone know how to use Randomized Continuous Testing for solr ?
This question is far too broad. There are many test frameworks you can use for Solr (.. almost all of them), and you can use either a live backend or the Embeddable Solr Server for integration testing.
If you watch to the end of the presentation, Dawid gives several examples of how to do randomized testing with Solr and Lucene, including linking to LUCENE-3492 which tracks the issue in Lucene and Solr. There is a Wiki page that contains information about how to run the internal Solr tests.

Can we customize Lucene which is embedded in Solr?

Can we customize Lucene which is embedded in Solr just as we can in raw Lucene ? So that we can have "everything" that we have in Lucene in Solr ?
I am asking this because we are stuck at a point of deciding Solr vs Lucene, thinking like so :
Argument 1 :
"We might hit a dead zone in future if
we choose Solr, and Lucene is a better
choice hence... So we might as well
start writing HTTP wrappers and almost
half of Solr ourselves on top of
Lucene to be on safer side. "
Argument 2 :
"Solr already has all the features we
want to use, so why not just use it ?
Since people who commit to Lucene are
also responsible for committing to
Solr, all features of Lucene are
available to Solr too..."
I went through many blogs and posts that say something like :
For situations where you have very customized requirements requiring
low-level access to the Lucene API classes, Solr would be more a
hindrance than a help, since it is an extra layer of indirection.
-http://www.lucenetutorial.com/lucene-vs-solr.html
One way of defending Argument 2 is by confirming that we can customize the underlying Lucene in Solr just like we would do if we had only Lucene.
Can someone provide a better way of closing this argument ? :)
ps : We need a fast search with indexing and sharding terabytes of data...
Can we customize Lucene which is embedded in Solr ?
Yes, you can. But keep this in mind:
Lucene and Solr committers are some of the foremost experts in the field of full-text search. They have several years of experience in this field. If you think you can do better than them, then go ahead and change Solr to your needs (it's Apache-licensed so there aren't any commercial restrictions), and if you do so try to do it so that you can later contribute it back to the project so everyone can benefit and the project moves forward.
For the vast majority of Solr users though, the stock product is more than enough and satisfies all needs.
In other words, before jumping in to change the code, ask on a mailing list (stackoverflow or solr-user), there's a good chance that you don't really need to change any code.
"Fast search with indexing and sharding terabytes of data" is precisely what Solr was built for. It would be a bad case of Not-Invented-Here not to use it or any of the other similar solutions, such as ElasticSearch, Sphinx, Xapian, etc. If you think you'll need to customize or extend the search server in any way, consider the license and underlying code of each one. Solr and ElasticSearch are both Apache-licensed so they don't have commercial restrictions and are built on top of Lucene, a well-known library.

Full text search for Rails 3

I’m evaluating full text search methods for Rails 3 ATM. Does anyone here have a recommendation? Seems to me as if most of the known methods (Sunspot, Sphinx, Ferret, Xapian) aren’t yet ready for Rails 3. Is that so? At the moment I’ve got plenty of resources left on the machine were I’d like to deploy my app but nevertheless, I’d like to keep the idle load for the search engine as low as possible. I’m planning to use PostgreSQL if that’s of any relevance here.
After some reading I’m almost sure that I’d like to use Sunspot or Xapian. But if there’s any other (and better) solution please tell me :-) Especially regarding Sunspot I’m not sure if it was clever to have a complete Tomcat running in addition to my Rails app. Anyone has experience with this constellation?
Thanks in advance,
Ulf
If you are using PostgreSQL you can get an awful lot out of its built-in text search capabilities before you need to reach for external libraries. I've been using tsearch queries for years with excellent results.
PostgreSQL full text search analyses word proximity to calculate Relevance & ranking and offers useful features like highlighting of search results.
It is also aware of language specific normalisation rules, for example it knows to ignore the s and es pluralization suffixes in English; so searches for 'country' will also bring back highlighted results for 'countries', much the same way that Google does.
I'm not suggesting that you shouldn't use the libraries that you've mentioned, but it is worth investigating the database to see if will already fulfil the majority, if not all of your requirements.
You can use sunspot with Rails3, no problem. We have done so successfully using the sunspot/sunspot_rails gems (1.2.rc4). And it's not too much of a hassle to run Solr within a Tomcat server.
For fulltext-search features you should use a search engine.
For example you could use the Lucene Library with jRuby.
If you like to stay with standard Ruby (cRuby) you coud use Solr.
For rails there are also some Solr plugins:
For example starting with http://wiki.apache.org/solr/SolRuby could be a good idea.
Sunspot is Rails3 ready, we're using it on a few Rails3 apps already. I've had a lot of success with Solr and Sunspot. So much that we're starting a blog series on it

Lucene.Net and Geosearch - is it outthere somewhere?

I've found an interesting article about Lucene and geosearching:
http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html
Is there an equivilant .NET implementation out there that I have been unable to find or do I have to rework the Java-code in his example to fit in the .NET Framework?
I came across this article, as well. I do not see a .NET-specific in my Googling, so I am planning on probably porting this code when the need arises, as well. Right now, I am just getting my feet wet with Lucene.NET and have not gotten to the point that I am comfortable enough with it to start extending it, yet.
The code in the article appears to be a derived example of the conceptual geo-distance functionality outlined in Lucene In Action. Although the book is based on the Java product, it is a great read. The samples port easily and it is full of information.
in the latest lucene.net contrib folder there is spatial contribution to perform geosearch see
https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_9_1/contrib/Spatial.Net/
With Lucene.NET 3.0.3, soon to be released, there is a brand new spatial contrib. See:
http://www.code972.com/blog/2012/05/the-future-of-geo-spatial-searches-with-lucene/
There is worked example at https://www.leapinggorilla.com/Blog/Read/1010/spatial-search-in-lucenenet---worked-example
Regards
Ismail

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's.
I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.
Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.
Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.
We use the same for our internal lans.
None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own.
Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative.
That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON
Google Search Appliance http://www.google.com/enterprise/gsa/
I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.
Take a look at eprints. It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. It can also be easily customised and branded.
Why re-invent the wheel. Again.
Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.
Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.
The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.
A great free search technology you might look at is the IBM Yahoo! free search. I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. Graphic user interface; easy to customize search results, and basic search analytics. Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. We've suggested this to a number of clients where there were fewer than half a million documents, and they love it.
If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.
It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.
Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.