mg4j vs. apache lucene - lucene

Can anyone provide a simple comparative analysis of these search engines? What advantages does either framework have?
BTW, I've seen the following basic explanations of choosing mg4j from several academic papers:
combining indices over the same collection
multi-index queries
Update:
These slides (from mir2ed.org) contain a more fresh overview of open source search engines including Lucene and mg4j on benchmarking various aspects: memory & CPU, index size, search performance, search quality etc.

Jeff Dalton reviewed many open source search engines including Lucene and mg4j in 2007, and updated the comparison in 2009.
I have not used mg4j. I have used Lucene, though. The number one feature of Lucene IMO is its wide adoption and wonderful community of users/developers/committers. This means that there is a fair chance that somebody worked on a use case similar to yours using Lucene.
Current weak points of Lucene are its scoring model and its ability to scale to large collections of text. The Lucene developers are working on these issues.
I believe that the choice of a search library is very dependent on your (academic or industrial) setting, the other parts of your application and your use case.

Related

Enterprise search platform vs General purpose search

I have a question about Solr. It is described as an enterprise search platform. Are there Enterprise oriented search platforms and general purpose search platforms? Can't you just use Solr for example to build a general purpose search engine? If there is such a distinction what are the major differences between them?
Enterprise is a vague term tacked on to things to say "Yes, you can totally use this in professional projects, it's super good". It's baloney, in short. When reading the front page of a software product (or any product really), I find it useful to ignore all adjectives and adverbs, which makes that first sentence on the Solr page read: "Solr is the search platform from the Apache Lucene project."
Don't know why I don't get hired to write ad copy.
I think it would be fair to say that Solr is a general purpose search server, sure (depending on what general purpose entails to you, of course). It indexes data, allows you to search it, and provides a lot of tools to do that in the way the best suits your data and users.
The term Search is overloaded with lots of semantics. It is often used to denote/describe either an action, a function or a technology. But more important wit respect to the question is the fact that there are two common kind of "search projects" which are Web Search and Enterprise Search projects.
Web Search is typically about indexing content from one kind of content source (Web Servers) serving content in html format. Most often it's only about public content and document level security is not an issue. A typical example for this kind of solution is Google's Web Search, but most full-text Site Search solutions can also be seen as good examples of this category. For a basic solution a crawler , an html markup removal tool and an indexing library and some "glue" is sufficient. Apache Nutch or Apache Solr and ElasticSearch in combination with a web crawler are good candidates to be used for implementing these kind of solutions.
Enterprise Search is typically about integrating content in various formats from multiple content sources. A typical example for this kind of solution are corporate intranets, but Search Based Applications often also fall into this category. Those solutions typically come with additional requirements such as support for document level security, advanced linguistics, metadata extraction, data mappings and enrichments, synonyms etc. The projects are more complex and a more complex technology stack is needed. While Apache Solr or ElasticSearch can both be used, a lot of the required functionality is not part of the standard download and needs to be developed or integrated as part of the project. But for both - Apache Solr and ElasticSearch - there are also commercial distributions available that already expand the functionality of the standard download into the direction of Enterprise Search. Other good alternatives are commercial search engines.
I agree with #femtoRgon that Solr:
is a good General Purpose Search Platform
and not an Enterprise Search Platform
but an Enterprise Search Platform can be built with Solr
Solr is a search platform that can be customized for either general purpose search or for Enterprise Search solutions. As suggested by Daniel in the previous comments, ESearch application is used specifically for an enterprise/organization to search for the organizations internal data and also in some cases can search external content as well but only related to the organization. Enterprises generally use various systems which are either internally developed or by a vendor and the ESearch application should be able to connect to the internal systems and index the content including the different file types, metadata and importantly security that is associated with each and every document from those systems.
To conclude, Solr is a Search system which can be used to index and search content as a general or as a ESearch application for a organization.

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.
E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.
Dbpedia Spotlight, the demo looks very promising
OpenNLP requires training. Which training data to use?
OpenNLP tools
Stanbol
NLTK
balie
UIMA
GATE -> example code
Apache Mahout
Stanford CRF-NER
maui-indexer
Mallet
Illinois Named Entity Tagger Not open source but free
wikipedianer data
My questions:
Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
How can they be integrated with Lucene?
Here are some questions related to that subject:
Does an algorithm exist to help detect the "primary topic" of an English sentence?
Named Entity Recognition Libraries for Java
Named entity recognition with Java
The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).
For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:
Zemanta
Maui-indexer
Dbpedia Spotlight
Extractiv (my company)
These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.
You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/
For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind
Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole.
The output is in your choice of XML or JSON which makes it very easy to use with Lucene.
It is written in java.
There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does.
Yes both versions perform entity and term disambiguation based on the linguistic usage.
The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.
Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml
The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a
Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.
Just in case you find it useful.

Can we customize Lucene which is embedded in Solr?

Can we customize Lucene which is embedded in Solr just as we can in raw Lucene ? So that we can have "everything" that we have in Lucene in Solr ?
I am asking this because we are stuck at a point of deciding Solr vs Lucene, thinking like so :
Argument 1 :
"We might hit a dead zone in future if
we choose Solr, and Lucene is a better
choice hence... So we might as well
start writing HTTP wrappers and almost
half of Solr ourselves on top of
Lucene to be on safer side. "
Argument 2 :
"Solr already has all the features we
want to use, so why not just use it ?
Since people who commit to Lucene are
also responsible for committing to
Solr, all features of Lucene are
available to Solr too..."
I went through many blogs and posts that say something like :
For situations where you have very customized requirements requiring
low-level access to the Lucene API classes, Solr would be more a
hindrance than a help, since it is an extra layer of indirection.
-http://www.lucenetutorial.com/lucene-vs-solr.html
One way of defending Argument 2 is by confirming that we can customize the underlying Lucene in Solr just like we would do if we had only Lucene.
Can someone provide a better way of closing this argument ? :)
ps : We need a fast search with indexing and sharding terabytes of data...
Can we customize Lucene which is embedded in Solr ?
Yes, you can. But keep this in mind:
Lucene and Solr committers are some of the foremost experts in the field of full-text search. They have several years of experience in this field. If you think you can do better than them, then go ahead and change Solr to your needs (it's Apache-licensed so there aren't any commercial restrictions), and if you do so try to do it so that you can later contribute it back to the project so everyone can benefit and the project moves forward.
For the vast majority of Solr users though, the stock product is more than enough and satisfies all needs.
In other words, before jumping in to change the code, ask on a mailing list (stackoverflow or solr-user), there's a good chance that you don't really need to change any code.
"Fast search with indexing and sharding terabytes of data" is precisely what Solr was built for. It would be a bad case of Not-Invented-Here not to use it or any of the other similar solutions, such as ElasticSearch, Sphinx, Xapian, etc. If you think you'll need to customize or extend the search server in any way, consider the license and underlying code of each one. Solr and ElasticSearch are both Apache-licensed so they don't have commercial restrictions and are built on top of Lucene, a well-known library.

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's.
I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.
Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.
Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.
We use the same for our internal lans.
None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own.
Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative.
That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON
Google Search Appliance http://www.google.com/enterprise/gsa/
I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.
Take a look at eprints. It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. It can also be easily customised and branded.
Why re-invent the wheel. Again.
Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.
Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.
The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.
A great free search technology you might look at is the IBM Yahoo! free search. I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. Graphic user interface; easy to customize search results, and basic search analytics. Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. We've suggested this to a number of clients where there were fewer than half a million documents, and they love it.
If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.
It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.
Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.