Enterprise search platform vs General purpose search - apache

I have a question about Solr. It is described as an enterprise search platform. Are there Enterprise oriented search platforms and general purpose search platforms? Can't you just use Solr for example to build a general purpose search engine? If there is such a distinction what are the major differences between them?

Enterprise is a vague term tacked on to things to say "Yes, you can totally use this in professional projects, it's super good". It's baloney, in short. When reading the front page of a software product (or any product really), I find it useful to ignore all adjectives and adverbs, which makes that first sentence on the Solr page read: "Solr is the search platform from the Apache Lucene project."
Don't know why I don't get hired to write ad copy.
I think it would be fair to say that Solr is a general purpose search server, sure (depending on what general purpose entails to you, of course). It indexes data, allows you to search it, and provides a lot of tools to do that in the way the best suits your data and users.

The term Search is overloaded with lots of semantics. It is often used to denote/describe either an action, a function or a technology. But more important wit respect to the question is the fact that there are two common kind of "search projects" which are Web Search and Enterprise Search projects.
Web Search is typically about indexing content from one kind of content source (Web Servers) serving content in html format. Most often it's only about public content and document level security is not an issue. A typical example for this kind of solution is Google's Web Search, but most full-text Site Search solutions can also be seen as good examples of this category. For a basic solution a crawler , an html markup removal tool and an indexing library and some "glue" is sufficient. Apache Nutch or Apache Solr and ElasticSearch in combination with a web crawler are good candidates to be used for implementing these kind of solutions.
Enterprise Search is typically about integrating content in various formats from multiple content sources. A typical example for this kind of solution are corporate intranets, but Search Based Applications often also fall into this category. Those solutions typically come with additional requirements such as support for document level security, advanced linguistics, metadata extraction, data mappings and enrichments, synonyms etc. The projects are more complex and a more complex technology stack is needed. While Apache Solr or ElasticSearch can both be used, a lot of the required functionality is not part of the standard download and needs to be developed or integrated as part of the project. But for both - Apache Solr and ElasticSearch - there are also commercial distributions available that already expand the functionality of the standard download into the direction of Enterprise Search. Other good alternatives are commercial search engines.
I agree with #femtoRgon that Solr:
is a good General Purpose Search Platform
and not an Enterprise Search Platform
but an Enterprise Search Platform can be built with Solr

Solr is a search platform that can be customized for either general purpose search or for Enterprise Search solutions. As suggested by Daniel in the previous comments, ESearch application is used specifically for an enterprise/organization to search for the organizations internal data and also in some cases can search external content as well but only related to the organization. Enterprises generally use various systems which are either internally developed or by a vendor and the ESearch application should be able to connect to the internal systems and index the content including the different file types, metadata and importantly security that is associated with each and every document from those systems.
To conclude, Solr is a Search system which can be used to index and search content as a general or as a ESearch application for a organization.

Related

What exactly are the UMLS and SNOMED-CT vocabularies used by cTAKES?

Very new to cTAKES and looking through the docs, curious about what exactly the UMLS and SNOMEDCT "vocabularies" are. The user installation docs don't really seem to tell and simply applying for the UMLS license and the language around the UMLS Metathesaurus does not really divulge much more about the structure of the data being accessed. Eg. is it some online API service? Is it some files that come with the cTAKES download that can only be unlocked with a valid UMLS password that is checked against an online DB?
Info on what the UMLS Metathesaurus and SNOMEDCT are can be found here (https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html) and here (https://www.ncbi.nlm.nih.gov/books/NBK9676/, specifically https://www.ncbi.nlm.nih.gov/books/NBK9684/):
The Metathesaurus is a very large, multi-purpose, and multi-lingual [relational?] vocabulary database that contains information about biomedical and health related concepts, their various names, and the relationships among them. Designed for use by system developers...
...The Metathesaurus contains concepts, concept names, and other attributes from more than 100 terminologies, classifications, and thesauri, some in multiple editions.
While I'm not sure how exactly cTAKES implements its use of the UMLS Metathesaurus (anyone who knows could please enlighten), I assume that it is accessing some API for a relational database based on the UMLS credentials you need to add to the example scripts that come with the cTAKES download (see https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+User+Install+Guide#cTAKES4.0UserInstallGuide-(Recommended)AddUMLSaccessrights).
...You may select from two relational formats: the Rich Release Format (RRF), introduced in 2004, and the Original Release Format (ORF).
(I think) this is what is used to power the UIMA analysis engines used to process text in cTAKES
UIMA is an architecture in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document [...] How Annotators represent and share their results is an important part of the UIMA architecture. To enable composition and reuse, UIMA defines a Common Analysis Structure (CAS) precisely for these purposes. The CAS is an object-based container that manages and stores typed objects having properties and values, https://www.ibm.com/developerworks/data/downloads/uima/#How-does-it-work

What is webcenter?

I tried to understand what is the use of this tool but could not understand much from blogs and oracle docs.
My questions are:
What are the highlights/features of this tool which make any company or architect decide that this is the appropriate tool they need for their web application?
How is different from other java IDEs like netbeans and eclipse?
WebCenter is not a tool. It is a branding of 3 different technologies brought together to support user interaction via web sites (WebCenter Sites = Fatwire), portlets, content management (WebCenter Content = Stellent), and WebCenter Portal, rebranded from Spaces (which is similar to MS Sharepoint) is a pre-built web site with support for collaboration and integrates with Sites and Content. We use Jdeveloper to design and build the pages (ADF Faces - JSF) that make up the sites, we use the Sites tool to build the site and WC Content has it's own interface to check in content, edit, manage, search, etc.
Lots here and here as well as: WCContent, Sites, Portals.
You might also want to download the free Virtual Box image here, and play with the software. There are myriad books and tutorials that can be found with a little searching. Also, consider taking a class with Oracle which will explain these technologies in detail.
Also, my aggregation site here, lists many books, blogs and tips that may help.

Which one is better for efficient free text search, Hibernate Search or Lucene?

We are developing a web application using Spring MVC, Spring and Hibernate.
We need to add efficient free text search capabilities to our applications. For this we are thinking of using either Hibernate Search (it uses Lucene under the hood) or directly lucene.
What is the best option for us as we are already using hibernate in our application? What are the pros and cons of one over the other?
Thanks.
You said it yourself - you'll be using Lucene one way or the other.
The raw Lucene API isn't very easy to use. It's much more low-level than Hibernate Search. if you're already using Hibernate, then it's a no-brainer - use Hibernate Search to implement your text search functionality.
disclaimer: I'm one of the developers of Hibernate Search.
The goal of the project is not to compete with Lucene nor Solr, but to facilitate as much as possible integration with Hibernate applications, to avoid having to maintain the two worlds in sync and duplicate all mapping and CRUD operations.
While we provide some common helpers and a nice encapsulation, Hibernate Search can also hand you over a direct reference to the Lucene API, so in case you find yourself needing to use the "raw" Lucene API you will never be stuck. Also for writing to the index Hibernate Search provides a common pattern which will solve most of known requirements, but in case you have very non-standard requirements you can get full control of the written Documents.
Solr is a good alternative, but as it is a separate server you have to interact with it via REST APIs which is quite different, with it's pros and cons. Having a second service to manage is not always wanted, and of course the remote invocations will never be as efficient as direct references to Lucene and to all it's internal filters and caches.
Not all functionality of Lucene can be exposed via a remote API, and if you need to do some "low level" operation, if this is not implemented in Solr you won't be able to do it (without patching Solr). Still Solr is very cute, especially when you want to share the index with other non-Java applications, and so we might add a Solr backend for Hibernate Search to eventually keep a Solr server in synch (especially if there's interest for it, and possibly some help).
Finally, the Lucene API is really hard core stuff. We spend a lot of effort to make the best use of it to provide top performance while exposing a stable API to people using Hibernate Search, basically until now all releases have been backwards compatible to provide a "drop-in" performance boost to use latest greatest tricks from Lucene - which actually changes API quite often; these changes are always exciting, but be prepared to maintain that in your application if you don't use a proper abstraction.
The other way of using Lucene is to get the middlman API which is known as SOLR. SOLR will connect to Lucene and perfom HTTP calls for search. Please note that you will need to build and Parse the XML what Solr consumes. All the functionality of Lucene is exponse via SOLR and should be really helpful.

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).

How do we create a simple search engine using Lucene, Solr or Nutch?

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's.
I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.
Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.
Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.
We use the same for our internal lans.
None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own.
Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative.
That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON
Google Search Appliance http://www.google.com/enterprise/gsa/
I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.
Take a look at eprints. It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. It can also be easily customised and branded.
Why re-invent the wheel. Again.
Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.
Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.
The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.
A great free search technology you might look at is the IBM Yahoo! free search. I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. Graphic user interface; easy to customize search results, and basic search analytics. Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. We've suggested this to a number of clients where there were fewer than half a million documents, and they love it.
If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.
It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.
Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.