Applying one or more field level filtering on search results - lucene

How do I apply one or more field level filters on existing search results returned by lucene?
Thanks!
Ed

Bluntly, looking at your posting record, you seem to be re-asking the same or very similar question repeatedly.
For general help in the area of Lucene and Information Retrieval, Lucene in Action is excellent, and will provide you with a better grounding than the specific answers to specific questions you get here. Lucene also has excellent online documentation for whatever version you're using.

Related

Solr spellcheck vs fuzzy search

I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob
I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.

incremental query vs. continuous query

I know that continuous query is a query which is registered once and it is evaluated continuously over a data stream. But, I don't understand what does incremental query means. I am reading about continuous data streams and the way we query for a specific pattern in the stream.
Can anyone explain me - what is an incremental query? Explanation with an example will be really helpful
Although after googling a lot, I find some definitions, but none of them explains clearly.
UPDATE:
I don't find the exact paper now in which I found this term, but in this paper I can find it on page no. 6.
You might already have researched incremental algorithm, I think it is what you're looking for.
I have never heard of an 'incremental' query. However that sounds a lot like doctrine's schema update command here in symfony's doc
Food for thought until someone come up with a better answer :)

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Best practices for configuring a Solr schema

I'm currently configuring my schema.xml file and trying to figure out what's the best way to set up my documents. I use a RMDBS and thus many objects are relational.
Take this site for instance; a document typically consists of a question, followed by 0 or more answers. Say you'd want to set up fields for this, you would have to declare all question and answer fields in the same document, the way I see it. But given the fact that there can be more than one answer, you would have to create a document for each answer. So that means each question and each answer is stored in a separate document, which contains fields for both.
I don't see a different approach for this kind of problem, however I am relatively new to Solr and document DB's, so I may be wrong.
In short: what are the best practices if I would implement such a schema?
Another way to do it would be to have a question field and a multi-valued field for answers and have them in the same document. This is probably the best way to start unless you have specific requirements that favor the document-per-answer approach.
For example, if you needed to match individual answers as stand-alone search results, you may get better results and performance with the document-per-answer approach, as the "answer" documents will be scored, ranked and loaded in isolation.
But that would be an unconventional use of this type of data. Normally when you search a site like stack overflow, you are searching for a question and set of answers that cover a particular topic, so having everything in one document makes more sense.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.