How to extract a single document out of a Lucene 4.0 index? - lucene

This may be one of the simplest and dullest questions ever, but after indexing all the Documents in Lucene, how can I extract one Document only that has a specified id stored e.g. in a StringField? It should be an equivalent to e.g. an SQL-expression like
Select id, description
from index
where id = '1'
Where the Document has two Fields, an ID and a description.
I already apologyze if this question had been asked too many times before etc. but after hours of searching the internet with probably wrong search terms, I decided to ask it here :)

The Lucene demo shows how to use Lucene's standard QueryParser to search for documents: http://lucene.apache.org/core/4_1_0/demo/overview-summary.html#overview_description

Here is a an excellent tutorial on Lucene : Lucene in 5 minutes
It will indeed take only 5 minutes, you will find the answer in sections Search, Display. You will find the query formation for your requirements in the "Query" section

Related

Fulltext Solr statistical search

Consider I'm having a couple of documents indexed with Solr 4.0. Each has 2 fields - unique ID and text DATA field. DATA field contains few paragraphs of text. Who could advise me what kind of analyzers/parsers I should use and how to build statistical query to find out sorted list of most frequently used words in all DATA fields of all documents.
for the most frequent terms look into the terms- and statistical component
besides the answers mentioned here, you can use the "HighFreqTerms" class: its in the lucene-misc-4.0 jar (which is bundled with Solr).
This is a command line application which lets you see the top terms for any field either by document frequency or by total term frequency (the -t option)
Here is the usage:
java org.apache.lucene.misc.HighFreqTerms [-t] [number_terms] [field]
-t: include totalTermFreq
Here's the original patch, which is committed and in the 4.0 (trunk) and branch_3x codebases: https://issues.apache.org/jira/browse/LUCENE-2393
For ID field use analyzer based on keyword tokenizer - it will take all the content of the field as a single token.
For DATA field use language specific analyzer. Notice, that there's possibility to auto-detect the language of the text (patch).
I'm not sure, if it's possible to find the most frequent words with Solr, but if you can use Lucene itself, pay attention to this question. My own suggestion for it is to use HighFreqTerms class from Luke project.

Getting exact matches in Lucene using the standard analyzer?

Given 2 documents with the content as follows
"I love Lucene"
"Lucene is nice"
I want to be able to query lucene only for those documents with Lucene in the beginning , i.e , everything that will match the regexp "^Lucene .*".
Is there a way to do it , provided that I can't change the index , and it was analyzed using the standard analyzer?
Sure, take a look at SpanFirstQuery. Here is a good tutorial:
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/

not query in lucene

i need to do not queries on my lucene index. Lucene currently allows not only when we have two or more terms in the query:
So I can do something like:
country:canada not sweden
but I can't run a query like:
country:not sweden
Could you please let me know if there is some efficient solution for this problem
Thanks
A very late reply, but it might be useful for somebody else later:
*:* AND NOT country:sweden
IF I'm not mistaken this should do a logical "AND" with all documents and the documents with a country that is different from "sweden".
Try with the following query in the search box:
NOT message:"warning"
message being the search field
Please check answer for similar question. The solution is to use MatchAllDocsQuery.
The short answer is that this is not possible using the standard Lucene.
Lucene does not allow NOT queries as a single term for the same reason it does not allow prefix queries - to perform either, the engine would have to look through each document to ascertain whether the document is/is not a hit. It has to look through each document because it cannot use the search term as the key to look up documents in the inverted index (used to store the indexed documents).
To take your case as an example:
To search for not sweden, the simplest (and possibly most efficient) approach would be to search for sweden and then "invert" the result set to return all documents that are not in that result set. Doing this would require finding all the required (ie. not in the result set) documents in the index, but without a key to look them up by. This would be done by iterating over the documents in the index - a task it is not optimised for, and hence speed would suffer.
If you really need this functionality, you could maintain your own list of items when indexing, so that a not sweden search becomes a sweden search using Lucene, followed by an inversion of the results using your set of items.
OK, I see what you are trying to do.
You can use it as a query refinement since there are no unary Boolean operators in Lucene. Despite the answers above, I believe this is a better and most forward approach (note the space before the wildcard):
&query= *&qf=-country:Canada

Retrieving per keyword/field match position in Lucene Solr -- possible?

Is there any way to retrieve the match field/position for each keyword for each matching document from solr?
For example, if the document has title "Retrieving per keyword/field match position in Lucene Solr -- possible?" and the query is "solr keyword", I'd like to get, in addition to the doc-id (I normally only want the doc-id, not the full document), something that can tell me the matches are at:
solr:
title: 9
keyword:
title: 3
I'm pretty sure such info is computing during query execution (for phrase queries), but is it possible to return these to the application?
Thanks!
Debugging Relevance Issues in Search suggest using Solr analysis, which you can get to from the admin URL, using something like http://localhost:8983/solr/admin/analysis.jsp?highlight=on .
This highlights matching terms and gives their position.
AFAIK there is no way to do that directly, but you can use hit highlighting to implement it.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".