Lucene - Effective text search - pdfbox

I have an index generated by the pdfbox api class LucenePDFDocument. As the index contains only the text contents, I wish to search this index effectively.
I will search the 'contents' field with the search string, the result order must be from the most relevant to the less relevant. The code given below did displayed the files that has the words of the searched text, ex 'What is your nationality' but the results didnt contain a file containing this full sentence.
What query parser and query should i use to search in the above said scenario.
Query query = new MultiFieldQueryParser(Version.LUCENE_30, fields,
new StandardAnalyzer(Version.LUCENE_30))
.parse(searchString);
TopScoreDocCollector collector = TopScoreDocCollector.create(5,
false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("count " + hits.length);
for (ScoreDoc scoreDoc : hits) {
int docId = scoreDoc.doc;
Document d = searcher.doc(docId);
System.out.println(d.getField("path"));
}

It's not about programmatic part, but about Lucene quesry syntax. To search whole phrase just wrap it with double quotes, i.e. instead of searching
What is your nationality
search
"What is your nationality"
Without quotes Lucene finds all documents with each separate word, i.e. "what", "is", "your" and "nationality" ("is" and "your" may be omitted as stop words) and sort them by overall number of occurrences in doc, not only in that phrase. Since you set number of docs to find only to 5 in TopScoreDocCollector, the file with the phrase may not occur in results. Adding quotes makes Lucene to ignore all other docs without exact phrase.
Also if you search only in 'contents' field, you need not MultiFieldQueryParser and can use simple QueryParser instead.

Related

How to make the first n words more important in Lucene

I want to make the first n (which i set) words from a document more important that the rest of the document in Lucene. How will i do that? I found something about boosting, but boost a field to be more important. My document is supposed to be an only field.
Is to number the words at indexing time and boost them a solution? Something like that:
TextField myField = new TextField("text",termAtt.toString(),Store.YES);
myField.setBoost(2);
document.add(myField);
if the i didn't reach the n-th word in my document?
I want to get the following result: let's say that the first 20 words in a document are more important than the rest. I have 2 identical documents that have more than 20 words and i add the word that i am searching in one document as th first word and in the second document as the last word, an i want that the first document to have a bigger score.
The best approach would be to simply create two different fields, one containing the higher value portion of the text (this wouldn't need to be stored), and the next containing the full text:
int leadinLength = 20
TextField myFieldLeadin = new TextField("text_leadin",termAtt.toString().substring(leadinLength,Store.NO);
TextField myField = new TextField("text, termAtt.toString(),Store.YES);
myFieldLeadin.setBoost(2);
document.add(myFieldLeadin);
document.add(myField);
To could use a MultiFieldQueryParser to streamline searching in both fields at once, if desired, like:
Query query = MultiFieldQueryParser.parse(Version.LUCENE_48, "my search query",{"text_leadin","text"}, analyzer);
TopDocs docs = searcher.search(query, 10);

Lucene Query does not return results even though it should

i am currently trying to get all Documents from a Lucene Index (v. 4) in a RamDirectory.
on index creation the following addDocument function is used:
public void addDocument(int id, String[] values, String[] fields) throws IOException{
Document doc = new Document();
doc.add(new IntField("getAll", 1, IntField.TYPE_STORED));
doc.add(new IntField("ID", id, IntField.TYPE_STORED));
for(int i = 0; i < fields.length; i++){
doc.add(new TextField(fields[i], values[i], Field.Store.NO));
}
writer.addDocument(doc);
}
after calling this for all documents the writer is closed.
as you can see from the first field added to the document, i added an additional field "getAll" to make it easy to retrieve all documents. If I understood it right, the Query "getAll:1" should return all documents in the index. But thats not the case.
I am using the following function for that:
public List<Integer> getDocIds(int noOfDocs) throws IOException, ParseException{
List<Integer> result = new ArrayList<Integer>(noOfDocs);
Query query = parser.parse("getAll:1");
ScoreDoc[] docs = searcher.search(query, noOfDocs).scoreDocs;
for(ScoreDoc doc : docs){
result.add(doc.doc);
}
return result;
}
noOfDocs is the number of Documents that were indexed. Of course i used the same RamDirectory when creating the IndexSearcher.
Substitution of the parsed Query to a manually created TermQuery didn't help either.
The query returns no results.
Hope someone can help to find my error.
Thanks
I believe you are having trouble searching because you are using an IntField, rather than a StringField or TextField, for instance. IntField, and other numeric fields, are designed for numeric range querying, and are not indexed in their raw form. You may use a NumericRangeQuery to search for them.
Really, though, IntField should only be used, to my mind, for numeric values, and not for a string of digits, which is what you appear to have. IDs should be keyword or text fields, generally.
As far as pulling all records, you don't need to add a field to do that. Simply use a MatchAllDocsQuery.
I think first you should run Luke to verify the contents of the index.
Also, if you allow * as the first character of a query with queryParser.setAllowLeadingWildcard(true); , then a query like ID:* would retrieve all documents without having to include the getAll field.

Space issue in Lucene.NET C#

I want to search sentence which has space in full text search.
Ex: Tom is a very good boy in class.
I want to Search the key word "very good".
I'm using white space tokenizer to create/search index. But it is not finding the keyword if it is separated by space.
Code:
Query searchItemQuery = new WildcardQuery(new Term(string-field-name, searchkeyword.ToLower()));
I've tried with split but it is not working properly.
Do anyone suggest me a solution for this problem?
Thanks,
Vijay
Since, you are working with tokenized string, every word is a separate term.
In order too find a phrase consisting of multiple terms, you would need to use PhraseQuery instead of WildcardQuery.
Like this:
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.Add(new Term(string-field-name, "very"));
phraseQuery.Add(new Term(string-field-name, "good"));
Note also, that you are using wildcard query. Wildcards in phrase query are a bit complex. Check this post for details: Lucene - Wildcards in phrases
And finally, I would suggest to consider using QueryParser instead of constructing query manually.

querying for a string'ed number in lucene finds nothing

I have an existing index with some documents I'm trying to search.
When I search a "real textual" field, everything is OK.
When I try to search a field which is a number, the search gives 0 results.
The code is something like this (it is pylucene but the concept is the same):
dir = SimpleFSDirectory(File(indexDir))
analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
searcher = IndexSearcher(dir)
query = QueryParser(Version.LUCENE_CURRENT, "id", analyzer).parse("902")
hits = searcher.search(query, MAX)
print hits.totalHits #gives me 0
a luke search (id:902) gives me empty results as well.
When I look at the Overview tab on luke it says this field is UTF-8 (string)
Anything I'm doing wrong?
edit:
It appears this happens on Fields that are indexed and has no Norm (according to the flags of luke).
Can someone explain it?
I don't like answering my own questions but I believe this answer is an important reference.
The solution is put a NumericRange query with both numbers the number you seek (this time in java):
NumericRangeQuery.newIntRange("id", Integer.valueOf(902), Integer.valueOf(902),
true, true)
Are you using SimpleAnalyzer while indexing? It strips off numbers. Make sure you are using same analyzer while indexing and searching.

Prevent KeywordTokenizer from creating multiple key-value pairs

I use the Lucene java QueryParser with KeywordAnalyzer. A query like topic:(hello world) is broken up in to multiple parts by the KeywordTokenizer so the resulting Query object looks like this topic:(hello) topic:(world) i.e. Instead of one, I now have two key-value pairs. I would like the QueryParser to interpret hello world as one value, without using double quotes. What is the best way to do so?
Parsing topic:("hello world") results in a single key value combination but, using double quotes is not an option.
I am not using the Lucene search engine. I am using Lucene's QueryParser just for parsing the query, not for searching. The text Hello World is entered at runtime, by the user so that can change. I would like KeywordTokenizer to treat Hello World as one Token instead of parsing splitting it in to two Tokens.
You'll need to use a BooleanQuery. Here's a code snippet using the .NET port of Lucene. This should work with both the KeywordAnalyzer and the StandardAnalyzer.
var luceneAnalyzer = new KeywordAnalyzer();
var query1 = new QueryParser("Topic", luceneAnalyzer).Parse("hello");
var query2 = new QueryParser("Topic", luceneAnalyzer).Parse("world");
BooleanQuery filterQuery = new BooleanQuery();
filterQuery.Add(query1, BooleanClause.Occur.MUST);
filterQuery.Add(query1, BooleanClause.Occur.MUST);
TopDocs results = searcher.Search(filterQuery);
You can construct the query directly as follows. This preserves the space.
Query query = new TermQuery(new Term("field", "value has space"));
If you print query as System.out.println(query); you will see the following result.
field:value has space