Lucene Highlighter with stemming analyzer

Lucene Highlighter with stemming analyzer - lucene

I am using Lucene's Highlighter class to highlight fragments of matched search results and it works well. I would like to switch from searching with the StandardAnalyzer to the EnglishAnalyzer, which will perform stemming of terms.
The search results are good, but now the highlighter doesn't always find a match. Here's an example of what I'm looking at:
document field text 1: Everyone likes goats.
document field text 2: I have a goat that eats everything.
Using the EnglishAnalyzer and searching for "goat", both documents are matched, but the highlighter is only able to find a matched fragment from document 2. Is there a way to have the highlighter return data for both documents?
I understand that the characters are different for the tokens, but the same tokens are still there, so it seems reasonable for it to just highlight whatever token is present at that location.
If it helps, this is using Lucene 3.5.

I found a solution to this problem. I changed from using the Highlighter class to using the FastVectorHighlighter. It looks like I'll pick up some speed improvements too (at the expense of storage of term vector data). For the benefit of anyone coming across this question later, here's a unit test showing how this all works together:
package com.sample.index;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.vectorhighlight.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import static junit.framework.Assert.assertEquals;
public class TestIndexStuff {
public static final String FIELD_NORMAL = "normal";
public static final String[] PRE_TAGS = new String[]{"["};
public static final String[] POST_TAGS = new String[]{"]"};
private IndexSearcher searcher;
private Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
#Before
public void init() throws IOException {
RAMDirectory idx = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter writer = new IndexWriter(idx, config);
addDocs(writer);
writer.close();
searcher = new IndexSearcher(IndexReader.open(idx));
}
private void addDocs(IndexWriter writer) throws IOException {
for (String text : new String[] {
"Pretty much everyone likes goats.",
"I have a goat that eats everything.",
"goats goats goats goats goats"}) {
Document doc = new Document();
doc.add(new Field(FIELD_NORMAL, text, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.addDocument(doc);
}
}
private FastVectorHighlighter makeHighlighter() {
FragListBuilder fragListBuilder = new SimpleFragListBuilder(200);
FragmentsBuilder fragmentBuilder = new SimpleFragmentsBuilder(PRE_TAGS, POST_TAGS);
return new FastVectorHighlighter(true, true, fragListBuilder, fragmentBuilder);
}
#Test
public void highlight() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_NORMAL, analyzer)
.parse("goat");
FastVectorHighlighter highlighter = makeHighlighter();
FieldQuery fieldQuery = highlighter.getFieldQuery(query);
TopDocs topDocs = searcher.search(query, 10);
List<String> fragments = new ArrayList<String>();
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
fragments.add(highlighter.getBestFragment(fieldQuery, searcher.getIndexReader(),
scoreDoc.doc, FIELD_NORMAL, 10000));
}
assertEquals(3, fragments.size());
assertEquals("[goats] [goats] [goats] [goats] [goats]", fragments.get(0).trim());
assertEquals("Pretty much everyone likes [goats].", fragments.get(1).trim());
assertEquals("I have a [goat] that eats everything.", fragments.get(2).trim());
}
}

Related

Query Expansion lucene

I am new to lucene and I am trying to do query expansion.
I have referred to these two posts (first , second) and I've managed to reuse the code in a way that suits version 6.0.0, as the one in the previous is deprecated.
The issue is, either I'm not getting a results or I didn't access the results (expanded queries) appropriately.
Here is my code:
import com.sun.corba.se.impl.util.Version;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.StringReader;
import java.io.UnsupportedEncodingException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.text.ParseException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.standard.ClassicTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.synonym.WordnetSynonymParser;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.util.*;
public class Graph extends Analyzer
{
protected static TokenStreamComponents createComponents(String fieldName, Reader reader) throws ParseException{
System.out.println("1");
// TODO Auto-generated method stub
Tokenizer source = new ClassicTokenizer();
source.setReader(reader);
TokenStream filter = new StandardFilter( source);
filter = new LowerCaseFilter(filter);
SynonymMap mySynonymMap = null;
try {
mySynonymMap = buildSynonym();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
filter = new SynonymFilter(filter, mySynonymMap, false);
return new TokenStreamComponents(source, filter);
}
private static SynonymMap buildSynonym() throws IOException, ParseException
{ System.out.print("build");
File file = new File("wn\\wn_s.pl");
InputStream stream = new FileInputStream(file);
Reader rulesReader = new InputStreamReader(stream);
SynonymMap.Builder parser = null;
parser = new WordnetSynonymParser(true, true, new StandardAnalyzer(CharArraySet.EMPTY_SET));
System.out.print(parser.toString());
((WordnetSynonymParser) parser).parse(rulesReader);
SynonymMap synonymMap = parser.build();
return synonymMap;
}
public static void main (String[] args) throws UnsupportedEncodingException, IOException, ParseException
{
Reader reader = new FileReader("C:\\input.txt"); // here I have the queries that I want to expand
TokenStreamComponents TSC = createComponents( "" , new StringReader("some text goes here"));
**System.out.print(TSC); //How to get the result from TSC????**
}
#Override
protected TokenStreamComponents createComponents(String string)
{
throw new UnsupportedOperationException("Not supported yet."); //To change body of generated methods, choose Tools | Templates.
}
}
Please suggest ways to help me access the expanded queries!

So, are you just trying to figure out how to iterate through the terms from the TokenStreamComponents in your main method?
Something like this:
TokenStreamComponents TSC = createComponents( "" , new StringReader("some text goes here"));
TokenStream stream = TSC.getTokenStream();
CharTermAttribute termattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
System.out.println(termattr.toString());
}

Add a watermark on a pdf that contains images using pdfbox (1.7)

I have used the code suggested in:
PDFBox Overlay fails
to add a watermark to an existing pdf.
Unfortunately, the pdf produced is corrupted. The pdf reader complains when I open the document: "An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem".
The document is opened but it does not show the images.
It seems to happen with all the pdfs. It could be worth saying that it happens also with a different implementation that simply uses the Overlay class.
The following url points to a pdf that I used for my testing:
A pdf with an image
The code to test this transformation is:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.graphics.PDExtendedGraphicsState;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
import org.apache.pdfbox.util.MapUtil;
/**
* This test is about overlaying with special effect.
*
* #author mkl
*/
public class OverlayWithEffect
{
final static File RESULT_FOLDER = new File("target/test-outputs", "assembly");
public static void overlayWithDarkenBlendMode(PDDocument document, PDDocument overlay) throws IOException
{
PDXObjectForm xobject = importAsXObject(document, (PDPage) overlay.getDocumentCatalog().getAllPages().get(0));
PDExtendedGraphicsState darken = new PDExtendedGraphicsState();
darken.getCOSDictionary().setName("BM", "Darken");
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
for (PDPage page: pages)
{
if (page.getResources() == null) {
page.setResources(page.findResources());
}
if (page.getResources() != null) {
Map<String, PDExtendedGraphicsState> states = page.getResources().getGraphicsStates();
if (states == null) {
states = new HashMap<String, PDExtendedGraphicsState>();
}
String darkenKey = MapUtil.getNextUniqueKey(states, "Dkn");
states.put(darkenKey, darken);
page.getResources().setGraphicsStates(states);
PDPageContentStream stream = new PDPageContentStream(document, page, true, false, true);
stream.appendRawCommands(String.format("/%s gs ", darkenKey));
stream.drawXObject(xobject, 0, 0, 1, 1);
stream.close();
}
}
}
public static PDXObjectForm importAsXObject(PDDocument target, PDPage page) throws IOException
{
final PDStream xobjectStream = new PDStream(target, page.getContents().createInputStream(), false);
final PDXObjectForm xobject = new PDXObjectForm(xobjectStream);
xobject.setResources(page.findResources());
xobject.setBBox(page.findCropBox());
COSDictionary group = new COSDictionary();
group.setName("S", "Transparency");
group.setBoolean(COSName.getPDFName("K"), true);
xobject.getCOSStream().setItem(COSName.getPDFName("Group"), group);
return xobject;
}
public static void main(String[] args) throws COSVisitorException, IOException
{
InputStream sourceStream = new FileInputStream("x:/pdf-test.pdf");
InputStream overlayStream = new FileInputStream("x:/draft.pdf");
try {
final PDDocument document = PDDocument.load(sourceStream);
final PDDocument overlay = PDDocument.load(overlayStream);
overlayWithDarkenBlendMode(document, overlay);
document.save("x:/da-draft-5.pdf");
document.close();
}
finally {
sourceStream.close();
overlayStream.close();
}
}
}
I am using version 1.7 of pdfbox.
Thanks

As suggested by mkl, it is probably an issue with the version of pdfbox that I am using.

BM25 in Lucene 4.9

I'm struggling with with BM25Similarity class in Lucene (link). All examples provided on the Web refers to older implementation (link). I kindly ask for a pointer how to modify the standard toy example below to include BM25 similarity (create index and perform search).
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import java.io.IOException;
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_9);
// Create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// Query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_4_9, "title", analyzer).parse(querystr);
// Search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// Display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}

You just need to set the similarity in IndexSearcher:
searcher.setSimilarity(new BM25Similarity(1.2, 0.75));
And IndexWriterConfig:
config.setSimilarity(new BM25Similarity(1.2, 0.75));

Empty Query - More Like This (Lucene)

I'm new at Java and Lucene.
I'm trying a simple MLT test, but I'm not getting any results.
import java.io.File;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;
import org.apache.lucene.index.DirectoryReader;
import java.io.IOException;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.queries.mlt.MoreLikeThis;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.SimpleFSDirectory;
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
String indexPath = "C:/Users/Name/Desktop/Lucene";
Directory index = new SimpleFSDirectory(new File(indexPath));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
IndexReader reader = DirectoryReader.open(index);
System.out.println("Reader has " + reader.numDocs() + " docs on it.");
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(1);
mlt.setMinDocFreq(1);
Query query = mlt.like(0); //doc ID
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs topDocs = searcher.search(query,5);
System.out.println("Found " + topDocs.totalHits + " hits.");
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(scoreDoc.doc + ". " + doc.get("isbn") + "\t" + doc.get("title"));
}
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
And this is what I get:
Reader has 4 docs on it.
Found 0 hits.
I tried using Luke to check the Doc's ID and there is nothing wrong apparently.
I even did some tests there, I dont know whats wrong :(
Did a lot of searching on the web, someone said something about the settings as MinTermFreq and MinDocFreq, I tried 1 and 0, but got nothing.
Does anyone have an idea?
Thanks in advance!
[SOLVED] Edited:
its working now!
I just had to add this:
mlt.setAnalyzer(analyzer);
mlt.setFieldNames(new String[] {"title"});

its working now!
I just had to add this:
mlt.setAnalyzer(analyzer);
mlt.setFieldNames(new String[] {"title"});

How to use prefix queries on fields in Lucene?

I am trying to use a PrefixQuery in Lucene for the purpose of autocomplete. I've made a simple test of what I thought should work, but it doesn't. I am indexing some simple strings and using the KeywordAnalyzer to make sure they are not tokenized, but my searches still do not match anything. How should I index and search a field to get a prefix match?
Here's the unit test I made to test with. Everything passes except the autocomplete and singleTerm methods.
package com.sample.index;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.util.HashMap;
import static junit.framework.Assert.assertEquals;
import static junit.framework.Assert.assertFalse;
import static junit.framework.Assert.assertTrue;
public class TestIndexStuff {
public static final String FIELD_AUTOCOMPLETE = "autocomplete";
public static final String FIELD_NORMAL = "normal";
private IndexSearcher searcher;
private PerFieldAnalyzerWrapper analyzer;
#Before
public void init() throws IOException {
RAMDirectory idx = new RAMDirectory();
HashMap<String, Analyzer> fieldAnalyzers = new HashMap<String, Analyzer>();
fieldAnalyzers.put(FIELD_AUTOCOMPLETE, new KeywordAnalyzer());
analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_35), fieldAnalyzers);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter writer = new IndexWriter(idx, config);
addDocs(writer);
writer.close();
searcher = new IndexSearcher(IndexReader.open(idx));
}
private void addDocs(IndexWriter writer) throws IOException {
for (String text : new String[]{"Fred Rogers", "Toni Reed Preckwinkle", "Randy Savage", "Kathryn Janeway", "Madonna", "Fred Savage"}) {
Document doc = new Document();
doc.add(new Field(FIELD_NORMAL, text, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field(FIELD_AUTOCOMPLETE, text, Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}
#Test
public void prefixParser() throws ParseException {
Query prefixQuery = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Fre*");
assertTrue(prefixQuery instanceof PrefixQuery);
Query normalQuery = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Fred");
assertFalse(normalQuery instanceof PrefixQuery);
}
#Test
public void normal() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_NORMAL, analyzer).parse("Fred");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(2, topDocs.totalHits);
}
#Test
public void autocomplete() throws IOException, ParseException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Fre*");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(2, topDocs.totalHits);
}
#Test
public void singleTerm() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Mado*");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(1, topDocs.totalHits);
}
}
edit: adding revised code for those who read this later to show full test after changing thanks to #jpountz. Rather than leave things as mixed case though, I chose to index them as lower case. I also added a unit test to make sure that a term in the middle would not be matched, since this should only match things that start with the search term.
package com.sample.index;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordAnalyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PrefixQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.util.HashMap;
import static junit.framework.Assert.assertEquals;
import static junit.framework.Assert.assertFalse;
import static junit.framework.Assert.assertTrue;
public class TestIndexStuff {
public static final String FIELD_AUTOCOMPLETE = "autocomplete";
public static final String FIELD_NORMAL = "normal";
private IndexSearcher searcher;
private PerFieldAnalyzerWrapper analyzer;
#Before
public void init() throws IOException {
RAMDirectory idx = new RAMDirectory();
HashMap<String, Analyzer> fieldAnalyzers = new HashMap<String, Analyzer>();
fieldAnalyzers.put(FIELD_AUTOCOMPLETE, new KeywordAnalyzer());
analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_35), fieldAnalyzers);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter writer = new IndexWriter(idx, config);
addDocs(writer);
writer.close();
searcher = new IndexSearcher(IndexReader.open(idx));
}
private void addDocs(IndexWriter writer) throws IOException {
for (String text : new String[]{"Fred Rogers", "Toni Reed Preckwinkle", "Randy Savage", "Kathryn Janeway", "Madonna", "Fred Savage"}) {
Document doc = new Document();
doc.add(new Field(FIELD_NORMAL, text, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field(FIELD_AUTOCOMPLETE, text.toLowerCase(), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
}
#Test
public void prefixParser() throws ParseException {
Query prefixQuery = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Fre*");
assertTrue(prefixQuery instanceof PrefixQuery);
Query normalQuery = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Fred");
assertFalse(normalQuery instanceof PrefixQuery);
}
#Test
public void normal() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_NORMAL, analyzer).parse("Fred");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(2, topDocs.totalHits);
}
#Test
public void autocomplete() throws IOException, ParseException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Fre*");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(2, topDocs.totalHits);
}
#Test
public void beginningOnly() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("R*");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(1, topDocs.totalHits);
}
#Test
public void singleTerm() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Mado*");
TopDocs topDocs = searcher.search(query, 10);
assertEquals(1, topDocs.totalHits);
}
}

By default, QueryParser lowercases the terms of special queries (in particular prefix queries). To disable this, see QueryParser.setLowercaseExpandedTerms.
Replace
Query query = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer).parse("Mado*");
with
QueryParser qp = new QueryParser(Version.LUCENE_35, FIELD_AUTOCOMPLETE, analyzer);
qp.setLowercaseExpandedTerms(false);
Query query = qp.parse("Mado*");
to fix your tests.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene Highlighter with stemming analyzer - lucene

Related

Query Expansion lucene

Add a watermark on a pdf that contains images using pdfbox (1.7)

BM25 in Lucene 4.9

Empty Query - More Like This (Lucene)

How to use prefix queries on fields in Lucene?

Categories

Resources