I want to search special characters in index.
I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().
Hence it search on no fields.
How to solve this problem? My index contains these special characters.
If you are using the StandardAnalyzer, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.
For example, the SimpleAnalyzer implements the following pipeline:
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
which just lower-cases the tokens.
The StandardAnalyzer does much more:
/** Constructs a {#link StandardTokenizer} filtered by a {#link
StandardFilter}, a {#link LowerCaseFilter} and a {#link StopFilter}. */
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
return result;
}
You can mix & match from these and other components in org.apache.lucene.analysis, or you can write your own specialized TokenStream instances that are wrapped into a processing pipeline by your custom Analyzer.
One other thing to look at is what sort of CharTokenizer you're using. CharTokenizer is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer). Lucene comes with two subclasses: a LetterTokenizer and a WhitespaceTokenizer. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c) method.
Maybe it's not actual for the author but to be able to search special characters you need:
Create custom analyzer
Use it for indexing and searching
Example how it works for me:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;
import java.io.IOException;
import static org.hamcrest.Matchers.equalTo;
import static org.junit.Assert.assertThat;
public class LuceneSpecialCharactersSearchTest {
/**
* Test that tries to search a string by some substring with each special character separately.
*/
#Test
public void testSpecialCharacterSearch() throws Exception {
// GIVEN
LuceneSpecialCharactersSearch service = new LuceneSpecialCharactersSearch();
String[] luceneSpecialCharacters = new String[]{"+", "-", "&&", "||", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":", "\\"};
// WHEN
for (String specialCharacter : luceneSpecialCharacters) {
String actual = service.search("list's special-characters " + specialCharacter);
// THEN
assertThat(actual, equalTo(LuceneSpecialCharactersSearch.TEXT_WITH_SPECIAL_CHARACTERS));
}
}
private static class LuceneSpecialCharactersSearch {
private static final String TEXT_WITH_SPECIAL_CHARACTERS = "This is the list's of special-characters + - && || ! ( ) { } [ ] ^ \" ~ ? : \\ *";
private final IndexWriter writer;
public LuceneSpecialCharactersSearch() throws Exception {
Document document = new Document();
document.add(new TextField("body", TEXT_WITH_SPECIAL_CHARACTERS, Field.Store.YES));
RAMDirectory directory = new RAMDirectory();
writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
writer.addDocument(document);
writer.commit();
}
public String search(String queryString) throws Exception {
try (IndexReader reader = DirectoryReader.open(writer, false)) {
IndexSearcher searcher = new IndexSearcher(reader);
String escapedQueryString = QueryParser.escape(queryString).toLowerCase();
Analyzer analyzer = buildAnalyzer();
QueryParser bodyQueryParser = new QueryParser("body", analyzer);
bodyQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query bodyQuery = bodyQueryParser.parse(escapedQueryString);
BooleanQuery query = new BooleanQuery.Builder()
.add(new BooleanClause(bodyQuery, BooleanClause.Occur.SHOULD))
.build();
TopDocs searchResult = searcher.search(query, 1);
return searcher.doc(searchResult.scoreDocs[0].doc).getField("body").stringValue();
}
}
/**
* Builds analyzer that is used for indexing and searching.
*/
private static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder()
.withTokenizer("whitespace")
.addTokenFilter("lowercase")
.addTokenFilter("standard")
.build();
}
}
}
Related
I'm looking for general advice how to search identifiers, product codes or phone numbers in Apache Lucene 8.x. Let's say I'm trying to to search lists of product codes (like an ISBN, for example 978-3-86680-192-9). If somebody enters 9783 or 978 3 or 978-3, 978-3-86680-192-9 should appear. Same should happen if an identifier uses any combinations of letters, spaces, digits, punctuation (examples: TS 123, 123.abc. How would I do this?
I thought I could solve this with a custom analyzer that removes all the punctuation and whitespace, but the results are mixed:
public class IdentifierAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return tokenStream;
}
}
So while I get the desired results when performing a PrefixQuery with TS1*, TS 1* (with whitespace) does not yield satisfactory results. When I look into the parsed query, I see that Lucene splits TS 1* into two queries: myField:TS myField:1*. WordDelimiterGraphFilter looks interesting, but I couldn't figure out to apply it here.
This is not a comprehensive answer - but I agree that WordDelimiterGraphFilter may be helpful for this type of data. However, there could still be test cases which need additional handling.
Here is my custom analyzer, using a WordDelimiterGraphFilter:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
import java.util.Map;
import java.util.HashMap;
public class IdentifierAnalyzer extends Analyzer {
private WordDelimiterGraphFilterFactory getWordDelimiter() {
Map<String, String> settings = new HashMap<>();
settings.put("generateWordParts", "1"); // e.g. "PowerShot" => "Power" "Shot"
settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
settings.put("catenateAll", "1"); // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
settings.put("preserveOriginal", "1"); // e.g. "500-42" => "500" "42" "500-42"
settings.put("splitOnCaseChange", "1"); // e.g. "fooBar" => "foo" "Bar"
return new WordDelimiterGraphFilterFactory(settings);
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = getWordDelimiter().create(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
return tokenStream;
}
}
It uses the WordDelimiterGraphFilterFactory helper, together with a map of parameters, to control which settings are applied.
You can see the complete list of available settings in the WordDelimiterGraphFilterFactory JavaDoc. You may want to experiment with setting/unsetting different ones.
Here is a test index builder for the following 3 input values:
978-3-86680-192-9
TS 123
123.abc
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new IdentifierAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
for (String identifier : identifiers) {
doc = new Document();
doc.add(new TextField("identifiers", identifier, Field.Store.YES));
writer.addDocument(doc);
}
}
}
This creates the following tokens:
For querying the above indexed data I used this:
public static void doSearch() throws IOException, ParseException {
Analyzer analyzer = new IdentifierAnalyzer();
QueryParser parser = new QueryParser("identifiers", analyzer);
List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");
for (String search : searches) {
Query query = parser.parse(search);
printHits(query, search);
}
}
private static void printHits(Query query, String search) throws IOException {
System.out.println("search term: " + search);
System.out.println("parsed query: " + query.toString());
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("hits: " + hits.length);
for (ScoreDoc hit : hits) {
System.out.println("");
System.out.println(" doc id: " + hit.doc + "; score: " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println(" identifier: " + doc.get("identifiers"));
}
System.out.println("-----------------------------------------");
}
This uses the following search terms - all of which I pass into the classic query parser (though you could, of course, use more sophisticated query types via the API):
9783
9783*
978 3
978-3
TS1*
TS 1*
The only query which failed to find any matching documents was the first one:
search term: 9783
parsed query: identifiers:9783
hits: 0
This should not be a surprise, since this is a partial token, without a wildcard. The second query (with the wildcard added) found one document, as expected.
The final query I tested TS 1* found three hits - but the one we want has the best matching score:
search term: TS 1*
parsed query: identifiers:ts identifiers:1*
hits: 3
doc id: 1; score: 1.590861
identifier: TS 123
doc id: 0; score: 1.0
identifier: 978-3-86680-192-9
doc id: 2; score: 1.0
identifier: 123.abc
I want to debug Lucene token filters and see results. How is it possible to apply a token filter to a token stream in order to see result?
(using Lucene 4.10.3)
import java.io.IOException;
import java.io.StringReader;
import java.util.Iterator;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class TokenFilterExample {
public static void main(String[] args) throws IOException {
// 1] Create token stream
StringReader r = new StringReader("Hello World");
StandardTokenizer s = new StandardTokenizer(r);
// Create lower-case token filter
LowerCaseFilter f = new LowerCaseFilter(s);
// Print result
System.out.println(??????);
// close
f.close();
s.close();
}
}
The solution is this:
// Print result
f.reset();
Iterator it = f.getAttributeClassesIterator();
while (f.incrementToken()) {
System.out.println(f
.getAttribute(CharTermAttribute.class));
}
I hereby paste the following code,
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import java.io.*;
import java.util.ArrayList;
/**
* This terminal application creates an Apache Lucene index in a folder and adds files into this index
* based on the input of the user.
*/
public class TextFileIndexer {
private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_44);
private Analyzer anal = new WhitespaceAnalyzer(Version.LUCENE_44);
private IndexWriter writer;
private ArrayList<File> queue = new ArrayList<File>();
public static void main(String[] args) throws IOException {
System.out.println("Enter the path where the index will be created: (e.g. /tmp/index or c:/temp/index)");
String indexLocation = null;
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in));
String s = br.readLine();
TextFileIndexer indexer = null;
try {
indexLocation = s;
indexer = new TextFileIndexer(s);
} catch (Exception ex) {
System.out.println("Cannot create index..." + ex.getMessage());
System.exit(-1);
}
//===================================================
//read input from user until he enters q for quit
//===================================================
while (!s.equalsIgnoreCase("q")) {
try {
System.out.println("Enter the full path to add into the index (q=quit): (e.g. /home/ron/mydir or c:\\Users\\ron\\mydir)");
System.out.println("[Acceptable file types: .xml, .html, .html, .txt]");
s = br.readLine();
if (s.equalsIgnoreCase("q")) {
break;
}
//try to add file into the index
indexer.indexFileOrDirectory(s);
} catch (Exception e) {
System.out.println("Error indexing " + s + " : " + e.getMessage());
}
}
//===================================================
//after adding, we always have to call the
//closeIndex, otherwise the index is not created
//===================================================
indexer.closeIndex();
//=========================================================
// Now search
//=========================================================
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(indexLocation)));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(5, true);
s = "";
while (!s.equalsIgnoreCase("q")) {
try {
System.out.println("Enter the search query (q=quit):");
s = br.readLine();
if (s.equalsIgnoreCase("q")) {
break;
}
Query q = new QueryParser(Version.LUCENE_44, "contents", analyzer).parse(s);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("path") + " score=" + hits[i].score);
}
} catch (Exception e) {
System.out.println("Error searching " + s + " : " + e.getMessage());
}
}
}
/**
* Constructor
* #param indexDir the name of the folder in which the index should be created
* #throws java.io.IOException when exception creating index.
*/
TextFileIndexer(String indexDir) throws IOException {
// the boolean true parameter means to create a new index everytime,
// potentially overwriting any existing files there.
FSDirectory dir = FSDirectory.open(new File(indexDir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_44, analyzer);
writer = new IndexWriter(dir, config);
}
/**
* Indexes a file or directory
* #param fileName the name of a text file or a folder we wish to add to the index
* #throws java.io.IOException when exception
*/
public void indexFileOrDirectory(String fileName) throws IOException {
//===================================================
//gets the list of files in a folder (if user has submitted
//the name of a folder) or gets a single file name (is user
//has submitted only the file name)
//===================================================
addFiles(new File(fileName));
int originalNumDocs = writer.numDocs();
for (File f : queue) {
FileReader fr = null;
try {
Document doc = new Document();
//===================================================
// add contents of file
//===================================================
fr = new FileReader(f);
// doc.add(new TextField("contents", fr));
doc.add(new StringField("path", f.getPath(), Field.Store.YES));
doc.add(new StringField("filename", f.getName(), Field.Store.YES));
writer.addDocument(doc);
System.out.println("Added: " + f);
BufferedReader br = new BufferedReader(new FileReader(fileName));
Field field = new StringField("contents", br.readLine().toString(),
Field.Store.YES);
doc.add(field);
writer.addDocument(doc);
} catch (Exception e) {
System.out.println("Could not add: " + f);
} finally {
fr.close();
}
}
int newNumDocs = writer.numDocs();
System.out.println("");
System.out.println("************************");
System.out.println((newNumDocs - originalNumDocs) + " documents added.");
System.out.println("************************");
queue.clear();
}
private void addFiles(File file) {
if (!file.exists()) {
System.out.println(file + " does not exist.");
}
if (file.isDirectory()) {
for (File f : file.listFiles()) {
addFiles(f);
}
} else {
String filename = file.getName().toLowerCase();
//===================================================
// Only index text files
//===================================================
if (filename.endsWith(".htm") || filename.endsWith(".html") ||
filename.endsWith(".xml") || filename.endsWith(".txt") || filename.endsWith(".pdf") ) {
queue.add(file);
} else {
System.out.println("Skipped " + filename);
}
}
}
/**
* Close the index.
* #throws java.io.IOException when exception closing
*/
public void closeIndex() throws IOException {
writer.close();
}
}
But when,I search for a particular string in the file. I get String Not Found. The output is as follows,
Enter the path where the index will be created: (e.g. /tmp/index or c:/temp/index)
D:/svn/phase2/JavaSource/test/test/
Enter the full path to add into the index (q=quit): (e.g. /home/ron/mydir or c:\Users\ron\mydir)
[Acceptable file types: .xml, .html, .html, .txt]
D:/svn/phase2/JavaSource/test/test
Skipped segments.gen
Skipped segments_1
Skipped write.lock
Added fileName : D:/svn/phase2/JavaSource/test/test
Added: D:\svn\phase2\JavaSource\test\test\demo.xml
Added fileName : D:/svn/phase2/JavaSource/test/test
Added: D:\svn\phase2\JavaSource\test\test\exe.xml
Added fileName : D:/svn/phase2/JavaSource/test/test
Added: D:\svn\phase2\JavaSource\test\test\Fruit.XML
Added fileName : D:/svn/phase2/JavaSource/test/test
Added: D:\svn\phase2\JavaSource\test\test\Influence_People.pdf
Added fileName : D:/svn/phase2/JavaSource/test/test
Added: D:\svn\phase2\JavaSource\test\test\new.html
Added fileName : D:/svn/phase2/JavaSource/test/test
Added: D:\svn\phase2\JavaSource\test\test\Toy.xml
************************
6 documents added.
************************
Enter the full path to add into the index (q=quit): (e.g. /home/ron/mydir or c:\Users\ron\mydir)
[Acceptable file types: .xml, .html, .html, .txt]
q
Enter the search query (q=quit):
for
Entered String is : for
fieldName =for
Found : 0 hits.
Enter the search query (q=quit):
i
Entered String is : i
Error searching i : this IndexReader is closed
Enter the search query (q=quit):
q
Entered String is : q
"for" and "i" are both stopwords, by default, in StandardAnalyzer, and so can not eally be searched for. The full list of default stop words is:
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
Seems likely there are other issues issues at work. Don't know why your reader would be closed for the second query. I don't know where the output "fieldName =for" is coming from either. But hopefully that gets you started on debugging.
Have you tried debugging your code in Luke? (Lucene Index Toolbox)
http://code.google.com/p/luke/
Luke is really nice in performing searches using different analyzers, inspecting the index storage, understanding how documents are scored based on searches etc. It can help eliminate any problems with search code, since it directly works on the index files.
Luke works for both Java and .NET versions of Lucene.
I am using Lucene's Highlighter class to highlight fragments of matched search results and it works well. I would like to switch from searching with the StandardAnalyzer to the EnglishAnalyzer, which will perform stemming of terms.
The search results are good, but now the highlighter doesn't always find a match. Here's an example of what I'm looking at:
document field text 1: Everyone likes goats.
document field text 2: I have a goat that eats everything.
Using the EnglishAnalyzer and searching for "goat", both documents are matched, but the highlighter is only able to find a matched fragment from document 2. Is there a way to have the highlighter return data for both documents?
I understand that the characters are different for the tokens, but the same tokens are still there, so it seems reasonable for it to just highlight whatever token is present at that location.
If it helps, this is using Lucene 3.5.
I found a solution to this problem. I changed from using the Highlighter class to using the FastVectorHighlighter. It looks like I'll pick up some speed improvements too (at the expense of storage of term vector data). For the benefit of anyone coming across this question later, here's a unit test showing how this all works together:
package com.sample.index;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.en.EnglishAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.vectorhighlight.*;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import static junit.framework.Assert.assertEquals;
public class TestIndexStuff {
public static final String FIELD_NORMAL = "normal";
public static final String[] PRE_TAGS = new String[]{"["};
public static final String[] POST_TAGS = new String[]{"]"};
private IndexSearcher searcher;
private Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
#Before
public void init() throws IOException {
RAMDirectory idx = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter writer = new IndexWriter(idx, config);
addDocs(writer);
writer.close();
searcher = new IndexSearcher(IndexReader.open(idx));
}
private void addDocs(IndexWriter writer) throws IOException {
for (String text : new String[] {
"Pretty much everyone likes goats.",
"I have a goat that eats everything.",
"goats goats goats goats goats"}) {
Document doc = new Document();
doc.add(new Field(FIELD_NORMAL, text, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
writer.addDocument(doc);
}
}
private FastVectorHighlighter makeHighlighter() {
FragListBuilder fragListBuilder = new SimpleFragListBuilder(200);
FragmentsBuilder fragmentBuilder = new SimpleFragmentsBuilder(PRE_TAGS, POST_TAGS);
return new FastVectorHighlighter(true, true, fragListBuilder, fragmentBuilder);
}
#Test
public void highlight() throws ParseException, IOException {
Query query = new QueryParser(Version.LUCENE_35, FIELD_NORMAL, analyzer)
.parse("goat");
FastVectorHighlighter highlighter = makeHighlighter();
FieldQuery fieldQuery = highlighter.getFieldQuery(query);
TopDocs topDocs = searcher.search(query, 10);
List<String> fragments = new ArrayList<String>();
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
fragments.add(highlighter.getBestFragment(fieldQuery, searcher.getIndexReader(),
scoreDoc.doc, FIELD_NORMAL, 10000));
}
assertEquals(3, fragments.size());
assertEquals("[goats] [goats] [goats] [goats] [goats]", fragments.get(0).trim());
assertEquals("Pretty much everyone likes [goats].", fragments.get(1).trim());
assertEquals("I have a [goat] that eats everything.", fragments.get(2).trim());
}
}
Given a document {'foo', 'bar', 'baz'}, I want to match using SpanNearQuery with the tokens {'baz', 'extra'}
But this fails.
How do I go around this?
Sample test (using lucene 2.9.1) with the following results:
givenSingleMatch - PASS
givenTwoMatches - PASS
givenThreeMatches - PASS
givenSingleMatch_andExtraTerm - FAIL
...
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
public class SpanNearQueryTest {
private RAMDirectory directory = null;
private static final String BAZ = "baz";
private static final String BAR = "bar";
private static final String FOO = "foo";
private static final String TERM_FIELD = "text";
#Before
public void given() throws IOException {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(
directory,
new StandardAnalyzer(Version.LUCENE_29),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
writer.optimize();
writer.close();
}
#After
public void cleanup() {
directory.close();
}
#Test
public void givenSingleMatch() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
#Test
public void givenTwoMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
#Test
public void givenThreeMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR)),
new SpanTermQuery(new Term(TERM_FIELD, BAZ))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
#Test
public void givenSingleMatch_andExtraTerm() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
},
Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
}
SpanNearQuery lets you find terms that are within a certain distance of each other.
Example (from http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):
Say we want to find lucene within 5
positions of doug, with doug following
lucene (order matters) – you could use
the following SpanQuery:
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "lucene")),
new SpanTermQuery(new Term(FIELD, "doug"))},
5,
true);
(source: lucidimagination.com)
In this sample text, Lucene is within
3 of Doug
But for your example, the only match I can see is that both your query and the target document have "cd" (and I am making the assumption that all of those terms are in a single field). In that case, you don't need to use any special query type. Using the standard mechanisms, you will get some non-zero weighting based on the fact that they both contain the same term in the same field.
Edit 3 - in response to latest comment, the answer is that you cannot use SpanNearQuery to do anything other than that which it is intended for, which is to find out whether multiple terms in a document occur within a certain number of places of each other. I can't tell what your specific use case / expected results are (feel free to post it), but in the last case if you only want to find out whether one or more of ("BAZ", "EXTRA") is in the document, a BooleanQuery will work just fine.
Edit 4 - now that you have posted your use case, I understand what it is you want to do. Here is how you can do it: use a BooleanQuery as mentioned above to combine the individual terms you want as well as the SpanNearQuery, and set a boost on the SpanNearQuery.
So, the query in text form would look like:
BAZ OR EXTRA OR "BAZ EXTRA"~100^5
(as an example - this would match all documents containing either "BAZ" or "EXTRA", but assign a higher score to documents where the terms "BAZ" and "EXTRA occur within 100 places of each other; adjust the position and boost as you like. This example is from the Solr cookbook so it may not parse in Lucene, or may give undesirable results. That's ok, because in the next section I show you how to build this using the API).
Programmatically, you would construct this as follows:
Query top = new BooleanQuery();
// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");
// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);
// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other. The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
new SpanQuery[] { new SpanTermQuery(bazTerm),
new SpanTermQuery(extraTerm) },
100, true);
// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);
// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);
Hope that helps! In the future, try to start off by posting exactly what results you are expecting - even if it is obvious to you, it may not be to the reader, and being explicit can avoid having to go back and forth so many times.