I'm looking for general advice how to search identifiers, product codes or phone numbers in Apache Lucene 8.x. Let's say I'm trying to to search lists of product codes (like an ISBN, for example 978-3-86680-192-9). If somebody enters 9783 or 978 3 or 978-3, 978-3-86680-192-9 should appear. Same should happen if an identifier uses any combinations of letters, spaces, digits, punctuation (examples: TS 123, 123.abc. How would I do this?
I thought I could solve this with a custom analyzer that removes all the punctuation and whitespace, but the results are mixed:
public class IdentifierAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
tokenStream = new TrimFilter(tokenStream);
return tokenStream;
}
}
So while I get the desired results when performing a PrefixQuery with TS1*, TS 1* (with whitespace) does not yield satisfactory results. When I look into the parsed query, I see that Lucene splits TS 1* into two queries: myField:TS myField:1*. WordDelimiterGraphFilter looks interesting, but I couldn't figure out to apply it here.
This is not a comprehensive answer - but I agree that WordDelimiterGraphFilter may be helpful for this type of data. However, there could still be test cases which need additional handling.
Here is my custom analyzer, using a WordDelimiterGraphFilter:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
import java.util.Map;
import java.util.HashMap;
public class IdentifierAnalyzer extends Analyzer {
private WordDelimiterGraphFilterFactory getWordDelimiter() {
Map<String, String> settings = new HashMap<>();
settings.put("generateWordParts", "1"); // e.g. "PowerShot" => "Power" "Shot"
settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
settings.put("catenateAll", "1"); // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
settings.put("preserveOriginal", "1"); // e.g. "500-42" => "500" "42" "500-42"
settings.put("splitOnCaseChange", "1"); // e.g. "fooBar" => "foo" "Bar"
return new WordDelimiterGraphFilterFactory(settings);
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = getWordDelimiter().create(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
#Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
return tokenStream;
}
}
It uses the WordDelimiterGraphFilterFactory helper, together with a map of parameters, to control which settings are applied.
You can see the complete list of available settings in the WordDelimiterGraphFilterFactory JavaDoc. You may want to experiment with setting/unsetting different ones.
Here is a test index builder for the following 3 input values:
978-3-86680-192-9
TS 123
123.abc
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new IdentifierAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
for (String identifier : identifiers) {
doc = new Document();
doc.add(new TextField("identifiers", identifier, Field.Store.YES));
writer.addDocument(doc);
}
}
}
This creates the following tokens:
For querying the above indexed data I used this:
public static void doSearch() throws IOException, ParseException {
Analyzer analyzer = new IdentifierAnalyzer();
QueryParser parser = new QueryParser("identifiers", analyzer);
List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");
for (String search : searches) {
Query query = parser.parse(search);
printHits(query, search);
}
}
private static void printHits(Query query, String search) throws IOException {
System.out.println("search term: " + search);
System.out.println("parsed query: " + query.toString());
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("hits: " + hits.length);
for (ScoreDoc hit : hits) {
System.out.println("");
System.out.println(" doc id: " + hit.doc + "; score: " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println(" identifier: " + doc.get("identifiers"));
}
System.out.println("-----------------------------------------");
}
This uses the following search terms - all of which I pass into the classic query parser (though you could, of course, use more sophisticated query types via the API):
9783
9783*
978 3
978-3
TS1*
TS 1*
The only query which failed to find any matching documents was the first one:
search term: 9783
parsed query: identifiers:9783
hits: 0
This should not be a surprise, since this is a partial token, without a wildcard. The second query (with the wildcard added) found one document, as expected.
The final query I tested TS 1* found three hits - but the one we want has the best matching score:
search term: TS 1*
parsed query: identifiers:ts identifiers:1*
hits: 3
doc id: 1; score: 1.590861
identifier: TS 123
doc id: 0; score: 1.0
identifier: 978-3-86680-192-9
doc id: 2; score: 1.0
identifier: 123.abc
Related
I'm attempting to create a custom analyser with multiple filters applied.
The issue is only the last filter (LowerCaseFilter) is applied.
public class CustomAnalyzer : Analyzer
{
protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
{
Tokenizer tokenizer = new KeywordTokenizer(reader);
//Remove basic stop words a, an, the, in, on etc
TokenStream result = new StopFilter(GlobalVariables.LuceneVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
////Remove tile/tiles
CharArraySet stopWords = new CharArraySet(GlobalVariables.LuceneVersion, 1, true)
{
"test",
}
result = new StopFilter(GlobalVariables.LuceneVersion, tokenizer, stopWords);
//Make case insenstive
result = new LowerCaseFilter(GlobalVariables.LuceneVersion, tokenizer);
return new TokenStreamComponents(tokenizer, result);
}
}
Don't pass the tokenizer into each filter, pass the previous filter in.
Tokenizer tokenizer = new KeywordTokenizer(reader);
TokenStream result = new StopFilter(GlobalVariables.LuceneVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
CharArraySet stopWords = new CharArraySet(GlobalVariables.LuceneVersion, 1, true)
result = new StopFilter(GlobalVariables.LuceneVersion, result, stopWords);
result = new LowerCaseFilter(GlobalVariables.LuceneVersion, result);
return new TokenStreamComponents(tokenizer, result);
I am using Lucene 3.6. I want to know why update does not work. Is there anything wrong?
public class TokenTest
{
private static String IndexPath = "D:\\update\\index";
private static Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_33);
public static void main(String[] args) throws Exception
{
try
{
update();
display("content", "content");
}
catch (IOException e)
{
e.printStackTrace();
}
}
#SuppressWarnings("deprecation")
public static void display(String keyField, String words) throws Exception
{
IndexSearcher searcher = new IndexSearcher(FSDirectory.open(new File(IndexPath)));
Term term = new Term(keyField, words);
Query query = new TermQuery(term);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
for (ScoreDoc hit : hits)
{
Document doc = searcher.doc(hit.doc);
System.out.println("doc_id = " + hit.doc);
System.out.println("内容: " + doc.get("content"));
System.out.println("路径:" + doc.get("path"));
}
}
public static String update() throws Exception
{
IndexWriterConfig writeConfig = new IndexWriterConfig(Version.LUCENE_33, analyzer);
IndexWriter writer = new IndexWriter(FSDirectory.open(new File(IndexPath)), writeConfig);
Document document = new Document();
Field field_name2 = new Field("path", "update_path", Field.Store.YES, Field.Index.ANALYZED);
Field field_content2 = new Field("content", "content update", Field.Store.YES, Field.Index.ANALYZED);
document.add(field_name2);
document.add(field_content2);
Term term = new Term("path", "qqqqq");
writer.updateDocument(term, document);
writer.optimize();
writer.close();
return "update_path";
}
}
I assume you want to update your document such that field "path" = "qqqq". You have this exactly backwards (please read the documentation).
updateDocument performs two steps:
Find and delete any documents containing term
In this case, none are found, because your indexed documents does not contain path:qqqq
Add the new document to the index.
You appear to be doing the opposite, trying to lookup by document, then add the term to it, and it doesn't work that way. What you are looking for, I believe, is something like:
Term term = new Term("content", "update");
document.removeField("path");
document.add("path", "qqqq");
writer.updateDocument(term, document);
How to implement Did You Mean and Spellchecker feature in lucene full text search engine.
After you've created the index you can can create the index with the dictionary used by the spell checker using:
public void createSpellChekerIndex() throws CorruptIndexException,
IOException {
final IndexReader reader = IndexReader.open(this.indexDirectory, true);
final Dictionary dictionary = new LuceneDictionary(reader,
LuceneExample.FIELD);
final SpellChecker spellChecker = new SpellChecker(this.spellDirectory);
final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig writerConfig = new IndexWriterConfig(
Version.LUCENE_36, analyzer);
spellChecker.indexDictionary(dictionary, writerConfig, true);
spellChecker.close();
}
and than ask for a suggestions array with:
public String[] getSuggestions(final String queryString,
final int numberOfSuggestions, final float accuracy) {
try {
final SpellChecker spellChecker = new SpellChecker(
this.spellDirectory);
final String[] similarWords = spellChecker.suggestSimilar(
queryString, numberOfSuggestions, accuracy);
return similarWords;
} catch (final Exception e) {
return new String[0];
}
}
Example:
After indexing the following document:
luceneExample.index("spell checker");
luceneExample.index("did you mean");
luceneExample.index("hello, this is a test");
luceneExample.index("Lucene is great");
And creating the spell index with the method above, i tried to search for the string "lucete" and, asking for suggestion with
final String query = "lucete";
final String[] suggestions = luceneExample.getSuggestions(query, 5,
0.2f);
System.out.println("Did you mean:\n" + Arrays.toString(suggestions));
This was the output:
Did you mean:
[lucene]
I want to search special characters in index.
I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().
Hence it search on no fields.
How to solve this problem? My index contains these special characters.
If you are using the StandardAnalyzer, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.
For example, the SimpleAnalyzer implements the following pipeline:
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
which just lower-cases the tokens.
The StandardAnalyzer does much more:
/** Constructs a {#link StandardTokenizer} filtered by a {#link
StandardFilter}, a {#link LowerCaseFilter} and a {#link StopFilter}. */
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
return result;
}
You can mix & match from these and other components in org.apache.lucene.analysis, or you can write your own specialized TokenStream instances that are wrapped into a processing pipeline by your custom Analyzer.
One other thing to look at is what sort of CharTokenizer you're using. CharTokenizer is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer). Lucene comes with two subclasses: a LetterTokenizer and a WhitespaceTokenizer. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c) method.
Maybe it's not actual for the author but to be able to search special characters you need:
Create custom analyzer
Use it for indexing and searching
Example how it works for me:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;
import java.io.IOException;
import static org.hamcrest.Matchers.equalTo;
import static org.junit.Assert.assertThat;
public class LuceneSpecialCharactersSearchTest {
/**
* Test that tries to search a string by some substring with each special character separately.
*/
#Test
public void testSpecialCharacterSearch() throws Exception {
// GIVEN
LuceneSpecialCharactersSearch service = new LuceneSpecialCharactersSearch();
String[] luceneSpecialCharacters = new String[]{"+", "-", "&&", "||", "!", "(", ")", "{", "}", "[", "]", "^", "\"", "~", "*", "?", ":", "\\"};
// WHEN
for (String specialCharacter : luceneSpecialCharacters) {
String actual = service.search("list's special-characters " + specialCharacter);
// THEN
assertThat(actual, equalTo(LuceneSpecialCharactersSearch.TEXT_WITH_SPECIAL_CHARACTERS));
}
}
private static class LuceneSpecialCharactersSearch {
private static final String TEXT_WITH_SPECIAL_CHARACTERS = "This is the list's of special-characters + - && || ! ( ) { } [ ] ^ \" ~ ? : \\ *";
private final IndexWriter writer;
public LuceneSpecialCharactersSearch() throws Exception {
Document document = new Document();
document.add(new TextField("body", TEXT_WITH_SPECIAL_CHARACTERS, Field.Store.YES));
RAMDirectory directory = new RAMDirectory();
writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
writer.addDocument(document);
writer.commit();
}
public String search(String queryString) throws Exception {
try (IndexReader reader = DirectoryReader.open(writer, false)) {
IndexSearcher searcher = new IndexSearcher(reader);
String escapedQueryString = QueryParser.escape(queryString).toLowerCase();
Analyzer analyzer = buildAnalyzer();
QueryParser bodyQueryParser = new QueryParser("body", analyzer);
bodyQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query bodyQuery = bodyQueryParser.parse(escapedQueryString);
BooleanQuery query = new BooleanQuery.Builder()
.add(new BooleanClause(bodyQuery, BooleanClause.Occur.SHOULD))
.build();
TopDocs searchResult = searcher.search(query, 1);
return searcher.doc(searchResult.scoreDocs[0].doc).getField("body").stringValue();
}
}
/**
* Builds analyzer that is used for indexing and searching.
*/
private static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder()
.withTokenizer("whitespace")
.addTokenFilter("lowercase")
.addTokenFilter("standard")
.build();
}
}
}
Given a document {'foo', 'bar', 'baz'}, I want to match using SpanNearQuery with the tokens {'baz', 'extra'}
But this fails.
How do I go around this?
Sample test (using lucene 2.9.1) with the following results:
givenSingleMatch - PASS
givenTwoMatches - PASS
givenThreeMatches - PASS
givenSingleMatch_andExtraTerm - FAIL
...
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.After;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
public class SpanNearQueryTest {
private RAMDirectory directory = null;
private static final String BAZ = "baz";
private static final String BAR = "bar";
private static final String FOO = "foo";
private static final String TERM_FIELD = "text";
#Before
public void given() throws IOException {
directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(
directory,
new StandardAnalyzer(Version.LUCENE_29),
IndexWriter.MaxFieldLength.UNLIMITED);
Document doc = new Document();
doc.add(new Field(TERM_FIELD, FOO, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAR, Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field(TERM_FIELD, BAZ, Field.Store.NO, Field.Index.ANALYZED));
writer.addDocument(doc);
writer.commit();
writer.optimize();
writer.close();
}
#After
public void cleanup() {
directory.close();
}
#Test
public void givenSingleMatch() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
#Test
public void givenTwoMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
#Test
public void givenThreeMatches() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, FOO)),
new SpanTermQuery(new Term(TERM_FIELD, BAR)),
new SpanTermQuery(new Term(TERM_FIELD, BAZ))
}, Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
#Test
public void givenSingleMatch_andExtraTerm() throws IOException {
SpanNearQuery spanNearQuery = new SpanNearQuery(
new SpanQuery[] {
new SpanTermQuery(new Term(TERM_FIELD, BAZ)),
new SpanTermQuery(new Term(TERM_FIELD, "EXTRA"))
},
Integer.MAX_VALUE, false);
TopDocs topDocs = new IndexSearcher(IndexReader.open(directory)).search(spanNearQuery, 100);
Assert.assertEquals("Should have made a match.", 1, topDocs.scoreDocs.length);
}
}
SpanNearQuery lets you find terms that are within a certain distance of each other.
Example (from http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/):
Say we want to find lucene within 5
positions of doug, with doug following
lucene (order matters) – you could use
the following SpanQuery:
new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(FIELD, "lucene")),
new SpanTermQuery(new Term(FIELD, "doug"))},
5,
true);
(source: lucidimagination.com)
In this sample text, Lucene is within
3 of Doug
But for your example, the only match I can see is that both your query and the target document have "cd" (and I am making the assumption that all of those terms are in a single field). In that case, you don't need to use any special query type. Using the standard mechanisms, you will get some non-zero weighting based on the fact that they both contain the same term in the same field.
Edit 3 - in response to latest comment, the answer is that you cannot use SpanNearQuery to do anything other than that which it is intended for, which is to find out whether multiple terms in a document occur within a certain number of places of each other. I can't tell what your specific use case / expected results are (feel free to post it), but in the last case if you only want to find out whether one or more of ("BAZ", "EXTRA") is in the document, a BooleanQuery will work just fine.
Edit 4 - now that you have posted your use case, I understand what it is you want to do. Here is how you can do it: use a BooleanQuery as mentioned above to combine the individual terms you want as well as the SpanNearQuery, and set a boost on the SpanNearQuery.
So, the query in text form would look like:
BAZ OR EXTRA OR "BAZ EXTRA"~100^5
(as an example - this would match all documents containing either "BAZ" or "EXTRA", but assign a higher score to documents where the terms "BAZ" and "EXTRA occur within 100 places of each other; adjust the position and boost as you like. This example is from the Solr cookbook so it may not parse in Lucene, or may give undesirable results. That's ok, because in the next section I show you how to build this using the API).
Programmatically, you would construct this as follows:
Query top = new BooleanQuery();
// Construct the terms since they will be used more than once
Term bazTerm = new Term("Field", "BAZ");
Term extraTerm = new Term("Field", "EXTRA");
// Add each term as "should" since we want a partial match
top.add(new TermQuery(bazTerm), BooleanClause.Occur.SHOULD);
top.add(new TermQuery(extraTerm), BooleanClause.Occur.SHOULD);
// Construct the SpanNearQuery, with slop 100 - a document will get a boost only
// if BAZ and EXTRA occur within 100 places of each other. The final parameter means
// that BAZ must occur before EXTRA.
SpanNearQuery spanQuery = new SpanNearQuery(
new SpanQuery[] { new SpanTermQuery(bazTerm),
new SpanTermQuery(extraTerm) },
100, true);
// Give it a boost of 5 since it is more important that the words are together
spanQuery.setBoost(5f);
// Add it as "should" since we want a match even when we don't have proximity
top.add(spanQuery, BooleanClause.Occur.SHOULD);
Hope that helps! In the future, try to start off by posting exactly what results you are expecting - even if it is obvious to you, it may not be to the reader, and being explicit can avoid having to go back and forth so many times.