Our database contains thousands of numbers in various formats and what I am attempting to do is remove all punctuation at index time and store only the digits and then when a user types digits into a keyword field, only match on those digits. I thought that a custom analyzer was the way to go but I think I am missing an important step...
#Override
protected TokenStreamComponents createComponents(String fieldName) {
log.debug("Creating Components for Analyzer...");
final Tokenizer source = new KeywordTokenizer();
LowerCaseFilter lcFilter = new LowerCaseFilter(source);
PatternReplaceFilter prFilter = new PatternReplaceFilter(lcFilter,
Pattern.compile("[^0-9]"), "", true);
TrimFilter trimFilter = new TrimFilter(prFilter);
return new TokenStreamComponents(source, trimFilter);
}
...
#KeywordSearch
#Analyzer(impl = com.jjkane.common.search.analyzer.PhoneNumberAnalyzer.class)
#Field(name = "phone", index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES)
public String getPhone() {
return this.phone;
}
This may just be ignorance on my part in attempting to do this... From all the documentation, it seems like I am on the right track, but the query never matches unless I submit (555)555-5555 as an exact match to what was in my db. If I put in 5555555555, I get nothing...
Related
Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.
I have a simple test case (it's delivered as an Ant project) which illustrates the problem.
Materials
You can download the test case here: mydemo.with.libs.zip (5MB)
That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch
Input
When indexing the documents, the following synonym list is used (permuted as necessary):
note,notes,notice,notification
subtree,sub tree,sub-tree
I have three documents, each containing a single sentence. The three sentences are:
These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.
Problem
I believe that any of the following searches should match all three documents:
release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"
As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem.
The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.
However if I change the last two searches like so:
"release notice"~1
"release notification"~2
then all three documents match.
What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.
I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.
Edit, here's the source of my analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
this.synlist = synlist;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
if (synlist != null) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
try {
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
try {
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
} catch (IOException e) {
System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
}
System.out.println("Synonym load processed " + cnt + " lines");
br.close();
} catch (Exception e) {
System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
}
if (cnt > 0) {
try {
synMap = builder.build();
} catch (IOException e) {
System.err.println(e);
}
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length < 2) {
System.err.println("Synonym input must have at least two terms on a line: " + line);
} else {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
addSyns(builder, word, synonymsOfWord, keepOrig);
}
}
private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
CharsRefBuilder synset = new CharsRefBuilder();
SynonymMap.Builder.join(syns, synset);
CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
builder.add(wordp, synset.get(), keepOrig);
}
private String synlist;
}
The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.
For the "note", "notes", "notice", "notification" list of synonyms:
It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.
I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).
The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:
You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.
Here is the approach:
My analyzer:
Analyzer analyzer = new Analyzer() {
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
tokenStream = new FlattenGraphFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
};
The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:
private SynonymMap getSynonyms() {
// de-duplicate rules when loading:
boolean dedup = Boolean.TRUE;
// include original word in index:
boolean includeOrig = Boolean.TRUE;
String[] synonyms = {"note", "notes", "notice", "notification"};
// build a synonym map where every word in the list is a synonym
// of every other word in the list:
SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
for (String word : synonyms) {
for (String synonym : synonyms) {
if (!synonym.equals(word)) {
synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
}
}
}
SynonymMap synonymMap = null;
try {
synonymMap = synMapBuilder.build();
} catch (IOException ex) {
System.err.print(ex);
}
return synonymMap;
}
I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:
term note
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
So, that tells us that all three documents contain note at token position 2 (the 3rd word).
And for notification we see exactly the same data:
term notification
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.
I am indexing technical documentation and incorporating synonyms at index time, so that users can search with a number of alternative patterns. But only some synonyms seem to be getting into the map.
I have a text file synonyms.list which contains a series of lines like so:
note,notes,notice
subtree,sub-tree,sub tree
My analyzer and synonym map builder (I've removed try and catch wrappers to save space, but they aren't the problem):
public class TechAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new TechTokenFilter(new LowerCaseFilter(src));
result = new SynopnymGraphFilter(result, getSynonyms(getSynonymsList()), Boolean.TRUE);
result = new FlattenGraphFilter(result);
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
br.close();
if (cnt > 0) {
synMap = builder.build();
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length > 1) {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
for (String syn : synonymsOfWord) {
addPair(builder, word, syn, keepOrig);
}
}
}
private static void addPair(SynonymMap.Builder builder, String word, String syn, boolean keepOrig) {
CharsRef synp = SynonymMap.Builder.join(syn.split("\\s+"), new CharsRefBuilder());
CharsRef wordp = new CharsRef(word);
builder.add(wordp, synp, keepOrig);
// builder.add(synp, wordp, keepOrig); // ? do I need this??
}
I'm not splitting word in addPair() because (at the moment, anyway) the first term in every line of synonyms.list must be a word not a phrase.
My first question relates to that comment at the bottom of addPair(): if I am adding (word,synonym) to the map, do I also need to add (synonym,word)? Or is the map commutative? I can't tell, because of the problem I'm having which is the basis of the next question.
So... the technical documentation being indexed contains some documents which refer to "release notes", and some which refer to "release notices". There are also points described as a "release note". So I would like a search for any of "release note", "release notes", or "release notice" to match all three alternatives.
My code doesn't seem to enable this. If I index a single file which refers to "release notes" I can inspect the generated index with luke and I can see that the index only ever contains one synonym, not two. The same position in the index might have "note" and "notes", or "notes" and "notice", depending on the order of the words in the synonyms.list text file, but it will never have "note", "notes" and "notice".
Obviously I'm not building the map correctly, but the documentation hasn't helped me see what I am doing wrong.
If you've read this far, and can see the flaw in my code, please help me see it too!
Thanks, etc.
I am using Hibernate Search with spring-boot. I have requirement that user will have search operators to perform the following on the establishment name:
Starts with a word
.Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching(term + "*").createQuery();
It returning mix result containing term in mid, start or in end not as per the above requirement
Ends with a word
Kamran. --> Means it should strictly end end Kamran, meaning that Kamranullah should not be returned in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching("*"+term).createQuery();
As per documentation, its not a good idea to put “*” in start. My question here is: how can i achieve the expected result
My domain class and analyzer:
#AnalyzerDef(name = "english", tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class), })
#Indexed
#Entity
#Table(name = "DIRECTORY")
public class DirectoryEntity {
#Analyzer(definition = "english")
#Field(store = Store.YES)
#Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
Two problems here:
Tokenizing
You're using a tokenizer, which means your searches will work with words, not with the full string you indexed. This explains that you're getting matches on terms in the middle of the sentence.
This can be solved by creating a separate field for these special begin/end queries, and using an analyzer with the KeywordTokenizer (which is a no-op).
For example:
#AnalyzerDef(name = "english", tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class), })
#AnalyzerDef(name = "english_beginEnd", tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class), })
#Indexed
#Entity
#Table(name = "DIRECTORY")
public class DirectoryEntity {
#Analyzer(definition = "english")
#Field(store = Store.YES)
#Field(name = "establishmentNameEn_beginEnd", store = Store.YES, analyzer = #Analyzer(definition = "english_beginEnd"))
#Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
Query analysis and performance
The wildcard query does not trigger analysis of the entered text. This will cause unexpected behavior. For example if you index "Ali", then search for "ali", you will probably get a result, but if you search for "Ali" you won't: the text was analyzed and indexed as "ali", which doesn't exactly match "Ali".
Additionally, as you are aware, a leading wildcard is very, very bad performance wise.
If your field has a reasonable length (say, less than 30 characters), I would recommend to use the "edge-ngram" analyzer instead; you will find an explanation here: Hibernate Search: How to use wildcards correctly?
Note that you will still need to use the KeywordTokenizer (unlike the example I linked).
This will take care of the "match the beginning of the text" query, but not the "match the end of the text" query.
To address that second query, I would create a separate field and a separate analyzer, similar to the one used for the first query, the only difference being that you insert a ReverseStringFilterFactory before the EdgeNGramFilterFactory. This will reverse the text before indexing ngrams, which should lead to the desired behavior. Do not forget to also use a separate query analyzer for this field, one that reverses the string.
How is it possible to add a suffix and prefix to an entity in Hibernate Search during indexing?
I need this to perform exact search.
E.g. if one is searching for "this is a test", then following entries are found:
* this is a test
* this is a test and ...
So I found the idea to add a prefix and suffix to the whole value during indexing, e.g.:
_____ this is a test _____
and if one is searching for "this is a test" and is enabling the checkbox for exact search, I'll change the search string to_
"_____ this is a test _____"
I created a FilterFactory for this, but with this one it adds the prefix and suffix to every term:
public boolean incrementToken() throws IOException {
if (!this.input.incrementToken()) {
return false;
} else {
String input = termAtt.toString();
// add "_____" at the beginning and ending of the phrase for exact match searching
input = "_____ " + input + " _____";
char[] newBuffer = input.toLowerCase().toCharArray();
termAtt.setEmpty();
termAtt.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
}
This is not how you should do it.
What you need is that the string you index is considered a unique token. This way, you will only have results having the exact token.
To do so you need to define an analyzer based on the KeywordTokenizer.
#Entity
#AnalyzerDefs({
#AnalyzerDef(name = "keyword",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class)
)
})
#Indexed
public class YourEntity {
#Fields({
#Field, // your default field with default analyzer if you need it
#Field(name = "propertyKeyword", analyzer = #Analyzer(definition = "keyword"))
})
private String property;
}
Then you should search on the propertyKeyword field. Note that the analyzer definition is global so you only need to declare the definition for one entity for it to be available for all your entities.
Take a look at the documentation about analyzers: http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#example-analyzer-def .
It's important to understand what an analyzer is for because usually the default one is not exactly the one you are looking for.
Currently I have an issue with the Lucene search (version 2.9).
I have a search term and I need to use it on several fields. Therefore, I have to use MultiFieldQueryParser. On the other hand, I have to use the WhildcardQuery(), because our customer wants to search for a term in a phrase (e.g. "CMH" should match "KRC250/CMH/830/T/H").
I have tried to replace the slashes ('/') with stars ('*') and use a BooleanQuery with enclosed stars for the term.
Unfortunately whichout any success.
Does anyone have any Idea?
Yes, if the field shown is a single token, setting setAllowLeadingWildcard to be true would be necessary, like:
parser.setAllowLeadingWildcard(true);
Query query = parser.parse("*CMH*");
However:
You don't mention how the field is analyzed. By default, the StandardAnalyzer is used, which will split it into tokens at slashes (or asterisks, when indexing data). If you are using this sort of analysis, you could simply create a TermQuery searching for "cmh" (StandardAnalyzer includes a LowercaseFilter), or simply:
String[] fields = {"this", "that", "another"};
QueryParser parser = MultiFieldQueryParser(Version.LUCENE_29, fields, analyzer) //Assuming StandardAnalyzer
Query simpleQuery = parser.parse("CMH");
//Or even...
Query slightlyMoreComplexQuery = parser.parse("\"CMH/830/T\"");
I don't understand what you mean by a BooleanQuery with enclosed stars, if you can include code to elucidate that, it might help.
Sorry, maybe I have described it a little bit wrong.
I took something like this:
BooleanQuery bq = new BooleanQuery();
foreach (string field in fields)
{
foreach (string tok in tokArr)
{
bq.Add(new WildcardQuery(new Term(field, " *" + tok + "* ")), BooleanClause.Occur.SHOULD);
}
}
return bq;
but unfortunately it did not work.
I have modified it like this
string newterm = string.Empty;
string[] tok = term.Split(new[] { ' ', '/' }, StringSplitOptions.RemoveEmptyEntries);
tok.ForEach(x => newterm += x.EnsureStartsWith(" *").EnsureEndsWith("* "));
var version = Lucene.Net.Util.Version.LUCENE_29;
var analyzer = new StandardAnalyzer(version);
var parser = new MultiFieldQueryParser(version, fields, analyzer);
parser.SetDefaultOperator(QueryParser.Operator.AND);
parser.SetAllowLeadingWildcard(true);
return parser.Parse(newterm);
and my customer love it :-)