How to add prefix and suffix when indexing

How to add prefix and suffix when indexing - lucene

How is it possible to add a suffix and prefix to an entity in Hibernate Search during indexing?
I need this to perform exact search.
E.g. if one is searching for "this is a test", then following entries are found:
* this is a test
* this is a test and ...
So I found the idea to add a prefix and suffix to the whole value during indexing, e.g.:
_____ this is a test _____
and if one is searching for "this is a test" and is enabling the checkbox for exact search, I'll change the search string to_
"_____ this is a test _____"
I created a FilterFactory for this, but with this one it adds the prefix and suffix to every term:
public boolean incrementToken() throws IOException {
if (!this.input.incrementToken()) {
return false;
} else {
String input = termAtt.toString();
// add "_____" at the beginning and ending of the phrase for exact match searching
input = "_____ " + input + " _____";
char[] newBuffer = input.toLowerCase().toCharArray();
termAtt.setEmpty();
termAtt.copyBuffer(newBuffer, 0, newBuffer.length);
return true;
}
}

This is not how you should do it.
What you need is that the string you index is considered a unique token. This way, you will only have results having the exact token.
To do so you need to define an analyzer based on the KeywordTokenizer.
#Entity
#AnalyzerDefs({
#AnalyzerDef(name = "keyword",
tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class)
)
})
#Indexed
public class YourEntity {
#Fields({
#Field, // your default field with default analyzer if you need it
#Field(name = "propertyKeyword", analyzer = #Analyzer(definition = "keyword"))
})
private String property;
}
Then you should search on the propertyKeyword field. Note that the analyzer definition is global so you only need to declare the definition for one entity for it to be available for all your entities.
Take a look at the documentation about analyzers: http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#example-analyzer-def .
It's important to understand what an analyzer is for because usually the default one is not exactly the one you are looking for.

Related

I have synonym matching working EXCEPT in quoted phrases

Simple synonyms (wordA = wordB) are fine. When there are two or more synonyms (wordA = wordB = wordC ...), then phrase matching is only working for the first, unless the phrases have proximity modifiers.
I have a simple test case (it's delivered as an Ant project) which illustrates the problem.
Materials
You can download the test case here: mydemo.with.libs.zip (5MB)
That archive includes the Lucene 9.2 libraries which my test uses; if you prefer a copy without the JAR files you can download that from here: mydemo.zip (9KB)
You can run the test case by unzipping the archive into an empty directory and running the Ant command ant rnsearch
Input
When indexing the documents, the following synonym list is used (permuted as necessary):
note,notes,notice,notification
subtree,sub tree,sub-tree
I have three documents, each containing a single sentence. The three sentences are:
These release notes describe a document sub tree in a simple way.
This release note describes a document subtree in a simple way.
This release notice describes a document sub-tree in a simple way.
Problem
I believe that any of the following searches should match all three documents:
release note
release notes
release notice
release notification
"release note"
"release notes"
"release notice"
"release notification"
As it happens, the first four searches are fine, but the quoted phrases demonstrate a problem.
The searches for "release note" and "release notes" match all three records, but "release notice" only matches one, and "release notification" does not match any.
However if I change the last two searches like so:
"release notice"~1
"release notification"~2
then all three documents match.
What appears to be happening is that the first synonym is being given the same index position as the term, the second synonym has the position offset by 1, the third offset by 2, etc.
I believe that all the synonyms should be given the same position so that all four phrases match without the need for proximity modifiers at all.
Edit, here's the source of my analyzer:
public class MyAnalyzer extends Analyzer {
public MyAnalyzer(String synlist) {
this.synlist = synlist;
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
WhitespaceTokenizer src = new WhitespaceTokenizer();
TokenStream result = new LowerCaseFilter(src);
if (synlist != null) {
result = new SynonymGraphFilter(result, getSynonyms(synlist), Boolean.TRUE);
result = new FlattenGraphFilter(result);
}
return new TokenStreamComponents(src, result);
}
private static SynonymMap getSynonyms(String synlist) {
boolean dedup = Boolean.TRUE;
SynonymMap synMap = null;
SynonymMap.Builder builder = new SynonymMap.Builder(dedup);
int cnt = 0;
try {
BufferedReader br = new BufferedReader(new FileReader(synlist));
String line;
try {
while ((line = br.readLine()) != null) {
processLine(builder,line);
cnt++;
}
} catch (IOException e) {
System.err.println(" caught " + e.getClass() + " while reading synonym list,\n with message " + e.getMessage());
}
System.out.println("Synonym load processed " + cnt + " lines");
br.close();
} catch (Exception e) {
System.err.println(" caught " + e.getClass() + " while loading synonym map,\n with message " + e.getMessage());
}
if (cnt > 0) {
try {
synMap = builder.build();
} catch (IOException e) {
System.err.println(e);
}
}
return synMap;
}
private static void processLine(SynonymMap.Builder builder, String line) {
boolean keepOrig = Boolean.TRUE;
String terms[] = line.split(",");
if (terms.length < 2) {
System.err.println("Synonym input must have at least two terms on a line: " + line);
} else {
String word = terms[0];
String[] synonymsOfWord = Arrays.copyOfRange(terms, 1, terms.length);
addSyns(builder, word, synonymsOfWord, keepOrig);
}
}
private static void addSyns(SynonymMap.Builder builder, String word, String[] syns, boolean keepOrig) {
CharsRefBuilder synset = new CharsRefBuilder();
SynonymMap.Builder.join(syns, synset);
CharsRef wordp = SynonymMap.Builder.join(word.split("\\s+"), new CharsRefBuilder());
builder.add(wordp, synset.get(), keepOrig);
}
private String synlist;
}
The analyzer includes synonyms when it builds the index, and does not add synonyms when it is used to process a query.

For the "note", "notes", "notice", "notification" list of synonyms:
It is possible to build an index of the above synonyms so that every query listed in the question will find all three documents - including the phrase searches without the need for any ~n proximity searches.
I see there is a separate question for the other list of synonyms "subtree", "sub tree", "sub-tree" - so I will skip those here (I expect the below approach will not work for those, but I would have to take a closer look).
The solution is straightforward, and it's based on a realization that I was (in an earlier question) completely incorrect in an assumption I made about how to build the synonyms:
You can place multiple synonyms of a given word at the same position as the word, when building your indexed data. I incorrectly thought you needed to provide the synoyms as a list - but you can provide them one at a time as words.
Here is the approach:
My analyzer:
Analyzer analyzer = new Analyzer() {
#Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
tokenStream = new SynonymGraphFilter(tokenStream, getSynonyms(), ignoreSynonymCase);
tokenStream = new FlattenGraphFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
};
The getSynonyms() method used by the above analyzer, using the note,notes,notice,notification list:
private SynonymMap getSynonyms() {
// de-duplicate rules when loading:
boolean dedup = Boolean.TRUE;
// include original word in index:
boolean includeOrig = Boolean.TRUE;
String[] synonyms = {"note", "notes", "notice", "notification"};
// build a synonym map where every word in the list is a synonym
// of every other word in the list:
SynonymMap.Builder synMapBuilder = new SynonymMap.Builder(dedup);
for (String word : synonyms) {
for (String synonym : synonyms) {
if (!synonym.equals(word)) {
synMapBuilder.add(new CharsRef(word), new CharsRef(synonym), includeOrig);
}
}
}
SynonymMap synonymMap = null;
try {
synonymMap = synMapBuilder.build();
} catch (IOException ex) {
System.err.print(ex);
}
return synonymMap;
}
I looked at the indexed data by using org.apache.lucene.codecs.simpletext.SimpleTextCodec, to generate human-readable indexes (just for testing purposes):
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
This allowed me to see where the synonyms were inserted into the indexed data. So, for example, taking the word note, we see the following indexed entries:
term note
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
So, that tells us that all three documents contain note at token position 2 (the 3rd word).
And for notification we see exactly the same data:
term notification
doc 0
freq 1
pos 2
doc 1
freq 1
pos 2
doc 2
freq 1
pos 2
We see this for all the words in the synonym list, which is why all 8 queries return all 3 documents.

Starts with a word or ends with a word using hibernate search

I am using Hibernate Search with spring-boot. I have requirement that user will have search operators to perform the following on the establishment name:
Starts with a word
.Ali --> Means the phrase should strictly start with Ali, which means AlAli should not return in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching(term + "*").createQuery();
It returning mix result containing term in mid, start or in end not as per the above requirement
Ends with a word
Kamran. --> Means it should strictly end end Kamran, meaning that Kamranullah should not be returned in the results
query = queryBuilder.keyword().wildcard().onField("establishmentNameEn")
.matching("*"+term).createQuery();
As per documentation, its not a good idea to put “*” in start. My question here is: how can i achieve the expected result
My domain class and analyzer:
#AnalyzerDef(name = "english", tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class), })
#Indexed
#Entity
#Table(name = "DIRECTORY")
public class DirectoryEntity {
#Analyzer(definition = "english")
#Field(store = Store.YES)
#Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}

Two problems here:
Tokenizing
You're using a tokenizer, which means your searches will work with words, not with the full string you indexed. This explains that you're getting matches on terms in the middle of the sentence.
This can be solved by creating a separate field for these special begin/end queries, and using an analyzer with the KeywordTokenizer (which is a no-op).
For example:
#AnalyzerDef(name = "english", tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class), })
#AnalyzerDef(name = "english_beginEnd", tokenizer = #TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class), })
#Indexed
#Entity
#Table(name = "DIRECTORY")
public class DirectoryEntity {
#Analyzer(definition = "english")
#Field(store = Store.YES)
#Field(name = "establishmentNameEn_beginEnd", store = Store.YES, analyzer = #Analyzer(definition = "english_beginEnd"))
#Column(name = "ESTABLISHMENT_NAME_EN")
private String establishmentNameEn;
getter and setter
}
Query analysis and performance
The wildcard query does not trigger analysis of the entered text. This will cause unexpected behavior. For example if you index "Ali", then search for "ali", you will probably get a result, but if you search for "Ali" you won't: the text was analyzed and indexed as "ali", which doesn't exactly match "Ali".
Additionally, as you are aware, a leading wildcard is very, very bad performance wise.
If your field has a reasonable length (say, less than 30 characters), I would recommend to use the "edge-ngram" analyzer instead; you will find an explanation here: Hibernate Search: How to use wildcards correctly?
Note that you will still need to use the KeywordTokenizer (unlike the example I linked).
This will take care of the "match the beginning of the text" query, but not the "match the end of the text" query.
To address that second query, I would create a separate field and a separate analyzer, similar to the one used for the first query, the only difference being that you insert a ReverseStringFilterFactory before the EdgeNGramFilterFactory. This will reverse the text before indexing ngrams, which should lead to the desired behavior. Do not forget to also use a separate query analyzer for this field, one that reverses the string.

Hibernate Search with Lucene Phone Number Analyzer issues

Our database contains thousands of numbers in various formats and what I am attempting to do is remove all punctuation at index time and store only the digits and then when a user types digits into a keyword field, only match on those digits. I thought that a custom analyzer was the way to go but I think I am missing an important step...
#Override
protected TokenStreamComponents createComponents(String fieldName) {
log.debug("Creating Components for Analyzer...");
final Tokenizer source = new KeywordTokenizer();
LowerCaseFilter lcFilter = new LowerCaseFilter(source);
PatternReplaceFilter prFilter = new PatternReplaceFilter(lcFilter,
Pattern.compile("[^0-9]"), "", true);
TrimFilter trimFilter = new TrimFilter(prFilter);
return new TokenStreamComponents(source, trimFilter);
}
...
#KeywordSearch
#Analyzer(impl = com.jjkane.common.search.analyzer.PhoneNumberAnalyzer.class)
#Field(name = "phone", index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES)
public String getPhone() {
return this.phone;
}
This may just be ignorance on my part in attempting to do this... From all the documentation, it seems like I am on the right track, but the query never matches unless I submit (555)555-5555 as an exact match to what was in my db. If I put in 5555555555, I get nothing...

Grails Searchable Plugin(Lucene) - 1 To Many Query

I am using grail's searchable plugin(0.6.4). I need to search the Members on the basis of privacy settings. Following is the db design.
Member has MemberProfile, and MemberProfile has PrivacySettings
class Member extends {
String firstName
String lastName
static searchable = {
analyzer "simple"
only = ['firstName', 'lastName']
firstName boost: 5.0
profile component: true
profile reference: true
}
static hasOne = [profile: MemberProfile]
}
class MemberProfile {
static searchable = {
analyzer "simple"
only = ['currentCity', 'currentCountry']
privacySettings component: true
}
static hasMany = [privacySettings:PrivacySettings]
String currentCity
String currentCountry
List<PrivacySettings> privacySettings
}
//For instance Privacy Settings contains
//fieldSectionName: firstName , connectionLevel: true , memberLevel: false
// This means firstName will be shown to only members' friends(connection)
class PrivacySettings {
static searchable = {
analyzer "simple"
only = ['fieldSectionName', 'connectionLevel', 'memberLevel']
}
String fieldSectionName
boolean connectionLevel
boolean memberLevel
}
One member profile has many privacy settings for each field.
What will be the query to search only those members which have display_name in fieldsSectionName and connectionLevel true in the privacy settings table.
I am trying something like this
def query="mydisplayname"
def searchResults = Member.search(query + "*" + " +(fieldSectionName:${'display_name'} AND connectionLevel:${true})", params)

I don't know grail but in Lucene the maximum number of clauses in a boolean query is 1024 by default.
You can increase this limit.
There would be performance penalty, though. or you can index the values on the document and search for them (lucene will do an OR operation on different fields with the same name).
You can change the limit using a static property on BooleanQuery class:
BooleanQuery.MaxClauseCount = 99999;
Omri

I had the same issue in my grails application, to resolve it add org.apache.lucene.search.BooleanQuery.maxClauseCount = 99999 to your config.groovy and restart your application

How do I get a list of fields in a generic sObject?

I'm trying to build a query builder, where the sObject result can contain an indeterminate number of fields. I'm using the result to build a dynamic table, but I can't figure out a way to read the sObject for a list of fields that were in the query.
I know how to get a list of ALL fields using the getDescribe information, but the query might not contain all of those fields.
Is there a way to do this?

Presumably you're building the query up as a string, since it's dynamic, so couldn't you just loop through the fields in the describe information, and then use .contains() on the query string to see if it was requested? Not crazy elegant, but seems like the simplest solution here.
Taking this further, maybe you have the list of fields selected in a list of strings or similar, and you could just use that list?

Not sure if this is exactly what you were after but something like this?
public list<sObject> Querylist {get; set;}
Define Search String
string QueryString = 'select field1__c, field2__c from Object where';
Add as many of these as you need to build the search if the user searches on these fields
if(searchParameter.field1__c != null && searchParameter.field1__c != '')
{
QueryString += ' field1__c like \'' + searchParameter.field1__c + '%\' and ';
}
if(searchParameter.field2__c != null && searchParameter.field2__c != '')
{
QueryString += ' field2__c like \'' + searchParameter.field2__c + '%\' and ';
}
Remove the last and
QueryString = QueryString.substring(0, (QueryString.length()-4));
QueryString += ' limit 200';
add query to the list
for(Object sObject : database.query(QueryString))
{
Querylist.add(sObject);
}

To get the list of fields in an sObject, you could use a method such as:
public Set<String> getFields(sObject sobj) {
Set<String> fieldSet = new Set<String>();
for (String field : sobj.getSobjectType().getDescribe().fields.getMap().keySet()) {
try {
a.get(field);
fieldSet.add(field);
} catch (Exception e) {
}
}
return fieldSet;
}
You should refactor to bulkily this approach for your context, but it works. Just pass in an sObject and it'll give you back a set of the field names.

I suggest using a list of fields for creating both the query and the table. You can put the list of fields in the result so that it's accesible for anyone using it. Then you can construct the table by using result.getFields() and retrieve the data by using result.getRows().
for (sObject obj : result.getRows()) {
for (String fieldName : result.getFields()) {
table.addCell(obj.get(fieldName));
}
}
If your trying to work with a query that's out of your control, you would have to parse the query to get the list of fields. But I wouldn't suggest trying that. It complicates code in ways that are hard to follow.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to add prefix and suffix when indexing - lucene

Related

I have synonym matching working EXCEPT in quoted phrases

Starts with a word or ends with a word using hibernate search

Hibernate Search with Lucene Phone Number Analyzer issues

Grails Searchable Plugin(Lucene) - 1 To Many Query

How do I get a list of fields in a generic sObject?

Categories

Resources