Lucene boost case sensitive match - lucene

Our client uses a few acronyms on its site.
For example, let's say STACK is an acronym they are using.
When they are searching for "STACK" (keyword), they want documents that match "STACK" exactly (uppercase) to be on top of the search results, instead of documents matching "stack" lowercase.
Is there a way to achieve this? Maybe through query boosting somehow?
I'm using the StandardAnalyzer at the moment.

From the docs:
public final class StandardAnalyzer extends StopwordAnalyzerBase
Filters StandardTokenizer with StandardFilter, LowerCaseFilter and
StopFilter, using a list of English stop words.
So you have no difference between 'STACK' and 'stack'. You could add the keywords again as StringField that exactly matches the keywords you search for and boost the keyword field.

It is really hard to make any judgements about this one case (STACK vs stack). If all your acronyms are upper case, just exclude LowerCaseFilter from the analyzer chain. If some of your acronyms can contain dots or dashes (e.g. Y.M.C.A.), you probably need to use WhitespaceAnalyzer (instead of StandardAnalyzer) to ensure these are not split into multiple terms.
To me, boosting sounds superfluous here. Say if somebody enters a query closely matching the acronym, the relevant document will be ranked high anyway because of its similarity.

Related

How to tackle efficient searching of a string that could have multiple variations?

My title sounds complicated, but the situation is very simple. People search on my site using a term such as "blackfriday".
When they conduct the search, my SQL code needs to look in various places such as a ProductTitle and ProductDescription field to find this term. For example:
SELECT *
FROM dbo.Products
WHERE ProductTitle LIKE '%blackfriday%' OR
ProductDescription LIKE '%blackfriday%'
However, the term appears differently in the database fields. It is most like to appear with a space between the words as such "Black Friday USA 2015". So without going through and adding more combinations to the WHERE clause such as WHERE ProductTitle LIKE '%Black-Friday%', is there a better way to accomplish this kind of fuzzy searching?
I have full-text search enabled on the above fields but its really not that good when I use the CONTAINS clause. And of course other terms may not be as neat as this example.
I should start by saying that "variations (of a string)" is a bit vague. You could mean plurality, verb tenses, synonyms, and/or combined words (or, ignoring spaces and punctuation between 2 words) like the example you posted: "blackfriday" vs. "black friday" vs "black-friday". I have a few solutions of which 1 or more together may work for you depending on your use case.
Ignoring punctuation
Full Text searches already ignore punctuation and match them to spaces. So black-friday will match black friday whether using FREETEXT or CONTAINS. But it won't match blackfriday.
Synonyms and combined words
Using FREETEXT or FREETEXTTABLE for your full text search is a good way to handle synonyms and some matching of combined words (I don't know which ones). You can customize the thesaurus to add more combined words assuming it's practical for you to come up with such a list.
Handling combinations of any 2 words
Maybe your use case calls for you to match poorly formatted text or hashtags. In that case I have a couple of ideas:
Write the full text query to cover each combination of words using a dictionary. For example your data layer can rewrite a search for black friday as CONTAINS(*, '"black friday" OR "blackfriday"'). This may have to get complex, for example would black friday treehouse have to be ("black friday" OR "blackfriday") AND ("treehouse" OR "tree house")? You would need a dictionary to figure out that "treehouse" is made up of 2 words and thus can be split.
If it's not practical to use a dictionary for the words being searched for (I don't know why, maybe acronyms or new memes) you could create a long query to cover every letter combination. So searching for do-re-mi could be "do re mi" OR "doremi" OR "do remi" OR "dore mi" OR "d oremi" OR "d o remi" .... Yes it will be a lot of combinations, but surprisingly it may run quickly because of how full text efficiently looks up words in the index.
A hack / workaround if searching for multiple variations is very important.
Define which fields in the DB are searchable (e.g ProductTitle, ProductDescription)
Before saving these fields in the DB, replace each space (or consecutive spaces by a placeholder e.g "%")
Search the DB for variation matches employing the placeholder
Do the reverse process when displaying these fields on your site (i.e replace placeholder with space)
Alternatively you can enable regex matching for your users (meaning they can define a regex either explicitly or let your app build one from their search term). But it is slower and probably error-prone to do it this way
After looking into everything, I have settled for using SQL's FREETEXT full-text search. Its not ideal, or accurate, but for now it will have to do.
My answer is probably inadequate but do you have any scenarios which wont be addressed by query below.
SELECT *
FROM dbo.Products
WHERE ProductTitle LIKE '%black%friday%' OR
ProductDescription LIKE '%black%friday%'

Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

How lucene can be used to search words prefixed with adverbs/negations?

I am a newbie to Lucene. I wanted to know how can i use Lucene to search a word which may be prefixed with an adverb. The document contains only words and no adverbs prefixed to them.
For example: If term to be searched is 'very beautiful' and my document contains
only beautiful, then i want a Hit. The word can also be prefixed with
negations like 'not very beautiful' or my not have a prefix at all
like 'beautiful'. I just can't drop off the prefixes because I need to
keep a track of Negations which change flow of further processing.
I tried Fuzzy search but results are not that satisfactory. Is there any way to find accomplish this?
I could not find relevant answers for this.
If I was doing this, I would Google on "part of speech tagging" and "natural language processing". Once you have tagged your parts of speech, you could then apply Lucene indexing.
One way to implement this would be to index the words with their tags, like:
n:he v:is a:a aj:big n:man
for
he is a big man
where he and man are nouns, is is a verb, a is an article, and big is an adjective.

How to implement EXCEPT boolean operator of ISYS using Lucene API

I've studied that EXCEPT is a boolean operator for queries in ISYS(which is an Enterprise search Engine).It has the following functionality.
If this is the query First EXCEPT Second------>The retrieved documents must contain the first search term but only if the second term is not in the same paragraph as the first. Both terms can appear in the document; just not in the same paragraph.
Now how do I achieve this in Lucene?
Thank you :)
A rough outline of an implementation strategy would be to:
tokenize your input on paragraphs
index each paragraph separately, with a field referring a common document identifier
use the BooleanQueryto construct a query that takes advantage of the above construction

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".