Lucene: How to give less weight to Wordnet synonyms? - lucene

Looking at the WordnetSynonymParser class, I find no way to attach the weight of word extended (to .setBoost to synonyms to 0.2).
I would want to have
"word" -> "word synonym^0.2"
synonym being only 20% the weight of regular word.
Thanks you for your help!

There is no possibility to do this, since synonym parser just adding synonym tokens at the same position - take a look at the code. Also SynonymMap is just really key-value store.

Related

Weighted synonyms

I use synonyms in lucene to increase the recall in the search. For that I construct a SynonymMap und use a SynonymGraphFilter in my custom Analyzer.
The synonym map looks like:
vw -> volkswagen
bmw -> bayerische motoren werke
I use QueryParser to parse the query.
Now I would like to lower the boost for synonym terms (e.g if I search for 'bmw', then the terms 'bayerische motoren werke' should have a lower boost)
How can I achieve it? It seems that Lucene supports this (see https://issues.apache.org/jira/browse/LUCENE-9171) however I do not know how to use it.
There are two different approaches for handling synonyms, here:
(1) Your usage of SynonymMap, which, as you note, is a way to pre-build synonym lists, which can then be used in analyzers and general queries.
(2) The enhancement you mention.
As the enhancement ticket notes,"this has been done targeting the Synonyms Query.".
The SynonymQuery class has a builder which allows you to add terms (as synonyms) with a boost value.
I do not believe there is any direct way to combine the two approaches. Synonym maps are not boost-aware. I think the best you can do is to iterate over your pre-defined list of synonyms, and feed the values into the synonym query builder.

Basic Lucene Beginners q: Index and Autocomplete

I am using Lucene.NET and have a basic question:
Do I need to make an additional Index for Autocompletion?
I created an index based on two different tables from a database.
Here are two Docs:
stored,indexed,tokenized,termVector<URL:/Service/Zahlungsmethoden/Teilzahlung>
stored,indexed,tokenized,termVector<Website:Body:The Text of the first Page>
stored,indexed,tokenized,termVector<Website:ID:19>
stored,indexed,tokenized,termVector<Website:Title:Teilzahlung>
stored,indexed,tokenized,termVector<URL:/Service/Kundenservice/Kinderbetreeung>
stored,indexed,tokenized,termVector<Website:Body:The text of the second Page>
stored,indexed,tokenized,termVector<Website:ID:13>
stored,indexed,tokenized,termVector<Website:Title:Kinderbetreuung>
I need to create a dropdown for a search with suggestions:
eg: term "Pag" should suggest "Page"
so I assume that for every word (token) in every doc, I need a list like:
p
pa
pag
page
is this correct?
Where do I store these?
In an additional Index?
Or how would I re-arrange the existing structure of my index to hold the autocompletion-suggestions?
Thank you!
1) Like femtoRgon said above, look at the Lucene Suggest API.
2) That being said, one cheap way to perform auto-suggest is to look for all words that start with the string you've typed so far, like 'pa' returning 'pa' + 'pag' + 'page'. A wildcard query would return those results -- in Lucene query syntax, a query like 'pa*'. (You might want to restrict the suggestions to only those strings of length 2+.)
Mark Leighton Fisher has the right approach for a cheap way but performing a wildcard query just returns you the documents but not the words. It's better to look at the imlpementation of the WildcardQuery maybe. You need to use the Terms object retrieved from the IndexReader and iterate through the terms in the index.

How lucene can be used to search words prefixed with adverbs/negations?

I am a newbie to Lucene. I wanted to know how can i use Lucene to search a word which may be prefixed with an adverb. The document contains only words and no adverbs prefixed to them.
For example: If term to be searched is 'very beautiful' and my document contains
only beautiful, then i want a Hit. The word can also be prefixed with
negations like 'not very beautiful' or my not have a prefix at all
like 'beautiful'. I just can't drop off the prefixes because I need to
keep a track of Negations which change flow of further processing.
I tried Fuzzy search but results are not that satisfactory. Is there any way to find accomplish this?
I could not find relevant answers for this.
If I was doing this, I would Google on "part of speech tagging" and "natural language processing". Once you have tagged your parts of speech, you could then apply Lucene indexing.
One way to implement this would be to index the words with their tags, like:
n:he v:is a:a aj:big n:man
for
he is a big man
where he and man are nouns, is is a verb, a is an article, and big is an adjective.

How to implement EXCEPT boolean operator of ISYS using Lucene API

I've studied that EXCEPT is a boolean operator for queries in ISYS(which is an Enterprise search Engine).It has the following functionality.
If this is the query First EXCEPT Second------>The retrieved documents must contain the first search term but only if the second term is not in the same paragraph as the first. Both terms can appear in the document; just not in the same paragraph.
Now how do I achieve this in Lucene?
Thank you :)
A rough outline of an implementation strategy would be to:
tokenize your input on paragraphs
index each paragraph separately, with a field referring a common document identifier
use the BooleanQueryto construct a query that takes advantage of the above construction

Lucene search and underscores

When I use Luke to search my Lucene index using a standard analyzer, I can see the field I am searchng for contains values of the form MY_VALUE.
When I search for field:"MY_VALUE" however, the query is parsed as field:"my value"
Is there a simple way to escape the underscore (_) character so that it will search for it?
EDIT:
4/1/2010 11:08AM PST
I think there is a bug in the tokenizer for Lucene 2.9.1 and it was probably there before.
Load up Luke and try to search for "BB_HHH_FFFF5_SSSS", when there is a number, the following tokens are returned:
"bb hhh_ffff5_ssss"
After some testing, I've found that this is because of the number. If I input
"BB_HHH_FFFF_SSSS", I get
"bb hhh ffff ssss"
At this point, I'm leaning towards a tokenizer bug unless the presence of the number is supposed to have this behavior but I fail to see why.
Can anyone confirm this?
It doesn't look like you used the StandardAnalyzer to index that field. In Luke you'll need to select the analyzer that you used to index that field in order to match MY_VALUE correctly.
Incidentally, you might be able to match MY_VALUE by using the KeywordAnalyzer.
I don't think you'll be able to use the standard analyser for this use case.
Judging what I think your requirements are, the keyword analyser should work fine for little effort (the whole field becomes a single term).
I think some of the confusion arises when looking at the field with luke. The stored value is not what's used by queries, what you need are the terms. I suspect that when you look at the terms stored for your field, they'll be "my" and "value".
Hope this helps,