Weighted synonyms - lucene

I use synonyms in lucene to increase the recall in the search. For that I construct a SynonymMap und use a SynonymGraphFilter in my custom Analyzer.
The synonym map looks like:
vw -> volkswagen
bmw -> bayerische motoren werke
I use QueryParser to parse the query.
Now I would like to lower the boost for synonym terms (e.g if I search for 'bmw', then the terms 'bayerische motoren werke' should have a lower boost)
How can I achieve it? It seems that Lucene supports this (see https://issues.apache.org/jira/browse/LUCENE-9171) however I do not know how to use it.

There are two different approaches for handling synonyms, here:
(1) Your usage of SynonymMap, which, as you note, is a way to pre-build synonym lists, which can then be used in analyzers and general queries.
(2) The enhancement you mention.
As the enhancement ticket notes,"this has been done targeting the Synonyms Query.".
The SynonymQuery class has a builder which allows you to add terms (as synonyms) with a boost value.
I do not believe there is any direct way to combine the two approaches. Synonym maps are not boost-aware. I think the best you can do is to iterate over your pre-defined list of synonyms, and feed the values into the synonym query builder.

Related

What does the Liferay documentation mean by "without using the indexer"

In the Liferay documentation, many *LocalServiceUtil classes have search methods with the following documentation:
Returns an ordered range of all the [...] matching the parameters without using the indexer, including keyword parameters for [...].
What does the without using the indexer part of the sentence mean?
In particular, does it mean that it does not use any database indexes? Does it mean that for instance JournalArticleLocalServiceUtil.search can be expected to run much slower than the equivalent JournalArticleLocalServiceUtil.getArticles? Or is it a different meaning?
Or does this indexer refer to the indexes in the result set in the same method's documentation, maybe?
The indexer refers to searchengine indexers such as those using Lucene, Solr, Elastic (or similar) implementations.
search and getArticles operations will query the database - if you do a keyword search your database might not use in (DB) index, because content or title are not part of an index by default. Therefore, when there is a bigger amount of articles, a keyword searchengine query might lead to a better response time.

Is it possible to add operators into manchester syntax?

I want to add custom operators for temporal representation kind of like how this paper describes on page 8: https://pdfs.semanticscholar.org/c097/3553764e2959af3ad4513515588791a13867.pdf.
Although it can be done in SPARQL, (I'm assuming through use of registries like this -> https://github.com/dotnetrdf/dotnetrdf/wiki/DeveloperGuide-SPARQL-Operators) I would like to know if it also possible to add these same operators into manchester syntax or find an alternative way of doing things so that I can do a temporal query within a DL Query.

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

Lucene Synonym Filter behavior

I am trying to figure out how does lucene's analyzer work?
My question is how does lucene handle synonym words? Here is the situation:
we have single words and multi words
single: foo = bar
multi words: foo bar = foobar
For single words:
Does lucene expand the indexed records or not? I guess if a query has a word like "foo", it adds "bar" to the query too. I don't know if it happens for indexing or not?
For multi words:
Does lucene expand both query and indexing? for example if we have "foo bar", does it add foobar to the indexing/query?
My second question is : Lucene uses a stream of tokens and gives them to the filters like lowercase filter. My question is how does lucene find the multi words? like how does it find out that "foo bar" is a multi words that are together?
thanks
SynonymFilter can, optionally, keep the original word, and add the synonym to the tokenstream as well, by setting keepOrig=true (see SynonymMap.Builder.add()). This behavior can cause problems for PhraseQueries and the like, see first Note on the SynonymFilter docs.
If you are using the same Analyzer for querying and indexing, then both queries and docs written to the index will, of course, be treated the same way. SynonymFilter with keepOrig set to true is one of the few Analyzers that is reasonably often applied incongruously between querying and indexing, but that is entirely up to your implementation.
As far as how it is implemented, the source code is available to you.

Case-insensitive search using Hibernate

I'm using Hibernate for ORM of my Java app to an Oracle database (not that the database vendor matters, we may switch to another database one day), and I want to retrieve objects from the database according to user-provided strings. For example, when searching for people, if the user is looking for people who live in 'fran', I want to be able to give her people in San Francisco.
SQL is not my strong suit, and I prefer Hibernate's Criteria building code to hard-coded strings as it is. Can anyone point me in the right direction about how to do this in code, and if impossible, how the hard-coded SQL should look like?
Thanks,
Yuval =8-)
For the simple case you describe, look at Restrictions.ilike(), which does a case-insensitive search.
Criteria crit = session.createCriteria(Person.class);
crit.add(Restrictions.ilike('town', '%fran%');
List results = crit.list();
Criteria crit = session.createCriteria(Person.class);
crit.add(Restrictions.ilike('town', 'fran', MatchMode.ANYWHERE);
List results = crit.list();
If you use Spring's HibernateTemplate to interact with Hibernate, here is how you would do a case insensitive search on a user's email address:
getHibernateTemplate().find("from User where upper(email)=?", emailAddr.toUpperCase());
You also do not have to put in the '%' wildcards. You can pass MatchMode (docs for previous releases here) in to tell the search how to behave. START, ANYWHERE, EXACT, and END matches are the options.
The usual approach to ignoring case is to convert both the database values and the input value to upper or lower case - the resultant sql would have something like
select f.name from f where TO_UPPER(f.name) like '%FRAN%'
In hibernate criteria restrictions.like(...).ignoreCase()
I'm more familiar with Nhibernate so the syntax might not be 100% accurate
for some more info see pro hibernate 3 extract and hibernate docs 15.2. Narrowing the result set
This can also be done using the criterion Example, in the org.hibernate.criterion package.
public List findLike(Object entity, MatchMode matchMode) {
Example example = Example.create(entity);
example.enableLike(matchMode);
example.ignoreCase();
return getSession().createCriteria(entity.getClass()).add(
example).list();
}
Just another way that I find useful to accomplish the above.
Since Hibernate 5.2 session.createCriteria is deprecated. Below is solution using JPA 2 CriteriaBuilder. It uses like and upper:
CriteriaBuilder builder = session.getCriteriaBuilder();
CriteriaQuery<Person> criteria = builder.createQuery(Person.class);
Root<Person> root = criteria.from(Person.class);
Expression<String> upper = builder.upper(root.get("town"));
criteria.where(builder.like(upper, "%FRAN%"));
session.createQuery(criteria.select(root)).getResultList();
Most default database collations are not case-sensitive, but in the SQL Server world it can be set at the instance, the database, and the column level.
You could look at using Compass a wrapper above lucene.
http://www.compass-project.org/
By adding a few annotations to your domain objects you get achieve this kind of thing.
Compass provides a simple API for working with Lucene. If you know how to use an ORM, then you will feel right at home with Compass with simple operations for save, and delete & query.
From the site itself.
"Building on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges."
I have used this in the past and I find it great.