Hibernate Search: how to query for embeded entities - lucene

I like to use Hibernate Search for implementing an sophisticated autosuggestion feature across multiple input fields on a web page.
Each input field is for its own entity, let's say Country and City. There is a many-to-one relationship between both entities
(countries contain cities).
The autosuggestion should work such that when typing e.g. a country name prefix and the city field is already filled,
you get only suggestions for countries that have such a city (and vice versa).
The server side autosuggestion service should return list of projections
(entityId, entityName) which are rendered into the input field (dropdown, whatever).
According to the schema and after having read the manual I tried the following index schema:
SearchMapping mapping = new SearchMapping();
mapping.analyzerDef(...
.entity(City.class).indexed().indexName("MyIndex")
.property("cityId", ElementType.FIELD)
.documentId()
.name("id")
.property("name", ElementType.FIELD)
.documentId()
.name("id")
.property("country", ElementType.METHOD)
.indexEmbedded()
.entity(Country.class).indexed()
.property("id", ElementType.FIELD)
.documentId()
.name("id")
.property("name", ElementType.METHOD)
.field()
.name("name")
This mapping defines City to be the main entity, right?
I have indexed all cities and am able to query for them (also by combining both fields). However, I only get matches when querying for cities.
i.e. when querying like
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(City.class).get();
This is not useful for the country field becuse when I type in "Spain", I get a single row for each city of Spain. (Spain, Spain, Spain, Spain ,.... ;-))
The question is: How is it possible to search for country entities? Changing the index structure? The indexing procedure? Or how to query?
The only way I found was to setup a Facet for country, and you the different possible facets as autosuggestion. However, this is also not perfect
since it is not possible to sort facets alphabetically.
Of course, in this example, I could switch both entities in the mapping, but suppose scenarios with more complex entity graphs.
UPDATE: adding queries requested in comment
For building queries, I employ the QueryBuilder. The following produces a result set like in the Spain example:
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(City.class).get();
with query:
country.name:Spain
If I try to use a query builder for countries
fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Country.class).get();
and query:
name:Spain
I get no results.

You are not showing your actual query. You don't have to use the query DSL, but you can also write native Lucene queries. In both cases (DSL or native Lucene) you can combine queries via boolean logic. Embedded entities follow the java bean notation. The country name would for example in a city query be reached as country.name. Again, without your actual query it is hard to give any more specific feedback.
Last, but not least, facets can also be sorted alphabetically. Check FacetSortOrder.COUNT_DESC.

Related

Is it possible to order lucene documents by matching term?

I'm using Lucene 4.10.3 with Java 1.7
I'm wondering whether it's possible to order query results the matching term?
Simply put, if my documents conatin a text field;
The query is
text:a*
I want documents with ab, then ac, then ad etc.
The real case is more complex however, what I'm actually trying to accomplish is to "stuff" a relational DB into my lucene Index (probably not the best idea?).
An appropriate example would be :
I have documents representing books in a library. every book has a title and also a list of people who has borrowed this book and the date of borrowing.
when a user searches for a book with title containing "JAVA", I want to give priority to books that were borrowed by this user. This could be accomplished by adding a TextField "borrowers", adding a SHOULD clause on it and ordering by score)
also, if there are several books with "JAVA" that this user has borrowed before, I want to show the most recent borrowed ones first. so I thought to create a TextField "borrowers" that will look like
borrowers : "user1__20150505 user2__20150506" etc.
I will add a BooleanClause borrowers: user1* and order by matching term.
any other solution ideas will be welcome
I understand your real problem is more complex, but maybe this is helpful anyway.
You could first search for Tokens in the index that match your query, then for each matching token executing a query using this token specifically.
See https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/index/TermsEnum.html for that. Just seek to the prefix and iterate until the prefix stops matching.
In general it is sometimes easy to just issue two queries. For example one within the corpus of books the user as borrowed before and another witin the whole corpus.
These approaches may not work, but in that case you could implement a custom Scorer somehow mapping the ordering to a number.
See http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

Lucene Fuzzy Search for customer names and partial address

I was going thru all the existing questions posts but couldn't get something much relevant.
I have file with millions of records for person first name, last name, address1, address2, country code, date of birth - I would like to check my list of customers with above file on daily basis (my customer list also get updated daily and file also gets updated daily).
For first name and last name I would like fuzzy match (may be lucene fuzzyquery/levenshtein distance 90% match) and for remaining fields country and date of birth I wanted exact match.
I am new to Lucene, but by looking at number of posts, looks like its possible.
My questions are:
How should I index my input file? I need to build index on combination of FN, LN, country, DOB and use the index for search
How I can use Fuzzy query of Lucene here?
Is there any other way I can implement the same?
Rushik, here are a few ideas:
Consider using Solr. It is much easier to start using it, rather than bare Lucene.
Build a Lucene/Solr index of the file. It appears that a document per customer is enough, if you use a multi-valued field or two different fields for addresses.
Do you have a unique id per person? To use Solr, you need one. In Lucene, you can get away without using a unique id.
Store the country code as a "keyword". If you only require exact match for date of birth, you may do the same. For range queries, you will need another representation.
I assume your customer list is smaller than the file. A possible policy would be to daily index the changes in the file (Here a unique id is really handy - otherwise you need to delete by query, which may miss the mark). Then you can optimize the index, and after that run a search for your updated customer list.
What you describe is a BooleanQuery, Whose clauses are fuzzy queries for the first and last names and term queries for the other fields. You can create the query programmaticaly or using the query parser.
Consider using soundex for names as described here.
Some academic papers on this subject are well worth reading (google for the free PDFs):
A Comparison of Personal Name Matching: Techniques and Practical Issues (2006)
Overview of Record Linkage and Current Research Directions (2006)
A Parallel Open Source Data Linkage System (2004)
You should also consider the following libraries/frameworks:
Duke: https://github.com/larsga/Duke
Febrl: http://sourceforge.net/projects/febrl/
(Answered for future visitors.)

Loading a entity IDs CSV column as hydrated entities in NHibernate

I have a number of database table that looks like this:
EntityId int : 1
Countries1: "1,2,3,4,5"
Countries2: "7,9,10,22"
I would like to have NHibernate load the Country entities identifed as 1,2,3,4,5,7,9 etc. whenever my EntityId is loaded.
The reason for this is that we want to avoid a proliferation of joins, as there are scores of these collections.
Does fetching the countries have to happen as the entity is loaded - it is acceptable that you run a query to fetch these? (You could amend your DAO to run the query after fetching the entity.) The reason I ask is that simply running a query compared to having custom code invoked when entities are loaded requires less "plumbing" and framework innards.
After fecthing your entity, and you have the Country1,Country2 lists, You can run a query like:
select c from Country c where c.id in (:Country1)
passing :Country1 as a named parameter. You culd also retrieve all rows for both sets of countries
select Entity e where e.id in (:Country1, :Country2)
I'm hoping the country1 & country2 strings can be used as they are, but I have a feeling this won't work. If so, you should convert the Strings to a collection of Integers and pass the collection as the query parameter.
EDIT: The "plumbing" to make this more transparent comes in the form of the IInterceptor interface. This allows you to plug in to how entities are loaded, saved, updated, flushed etc. Your entity will look something like this
class MyEntity
{
IList<Country> Country1;
IList<Country> Country2;
// with public getter/setters
String Country1IDs;
String Country2IDs;
// protected getter and setter for NHibernate
}
Although the domain object has two representations of the list - the actual entities, and the list of IDs, this is the same intrusion that you have when declaring a regular ID field in an entity. The collections (country1 and Country2) are not persisted in the mapping file.
With this in place, you provide an IInterceptor implementation that hooks the loading and saving. On loading, you fetch the value of the countryXID property an use to load the list of countries (as I described above.) On saving, you turn the IList of countries into an ID list, and save this value.
I couldn't find the documentation for IInterceptor, but there are many projects on the net using it. The interface is described in this article.
No you cannot, at least not with default functionality.
Considering that there is no SPLIT string function in SQL it would be hard for any ORM to detect the discreet integer values delimited by commas in a varchar column. If you somehow (custom sql func) overcame that obstacle, your best shot would be to use some kind of component/custom user type that would still make a smorgasbond of joins on the 'Country' table to fetch, in the end, a collection of country entities...
...but I'm not sure it can be done, and it would also mean writing from scratch the persistence mechanism as well.
As a side note, I must say that i don't understand the design decision; you denormalized the db and, well, since when joins are bad?
Also, the other given answer will solve your problem without re-designing your database, and without writing a lot of experimental plumbing code. However, it will not answer your question for hydration of the country entities
UPDATE:
on a second thought, you can cheat, at least for the select part.
You could make a VIEW the would split the values and display them as separate rows:
Entity-Country1 View:
EntityId Country
1 1
1 2
1 3
etc
Entity-Country2 View:
EntityId Country
1 7
1 9
1 10
etc
Then you can map the view

nhibernate and DDD suggestion

I am fairly new to nHibernate and DDD, so please bear with me.
I have a requirement to create a new report from my SQL table. The report is read-only and will be bound to a GridView control in an ASP.NET application.
The report contains the following fields Style, Color, Size, LAQty, MTLQty, Status.
I have the entities for Style, Color and Size, which I use in other asp.net pages. I use them via repositories. I am not sure If should use the same entities for my report or not. If I use them, where I am supposed to map the Qty and Status fields?
If I should not use the same entities, should I create a new class for the report?
As said I am new to this and just trying to learn and code properly.
Thank you
For reports its usually easier to use plain values or special DTO's. Of course you can query for the entity that references all the information, but to put it into the list (eg. using databinding) it's handier to have a single class that contains all the values plain.
To get more specific solutions as the few bellow you need to tell us a little about your domain model. How does the class model look like?
generally, you have at least three options to get "plain" values form the database using NHibernate.
Write HQL that returns an array of values
For instance:
select e1.Style, e1.Color, e1.Size, e2.LAQty, e2.MTLQty
from entity1 inner join entity2
where (some condition)
the result will be a list of object[]. Every item in the list is a row, every item in the object[] is a column. This is quite like sql, but on a higher level (you describe the query on entity level) and is database independent.
Or you create a DTO (data transfer object) only to hold one row of the result:
select new ReportDto(e1.Style, e1.Color, e1.Size, e2.LAQty, e2.MTLQty)
from entity1 inner join entity2
where (some condition)
ReportDto need to implement a constructor that has all this arguments. The result is a list of ReportDto.
Or you use CriteriaAPI (recommended)
session.CreateCriteria(typeof(Entity1), "e1")
.CreateCriteria(typeof(Entity2), "e2")
.Add( /* some condition */ )
.Add(Projections.Property("e1.Style", "Style"))
.Add(Projections.Property("e1.Color", "Color"))
.Add(Projections.Property("e1.Size", "Size"))
.Add(Projections.Property("e2.LAQty", "LAQty"))
.Add(Projections.Property("e2.MTLQty", "MTLQty"))
.SetResultTransformer(AliasToBean(typeof(ReportDto)))
.List<ReportDto>();
The ReportDto needs to have a proeprty with the name of each alias "Style", "Color" etc. The output is a list of ReportDto.
I'm not schooled in DDD exactly, but I've always modeled my nouns as types and I'm surprised the report itself is an entity. DDD or not, I wouldn't do that, rather I'd have my reports reflect the results of a query, in which quantity is presumably count(*) or sum(lineItem.quantity) and status is also calculated (perhaps, in the page).
You haven't described your domain, but there is a clue on those column headings that you may be doing a pivot over the data to create LAQty, MTLQty which you'll find hard to do in nHibernate as its designed for OLTP and does not even do UNION last I checked. That said, there is nothing wrong with abusing HQL (Hibernate Query Language) for doing lightweight reporting, as long as you understand you are abusing it.
I see Stefan has done a grand job of describing the syntax for that, so I'll stop there :-)

Lucene exact ordering

I've had this long term issue in not quite understanding how to implement a decent Lucene sort or ranking. Say I have a list of cities and their populations. If someone searches "new" or "london" I want the list of prefix matches ordered by population, and I have that working with a prefix search and an sort by field reversed, where there is a population field, IE New Mexico, New York; or London, Londonderry.
However I also always want the exact matching name to be at the top. So in the case of "London" the list should show "London, London, Londonderry" where the first London is in the UK and the second London is in Connecticut, even if Londonderry has a higher population than London CT.
Does anyone have a single query solution?
dlamblin,let me see if I get this correctly: You want to make a prefix-based query, and then sort the results by population, and maybe combine the sort order with preference for exact matches.
I suggest you separate the search from the sort and use a CustomSorter for the sorting:
Here's a blog entry describing a custom sorter.
The classic Lucene book describes this well.
API for
Sortcomparator
says
There is a distinct Comparable for each unique term in the field - if
some documents have the same term in
the field, the cache array will have
entries which reference the same
Comparable
You can apply a
FieldSortedHitQueue
to the sortcomparator which has a Comparator field for which the api says ...
Stores a comparator corresponding to
each field being sorted by.
Thus the term can be sorted accordingly
My current solution is to create an exact searcher and a prefix searcher, both sorted by reverse population, and then copy out all my hits starting from the exact hits, moving to the prefix hits. It makes paging my results slightly more annoying than I think it should be.
Also I used a hash to eliminate duplicates but later changed the prefix searcher into a boolean query of a prefix search (MUST) with an exact search (MUST NOT), to have Lucene remove the duplicates. Though this seemed even more wasteful.
Edit: Moved to a comment (since the feature now exists): Yuval F Thank you for your blog post ... How would the sort comparator know that the name field "london" exactly matches the search term "london" if it cannot access the search term?