automatic indexing using hibernate search - lucene

Is it possible to do automatic indexing at the time of inserting/updating a record in db using Hibernate Search. Instead of doing manually every time like running the application and also someone has to keep an eye on that so I want to do my code that part like automatic indexing every time no need of checking.

Yes, it is very much possible. You can just use annotations on your Entities. Take a look at this guide:
http://java.dzone.com/articles/hibernate-search-mapping-entit
Edit:
Hi. If your hibernate properties are correct, once your index is built, you don't have to index your tables manually. Each insert going through an EntityManager/HibernateFactory, will hit hibernate search and if the Entity is Indexed, it will also update the index. Have you configured your search properly? Take a look at the following links:
http://docs.jboss.org/seam/snapshot/en-US/html/search.html http://docs.jboss.org/hibernate/search/3.1/reference/en/html/search-configuration-event.html.
As the documentation clearly states
'By default, every time an object is inserted, updated or deleted
through Hibernate, Hibernate Search updates the according Lucene
index. It is sometimes desirable to disable that features if either
your index is read-only or if index updates are done in a batch way
(see Chapter 6, Manual indexing).'

Related

Incremental indexing for semantic search

I wonder if there are some standards or best practices, in performing an incremental indexation of a triple store for semantic search purpose.
Indeed to support semantic search one usually use solr or elasticsearch where resource are indexed according to some specific SPARQL query. While one can re-index its entire resources set once a day for instance, it is not that desirable. Hence comes the need to perform it incrementally. However that requires somehow to track changes, with the ultimate goat to be able to keep on indexing or deleting whatever has changed only.
For instance to only index what has change, the SPARQL query should include some timestamp filter somehow.
If anyone has some suggestions, or experience on performing it, that he would like to share this would be well apreciated
So far I am being somewhat inspired by EEA ElasticSearch RDF River Plugin. I'm also looking at the ontology Changeset Ontology.
The easiest way to accomplish this would be to get something involved in the transaction lifecycle. Then you're able to see the changes to the database which will give you the graph that needs to be indexed.
But don't dismiss doing a full re-index on a periodic schedule, such as nightly. Unless your requirement is that full-text searches must always be against the most recent data and your data changes quickly, a full re-index on a regular basis will work just fine.

Elasticsearch _boost deprecated. Any alternative?

So basically _boost, which was a mapping option to give a field a certain boost is now deprecated
The page suggest to use "function score instead of boost". But function score means:
The function_score allows you to modify the score of documents that are retrieved by a query
So it's not even an alternative. Function score just modifies the score of the documents at query time.
How i do alter the relevante of a field at the mapping time?
That option is not valid anymore? Removed and no replacement?
The option is no longer valid and there is no direct replacement. The problem is that index time boosting was removed from Lucene 4.0 upon which Elasticsearch runs. Elasticsearch then used it's own implementation which had it's own issues. A good write up on the issues can be found here: http://blog.brusic.com/2014/02/document-boosting-in-elasticsearch.html and the issue deprecating boost at index time here: https://github.com/elasticsearch/elasticsearch/issues/4664
To summarize, it basically was not working in a transparent and understandable way - you could boost one document by 100 and another by 50, hit the same keyword and yet get the same score. So the decision was made to remove it and rely on function score queries, which have the benefit of being much more transparent and predictable in their impact on scoring.
If you feel strongly that function score queries do not meet your needs and use case, I'd open an issue in github and explain your case.
Function score query can be used to boost the whole document. If you want to use field boost, you can do so with a multi match query or a term query.
I don't know about your case but I believe you have strong reasons to boost documents at index time. It is always recommended to boost at "Query" as "Index" time boosting will require reindexing the data again if ever your boost criteria changes. Being said that, in my application we have implemented both Index & Query time boosting, we are using
Index Time Boosting (document boosting), to boost some documents which we know will always be TOP HIT for our search. e.g searching with word "google" should always put a document containing "google.com" as top hit. We do this using a custom boost field and a custom boosting script to achieve this. Please see this link: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
Query time Boosting (Per field boosting), we are using ES java APIs to execute our queries, we apply field level boosting at query time to each field as its highly flexible & allows us to change the field level boosting without reindexing the whole data set again.
You can have a look at this, it might be helpful for you: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#_field_value_factor
I have described my complete use case here, hopefully you will find it helpful.

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

Hibernate Search, Entities, and SQL VIEWs

I have a table that maintains rows of products that are for sale (tbl_products) using PostgreSQL 9.1. There are also several other tables that maintain ratings on the items, comments, etc. We're using JPA/Hibernate for ORM in a Seam application, and have the appropriate entities wired up properly. In an effort to provide better listings of these items, I've created a SQL VIEW (v_product_summary) that aggregates some of the basic product data (name, description, price, etc.) with data from the other tables (number of comments, average rating, etc.). This provides a nice concise view of the data, and I've created a corresponding JPA entity object that provides read-only access to the view data.
Everything is working fine with respect to running JPQL queries on either the Product object (tbl_products) or the ProductSummary (v_product_summary) objects. However, we'd like to provide a richer search experience using Hibernate Search and Lucene. The issue we're running into, though, is how do we query the ProductSummary objects using Hibernate Search? They're not indexed upon creation, because they're never really "created". They're obtained as read-only objects from the v_product_summary VIEW. An index entry is only created on Product when it's persisted to the database, and not for ProductSummary since it's never persisted.
Our thought is that we should be able to:
Persist our Product object to the database
Immediately query the corresponding ProductSummary object using the product's ID
Manually update the Hibernate Search index for the ProductSummary object
Is this possible? Is this even a good idea? I can see there will be a performance impact since we're executing a query for the ProductSummary object every time a new Product is persisted. However, products are not added to the database at a high volume, so I don't think this will be a huge issue.
We'd really like to find a better, more efficient way of accomplishing this. Can anyone provide any tips or recommendations? If we do go the route of updating the search index manually, is that even doable? Can anyone provide a resource explaining how we can add a single ProductSummary to the index?
Any help you can provide is GREATLY appreciated.
If I understand the question correctly, you're trying to do the normal thing of persisting an object and indexing it at that point, but you're dealing with 2 separate objects.
I find myself doing kludgey things in Hibernate all the time, it feels like it almost demands it of you. Yes, there'd be a performance impact, and as you say, it is probably not a big deal, so it might be worth profiling.
A part of me remembers there's a way you can refresh the object upon write, and wonders if there's a way you can wrap the Product and the ProductSummary and tweak the mapping so that you read part and write part of it (waves hands on syntax and mapping). Or create a Hibernate-facing object with readonly fields that can be split and merged into your two objects. I don't know if your design allows Hibernate-only objects, it's a common idiom in my system.
Either way could be useful if you had a lot of objects in this situation, if this is the only object you're searching in this way, your 3 steps look much clearer.
As for the syntax for adding an object manually, I think you're looking for something like this, after your fetch:
FullTextSession textSession = Search.getFullTextSession(session);
textSession.index(myProductSummary);
Was that all you wanted?
Since you are using postgresql, you could insert to the view and use a rule to redirect the insert to the appropriate table.
A postgresql rule is a way to change the query just before it gets executed. I used it in an application which needed a change in schema but required the old queries to still work for a little while.
You can check out the documentation about rules on insert queries on the postgresql site
Since you'll be inserting and updating to the view, hibernate search will work as usual.
EDIT
An easier strategy. You could insert and update ProductSummary when doing so on Product and tell PostgreSQL to ignore the inserts, updates and deletes on the view.
On the database side"
create RULE dontinsert AS ON insert to v_product_summary do instead nothing
create RULE dontupdate AS ON update to v_product_summary do instead nothing
create RULE dontdelete AS ON delete to v_product_summary do instead nothing
But I guess you will need to hack a little, since the jdbc call executeUpdate will return 0, and hibernate will probably freak.
Technically I think this would be possible, but I think your entire efficiency dilemma might be better solved using something like memcached, therefore making performance less of an issue, and perhaps increasing code maintainability depending on how you currently have it implemented at statement level. By updating the search index manually, do you mean the database index? That is not recommended, and I'm not sure if it's even doable. Why not index them on creation?

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.