Incremental indexing for semantic search - indexing

I wonder if there are some standards or best practices, in performing an incremental indexation of a triple store for semantic search purpose.
Indeed to support semantic search one usually use solr or elasticsearch where resource are indexed according to some specific SPARQL query. While one can re-index its entire resources set once a day for instance, it is not that desirable. Hence comes the need to perform it incrementally. However that requires somehow to track changes, with the ultimate goat to be able to keep on indexing or deleting whatever has changed only.
For instance to only index what has change, the SPARQL query should include some timestamp filter somehow.
If anyone has some suggestions, or experience on performing it, that he would like to share this would be well apreciated
So far I am being somewhat inspired by EEA ElasticSearch RDF River Plugin. I'm also looking at the ontology Changeset Ontology.

The easiest way to accomplish this would be to get something involved in the transaction lifecycle. Then you're able to see the changes to the database which will give you the graph that needs to be indexed.
But don't dismiss doing a full re-index on a periodic schedule, such as nightly. Unless your requirement is that full-text searches must always be against the most recent data and your data changes quickly, a full re-index on a regular basis will work just fine.

Related

Why are aggregate functions like group_by not supported in hibernate search?

Why are aggregate functions like group_by not supported in hibernate search?
I have a use case where i need to fetch results after applying group by in the query.
There is no technical reason, if this is what you mean. We could probably add it, but there simply wasn't enough demand for this feature to make it to the top of our priority list.
If you want to see a feature added to Hibernate Search, feel free to create a ticket on our JIRA instance, describing in details your use case and the API you would expect.
Note that I am not 100% sure we would implement it for the Lucene backend, since that would probably require a lot of effort. But for people using Elasticsearch behind Hibernate Search, we may at least introduce ways to use Elasticsearch's aggregation support from within Hibernate Search. We are currently experimenting with Hibernate Search 6 and trying this is on my checklist.
In the meantime, if you want us to suggest alternatives, please provide more details about your use case: domain model, mapping, fields you would like to aggregate as part of your "group by"...
Why it's missing
The primary reason for this to not be support by Hibernate Search is that noone ever asked for it or contributed it.
Another reason is that since the results would be "groups of entities" while the FulltextQuery API returns a List of entities, this would need a new API specifically to run such queries.
How to get it added
We could make that, but if there is not much interest in the feature it would possibly not be worth the maintenance work.
If you need such a feature I suggest you open an issue on the Hibernate Search issue tracker so that other people can also vote or express interest for it. Ideally, someone needing it like yourself might be willing to create a patch or at least start a proof of concept.
Alternatives
Until Hibernate Search provides direct support for it, you can still run such queries yourself. See Using IndexReaders directly to work on the Lucene index directly.
Using the IndexReaders you can always read and Search on Lucene using any advanced feature for which Hibernate Search doesn't provide an API.

Developing a search and tag heavy website

I'm in the planning phase of developing a very tag heavy website. Everything will essentially be associated with tags and the entire site would be based on searching these tags.
Now, I've been thinking a lot about going the nosql route here, since from what I read and understand, it makes the most sense for something like this.
Would it be best to go with this database system? Would it makes sense to go with the relational database system? Should I think incorporating something like SOLR?
What would the ideal setup be?
UPDATE:
Ideally they would be user generated, but we all know how that would turn out with giving users that much power. So, let’s change up the requirements and say that users WILL NOT have the power to create tags.
Searching on tags based on text matches is something that would probably be useful and needed. If the tag is “garage sale”, the search for “sale” should also pick this up, at a lower relevance for sure.
I can’t imagine the usage being so much that scaling would be an issue.
Thanks
I would spend a bit of time thinking about these tags. For example, are these tags going to be user generated or will you provide a few tags and let users select which ones they want?
Will you need to search on tags based on text matches? For example if a tag is "garage sale" do you want to search for "sale" to also pick this up? Maybe at a lower relevance?
Also, what kind of usage are you looking at? One good thing about Solr is that it's super easy to scale and synchronize data, it is easy to deploy multiple nodes, shard collections and replicate data to other nodes, something that traditional databases struggle with.
Another thing to keep in mind is that most of the time, Solr is not the official "repository of record", most of the time the data gets fed to it from a DB somewhere, but all reading activities are done from Solr.
See this answer for a SQL solution. Offhand I can't think of any advantage to using most NoSQL databases (i.e. key-value, columnar, or document) as the SQL solution will be more compact and ought to give good performance; a graph database may be appropriate if you're doing a lot of navigational type queries on your tags, but it doesn't sound like that's the case.
Use of Solr (or ElasticSearch or whatever) is orthogonal to your primary database; it may be appropriate to incorporate a search tool if users are typing inexact tags for search, but I recommend integrating a stemming library or something along those lines before turning to a full blown search tool.

automatic indexing using hibernate search

Is it possible to do automatic indexing at the time of inserting/updating a record in db using Hibernate Search. Instead of doing manually every time like running the application and also someone has to keep an eye on that so I want to do my code that part like automatic indexing every time no need of checking.
Yes, it is very much possible. You can just use annotations on your Entities. Take a look at this guide:
http://java.dzone.com/articles/hibernate-search-mapping-entit
Edit:
Hi. If your hibernate properties are correct, once your index is built, you don't have to index your tables manually. Each insert going through an EntityManager/HibernateFactory, will hit hibernate search and if the Entity is Indexed, it will also update the index. Have you configured your search properly? Take a look at the following links:
http://docs.jboss.org/seam/snapshot/en-US/html/search.html http://docs.jboss.org/hibernate/search/3.1/reference/en/html/search-configuration-event.html.
As the documentation clearly states
'By default, every time an object is inserted, updated or deleted
through Hibernate, Hibernate Search updates the according Lucene
index. It is sometimes desirable to disable that features if either
your index is read-only or if index updates are done in a batch way
(see Chapter 6, Manual indexing).'

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.