Strategies for keeping a Lucene Index up to date with domain model changes - indexing

Was looking to get peoples thoughts on keeping a Lucene index up to date as changes are made to the domain model objects of an application.
The application in question is a Java/J2EE based web app that uses Hibernate. The way I currently have things working is that the Hibernate mapped model objects all implement a common "Indexable" interface that can return a set of key/value pairs that are recorded in Lucene. Whenever a CRUD operation is performed involving such an object I send it via JMS queue into a message driven bean that records in Lucene the primary key of the object and the key/value pairs returned from the index( ) method of the Indexable object that was provided.
My main worries about this scheme is if the MDB gets behind and can't keep up with the indexing operations that are coming in or if some sort of error/exception stops an object from being index. The result is an out-of-date index for either a sort, or long, period of time.
Basically I was just wondering what kind of strategies others had come up with for this sort of thing. Not necessarily looking for one correct answer but am imagining a list of "whiteboard" sort of ideas to get my brain thinking about alternatives.

Change the message: just provide the primary key and the current date, not the key/value pairs. Your mdb fetches the entity by primary key and calls index(). After indexing you set a value "updated" in your index to the message date. You update your index only if the message date is after the "updated" field of the index. This way you can't get behind because you always fetch the current key/value pairs first.
As an alternative: have a look at http://www.compass-project.org.

The accepted answer is 8 years old now and very out of date.
The Compass Project is not maintained anymore since a long time, as its main developer moved on to create the excellent Elasticsearch.
The modern answer to this is to use Hibernate Search, which incidentally can map to either a Lucene index directly or through Elasticsearch.

Related

Google ndb datastore new composite index issue

We tried adding a new composite index to an existing entity but the old data present are not indexed as expected.
We worked around the issue by reading all the data and re writing them to the datastore. After that, data are indexed and are available for querying.
Just curious, is this temporary issue at google end or it's a know limitation with ndb?
This is the expected behavior. When using Google Cloud Datastore you have to know ahead of time what your queries will be in order to avoid having to read all the entities from your kind and writing them again. From time to time I end up having to do that myself as well for your use case or to add or remove a new property.
This answer explains everything about indexing: https://stackoverflow.com/a/35744783/190908
There was a bug that affected composite index. It required you to index every individual property in the composite index, but since the pricing model changed, this won't end up costing you more nowadays: https://code.google.com/p/googleappengine/issues/detail?id=4231

Updating Lucene payloads without a full re-index

In Lucene, I'm using payloads to store information for each token in a document (a float value in my case). From time to time, those payloads may need to be updated. If I know the docID, termID, offset, etc., is there any way for me to update the payloads in place without having to re-index the whole document?
I'm no aware of any Lucene API to support this, even an "update" operation under the hood is executed as a "delete" and "add" add operation.
A workaround that will require more storage, but reduces IO and latency could be to store the whole source of a document either in the Lucene index itself or a dedicated data store on the same node as the Lucene index. Then you still could send only the updated payload info to your application, to get your document updated. But still the whole document needs to be re-indexed.
See also How to set a field to keep a row unique in lucene?

Why not assign multiple types in an ElasticSearch index for logging, rather than multiple indices?

I am currently researching some data storage strategies with ElasticSearch and wonder why for storing logs, this page indicates:
A standard format is to assign a new index for each day.
Would it not make more sense to create an index (database) with a new type a name (table) per day?
I am looking at this from the point of view of each index is tied to a different web application.
In another scenario, a web app uses one index. One of the types within that index is used for logging (what we currently do with SQL Server). Is this a good approach?
Interesting idea and, yes, you could probably do that. Why use multiple indices instead? If having control over things like shard-to-node allocation (maybe you want all of 2015 stored on one set of nodes, 2014, another), filter cache size, and similar is important, you lose that by going to a single index/multi-mapping approach. For very high volume applications, that control might be significant. YMMV.
With regard to the "each index is tied to a different web application" sentiment, aliases can (and are) used to collect multiple physical indices under a single searchable umbrella; you create one index per day/week/whatever, say, logs-20150730, logs-20150731... and assign the logs alias to all of the indices in the series. Net effect is the same as having a single "index".
Nice part of the alias approach is that purging/pruning old data is trivial; just delete the index when its contents age out of whatever your data retention policy is. With multi-mappings, you'd have to delete the requisite mapping within the index (do-able, but pretty I/O intrusive, since you'd likely be shoving stuff around inside every shard the mapping was distributed through.)

How to use db indexing to create unique node (or get existing node handle) in neo4j through Java code?

I want to create nodes having persons but don't want to duplicate them. For every insertion, I get two entities among which I have to make a relationship, so it may happen that none, one or both are already in my db. So, how to use indexing and deal with the situation?
there is a manual page on this topic
http://docs.neo4j.org/chunked/stable/transactions-unique-nodes.html

Data structure for efficient access of random slices of data from an API call

We are writing a library for an Api which pulls down on ordered stream of data. Through this Api you can make calls for data by slices. For instance if I want items 15-25 I can make an api call for that.
The library we are writing will allow the client to call for any slice of data as well, but we want the library to be as efficient with these api calls as possible. So if I've already asked for items 21-30, I don't want to ever request those individual data items again. If someone asks the library for 15-25 we want to call the api for 15-20. We will need to search for what data we already have and avoid requesting that data again.
What is the most efficient data structure for storing the results of these api calls? The data sets will not be huge so search time in local memory isn't that big of a deal. We are looking for simplicity and cleanliness of code. There are several obvious answers to this problem but I'm curious if any data structure nerds out there have an elegant solution that isn't coming to mind.
For reference we are coding in Python but are really just looking for a data structure that solves this problem elegantly.
I'd use a balanced binary tree (e.g. http://pypi.python.org/pypi/bintrees/0.4.0) to map begin -> (end, data). When a new request comes in for [b, e) range, do a search for b (followed by move to previous record if b != key), another search for e (also step back), scan all entries between the resulting keys, pull down missing ranges, and merge all from-cache intervals and the new data into one interval. For N intervals in the cache, you'll get amortized O(log-N) cost of each cache update.
You can also simply keep a list of (begin, end, data) tuples, ordered by begin, and use bisect_right to search. Cost: O(N=number of cached intervals) for every update in the worst case, but if the clients tend to request data in increasing order, the cache update will be O(1).
Cache search itself is O(log-N) in either case.
The canonical data structure often used to self this problem is an interval tree. (See this Wikipedia article.) Your problem can be thought of as needing to know what things you've sent (what intervals) overlap with what you're trying to send -- then cut out the intervals that intersect with what you're trying to send (which is linear time with respect to the number of intervals that you find overlap) and you're there. The "Augmented" tree half way down the Wikipedia article looks simpler in implementation, though, so I'd stick with that. Should be "log N" time complexity, amortization or not.