Google ndb datastore new composite index issue - app-engine-ndb

We tried adding a new composite index to an existing entity but the old data present are not indexed as expected.
We worked around the issue by reading all the data and re writing them to the datastore. After that, data are indexed and are available for querying.
Just curious, is this temporary issue at google end or it's a know limitation with ndb?

This is the expected behavior. When using Google Cloud Datastore you have to know ahead of time what your queries will be in order to avoid having to read all the entities from your kind and writing them again. From time to time I end up having to do that myself as well for your use case or to add or remove a new property.
This answer explains everything about indexing: https://stackoverflow.com/a/35744783/190908
There was a bug that affected composite index. It required you to index every individual property in the composite index, but since the pricing model changed, this won't end up costing you more nowadays: https://code.google.com/p/googleappengine/issues/detail?id=4231

Related

Why not assign multiple types in an ElasticSearch index for logging, rather than multiple indices?

I am currently researching some data storage strategies with ElasticSearch and wonder why for storing logs, this page indicates:
A standard format is to assign a new index for each day.
Would it not make more sense to create an index (database) with a new type a name (table) per day?
I am looking at this from the point of view of each index is tied to a different web application.
In another scenario, a web app uses one index. One of the types within that index is used for logging (what we currently do with SQL Server). Is this a good approach?
Interesting idea and, yes, you could probably do that. Why use multiple indices instead? If having control over things like shard-to-node allocation (maybe you want all of 2015 stored on one set of nodes, 2014, another), filter cache size, and similar is important, you lose that by going to a single index/multi-mapping approach. For very high volume applications, that control might be significant. YMMV.
With regard to the "each index is tied to a different web application" sentiment, aliases can (and are) used to collect multiple physical indices under a single searchable umbrella; you create one index per day/week/whatever, say, logs-20150730, logs-20150731... and assign the logs alias to all of the indices in the series. Net effect is the same as having a single "index".
Nice part of the alias approach is that purging/pruning old data is trivial; just delete the index when its contents age out of whatever your data retention policy is. With multi-mappings, you'd have to delete the requisite mapping within the index (do-able, but pretty I/O intrusive, since you'd likely be shoving stuff around inside every shard the mapping was distributed through.)

How to create a facet in Sitecore Content Search (Lucene) based on Real Time Data?

With Sitecore Content Search configuration is it possible to support the addition of a field which is populated with a value at search time, not index time? The population would be from an in-memory data structure for performance.
Essentially without re-indexing the values need to be updated/accessed, examples for this real time field would be Facebook Likes, In Stock, or Real Time Pricing. This data would then be used for faceting such as items with a range of Facebook likes, in-stock versus out-of-stock, or real time price facets.
The content search api does the searching on an iindexable, so I would look into that - you'd probably have to implement this interface yourself.
More info here:
http://www.sitecore.net/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-search-operations-explained.aspx
If you need to search on data that is not in the index I would question whether sitecore search is the best option here. If the data needs to be searched in real time then maybe a database would suffice.
If the data set is large and you need realtime access then maybe a nosql database such as MongoDB might be the right choice. Hope this has given you some ideas and you reach a solution
You can leverage the Sitecore dynamic index. The idea is to query your "large" index from within your in-memory index which you'll use dynamically. The implementation is relatively easy.
More info: http://www.sitecore.net/en-gb/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-dynamic-indexes.aspx

Versioning data in SQL Server so user can take a certain cut of the data

I have a requirement that in a SQL Server backed website which is essentially a large CRUD application, the user should be able to 'go back in time' and be able to export the data as it was at a given point in time.
My question is what is the best strategy for this problem? Is there a systematic approach I can take and apply it across all tables?
Depending on what exactly you need, this can be relatively easy or hell.
Easy: Make a history table for every table, copy data there pre update or post insert/update (i.e. new stuff is there too). Never delete from the original table, make logical deletes.
Hard: There is an fdb version counting up on every change, every data item is correlated to start and end. This requires very fancy primary key mangling.
Just add a little comment to previous answers. If you need to go back for all users you can use snapshots.
The simplest solution is to save a copy of each row whenever it changes. This can be done most easily with a trigger. Then your UI must provide search abilities to go back and find the data.
This does produce an explosion of data, which gets worse when tables are updated frequently, so the next step is usually some kind of data-based purge of older data.
An implementation you could look at is Team Foundation Server. It has the ability to perform historical queries (using the WIQL keyword ASOF). The backend is SQL Server, so there might be some clues there.

Best way to keep index real time?

I have a Solr/Lucene index file of approximately 700 Gb. The documents that I need to index are being read in real-time, roughly 1000 docs every 30 minutes are submitted and need to be indexed. In my scenario a script is run every 30 mins that indexes the documents that are not yet indexed, since it is a requirement that new documents should be searchable as soon as possible, but this process slow down the searching.
Is this the best way i can index latest documents or there is some other better way!
First, remember that Solr is not a real-time search engine (yet). There is still work to be done.
You can use a master/slave setup, where the indexation are done on the master and the search on the slave. With this, indexation does not affect search performance. After the commit is done on the master, force the slave to fetch the latest index from the master. While the new index is being replicated on the slave, it is still processing queries with the previous index.
Also, check you cache warming settings. Remember that this might slow down the searches if those settings are too aggressive. Also check the queries launched on the new searcher event.
You can do this with Lucene easily. Split the indexes in multiple parts (or to be precise, while building indexes, create "smaller" parts.) Create searcher for each of the part and store a reference to them. You can create a MultiSearcher on top of these individual parts.
Now, there will be only one index that will get the new documents. At regular intervals, add documents to this index, commit and re-open this searcher.
After the last index is updated, you can create a new multi-searcher again, using the previously opened searchers.
Thus, at any point, you will be re-opening only one searcher and that will be quite fast.
Check http://code.google.com/p/zoie/ wrapper around Lucene to make it real time - code donated from Linkedin.
^^i do this, with normal lucene, non solr, and it works really nice. however not sure if there is a solr way to do that at the moment. twitter recently went with lucene for searching and has effectively real time searching by just writing to their index at any update. their index resides completely in memory, so updating/reading the index is of no consequence and happens instantly, a lucene index can always be read while being written to, as long as there is only one writer at a time.
Check out this wiki page

Strategies for keeping a Lucene Index up to date with domain model changes

Was looking to get peoples thoughts on keeping a Lucene index up to date as changes are made to the domain model objects of an application.
The application in question is a Java/J2EE based web app that uses Hibernate. The way I currently have things working is that the Hibernate mapped model objects all implement a common "Indexable" interface that can return a set of key/value pairs that are recorded in Lucene. Whenever a CRUD operation is performed involving such an object I send it via JMS queue into a message driven bean that records in Lucene the primary key of the object and the key/value pairs returned from the index( ) method of the Indexable object that was provided.
My main worries about this scheme is if the MDB gets behind and can't keep up with the indexing operations that are coming in or if some sort of error/exception stops an object from being index. The result is an out-of-date index for either a sort, or long, period of time.
Basically I was just wondering what kind of strategies others had come up with for this sort of thing. Not necessarily looking for one correct answer but am imagining a list of "whiteboard" sort of ideas to get my brain thinking about alternatives.
Change the message: just provide the primary key and the current date, not the key/value pairs. Your mdb fetches the entity by primary key and calls index(). After indexing you set a value "updated" in your index to the message date. You update your index only if the message date is after the "updated" field of the index. This way you can't get behind because you always fetch the current key/value pairs first.
As an alternative: have a look at http://www.compass-project.org.
The accepted answer is 8 years old now and very out of date.
The Compass Project is not maintained anymore since a long time, as its main developer moved on to create the excellent Elasticsearch.
The modern answer to this is to use Hibernate Search, which incidentally can map to either a Lucene index directly or through Elasticsearch.