How to create a facet in Sitecore Content Search (Lucene) based on Real Time Data? - indexing

With Sitecore Content Search configuration is it possible to support the addition of a field which is populated with a value at search time, not index time? The population would be from an in-memory data structure for performance.
Essentially without re-indexing the values need to be updated/accessed, examples for this real time field would be Facebook Likes, In Stock, or Real Time Pricing. This data would then be used for faceting such as items with a range of Facebook likes, in-stock versus out-of-stock, or real time price facets.

The content search api does the searching on an iindexable, so I would look into that - you'd probably have to implement this interface yourself.
More info here:
http://www.sitecore.net/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-search-operations-explained.aspx
If you need to search on data that is not in the index I would question whether sitecore search is the best option here. If the data needs to be searched in real time then maybe a database would suffice.
If the data set is large and you need realtime access then maybe a nosql database such as MongoDB might be the right choice. Hope this has given you some ideas and you reach a solution

You can leverage the Sitecore dynamic index. The idea is to query your "large" index from within your in-memory index which you'll use dynamically. The implementation is relatively easy.
More info: http://www.sitecore.net/en-gb/learn/blogs/technical-blogs/sitecore-7-development-team/posts/2013/04/sitecore-7-dynamic-indexes.aspx

Related

Why use multiple ElasticSearch indices for one web application?

In asking a questions relating to using ES for web applications, suggestions have been made to have one index for things like user profiles, another index for data, etc., and several other ones for logs.
Having these all on a cluster with several web applications, this seems like things could get messy or disorganized.
In that case, are people using one cluster per application? I am a bit confused because when I read articles about indexing logs, they seem to refer to storing the data in multiple indices, rather than types within an index.
Secondly, why not have one index per app, with types for logs, user profiles, data, etc.?
Is there some benefit to using multiple indices rather than many types within an index for a web application?
-- UPDATE --
To add to this, the comments in this question, Elastic search, multiple indexes vs one index and types for different data sets?, don't seem to go far enough in explaining why:
data retention: for application log/metric data, use different indexes
if you require different retention period
Is that recommended because it's just simpler to delete an entire index rather than a type within an index? Does it have to do with the way the data is stored then space recovered after deleting the data?
I found the primary reason for creating multiple indices that satisfies my quest for an answer in ElasticSearch's pagination documentation:
To understand why deep paging is problematic, let’s imagine that we
are searching within a single index with five primary shards. When we
request the first page of results (results 1 to 10), each shard
produces its own top 10 results and returns them to the requesting
node, which then sorts all 50 results in order to select the overall
top 10.
Now imagine that we ask for page 1,000—results 10,001 to 10,010.
Everything works in the same way except that each shard has to produce
its top 10,010 results. The requesting node then sorts through all
50,050 results and discards 50,040 of them!
You can see that, in a distributed system, the cost of sorting results
grows exponentially the deeper we page. There is a good reason that
web search engines don’t return more than 1,000 results for any query.

Why not assign multiple types in an ElasticSearch index for logging, rather than multiple indices?

I am currently researching some data storage strategies with ElasticSearch and wonder why for storing logs, this page indicates:
A standard format is to assign a new index for each day.
Would it not make more sense to create an index (database) with a new type a name (table) per day?
I am looking at this from the point of view of each index is tied to a different web application.
In another scenario, a web app uses one index. One of the types within that index is used for logging (what we currently do with SQL Server). Is this a good approach?
Interesting idea and, yes, you could probably do that. Why use multiple indices instead? If having control over things like shard-to-node allocation (maybe you want all of 2015 stored on one set of nodes, 2014, another), filter cache size, and similar is important, you lose that by going to a single index/multi-mapping approach. For very high volume applications, that control might be significant. YMMV.
With regard to the "each index is tied to a different web application" sentiment, aliases can (and are) used to collect multiple physical indices under a single searchable umbrella; you create one index per day/week/whatever, say, logs-20150730, logs-20150731... and assign the logs alias to all of the indices in the series. Net effect is the same as having a single "index".
Nice part of the alias approach is that purging/pruning old data is trivial; just delete the index when its contents age out of whatever your data retention policy is. With multi-mappings, you'd have to delete the requisite mapping within the index (do-able, but pretty I/O intrusive, since you'd likely be shoving stuff around inside every shard the mapping was distributed through.)

Applying pagination on keen's extraction api

I've large number of messages in a keen collection and want to expose them to our end users through pagination using an api. Is it possible to specify offset like queries in Keen ?
We earlier had tradition database so were able to support above operations and thinking to shift to Keen because of it's easier analysis capabilities.
It's not possible to paginate extractions.
We created the Extractions API to allow you to get your event data out of Keen IO any time you like. It's your data and we believe that you should always have full access to it! Think of extractions as a way to export data rather than a way to query it and you'll begin to understand how extractions are intended to be used.
Keen is great at collecting and analyzing data, but it's not great at being a database. You will struggle to provide the user experience your users deserve if you attempt to use extractions in a real-time user facing manner. Our recommendation for a use case like yours is to add a database layer that stores your entity data somewhere outside of Keen. Augment that entity data with the results of your queries from Keen and you'll be all set.
I hope this helps!
Terry's advice is sound, but if you can live with an approximation of pagination than consider making multiple requests with non-overlapping timeframes.
For example, if you wanted to paginate over an hour's worth of data you could issue extractions over 1 minute of data at a time until you reach the desired page size. You would keep track of where you left off to load the next "page", and so forth.

Redis full text search : reverse indexing or sunspot?

I have 3,5 millions records (readonly) actually stored in a MySQL DB that I would want to pull out to Redis for performance reasons. Actually, I've managed to store things like this into Redis :
1 {"type":"Country","slug":"albania","name_fr":"Albanie","name_en":"Albania"}
2 {"type":"Country","slug":"armenia","name_fr":"Arménie","name_en":"Armenia"}
...
The key I use here is the legacy MySQL id, so with some Ruby glue, I can break as less things as possible in this existing app (and this is a serious concern here).
Now the problem is when I need to perform a search on the keyword "Armenia", inside the value part. Seems like there's only two ways out :
Either I multiplicate Redis index :
id => JSON values (as shown above)
slug => id (reverse indexing based on the slug, that could do the basic search trick)
finally, another huge index specifically for autocomplete, as shown in this post : http://oldblog.antirez.com/post/autocomplete-with-redis.html
Either I use sunspot or some full text search engine (unfortunatly, I actually use ThinkingSphinx which is too much tied to MySQL :-(
So, what would you do ? Do you think the MySQL to Redis move of a single table is even a good idea ? I'm afraid of the Memory footprint those gigantic Redis key/values could take on a 16GB RAM Server.
Any feedback on a similar Redis usage ?
Before I start with a real answer, I wanted to mention that I don't see a good reason for you to be using Redis here. Based on what types of use cases it sounds like you're trying to do, it sounds like something like elasticsearch would be more appropriate for you.
That said, if you just want to be able to search for a few different fields within your JSON, you've got two options:
Auxiliary index that points field_key -> list_of_ids (in your case, "Armenia" -> 1).
Use Lua on top of Redis with JSON encoding and decoding to get at what you want. This is way more flexible and space efficient, but will be slower as your table grows.
Again, I don't think either is appropriate for you because it doesn't sound like Redis is going to be a good choice for you, but if you must, those should work.
Here's my take on Redis.
Basically I think of it as an in-memory cache that can be configured to only store the least recently used data (LRU). Which is the role I made it to play in my use case, the logic of which may be applicable to helping you think about your use case.
I'm currently using Redis to cache results for a search engine based on some complex queries (slow), backed by data in another DB (similar to your case). So Redis serves as a cache storage for answering queries. All queries either get served the data in Redis or the DB if it's a cache-miss in Redis. So, note that Redis is not replacing the DB, but merely being an extension via cache in my case.
This fit my specific use case, because the addition of Redis was supposed to assist future scalability. The idea is that repeated access of recent data (in my case, if a user does a repeated query) can be served by Redis, and take some load off of the DB.
Basically my Redis schema ended up looking somewhat like the duplication of your index you outlined above. I used sets and sortedSets to create "batches / sets" of redis-keys, each of which pointed to specific query results stored under a particular redis-key. And in the DB, I still had the complete data set and an index.
If your data set fits on RAM, you could do the "table dump" into Redis, and get rid of the need for MySQL. I could see this working, as long as you plan for persistent Redis storage and plan for the possible growth of your data, if this "table" will grow in the future.
So depending on your actual use case and how you see Redis fitting into your stack, and the load your DB serves, don't rule out the possibility of having to do both of the options you outlined above (which happend in my case).
Hope this helps!
Redis does provide Full Text Search with RediSearch.
Redisearch implements a search engine on top of Redis. This also enables more advanced features, like exact phrase matching, auto suggestions and numeric filtering for text queries, that are not possible or efficient with traditional Redis search approaches.

How to monitor prices of apps in App Store

Now, I can get prices of apps using the search and lookup APIs. But I don't know how to monitor the prices. Should I check all the apps using API every day or even every some hours? It seems to be a huge task.
And here's another question. How can I get info of "all" the apps since the APIs needs keyword or id parameter.
You would want to index them initially, then index the most popular keywords that you're looking for, so you can search for those and update the prices if necessary. In terms of actually scanning the entire app store every few hours, that seems a bit much. Can you expand on what you'll be doing with this information?