Key-value design in redis - redis

I just begun work with redis. I have problem when work with it. I have a list of users.
I have a page that displays a list of users, in that page I have pagination, sorting, filter by name address... How can I design key-value redis for easy use?

Redis is not exactly suited for an SQL-alike usage. What I mean is that usually you get data the way you put data in Redis.
Having a list of users with pagination, if you don't need too much filtering, or just limited filtering, can be a good use case using a sorted set data type where you have your user IDs as values, and the unix time as score. If you need another listing sorted by a different field, you'll likely need an additional sorted set, and so forth.
As far as filtering is concerned, you may do it server-side getting ranges from the sorted set and removing the non-matching items if they are sparse. However you can see how this will not scale if your filter selects 10 elements out of millions.
So the applicability of Redis in your use case depends on the exact details, and in general it looks like you may want a database more suitable for complex queries, even if you are likely going to pay the performance price.

Related

Is it ever a good idea to store an array as a field value, or store array values as records?

In my application I've got "articles" (similar to posts/tweets/articles) that are tagged with descriptive predefined tags: i.e "difficult", "easy", "red", "blue", "business" etc
These available tags are stored in a table, call it "tags" that contains all available tags.
Each article can be tagged with multiple tags, editable through a custom admin interface.
It could be tempting to simply bundle the tags for each entity into a stringified array of the IDs of each tag and store it alongside the article record in my "articles" table:
id | title | author | tags
---+-------+--------+-------------
1 | title | TG | "[1,4,7,12]"
though I'm sure this is a bad idea for a number of reasons, is there ever a reasonable reason to do the above?
I think you should read about Database normalization and decide for yourself. In short though, there are a number of issues with your proposal, but you may decide you can live with them.
The most obvious are:
What if an additional tag is added to row(1)? Do you have to first parse, check if it's already present then update the row to be tags.append(newTag).
Worse still deleting a tag? Search tags, is present, re-create tags.
What if a tag is to change name - some moderation process, perhaps?
Worse again, what about dfferent people specifying a tag-name differently - it'd be hard to rationalise.
What if you want to query data based on tags? Your query becomes far more complex than it would need to be.
Presentation: The client has to parse the tag in order to use it. What about the separator field? Change that and all clients have to change.
In short, all of these operations become harder and more cumbersome. Normalization is designed to overcome such issues. Probably the only reason for doing what you say, IMO, is that you're capturing the data as a one-off and it's informational only - that is, makes sense to a user but not to a system per-se. This is kind of like saying it's probably best avoided (again, IMO).
It seems to me like you want to have a separate table that stores tags and holds a foreign key which relates the tag records back to their parent record in the articles table (this is referred to as "normalizing" the database structure).
Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.
I would say that there are very few reasons to do what you have suggested, given how straightforward it is to create another table and setup a relationship to link keys between the two tables to maintain referential integrity.
I totally agree that it CAN be a good idea. I am a strong advocate of storing tags in the database as a single delimited list of strings.
BUT: The reason that I agree is that I like to use Azure Search API to index these types of data, so the query to do a lookup based on tags is not done via SQL. (using the Azure search API service is not necessary, but In my experience you will get much better performance and scalability by using a search index that is outside of the database.)
If you primary query language will be SQL (relational based queries)
then you are better off creating a child table that has a row for each
tag, otherwise you will wear a performance hit when your query has to
perform a logic on each value to split it for analysis.
Tagging is a concept that we use to get around relational data or hierarchical mapping, so to get the best performance do no try to use these relational concepts to query the tags. It is often best implemented in NoSQL data storage because they don't try to use the database to process the search queries.
I encourage you to store the data as a delimited string, and use an external indexing service to provide search and insights into your data. This is a good trade off between CRUD data access performance attempts to manage the data and indexes to optimise for searching. Sure you can optimise the DB and the search queries to make it work in SQL but it can take effort to get it right.
Once your user base hits large volumes and you need to support multiple concurrent searches without affecting update performance you will find that external indexing is an awesome investment in your time now, to save you time and resources later.

Elasticsearch querying multiple types and grouped by types?

Suppose I am to search against two types [cars] and [buildings], and I would want the results to be separated. Is there a way one can group results by types?
I understand one simple way will be to query each types separately, but for other use cases one may actually need to query tens or hundreds of types together. Is there a native way or hacky way(like using sort) to achieve this?
This type of grouping behavior is (currently) not available in elasticsearch. It has been a long standing request:
https://github.com/elasticsearch/elasticsearch/issues/256
There are two approaches that can help, both of which are far from perfect, but may be good enough for some use cases.
Client side aggregation. Request a lot more results than you plan on displaying and the then bucket those.
Using multi-query. This allows you to easily pass down some number of queries in a single batch, but will have potential scaling problems if the number of queries gets to large.
This is one feature that Solr has that elasticsearch doesn't, but I have never tried it. I used a similar feature with Autonomy IDOL years back, but the performance was abysmal.
If you want the results separated in groups of documents, you're going to have to restructure your documents, since, elasticsearch is focused on finding matching documents. You might get around this by designing a document that has child documents then you can query for matches on the parent document that represents your type.
I guess there might be some common field (let's say it's [price]) if you want to search against different types. Then it would be reasonable to add some different type like [price_aggregator] and put into it fields [type] and [price]. And then you could easily build your query against just one type. This requires some additional work while indexing and more memory to store index but it's much performant when you search.

Too many fields bad for elasticsearch index?

Let say I have a thousand keys, and I would want to store the associated values. The intuitive approach seems to be something like
{
"key1":"someval",
"key2":"someotherval",
...
}
Is this a bad design pattern for elasticsearch index to have thousands of keys? Would each keys introduced this way create overhead for every documents under the index?
If you know there is an upper limit to the number of keys you'll have, a few thousand fields is not a problem.
The problem is when you have an unbounded set of keys, e.g. when the key is derived from a value, as you'll have a continuously growing mapping and thus also cluster state. It can also lead to quirky searches.
This is a common enough question/issue that I dedicated a section to it in my article on Troubleshooting Elasticsearch searches, for Beginners.
In short, thousands of fields is no problem - not having control of the mapping is.
Elasticsearch is not ideal for 1000s of key-value pattern in a document. and if you want to update them in real-time or something, then try redis or riak for that.
If you have thousands of keys in a document/record, essentially they become fields and the value become the text and indexed.
From information-retrieval perspective with large data, it is advised to use fewer big fields than numerous small fields, for faster search performance.

ElasticSearch types and indexing performance

I would like to understand the performance impact of indexing documents of multiple types to a single index where there is an imbalance in the number of items of each type (one type has millions, where another type has just thousands of documents). I have spotted issues on some of my indexes, and ruling out whether types are indexed separately within a single index (or not) would help me. Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
If the answer to the above is no and that types are effectively all lumped together, then I'll lay out the rest of what I'm doing to try and get some more detailed input.
The use case for this example is capturing tweets for Twitter users (call it owner for clarity). I have multi-tenant environment with one index per twitter owner. That said, focusing on a single owner:
I capture the tweets from each timeline (mentions, direct messages, my tweets, and the full 'home' timeline) into a single index, with each timeline type having a different mapping in ElasticSearch
Each tweet refers to a parent type, the user who authored the tweet (which may or may not be the owner), with a parent mapping. There is only a single 'user' type for all the timeline types
I search and facet only ever on one owner in a single query, so I don't have to concern myself searching across multiple indexes
The home timeline may capture millions of tweets, where the owner's own tweets may result in hundreds or thousands
The user documents are routinely updated with information outside of the Twitter timelines, therefore I would like to avoid (if possible) the situation where I have to keep multiple copies of the same user object in sync across multiple indexes
I have noticed a much slower response querying on the indexes with millions of documents, even when excluding the 'home timeline' type with millions of documents indexed, leaving just the types with a few thousand entries. I don't want to have to split the types into separate indexes (unless I have to), due to the parent-child relationship between a tweet and a user.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
Any input would be appreciated.
EDIT
To clarify the statement that tweets are stored per timeline. This means that there is an ElasticSearch type defined for home_timeline, my_tweets_timeline, mentions_timeline, direct_messages_timeline, etc, which correspond to what you see in the standard twitter.com UI. So there is a natural split between the sets of tweets, although with some overlap too.
I have gone back in to check out the has_child queries, and this is a definite red-herring at this point. Basic queries on the larger indexes are much slower, even when querying a type with just a few thousand rows (my_tweets_timeline).
Can I assume that types are indexed separately along the lines of a relational database where each table is effectively separate?
No, types are all lumped together into one index as you guessed.
Is there a way I can understand if the issue is with the total number of documents in a specific index, something to do with the operation of 'has_child' filtered queries, some other poor design of queries or facets, or something else?
The total number of documents in the index is obviously a factor. Whether has_child queries are slow in particular is another question - try comparing the performance of has_child queries with trivial term queries for example. The has_child documentation offers one clue under "memory considerations":
With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough memory for it.
This would imply a large amount of memory is required for any has_child query where there are millions of potential children. Make sure enough memory is available for such operations, or consider a redesign that removes the need for has_child.

Is Full text indexing for high transaction responses a good idea?

For example, in order to provide an effective way to query repsondents answers to a dynamic questionnaire, where responses are stored in a keyword/response pair.
I am aware that there may be some latency in updating the catalogue/text index as new entries are added, but this may not matter if reporting/querying is not a real time concern. (i.e. performed at some later date)
So in answer to my own question, the transactional aspect of this doesnt actually matter, does it?
I would distinguish between data consistency in selected storage and gap between data arrival and appearing in search results for the user as you might use external or even remote search solutions for your application as the index update might take some significant time depends on the case.