Redis Secondary Indexes and Performance Question - redis

I know that Redis doesn't really have the concept of secondary indexes, but that you can use the Z* commands to simulate one. I have a question about the best way to handle the following scenario.
We are using Redis to keep track of orders. But we also want to be able to find those orders by phone number or email ID. So here is our data:
> set 123 7245551212:dlw#email.com
> set 456 7245551212:dlw#email.com
> set 789 7245559999:kdw#email.com
> zadd phone-index 0 7245551212:123:dlw#email.com
> zadd phone-index 0 7245551212:456:dlw#email.com
> zadd phone-index 0 7245559999:789:kdw#email.com
I can see all the orders for a phone number via the following (is there a better way to get the range other than adding a 'Z' to the end?):
> zrangebylex phone-index [7245551212 (7245551212Z
1) "7245551212:123:dlw#dcsg.com"
2) "7245551212:456:dlw#dcsg.com"
My question is, is this going to perform well? Or should we just create a list that is keyed by phone number, and add an order ID to that list instead?
> rpush phone:7245551212 123
> rpush phone:7245551212 456
> rpush phone:7245559999 789
> lrange phone:7245551212 0 -1
1) "123"
2) "456"
Which would be the preferred method, especially related to performance?

RE: is there a better way to get the range other than adding a 'Z' to the end?
Yes, use the next immediate character instead of adding Z:
zrangebylex phone-index [7245551212 (7245551213
But certainly the second approach offers better performance.
Using a sorted set for lexicographical indexing, you need to consider that:
The addition of elements, ZADD, is O(log(N))
The query, ZRANGEBYLEX, is O(log(N)+M) with N being the number of elements in the sorted set and M the number of elements being returned
In contrast, using lists:
The addition, RPUSH, is O(1)
The query, LRANGE, is O(N) as you are starting in zero.
You can also use sets (SADD and SMEMBERS), the difference is lists allows duplicates and preserves order, sets ensure uniqueness and doesn't respect insertion order.

ZSet use skiplist for score and dict for hashset. And if you add all elements with same score, skiplist will be turned to B-TREE like structure, which have a O(logN) time complexity for lexicographical order search.
So if you don't always perform range query for phone number, you should use list for orders which phone number as key for precise query. Also this will work for email(you can use hash to combine these 2 list). In this way performance for query will be much better than ZSET.

Related

Redis get multiple sorted sets?

I have multiple sorted sets, which I have named by keys like:
hello:user_id:2015-01-01
hello:user_id:2015-01-02
hello:user_id:2015-01-03
hello:user_id:2015-01-04
etc.
Is it possible to get all of these sets for dates between hello:user_id:2015-01-01 and hello:user_id:2015-01-04 ?
As #zenbeni pointed out this is possible with ZUNIONSTORE
Here is how you can run it.
ZUNIONSTORE resultzset 4 hello:user_id:2015-01-01 hello:user_id:2015-01-02 hello:user_id:2015-01-03 hello:user_id:2015-01-04
Once that runs the result will be stored in resultzset which you can query to get the stored values.
ZRANGE resultzset 0 -1

PostgreSQL, find strings differ by n characters

Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

How should I model this in Redis?

FYI: Redis n00b.
I need to store search terms in my web app.
Each term will have two attributes: "search_count" (integer) and "last_searched_at" (time)
Example I've tried:
Redis.hset("search_terms", term, {count: 1, last_searched_at: Time.now})
I can think of a few different ways to store them, but no good ways to query on the data. The report I need to generate is a "top search terms in last 30 days". In SQL this would be a where clause and an order by.
How would I do that in Redis? Should I be using a different data type?
Thanks in advance!
I would consider two ordered sets.
When a search term is submitted, get the current timestamp and:
zadd timestamps timestamp term
zincrby counts 1 term
The above two operations should be atomic.
Then to find all terms in the given time interval timestamp_from, timestamp_to:
zrangebyscore timestamps timestamp_from timestamp_to
after you get these, loop over them and get the counts from counts.
Alternatively, I am curious whether you can use zunionstore. Here is my test in Ruby:
require 'redis'
KEYS = %w(counts timestamps results)
TERMS = %w(test0 keyword1 test0 test1 keyword1 test0 keyword0 keyword1 test0)
def redis
#redis ||= Redis.new
end
def timestamp
(Time.now.to_f * 1000).to_i
end
redis.del KEYS
TERMS.each {|term|
redis.multi {|r|
r.zadd 'timestamps', timestamp, term
r.zincrby 'counts', 1, term
}
sleep rand
}
redis.zunionstore 'results', ['timestamps', 'counts'], weights: [1, 1e15]
KEYS.each {|key|
p [key, redis.zrange(key, 0, -1, withscores: true)]
}
# top 2 terms
p redis.zrevrangebyscore 'results', '+inf', '-inf', limit: [0, 2]
EDIT: at some point you would need to clear the counts set. Something similar to what #Eli proposed (https://stackoverflow.com/a/16618932/410102).
Depends on what you want to optimize for. Assuming you want to be able to run that query very quickly and don't mind expending some memory, I'd do this as follows.
Keep a key for every second you see some search (you can go more or less granular if you like). The key should point to a hash of $search_term -> $count where $count is the number of times $search_term was seen in that second.
Keep another key for every time interval (we'll call this $time_int_key) over which you want data (in your case, this is just one key where your interval is the last 30 days). This should point to a sorted set where the items in the set are all of your search terms seen over the last 30 days, and the score they're sorted by is the number of times they were seen in the last 30 days.
Have a background worker that every second grabs the key for the second that occurred exactly 30 days ago and loops through the hash attached to it. For every $search_term in that key, it should subtract the $count from the score associated with that $search_term in $time_int_key
This way, you can just use ZRANGE $time_int_key 0 $m to grab the m top searches ([WITHSCORES] if you want the amounts they were searched) in O(log(N)+m) time. That's more than cheap enough to be able to run as frequently as you want in Redis for just about any reasonable m and to always have that data updated in real time.

Redis fetch all value of list without iteration and without popping

I have simple redis list key => "supplier_id"
Now all I want it retrieve all value of list without actually iterating over or popping the value from list
Example to retrieve all the value from a list Now I have iterate over redis length
element = []
0.upto(redis.llen("supplier_id")-1) do |index|
element << redis.lindex("supplier_id",index)
end
can this be done without the iteration perhap with better redis modelling . can anyone suggest
To retrieve all the items of a list with Redis, you do not need to iterate and fetch each individual items. It would be really inefficient.
You just have to use the LRANGE command to retrieve all the items in one shot.
elements = redis.lrange( "supplier_id", 0, -1 )
will return all the items of the list without altering the list itself.
I'm a bit unclear on your question but if the supplier_id is numeric, why not use a ZSET?
Add your values like so:
ZADD suppliers 1 "data for supplier 1"
ZADD suppliers 2 "data for supplier 2"
ZADD suppliers 3 "data for supplier 3"
You could then remove everything up to (but not including supplier three) like so:
ZREMRANGEBYSCORE suppliers -inf 2
or
ZREMRANGEBYSCORE suppliers -inf (3
That also gives you very fast access (by supplier id) if you just want to read from it.
Hope that helps!

How to create a sorted set with "field1 desc, field2 asc" order in Redis?

I am trying to build leaderboards in Redis and be able to get top X scores and retrieve a rank of user Y.
Sorted lists in Redis look like an easy fit except for one problem - I need scores to be sorted not only by actual score, but also by date (so whoever got the same score earlier will be on top). SQL query would be:
select * from scores order by score desc, date asc
Running zrevrange on a sorted set in Redis uses something like:
select * from scores order by score desc, key desc
Which would put users with lexicographically bigger keys above.
One solution I can think of is making some manipulations with a score field inside a sorted set to produce a combined number that consists of a score and a timestamp.
For example for a score 555 with a timestamp 111222333 the final score could be something like 555.111222333 which would put newer scores above older ones (not exactly what I need but could be adjusted further).
This would work, but only on small numbers, as a score in a sorted set has only 16 significant digits, so 10 of them will be wasted on a timestamp right away leaving not much room for an actual score.
Any ideas how to make a sorted set arrange values in a correct order? I would really want an end result to be a sorted set (to easily retrieve user's rank), even if it requires some temporary structures and sorts to build such set.
Actually, all my previous answers are terrible. Disregard all my previous answers (although I'm going to leave them around for the benefit of others).
This is how you should actually do it:
Store only the scores in the zset
Separately store a list of each time a player achieved that score.
For example:
score_key = <whatever unique key you want to use for this score>
redis('ZADD scores-sorted %s %s' %(score, score))
redis('RPUSH score-%s %s' %(score, score_key))
Then to read the scores:
top_score_keys = []
for score in redis('ZRANGE scores-sorted 0 10'):
for score_key in redis('LRANGE score-%s 0 -1' %(score, )):
top_score_keys.append(score_key)
Obviously you'd want to do some optimizations there (ex, only reading hunks of the score- list, instead of reading the entire thing).
But this is definitely the way to do it.
User rank would be straight forward: for each user, keep track of their high score:
redis('SET highscores-%s %s' %(user_id, user_high_score))
Then determine their rank using:
user_high_score = redis('GET highscores-%s' %(user_id, ))
score_rank = int(redis('ZSCORE scores-sorted %s' %(user_high_score, )))
score_rank += int(redis('LINDEX score-%s' %(user_high_score, )))
It's not really the perfect solution, but if you make a custom epoch that would be closer to the current time, then you would need less digits to represent it.
For instance if you use January 1, 2012 for your epoch you would (currently) only need 8 digits to represent the timestamp.
Here's an example in ruby:
(Time.new(2012,01,01,0,0,0)-Time.now).to_i
This would give you about 3 years before the timestamp would require 9 digits, at which time you could perform some maintenance to move the custom epoch forward again.
I would however love to hear if anyone has a better idea, since I have the excact same problem.
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
Instead of using a timestamp in the score, you could use a global counter. For example:
score_key = <whatever unique key you want to use for this score>
score_number = redis('INCR global-score-counter')
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key)
And to sort them in descending order, pick a large score count (1<<24, say), use that as the initial value of global-score-counter, then use DECR instead of INCR.
(this would also apply if you are using a timestamp)
Alternately, if you really incredibly worried about the number of players, you could use a per-score counter:
score_key = <whatever unique key you want to use for this score>
score_number = redis('HINCR score-counter %s' %(score, ))
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key))
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
A couple thoughts:
You could make some assumptions about the timestamps to make them smaller. For example, instead of storing Unix timestamps, you could store "number of minutes since May 13, 2012" (for example). In exchange for seven significant digits, this would let you store times for the next 19 years.
Similarly, you could reduce the number of significant digits in the scores. For example, if you expect scores to be in the 7-digit range, you could divide them by 10, 100, or 1000 when storing them in the sorted list, then use the results of the sorted list to access the actual scores, sorting those at the application level.
For example, using both of the above (in potentially buggy pseudo-code):
score_small = int(score / 1000)
time_small = int((time - 1336942269) / 60)
score_key = uuid()
redis('SET full-score-%s "%s %s"' %(score_key, score, time))
redis('ZADD sorted-scores %s.%s %s' %(score_small, time_small, score_key))
Then to load them (approximately):
top_scores = []
for score_key in redis('ZRANGE sorted-scores 0 10'):
score_str, time_str = redis('GET full-score-%s' %(score_key, )).split(" ")
top_scores.append((int(score_str), int(time_str))
top_scores.sort()
This operation could even be done entirely inside Redis (avoid the network overhead of the O(n) GET operations) using the EVAL command (although I don't know enough Lua to confidently provide an example implementation).
Finally, if you expect a truly huge range of scores (for example, you expect that there will be a large number of scores below 10,000 and an equally large number of scores over 1,000,000), then you could use two sorted sets: scores-below-100000 and scores-above-100000.