FYI: Redis n00b.
I need to store search terms in my web app.
Each term will have two attributes: "search_count" (integer) and "last_searched_at" (time)
Example I've tried:
Redis.hset("search_terms", term, {count: 1, last_searched_at: Time.now})
I can think of a few different ways to store them, but no good ways to query on the data. The report I need to generate is a "top search terms in last 30 days". In SQL this would be a where clause and an order by.
How would I do that in Redis? Should I be using a different data type?
Thanks in advance!
I would consider two ordered sets.
When a search term is submitted, get the current timestamp and:
zadd timestamps timestamp term
zincrby counts 1 term
The above two operations should be atomic.
Then to find all terms in the given time interval timestamp_from, timestamp_to:
zrangebyscore timestamps timestamp_from timestamp_to
after you get these, loop over them and get the counts from counts.
Alternatively, I am curious whether you can use zunionstore. Here is my test in Ruby:
require 'redis'
KEYS = %w(counts timestamps results)
TERMS = %w(test0 keyword1 test0 test1 keyword1 test0 keyword0 keyword1 test0)
def redis
#redis ||= Redis.new
end
def timestamp
(Time.now.to_f * 1000).to_i
end
redis.del KEYS
TERMS.each {|term|
redis.multi {|r|
r.zadd 'timestamps', timestamp, term
r.zincrby 'counts', 1, term
}
sleep rand
}
redis.zunionstore 'results', ['timestamps', 'counts'], weights: [1, 1e15]
KEYS.each {|key|
p [key, redis.zrange(key, 0, -1, withscores: true)]
}
# top 2 terms
p redis.zrevrangebyscore 'results', '+inf', '-inf', limit: [0, 2]
EDIT: at some point you would need to clear the counts set. Something similar to what #Eli proposed (https://stackoverflow.com/a/16618932/410102).
Depends on what you want to optimize for. Assuming you want to be able to run that query very quickly and don't mind expending some memory, I'd do this as follows.
Keep a key for every second you see some search (you can go more or less granular if you like). The key should point to a hash of $search_term -> $count where $count is the number of times $search_term was seen in that second.
Keep another key for every time interval (we'll call this $time_int_key) over which you want data (in your case, this is just one key where your interval is the last 30 days). This should point to a sorted set where the items in the set are all of your search terms seen over the last 30 days, and the score they're sorted by is the number of times they were seen in the last 30 days.
Have a background worker that every second grabs the key for the second that occurred exactly 30 days ago and loops through the hash attached to it. For every $search_term in that key, it should subtract the $count from the score associated with that $search_term in $time_int_key
This way, you can just use ZRANGE $time_int_key 0 $m to grab the m top searches ([WITHSCORES] if you want the amounts they were searched) in O(log(N)+m) time. That's more than cheap enough to be able to run as frequently as you want in Redis for just about any reasonable m and to always have that data updated in real time.
Related
I have implemented a leader board using sorted sets in redis. I want users with same scores to be ordered in chronological order, i.e., user who came first should be ranked higher. Currently redis supports lexicographical order. Is there a way to override that. Mobile numbers are being used as members in sorted set.
One solution that I thought of is appending timestamp in front of mobile numbers and maintaining a hash to map mobile number and timestamp.
$redis.hset('mobile_time', '1234567890', "#{Time.now.strftime('%y%m%d%H%M%S')}")
pref = $redis.hget('mobile_time, '1234567890'')
$redis.zadd('myleaderboard', "1234567890:#{pref}")
That way I can get rank for a given user at any instance by adding a prefix from hash.
Now this is not exactly what I want. This will return opposite of what I want. User who comes early will be placed below user who comes later(both with same score).
Key for user1 = 201210121953**23**01234567890 score: 400
key for user2 = 201210121253**26**09313123523 score: 400 (3 seconds later)
if I use zrevrangebyscore, user2 will be placed higher than user1.
However, there's a way to get the desired rank:
users_with_higher_score_count = $redis.zcount("mysset", "(400", "+inf")
users_with_same_score = $redis.zrangebyscore("mysset", "400", "400")
Now I have the list users_with_same_score with correct ordering. Looking at index I can calculate rank of the user.
To get leader board. I can get members in intervals of 50 and order them through ruby code. But it doesn't seems to be a good way.
I want to know if there's a better approach to do it. Or any improvements that can be made in solution I purposed.
Thanks in advance for your help.
P.S. Scores are in multiples of 50
The score in a sorted set supports double precision floating point numbers, so possibly a better solution would be to store the redis score as highscore.timestamp
e.g. (pseudocode)
highscore = 100
timestamp = now()
redis.zadd('myleaderboard', highscore + '.' + timestamp, playerId)
This would mean that multiple players who achieved the same high score will also be sorted based on the time they achieved that high score as per the following
For player 1...
redis.zadd('myleaderboard', '100.1362345366', "Charles")
For player 2...
redis.zadd('myleaderboard', '100.1362345399', "Babbage")
See this question for more detail: Unique scoring for redis leaderboard
The external weights feature of the sort command is your saviour here
SORT mylist BY weight_*
http://redis.io/commands/sort
If you are displaying leaderboard in descending order of score then I don't think the above solution will work. Instead of just appending timestamp in the score you should append Long.MAX_VALUE - System.nanoTime() So your final score code should be like -
highscore = 100
timestamp = Long.MAX_VALUE - System.nanoTime();
redis.zadd('myleaderboard', highscore + '.' + timestamp, playerId);
Now you will get the correct order when you call redis.zrevrange('myleaderboard', startIndex, endIndex)
I'm trying to group a series of records in Active Record so I can do some calculations to normalize that quantity attribute of each record for example:
A user enters a date and a quantity. Dates are not unique, so I may have 10 - 20 quantities for each date. I need to work with only the totals for each day, not every individual record. Because then, after determining the highest and lowest value, I convert each one by basically dividing by n which is usually 10.
This is what I'm doing right now:
def heat_map(project, word_count, n_div)
return "freezing" if word_count == 0
words = project.words
counts = words.map(&:quantity)
max = counts.max
min = counts.min
return "max" if word_count == max
return "min" if word_count == min
break_point = (max - min).to_f/n_div.to_f
heat_index = (((word_count - min).to_f)/break_point).to_i
end
This works great if I display a table of all the word counts, but I'm trying to apply the heat map to a calendar that displays running totals for each day. This obviously doesn't total the days, so I end up with numbers that are out of the normal scale.
I can't figure out a way to group the word counts and total them by day before I do the normalization. I tried doing a group_by and then adding the map call, but I got an error an undefined method error. Any ideas? I'm also open to better / cleaner ways of normalizing the word counts, too.
Hard to answer without knowing a bit more about your models. So I'm going to assume that the date you're interested in is just the created_at date in the words table. I'm assuming that you have a field in your words table called word where you store the actual word.
I'm also assuming that you might have multiple entries for the same word (possibly with different quantities) in the one day.
So, this will give you an ordered hash of counts of words per day:
project.words.group('DATE(created_at)').group('word').sum('quantity')
If those guesses make no sense, then perhaps you can give a bit more detail about the structure of your models.
I am trying to build leaderboards in Redis and be able to get top X scores and retrieve a rank of user Y.
Sorted lists in Redis look like an easy fit except for one problem - I need scores to be sorted not only by actual score, but also by date (so whoever got the same score earlier will be on top). SQL query would be:
select * from scores order by score desc, date asc
Running zrevrange on a sorted set in Redis uses something like:
select * from scores order by score desc, key desc
Which would put users with lexicographically bigger keys above.
One solution I can think of is making some manipulations with a score field inside a sorted set to produce a combined number that consists of a score and a timestamp.
For example for a score 555 with a timestamp 111222333 the final score could be something like 555.111222333 which would put newer scores above older ones (not exactly what I need but could be adjusted further).
This would work, but only on small numbers, as a score in a sorted set has only 16 significant digits, so 10 of them will be wasted on a timestamp right away leaving not much room for an actual score.
Any ideas how to make a sorted set arrange values in a correct order? I would really want an end result to be a sorted set (to easily retrieve user's rank), even if it requires some temporary structures and sorts to build such set.
Actually, all my previous answers are terrible. Disregard all my previous answers (although I'm going to leave them around for the benefit of others).
This is how you should actually do it:
Store only the scores in the zset
Separately store a list of each time a player achieved that score.
For example:
score_key = <whatever unique key you want to use for this score>
redis('ZADD scores-sorted %s %s' %(score, score))
redis('RPUSH score-%s %s' %(score, score_key))
Then to read the scores:
top_score_keys = []
for score in redis('ZRANGE scores-sorted 0 10'):
for score_key in redis('LRANGE score-%s 0 -1' %(score, )):
top_score_keys.append(score_key)
Obviously you'd want to do some optimizations there (ex, only reading hunks of the score- list, instead of reading the entire thing).
But this is definitely the way to do it.
User rank would be straight forward: for each user, keep track of their high score:
redis('SET highscores-%s %s' %(user_id, user_high_score))
Then determine their rank using:
user_high_score = redis('GET highscores-%s' %(user_id, ))
score_rank = int(redis('ZSCORE scores-sorted %s' %(user_high_score, )))
score_rank += int(redis('LINDEX score-%s' %(user_high_score, )))
It's not really the perfect solution, but if you make a custom epoch that would be closer to the current time, then you would need less digits to represent it.
For instance if you use January 1, 2012 for your epoch you would (currently) only need 8 digits to represent the timestamp.
Here's an example in ruby:
(Time.new(2012,01,01,0,0,0)-Time.now).to_i
This would give you about 3 years before the timestamp would require 9 digits, at which time you could perform some maintenance to move the custom epoch forward again.
I would however love to hear if anyone has a better idea, since I have the excact same problem.
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
Instead of using a timestamp in the score, you could use a global counter. For example:
score_key = <whatever unique key you want to use for this score>
score_number = redis('INCR global-score-counter')
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key)
And to sort them in descending order, pick a large score count (1<<24, say), use that as the initial value of global-score-counter, then use DECR instead of INCR.
(this would also apply if you are using a timestamp)
Alternately, if you really incredibly worried about the number of players, you could use a per-score counter:
score_key = <whatever unique key you want to use for this score>
score_number = redis('HINCR score-counter %s' %(score, ))
redis('ZADD sorted-scores %s.%s %s' %(score, score_number, score_key))
(Note: this answer is almost certainly suboptimial; see https://stackoverflow.com/a/10575370/71522)
A couple thoughts:
You could make some assumptions about the timestamps to make them smaller. For example, instead of storing Unix timestamps, you could store "number of minutes since May 13, 2012" (for example). In exchange for seven significant digits, this would let you store times for the next 19 years.
Similarly, you could reduce the number of significant digits in the scores. For example, if you expect scores to be in the 7-digit range, you could divide them by 10, 100, or 1000 when storing them in the sorted list, then use the results of the sorted list to access the actual scores, sorting those at the application level.
For example, using both of the above (in potentially buggy pseudo-code):
score_small = int(score / 1000)
time_small = int((time - 1336942269) / 60)
score_key = uuid()
redis('SET full-score-%s "%s %s"' %(score_key, score, time))
redis('ZADD sorted-scores %s.%s %s' %(score_small, time_small, score_key))
Then to load them (approximately):
top_scores = []
for score_key in redis('ZRANGE sorted-scores 0 10'):
score_str, time_str = redis('GET full-score-%s' %(score_key, )).split(" ")
top_scores.append((int(score_str), int(time_str))
top_scores.sort()
This operation could even be done entirely inside Redis (avoid the network overhead of the O(n) GET operations) using the EVAL command (although I don't know enough Lua to confidently provide an example implementation).
Finally, if you expect a truly huge range of scores (for example, you expect that there will be a large number of scores below 10,000 and an equally large number of scores over 1,000,000), then you could use two sorted sets: scores-below-100000 and scores-above-100000.
In my postgres database, I have the following relationships (simplified for the sake of this question):
Objects (currently has about 250,000 records)
-------
n_id
n_store_object_id (references store.n_id, 1-to-1 relationship, some objects don't have store records)
n_media_id (references media.n_id, 1-to-1 relationship, some objects don't have media records)
Store (currently has about 100,000 records)
-----
n_id
t_name,
t_description,
n_status,
t_tag
Media
-----
n_id
t_media_path
So far, so good. When I need to query the data, I run this (note the limit 2 at the end, as part of the requirement):
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
limit
2
This works fine and gives me two entries back, as expected. The execution time on this is about 20 ms - just fine.
Now I need to get 2 random entries every time the query runs. I thought I'd add order by random(), like so:
select
o.n_id,
s.t_name,
s.t_description,
me.t_media_path
from
objects o
join store s on (o.n_store_object_id = s.n_id and s.n_status > 0 and s.t_tag is not null)
join media me on o.n_media_id = me.n_id
order by
random()
limit
2
While this gives the right results, the execution time is now about 2,500 ms (over 2 seconds). This is clearly not acceptable, as it's one of a number of queries to be run to get data for a page in a web app.
So, the question is: how can I get random entries, as above, but still keep the execution time within some reasonable amount of time (i.e. under 100 ms is acceptable for my purpose)?
Of course it needs to sort the whole thing according to random criteria before getting first rows. Maybe you can work around by using random() in offset instead?
Here's some previous work done on the topic which may prove helpful:
http://blog.rhodiumtoad.org.uk/2009/03/08/selecting-random-rows-from-a-table/
I'm thinking you'll be better off selecting random objects first, then performing the join to those objects after they're selected. I.e., query once to select random objects, then query again to join just those objects that were selected.
It seems like your problem is this: You have a table with 250,000 rows and need two random rows. Thus, you have to generate 250,000 random numbers and then sort the rows by their numbers. Two seconds to do this seems pretty fast to me.
The only real way to speed up the selection is not have to come up with 250,000 random numbers, but instead lookup rows through an index.
I think you'd have to change the table schema to optimize for this case. How about something like:
1) Create a new column with a sequence starting at 1.
2) Every row will then have a number.
3) Create an index on: number % 1000
4) Query for rows where number % 1000 is equal to a random number
between 0 and 999 (this should hit the index and load a random
portion of your database)
5) You can probably then add on RANDOM() to your ORDER BY clause and
it will then just sort that chunk of your database and be 1,000x
faster.
6) Then select the first two of those rows.
If this still isn't random enough (since rows will always be paired having the same "hash"), you could probably do a union of two random rows, or have an OR clause in the query and generate two random keys.
Hopefully something along these lines could be very fast and decently random.
hi have index simple document where you have 2 fields:
1. profileId as long
2. profileAttribute as long.
i need to know how many profileId's have a certain set of attribute.
for example i index:
doc1: profileId:1 , profileAttribute = 55
doc2: profileId:1 , profileAttribute = 57
doc3: profileId:2 , profileAttribute = 55
and i want to know how many profiles have both attribute 55 and 57
in this example the answer is 1 cuz only profile id 1 have both attributes
thanks in advance for your help
You can search for profileAttribute:(57 OR 55) and then iterate over the results and put their profileId property in a set in order to count the total number of unique profileIds.
But you need to know that Lucene will perform poorly at this compared to, say, a RDBMS. This is because Lucene is an inverted index, meaning it is very good at retrieving the top documents that match a query. It is however not very good at iterating over the stored fields of a large number of documents.
However, if profileId is single-valued and indexed, you can get its values using Lucene's fieldCache which will prevent you from performing costly disk accesses. The drawback is that this fieldCache will use a lot of memory (depending on the size of your index) and take time to load every time you (re-)open your index.
If changing the index format is acceptable, this solution can be improved by making profileIds uniques, your index would have the following format :
doc1: profileId: [1], profileAttribute: [55, 57]
doc2: profileId: [2], profileAttribute: [55]
The difference is that profileIds are unique and profileAttribute is now a multi-valued field. To count the number of profileIds for a given set of profileAttribute, you now only need to query for the list of profileAttribute (as previously) and use a TotalHitCountCollector.