redis - compare 10 million sets with each other - redis

This is the setup that I have going on:
[domain:id] => [keyword_id, keyword_id2, keyword_id3....]
....
What I want to do is for each domain, find other similar domains that contain similar keywords. The way I "measure" similarity between domain:1 and domain:2, for example, is by dividing intersection(domain:1, domain:2) by union(domain:1, domain:2).
The problem is that I have about 5 million domains with each having about a couple hundred keywords on average. Doing this similarity calculation in a nested loop comparing each domain with every other domain would take years on the hardware I have now. I tested this just for one domain:
keys = redis.keys("domain:*");
foreach(keys as key){
long inter = sinterstore("inter_temp", "domain:1", key);
long union = sunionstore("union_temp", "domain:1", key);
float similarity = inter / union;
if(similarity > 0.1){
similar_domains.add(key);
}
}
...
^ and computing similar domains for just this one domain took about 2 minutes. Doing this for 5 million domains would have taken years.
So what could I do? I have no problem moving this program to the most expensive Amazon EC2 instance for an hour once a week to compute it all, and send it back to my host, but would that even help or do I simply have too much data?

Instead of comparing each domain one by one. can't you create a batch of say 100 and pass all the keys in that domain to Redis where it will do the union/intersection for you.
For example
SADD domain:1 a b c d e f
SADD domain:2 a c e
SADD domain:3 c e f h
SINTERSTORE destination domain:1 domain:2 domain:3
will result following keys [a, b ,c ,d ,e ,f, h] in destination set
and
SINTERSTORE destination domain:1 domain:2 domain:3
will result following keys [c ,e] in destination set

Related

finding largest number of candidate keys that a relation has?

I am trying to solve this question which has to do with candidate keys in a relation.
This is the question:
Consider table R with attributes A, B, C, D, and E. What is the largest number of
candidate keys that R could simultaneously have?
the answer is 10 but i have no clue how it was done, nor how does the word simultaneously plays into effect when calculating the answer.
Sets that are not subsets of other sets.
For example {A-B} and {A,B,C} can't be candidates keys simultaneously, because {A,B} is a subset of {A,B,C}.
Combinations of 2 attributes or 3 attributes generates the maximum number of simultaneous candidates keys.
See how the 3 attributes sets are actually complements of the 2 attributes sets, e.g. {C,D,E} is the complement of {A,B}.
2 3
attributes attributes
sets sets
1. {A,B} - {C,D,E}
2. {A,C} - {B,D,E}
3. {A,D} - {B,C,E}
4. {A,E} - {B,C,D}
-
5. {B,C} - {A,D,E}
6. {B,D} - {A,C,E}
7. {B,E} - {A,C,D}
-
8. {C,D} - {A,B,E}
9. {C,E} - {A,B,D}
-
10. {D,E} - {A,B,C}
If I would take sets of a single attribute I would have only 4 options
{A},{B},{C},{D}
Any set with more than 1 element will contain one of the above and therefore will not be qualified.
If I would take sets of 4 attributes I would have only 4 options
{A,B,C,D},{A,B,C,E},{A,B,D,E},{B,C,D,E}
Any set with more than 4 element will contain one of the above and therefore will not be qualified.
Any set with less than 4 element will be contained by one of the above and therefore will not be qualified.
etc.
For 5 keys, it is probably best to do this by brute force. Understanding the ideas is more important than the calculation (DuDu/David gives a good example of 10 candidate keys, showing that a set of 10 keys is possible so the maximum is at least this large).
What is the idea? A candidate key is a combination of attributes that is unique. So, if A is unique, then A with any other column is also unique. One set of candidate keys is simply:
A
B
C
D
E
If each of these are unique, then any combination of keys is going to contain at least one of these attributes and the combination will also be unique. Hence, the uniqueness of these five would imply the uniqueness of any other combination.
5 is not the largest number of candidate keys with this property.
It gets a bit more complicated. If {A, B, C, D, E} is unique (and no subset is a candidate key), then there is exactly 1 candidate key. Rearranging the columns doesn't change the set (sets are unordered).
One thing we might postulate is that the biggest set of candidate keys has keys all of the same length. This is in fact true. Why? Well, if we have a set of keys that are of different lengths, we can lengthen the shorter ones by adding arbitrary attributes and still have a maximal set.
So, you only need to consider subsets of 1, 2, 3, 4, and 5 keys, exactly. When you work it out, you will find that the maximum numbers are:
5 10 10 5 1
You can add a "1" to the beginning and you may recognize the pattern. This is a row from Pascal's Triangle. This observation (well, and the related proof) actually makes it easy to determine the maximum value for any given n.
Incidentally, the sets of length 3 are:
A B C
A B D
A B E
A C D
A C E
A D E
B C D
B C E
B D E
C D E

Checking if IP falls within a range with Redis

I am interested in using Redis to check if a IP address (converted into integer) falls within a range of IPs. It is very likely that the ranges will overlap.
I have found this question/answer, although I am not able to fully understand the logic behind it.
Thank you for your help!
EDIT - Since I got a downvote (a comment to explain why would be nice), I've removed some clutter from my answer.
#DidierSpezia answer in your linked question is a good answer, but it becomes hard to maintain if you are adding/removing ranges.
However it is not trivial (and expensive) to build and maintain it.
I have an answer that is easier to maintain, but it could get slow and memory expensive to compute with many ranges as it requires cloning a set of all ranges.
You need to save all ranges twice, in two sets. The score of each range will be its border values.
Going with the sets in #DidierSpezia example:
A 2-8
B 4-6
C 2-9
D 7-10
Your two sets will be:
ZADD ranges:low 2 "2-8" 4 "4-6" 2 "2-9" 7 "7-10"
ZADD ranges:high 8 "2-8" 6 "4-6" 9 "2-9" 10 "7-10"
To query to which ranges a value belongs, you need to trim the ranges that the lower border is higher than the queried value, and trim the ranges that the higher border is lower.
The most efficient way I can think of is cloning one of the sets, trimming one of it sides by the rules gave above, changing the scores of the ranges to reflect the other border and then trim the second side.
Here's how to find the ranges 5 belongs to:
ZUNIONSTORE tmp 1 ranges:low
ZREMRANGEBYSCORE tmp (5 +inf
ZINTERSTORE tmp 2 tmp ranges:high WEIGHTS 0 1
ZREMRANGEBYSCORE tmp -inf (5
ZRANGE tmp 0 -1
In this discussion, Dvir Volk and #antirez suggested to use a sorted set in which each entry represent a range, and has the following form:
Member = "min-max" range
Score = max value
For example:
ZADD z 10 "0-10"
ZADD z 20 "10-20"
ZADD z 100 "50-100"
And in order to check if a value falls within a range, you can use ZRANGEBYSCORE and parse the member returned.
For example, to check value 5:
ZRANGEBYSCORE z 5 +inf LIMIT 0 1
this will return the "0-10" member, and you only need to parse the string and validate if your value is in between.
To check value 25:
ZRANGEBYSCORE z 25 +inf LIMIT 0 1
will return "50-100", but the value is not between that range.

How do I represent this data using Redis?

I want to be able to store data such as "store x is open between 9am and 5pm on Monday but it's only open during 9am and 12pm on Saturday"
What's the best way to store this using redis?
I would later like to query it using something like this. Show me all stores that are open on Saturday at 10:30am
In Redis, like most if not all other NoSQL databases, you want to store your data in the manner that's most suitable for answering the query. There are quite a few ways you can represent this data and answer the query, choosing between them requires knowledge about the other access patterns that you need to support.
However, in the context of this specific question alone, the simplest way of doing that IMO is to use two Sorted Sets per for each day of the week. Assuming that stores are open continuously and at most once each day (i.e. no siestas), the members of these Sorted Sets should be the store ids and the scores their opening hours - the first Sort Set's scores will denote the time that the store opens whereas the second's the time it closes. For example:
ZADD monday:open 9 store:x
ZADD monday:close 17 store:x
ZADD saturday:open 9 store:x
ZADD saturday:close 12 store:x
Once you have all the Sorted Sets in place, answering the query requires two calls to ZRANGEBYSCORE and intersecting the results. The snippet below demonstrates how to do it using Lua since doing using server scripts will be more efficient than moving the entire thing to the client in most cases.
Note: an alternative approach to doing the intersect in Lua is actually storing the temporary results in Redis' Sets and calling SINTER.
-- helper function to make a "set" out of a table
local function makeset(t)
local r = {}
for _, v in ipairs(t) do r[v] = true end
return(r)
end
-- get opening and closing hours for a given day
local ot = redis.call('ZRANGEBYSCORE', KEYS[1], '-inf', ARGV[1])
local ct = redis.call('ZRANGEBYSCORE', KEYS[2], '(' .. ARGV[1], '+inf')
-- convert to sets and choose the smaller set as s1
local s1 = {}
local s2 = {}
if #ot < #ct then
s1 = makeset(ot)
s2 = makeset(ct)
else
s1 = makeset(ct)
s2 = makeset(ot)
end
-- intersect s1 and s2
local t = {}
for k in pairs(s1) do
t[k] = s2[k]
end
-- prepare a response table
local r = {}
for k in pairs(t) do
r[#r+1] = k
end
return(r)
Run this script by passing to it the two keys and the hour, like so:
redis-cli --eval storehours.lua saturday:open saturday:close , 10.5

Identifying graphs in heap of connected nodes -- how is this called?

I have a SQL table with three columns X, Y, Z. I need to split it in groups in such a way that all records with same value of X or Y or Z are assigned to the same group. I need to make sure that the records with same value X or Y or Z are never split across multiple groups.
If you think of records as nodes and values of X, Y, Z as edges, this problem is the same as finding all graphs where the nodes in each graph will be connected directly or indirectly via X, Y, or Z-edge, but each graph will have no edges in common with other graphs (otherwise it would be part of the same graph).
A few years ago I knew what this was called and even remembered the algorithm but now it escapes me. Please tell me how this problem is called so I can Google for solution. If you now a good algorithm -- please point me to it. If you have a SQL implementation -- I will marry you :)
Example:
X Y Z BUCKET
--------- ---------------- --------- -----------
1 34 56 1
54 43 45 2
1 12 22 1
2 34 11 1
The last row is in bucket 1 because of the value of Y=34 which is the same as of the first row, which is in bucket 1.
It looks not like a graph, more like a simplicial complex.
But if we treat this complex as its skeletal graph (the numbers are treated as vertices and a row in a table means that all that three vertices are connected by an edge), then we may just use any algorithm to find connected components of this graph. I'm not sure whether there is a feasible way to do this in SQL though, perhaps it would be more prudent to use a graph database somehow.
However, for this specific problem there may be some easy solution attainable by means of SQL which I didn't look for.
to find how many nodes in each group x:
select x, count(x)
from mytable
group by x
or to find the list of sets x:
select distinct x from mytable;
Why don't you initially GROUP BY one of the colums (say X), make buckets, then do so for Y and Z, each time merging all the buckets from the previous step if you find new groups.
Repeat the process for X, Y, and Z until the buckets stop changing.
Are you working for linked-in or facebook? :)

Algorithm - combine multiple lists, resulting in unique list and retaining order

I want to combine multiple lists of items into a single list, retaining the overall order requirements. i.e.:
1: A C E
2: D E
3: B A D
result: B A C D E
above, starting with list 1, we have ACE, we then know that D must come before E, and from list 3, we know that B must come before A, and D must come after B and A.
If there are conflicting orderings, the first ordering should be used. i.e.
1: A C E
2: B D E
3: F D B
result: A C F B D E
3 conflicts with 2 (B D vs D B), therefore requirements for 2 will be used.
If ordering requirements mean an item must come before or after another, it doesn't matter if it comes immediately before or after, or at the start or end of the list, as long as overall ordering is maintained.
This is being developed using VB.Net, so a LINQy solution (or any .Net solution) would be nice - otherwise pointers for an approach would be good.
Edit: Edited to make example 2 make sense (a last minute change had made it invalid)
The keyword you are probably interested in is "Topological sorting". The solution based on that would look as follows:
Create an empty directed graph.
Process sequences in order, for each two consecutive elements X,Y in a sequence add an edge X->Y to the graph, unless this would form a cycle.
Perform a topological sort on the vertices of the graph. The resulting sequence should satisfy your requirements.