Efficiently find mapping - sql

Assume that we have two data sets A, B that have m to n relationship.
A = { k1, k2, k3 .... kn}
B = { g1, g2, g3..........gn}
All the elements in both the sets are alphanumeric.
Now, tuples one each from Set A and Set B are stored in a table T.
for ex :-
(k1, g2)
(k2, g4)
(k1, g3)
(k4, g2)
...
...
..
(kn, gm)
The challenge is to find out what 'm' elements in set A map to what 'n' elements in set B in the most efficient way.
For ex, let's say we have the below tuples,
(k1, g1)
(k1, g2)
(k3, g1)
(k3, g2)
(k5, g1)
(k5, g2)
the o/p I need is (k1, k3, k5) -> (g1, g2).
As the mapping is m to n, a simple select won' t work. Please let me know if you need further clarifications
Since this information is already in database, I would prefer if we can get to this with some SQL.
Help much appreciated.
Thanks in advance...

You can often solve problems like this by using an aggregate, and a group by clause.
For example, if your table name is T then:
select T.item1, concat(T.item2, ", ") from T group by T.item1
Gives you which item1 maps to item 2. THen do it again switching item1 and item 2 around to find which item2 maps to item1.

Related

Store targets as collections that handle logic operation

I think my title is kinda unclear but I don't konw how to tell that otherwise.
My problem is:
We have users that belong to groups, there are many types of groups and any user belong to exaclty one group for each type.
Example: With group types A, B and C, containing respectively the groups (A1; A2; A3), (B1; B2) and (C1; C2; C3)
Every User must have a list of groups like [A1, B1, C1] or [A1, B2, C3] but never [A1, A2, B1] or [A1, C2]
We have messages that target to certain groups but not just a union, it can be more complex collection operations
Example: we can have message intended to [A1, B1, C3], [A1, *, *], [A1|A2, *, *] or even like ([A1, B1, C2] | [A2, B2, C1])
(* = any group of the type, | = or)
Messages are stored in a SQL DB, and users can retrieve all messages intended to their groups
How may I store messages and make my Query to reproduce this behavior ?
An option could be to encode both the user groups and the message targets in a (big) integer built on the powers of 2, and then base your query on a bitwise AND between user group code and message target code.
The idea is, group 1 is 1, group 2 is 2, group 3 is 4 and so on.
Level 1:
Assumptions:
you know in advance how many group types you have, and you have very few of them
you don't have more than 64 groups per type (assuming you work with 64-bit integers)
the message has only one target: A1|A2,B..,C... is ok, A*,B...,C... is ok, (A1,B1,C1)|(A2,B2,C2) is not.
Solution:
Encode each user group as the corresponding power of 2
Encode each message target as the sum of the allowed values: if groups 1 and 3 are allowed (A1|A3) the code will be 1+4=5, if all groups are allowed (A*) the code will be 2**64-1
you will have a User table and a Message table, and both will have one field for each group type code
The query will be WHERE (u.g1 & m.g1) * (u.g2 & m.g2) * ... * (u.gN & m.gN) <> 0
Level 2:
Assumptions:
you have some more group types, and/or you don't know in advance how many they are, or how they are composed
you don't have more than 64 groups in total (e.g. 10 for the first type, 12 for the second, ...)
the message still has only one target as above
Solution:
encode each user group and each message target as a single integer, taking care of the offset: if the first type has 10 groups they will be encoded from 1 to 1023 (2**10-1), then if the second type has 12 groups they will go from 1024 (2**10) to 4194304 (2**(10+12)-1), and so on
you will still have a User table and a Message table, and both will have one single field for the cumulative code
you will need to define a function which is able to check the user group vs the message target separately by each range; this can be difficult to do in SQL, and depends on which engine you are using
following is a Python implementation of both the encoding and the check
class IdEncoder:
def __init__(self, sizes):
self.sizes = sizes
self.grouplimits = {}
offset = 0
for i,size in enumerate(sizes):
self.grouplimits[i] = (2**offset, 2**(offset + size)-1)
offset += size
def encode(self, vals):
n = 0
for i, val in enumerate(vals):
if val == '*':
g = self.grouplimits[i][1] - self.grouplimits[i][0] + 1
else:
svals = val.split('|')
g = 0
for sval in svals:
g += 2**(int(sval)-1)
if i > 0:
g *= self.grouplimits[i][0]
print(g)
n += g
return n
def check(self, user, message):
res = False
for i,size in enumerate(self.sizes):
if user%2**size & message%2**size == 0:
break
if i < len(self.sizes)-1:
user >>= size
message >>= size
else:
res = True
return res
c = IdEncoder([10,12,10])
m3 = c.encode(['1|2','*','*'])
u1 = c.encode(['1','1','1'])
c.check(u1,m3)
True
u2=c.encode(['4','1','1'])
c.check(u2,m3)
False
Level 3:
Assumptions:
you adopt one of the above solutions, but you need multiple targets for each message
Solution:
You will need a third table, MessageTarget, containing the target code fields as above and a FK linking to the message
The query will search for all the MessageTarget rows compatible with the User group code(s) and show the related Message data
So you have 3 main tables:
Messages
Users
Groups
You then create 2 relationship tables:
Message-Group
User-Group
If you want to limit users to have access to just "their" messages then you join:
User > User-Group > Message-Group > Message

F# Deedle and Multi Index

I have recently started to learn F# for Data Science (coming from simple C# and Python). I start to get used to the power of functional first paradigm for Science.
However, I am still confused on how to treat a problem I could easily fix using pandas in Python. It is related to Multi index time series / Data frame. I have extensively checked on Deedle but I am still not sure if Deedle could help me achieve such a table:
Column Index 1: A || B
Column Index 2: A1 A2 || B1 B2
Column Index 3: p1 p2 | p1 p2 || p1 p2 | p1 p2
Row Index:
date1 0.5 2. | 2. 0.5 || 3. 0. | 2. 3.
date2 ......
The idea being able to sum all p1 series when Index1 = A etc etc
I did not find example of such a thing using Deedle.
If it is not available, what structure for my data would you recommend me?
Thanks for helping a newbie (but in love with) in F#
In Deedle, you can create a frame or a series with hierarchical index by using a tuple as the key:
let ts =
series
[ ("A", "A1", "p1") => 0.5
("A", "A1", "p2") => 2.
("A", "A2", "p3") => 2.
("A", "A2", "p4") => 0.5 ]
Deedle does have some special handling for this. For example, it will output the data as:
A A1 p1 -> 0.5
p2 -> 2
A2 p3 -> 2
p4 -> 0.5
To apply aggregation over a part of the hierarchy, you can use the applyLevel function:
ts |> Series.applyLevel (fun (l1, l2, l3) -> l1) Stats.mean
ts |> Series.applyLevel (fun (l1, l2, l3) -> l1, l2) Stats.mean
The first argument is a function that gets the tuple of keys and selects what part of the level you want to group - so the above two create an aggregation over the top and top two levels, respectively.

Find continuity of elements in Pig

how can i find the continuity of a field and starting position
The input is like
A-1
B-2
B-3
B-4
C-5
C-6
The output i want is
A,1,1
B,3,2
C,2,5
Thanks.
Assuming you do not have discontinuous data with respect to a value, you can get the desired results by first grouping on value and using COUNT and MIN to get continuous_counts and start_index respectively.
A = LOAD 'data' USING PigStorage('-') AS (value:chararray;index:int);
B = FOREACH (GROUP A BY value) GENERATE
group as value,
COUNT(A) as continuous_counts,
MIN(A.value) as start_index;
STORE B INTO 'output' USING PigStorage(',');
If your data does have the possibility of discontinuous data, the solution is not longer trivial in native pig and you might need to write a UDF for that purpose.
Group and count the number of values for continous_counts. i.e.
A,1
B,3
C,2
Get the top row for each value. i.e.
A,1
B,2
C,5
Join the above two relations and get the desired output.
A = LOAD 'data.txt' USING PigStorage('-') AS (value:chararray;index:int);
B = GROUP A BY value;
C = FOREACH B GENERATE group as value,COUNT(A.value) as continuous_counts;
D = FOREACH B {
ordered = ORDER B BY index;
first = LIMIT ordered 1;
GENERATE first.value,first.index;
}
E = JOIN C BY value,D BY value;
F = FOREACH E GENERATE C::value,C::continuous_counts,D::index;
DUMP F;

Find pairs with small difference in value

In a single table I need to find those pairs for which the values of a certain differ by maximum a given amount. For example, given the following table and the maximum difference 0.5:
val
---
1
1.2
1.3
4
4.5
6
The desired result would be:
val1 | val2
-----+-----
1 | 1.2
1 | 1.3
1.2 | 1.3
4 | 4.5
The main problem is that my table is gigantic and a cross product is not possible in reasonable time. i.e. this does not work:
SELECT t1.val, t2.val
FROM table t1, table t2
WHERE abs(t1.val - t2.val) <= 0.5
Is there a way to do this? I read up upon window functions, so at least I know it is possible to compute for each entry the difference in value to the previous entry, obtaining for the example above:
val | diff
----+-----
1 | 0
1.2 | 0.2
1.3 | 0.1
4 | 2.7
4.5 | 0.5
6 | 1.5
From here on I need to find the ranges where the sum of diff does not exceed the given maximum. Is this possible? Are there more reasonable approaches?
I'm using spark.
Thank you.
EDIT: As pointed out, my query would also include symmetric pairs as well es pairs where the values are equal. Sorry for the ambiguity.
However, this is not the point. My problem is the join. The dataset is too large for a cartesian product. I am looking for a solution which avoids using one.
Also, the size of the dataset I'm dealing with is 1000000 tuples. I am not
sure what execution time to expect, but it was suggested that there must be a solution which avoids using a cartesian product on the data.
Thank you.
What you tried is close. Just a few modifications needed:
select t1.val,t2.val
from tbl t1
join tbl t2 on t2.val-t1.val<=0.5 and t1.val<t2.val
You can generate virtual time-based window:
import org.apache.spark.sql.functions._
import spark.implicits._ // Where spark is an instance of SparkSession
val df = Seq(1.0, 1.2, 1.3, 4.0, 4.5, 6).toDF("val")
val w = window(
$"val".cast("timestamp"), "1000 milliseconds", "500 milliseconds"
).cast("struct<start:double,start:double>").alias("window")
val windowed = df.select(w, $"val")
join and filter and remove duplicates:
val result = windowed.alias("left")
.join(windowed.alias("right"), "window")
.where(abs($"left.val" - $"right.val") <= 0.5 && $"left.val" < $"right.val")
.drop("window").distinct
Result:
result.show
// +---+---+
// |val|val|
// +---+---+
// |1.0|1.2|
// |1.2|1.3|
// |4.0|4.5|
// |1.0|1.3|
// +---+---+
One thing I have been advised to do is adding a bucket column so that each possibly matching tuples must be either in the same bucket or in adjacent buckets. Thus I can join (equijoin) the table with itself based on buckets and extract tuples from the result where the condition does indeed hold. I'm not sure if it is a good solution and I have not yet been able to verify it.
/* max difference cannot span more than 2 buckets */
spark.sql("set max_diff=0.001")
var dec_count = 3
var bucket_size = scala.math.pow(10,-1 * dec_count)
var songs_buckets = songs.orderBy(col("artist_familiarity")).withColumn("bucket", round(col("artist_familiarity"), dec_count))
/*
tuples in adjacent buckets can have very close `artist_familiarity`.
add id to avoid duplicate pairs or tuples paired with themselves.
*/
songs_buckets = songs_buckets.withColumn("bucket2", $"bucket" - bucket_size).withColumn("id", monotonically_increasing_id())
songs_buckets.createOrReplaceTempView("songs_buckets")
var tmp = sql("SELECT s1.title as t1, s2.title as t2, s1.artist_familiarity as f1, s2.artist_familiarity as f2, s1.id as id1, s2.id as id2 FROM songs_buckets s1 JOIN songs_buckets s2 ON s1.bucket = s2.bucket OR s1.bucket = s2.bucket2")
tmp.createOrReplaceTempView("tmp")
var result = sql("SELECT t1, t2 FROM tmp WHERE id1 < id2 and f2 - f1 <= ${max_diff}")
result.show()
I haven't bothered to change variable names back to the example in the question. It displays the first 20 rows of the result after about 12 seconds. Not sure if this has something to do with lazy loading, because it won't display the count of the result, but it's the best thing I could make work.

Using pig, How do I parse and comapre a grouped item

I have
A B
a, d
a, e
a, y
z, v
z, k
z, o
and so on.
Column B is of type cararray and contains key value pairs separated by &.
For example - d = 'abc=1&c=1&p=success'
What I want to figure out --
Suppose -
d = 'abc=1&c=1&xyz=23423423'
e = 'xyz=1&it=ssd'
y = 'abc=1&c=1&p=success'
For every 'a' I want to figure out if it has column b which contains the same value of abc and have c=1 and p = success. I also want to extract the value of abc and c from d and y.
For instance lets take the above example -
d contains abc=1 and c=1
y contains abc=1 and p= success
So this satisfies what I am looking for i.e for a given 'a' i have same value of abc and c=1 and p =success.
I started with grouping my data :
grouped = group data BY (A, B);
which gives me
a, (a,b)(a,e)(a,y)
z, (z,v)(z,k)(z,o)
But after this I am clueless on how to compare data within each group so that the above condition is satisfied.
Any help on this is appreciated.
Please let me know if you want me to clarify further on my question.
Since you are only concerned with some of the fields in the query string (I assume that's what it is), you will want to split the data with a FOREACH and STRSPLIT. Flatten it so you have something that looks like this
(a, b) where b would be a single key/value from the query ex: abc=1
Filter out the key/value pairs you don't care about, join them back together and then group by the combined key/value pairs. That will give you a list of every a with the same b where b only contains abc=X, c=1 and p=success