In a single table I need to find those pairs for which the values of a certain differ by maximum a given amount. For example, given the following table and the maximum difference 0.5:
val
---
1
1.2
1.3
4
4.5
6
The desired result would be:
val1 | val2
-----+-----
1 | 1.2
1 | 1.3
1.2 | 1.3
4 | 4.5
The main problem is that my table is gigantic and a cross product is not possible in reasonable time. i.e. this does not work:
SELECT t1.val, t2.val
FROM table t1, table t2
WHERE abs(t1.val - t2.val) <= 0.5
Is there a way to do this? I read up upon window functions, so at least I know it is possible to compute for each entry the difference in value to the previous entry, obtaining for the example above:
val | diff
----+-----
1 | 0
1.2 | 0.2
1.3 | 0.1
4 | 2.7
4.5 | 0.5
6 | 1.5
From here on I need to find the ranges where the sum of diff does not exceed the given maximum. Is this possible? Are there more reasonable approaches?
I'm using spark.
Thank you.
EDIT: As pointed out, my query would also include symmetric pairs as well es pairs where the values are equal. Sorry for the ambiguity.
However, this is not the point. My problem is the join. The dataset is too large for a cartesian product. I am looking for a solution which avoids using one.
Also, the size of the dataset I'm dealing with is 1000000 tuples. I am not
sure what execution time to expect, but it was suggested that there must be a solution which avoids using a cartesian product on the data.
Thank you.
What you tried is close. Just a few modifications needed:
select t1.val,t2.val
from tbl t1
join tbl t2 on t2.val-t1.val<=0.5 and t1.val<t2.val
You can generate virtual time-based window:
import org.apache.spark.sql.functions._
import spark.implicits._ // Where spark is an instance of SparkSession
val df = Seq(1.0, 1.2, 1.3, 4.0, 4.5, 6).toDF("val")
val w = window(
$"val".cast("timestamp"), "1000 milliseconds", "500 milliseconds"
).cast("struct<start:double,start:double>").alias("window")
val windowed = df.select(w, $"val")
join and filter and remove duplicates:
val result = windowed.alias("left")
.join(windowed.alias("right"), "window")
.where(abs($"left.val" - $"right.val") <= 0.5 && $"left.val" < $"right.val")
.drop("window").distinct
Result:
result.show
// +---+---+
// |val|val|
// +---+---+
// |1.0|1.2|
// |1.2|1.3|
// |4.0|4.5|
// |1.0|1.3|
// +---+---+
One thing I have been advised to do is adding a bucket column so that each possibly matching tuples must be either in the same bucket or in adjacent buckets. Thus I can join (equijoin) the table with itself based on buckets and extract tuples from the result where the condition does indeed hold. I'm not sure if it is a good solution and I have not yet been able to verify it.
/* max difference cannot span more than 2 buckets */
spark.sql("set max_diff=0.001")
var dec_count = 3
var bucket_size = scala.math.pow(10,-1 * dec_count)
var songs_buckets = songs.orderBy(col("artist_familiarity")).withColumn("bucket", round(col("artist_familiarity"), dec_count))
/*
tuples in adjacent buckets can have very close `artist_familiarity`.
add id to avoid duplicate pairs or tuples paired with themselves.
*/
songs_buckets = songs_buckets.withColumn("bucket2", $"bucket" - bucket_size).withColumn("id", monotonically_increasing_id())
songs_buckets.createOrReplaceTempView("songs_buckets")
var tmp = sql("SELECT s1.title as t1, s2.title as t2, s1.artist_familiarity as f1, s2.artist_familiarity as f2, s1.id as id1, s2.id as id2 FROM songs_buckets s1 JOIN songs_buckets s2 ON s1.bucket = s2.bucket OR s1.bucket = s2.bucket2")
tmp.createOrReplaceTempView("tmp")
var result = sql("SELECT t1, t2 FROM tmp WHERE id1 < id2 and f2 - f1 <= ${max_diff}")
result.show()
I haven't bothered to change variable names back to the example in the question. It displays the first 20 rows of the result after about 12 seconds. Not sure if this has something to do with lazy loading, because it won't display the count of the result, but it's the best thing I could make work.
Related
I have a table whose structure looks like the following:
k | i | p | v
Notice that the key (k) is not unique, there are no keys, nothing. Each key can have multiple attributes (i = 0, 1, 2, ...) which can be of different types (p) and have different values (v). One attribute type may also appear multiple times (p(i-1) = p(i)).
What I want to do is pick certain attribute types and their corresponding values and place them in the same row. For example I want to have:
k | attr_name1 | attr_name2
I have managed to make a query that does this and works for all keys (k) for which attr_name1 and attr_name2 appear in the column p of the initial table:
SELECT DISTINCT ON (key) fn.k AS key, fn.v AS attr_name1, a.v AS attr_name2
FROM Table fn
LEFT JOIN Table a ON fn.k = a.k
AND a.p = 'attr_name2'
WHERE fn.p = 'attr_name1'
I would like, however, to take into account the case where a certain key has no attribute named attr_name1 and insert a NULL value into the corresponding column of the new table. I am not sure how to achieve that. I have no issue using multiple queries or intermediate tables etc, but there are quite a lot of rows in the table and I need something that scales to millions of rows.
Any help would be appreciated.
Example:
k i p v
1 0 a 10
1 1 b 12
1 2 c 34
1 3 d 44
1 4 e 09
2 0 a 11
2 1 b 13
2 2 d 22
2 3 f 34
Would turn into (assuming I am only interested in columns a, b, c):
k a b c
1 10 12 34
2 11 13 NULL
I would use conditional aggregation. That is, an aggregate function around a CASE expression.
SELECT
k,
MAX(CASE WHEN p='a' THEN v END) AS a,
MAX(CASE WHEN p='b' THEN v END) AS b,
MAX(CASE WHEN p='c' THEN v END) AS c
FROM
your_table
GROUP BY
k
This presumes that (k, p) is unique. If there are duplicate keys, this will clearly find the one v with the highest value (for each (k,p))
As a general rule this kind of pivoting makes the data harder to process in SQL. This is often done for display purposes because humans find this easier to read. However, from a software engineering perspective, such formatting should not be done in the data layer; be careful that by doing this you don't actually make your future life harder.
I have one table like this:
id status time days ...
1 optimal 60 21
2 optimal 50 21
3 no solution 60 30
4 optimal 21 31
5 no solution 34 12
.
.
.
There are many more rows and columns.
I need to make a query that will return which columns have different information, given two IDs.
Rephrasing it, I'll provide two IDs, for example 1 and 5 and I need to know if these two rows have any columns with different values. In this case, the result should be something like:
id status time days
1 optimal 60 21
5 no solution 34 12
If I provide IDs 1 and 2, for example, the result should be:
id time
1 60
2 50
The output format doesn't need to be like this, it only needs to show clearly which columns are different and their values
I can tell you off the bat that processing this data in some sort of programming language will greatly help you out in terms of simplicity and readability for this type of solution, but here a thread of how it can be done in SQL.
Compare two rows and identify columns whose values are different
If you are looking for the solution in R. Here is my solution:
df <- read.csv(file = "sf.csv", header = TRUE)
diff.eval <- function(first.id, second.id, eval.df) {
res <- eval.df[c(first.id, second.id), ]
cols <- colnames(eval.df)
for (col in cols) {
if (res[1, col] == res[2, col]) {
res[, col] <- NULL
}
}
return(res)
}
print(diff.eval(1, 5, df))
print(diff.eval(1, 2, df))
You just need to create a dataframe out of table. I just created a .csv for ease locally and used the data by importing into a dataframe.
I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4
I have a table which contains the edges from node x to node y in a graph.
n1 | n2
-------
a | a
a | b
a | c
b | b
b | d
b | c
d | e
I would like to create a (materialized) view which denotes the shortest number of nodes/hops a path contains to reach from x to node y:
n1 | n2 | c
-----------
a | a | 0
a | b | 1
a | c | 1
a | d | 2
a | e | 3
b | b | 0
b | d | 1
b | c | 1
b | e | 2
d | e | 1
How should I model my tables and views to facilitate this? I guess I need some kind of recursion, but I believe that is pretty difficult to accomplish in SQL. I would like to avoid that, for example, the clients need to fire 10 queries if the path happens to contain 10 nodes/hops.
This works for me, but it's kinda ugly:
WITH RECURSIVE paths (n1, n2, distance) AS (
SELECT
nodes.n1,
nodes.n2,
1
FROM
nodes
WHERE
nodes.n1 <> nodes.n2
UNION ALL
SELECT
paths.n1,
nodes.n2,
paths.distance + 1
FROM
paths
JOIN nodes
ON
paths.n2 = nodes.n1
WHERE
nodes.n1 <> nodes.n2
)
SELECT
paths.n1,
paths.n2,
min(distance)
FROM
paths
GROUP BY
1, 2
UNION
SELECT
nodes.n1,
nodes.n2,
0
FROM
nodes
WHERE
nodes.n1 = nodes.n2
Also, I am not sure how good it will perform against larger datasets. As suggested by Mark Mann, you may want to use a graph library instead, e.g. pygraph.
EDIT: here's a sample with pygraph
from pygraph.algorithms.minmax import shortest_path
from pygraph.classes.digraph import digraph
g = digraph()
g.add_node('a')
g.add_node('b')
g.add_node('c')
g.add_node('d')
g.add_node('e')
g.add_edge(('a', 'a'))
g.add_edge(('a', 'b'))
g.add_edge(('a', 'c'))
g.add_edge(('b', 'b'))
g.add_edge(('b', 'd'))
g.add_edge(('b', 'c'))
g.add_edge(('d', 'e'))
for source in g.nodes():
tree, distances = shortest_path(g, source)
for target, distance in distances.iteritems():
if distance == 0 and not g.has_edge((source, target)):
continue
print source, target, distance
Excluding the graph building time, this takes 0.3ms while the SQL version takes 0.5ms.
Expanding on Mark's answer, there are some very reasonable approaches to explore a graph in SQL as well. In fact, they'll be faster than the dedicated libraries in perl or python, in that DB indexes will spare you the need to explore the graph.
The most efficient of index (if the graph is not constantly changing) is a nested-tree variation called the GRIPP index. (The linked paper mentions other approaches.)
If your graph is constantly changing, you might want to adapt the nested intervals approach to graphs, in a similar manner that GRIPP extends nested sets, or to simply use floats instead of integers (don't forget to normalize them by casting to numeric and back to float if you do).
Rather than computing these values on the fly, why not create a real table with all interesting pairs along with the shortest path value. Then whenever data is inserted, deleted or updated in your data table, you can recalculate all of the shortest path information. (Perl's Graph module is particularly well-suited to this task, and Perl's DBI interface makes the code straightforward.)
By using an external process, you can also limit the number of recalculations. Using PostgreSQL triggers would cause recalculations to occur on every insert, update and delete, but if you knew you were going to be adding twenty pairs of points, you could wait until your inserts were completed before doing the calculations.
I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.