One-dimensional earth mover's distance in BigQuery/SQL - sql

Let P and Q be two finite probability distributions on integers, with support between 0 and some large integer N. The one-dimensional earth mover's distance between P and Q is the minimum cost you have to pay to transform P into Q, considering that it costs r*|n-m| to "move" a probability r associated to integer n to another integer m.
There is a simple algorithm to compute this. In pseudocode:
previous = 0
sum = 0
for i from 0 to N:
previous = P(i) - Q(i) + previous
sum = sum + abs(previous) // abs = absolute value
return sum
Now, suppose you have two tables that contain each a probability distribution. Column n contains integers, and column p contains the corresponding probability. The tables are correct (all probabilities are between 0 and 1, their sum is I want to compute the earth mover's distance between these two tables in BigQuery (Standard SQL).
Is it possible? I feel like one would need to use analytical functions, but I don't have much experience with them, so I don't know how to get there.
What if N (the maximum integers) is very large, but my tables are not? Can we adapt the solution to avoid doing a computation for each integer i?

Hopefully I fully understand your problem. This seems to be what you're looking for:
WITH Aggr AS (
SELECT rp.n AS n, SUM(rp.p - rq.p)
OVER(ORDER BY rp.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS emd
FROM P rp
LEFT JOIN Q rq
ON rp.n = rq.n
) SELECT SUM(ABS(a.emd)) AS total_emd
FROM Aggr a;
WRT question #2, note that we only scan what's actually in tables, regardless of the N, assuming a one-to-one match for every n in P with n in Q.

I adapted Michael's answer to fix its issues, here's the solution I ended up with. Suppose the integers are stored in column i and the probability in column p. First I join the two tables, then I compute EMD(i) for all i using the window, then I sum all absolute values.
WITH
joined_table AS (
SELECT
IFNULL(table1.i, table2.i) AS i,
IFNULL(table1.p, 0) AS p,
IFNULL(table2.p, 0) AS q,
FROM table1
OUTER JOIN table2
ON table1.i = table2.i
),
aggr AS (
SELECT
(SUM(p-q) OVER win) * (i - (LAG(i,1) OVER win)) AS emd
FROM joined_table
WINDOW win AS (
ORDER BY i
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
SELECT SUM(ABS(emd)) AS total_emd
FROM aggr

Related

Calculate a specific moving average using sql query

Consider that I have a table with one column "A" and I would like to create another column called "B" such that
B[i] = 0.2*A[i] + 0.8*B[i-1]
where B[0]=0.
My problem is that I cannot use the OVER() function because I want to use the values in B while I am trying to construct B. Any idea would be appreciated. Thanks
This is a rather complex mathematical exercise. You want to accumulate exponentially decreasing amounts from previous rows.
It is a little confusing because the amount going in on each row is 20%, but that is just a factor in the formula.
In any case, this seems to do what you want:
select t.*,
sum(power(0.8, -n) * a * 0.2) over (order by id) / power(0.8, -n)
from (select t.8,
row_number() over (order by id) - 1 as n
from t
) x;
Here is a db<>fiddle using Postgres.

Get distance and duration to closest matrix in SQL

I have an logic to find the most optimised to perform delivery.
Lets say, I have location A,B,C. So need distance and duration from A to B, B to A, A to C, C to A, B to C and C to B.
I know how to come out with above query. Example result would be NewMatrix in fiddle.
http://sqlfiddle.com/#!6/9cce7/1
I have a table where I store current matrix we have based on past deliveries. (AppMatrix in table above)
So I need to lookup distance and duration in this table, to find closest matching origin and destination. I have created following function which works just perfect to get my answer :
SELECT TOP 1 Distance, ([Time]/60) as Duration FROM [AppMatrix]
ORDER BY ABS([OriginSiteLat] - #OriginLat) + ABS([OriginSiteLng] - #OriginLong)
,ABS([DestSiteLat] - #DestinationLat) + ABS([DestSiteLng] - #DestinationLong)
The problem is slowness. Since I need to perform these call with each matrix (I can have 700 different deliveries in a day, 700*700 = 14000, this just too slow - it takes few hours to return result)
I'm working on best how to limit the data, but any advise on how to optimize performance is appreciated. Maybe advice on how to use spatial here would help.
This is my current code :
SELECT * FROM CT as A
INNER JOIN CT AS B ON A.Latitude <> B.Latitude AND A.Longitude<>B.Longitude
CROSS APPLY [dbo].[ufn_ClosestLocation](A.Latitude,A.Longitude, B.Latitude, B.Longitude) R

How to overcome the limitation of `PERCENTILE_CONT` that the argument should be constant?

I want to find POINTCOUNT values that cut the input set ADS.PREDICTOR into equally large groups. The parameter POINTCOUNT can have different value for different predictors, so I don't want to hard-code it in the code.
Unfortunately the code below fails with ORA-30496: Argument should be a constant... How can I overcome this (except for 300 lines of code with hard-coded threshold quantiles, of course)?
define POINTCOUNT=300;
select
*
from (
select
percentile_disc(MYQUNTILE)
within group (
order by PREDICTOR ) as THRESHOLD
from ADS
inner join (
select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
on 1=1
)
group by THRESHOLD
I want to draw a ROC curve. The curve will be plotted in Excel as a linear interpolation between pairs of points (X, Y) calculated in Oracle.
Each point (X, Y) is calculated using a threshold value.
I will get the best approximation of the ROC curve for a give number of the pairs of points if the distance between each adjacent pair of (X, Y) is uniform.
if I cut the domain of the predicted values into N values that separate 1/Nth quantiles, I should get a fairly good set of the threshold values.
PERCENTILE_CONT() only requires that the percentile value be constant within each group. You do not have a group by in your subquery, so I think this might fix your problem:
select MYQUANTILE,
percentile_disc(MYQUANTILE) within group (order by PREDICTOR
) as THRESHOLD
from ADS cross join
(select (LEVEL - 1)/(&POINTCOUNT-1) as MYQUANTILE
from dual
connect by LEVEL <= &POINTCOUNT
)
GROUP BY MYQUANTILE;
Also, note that CROSS JOIN is the same as INNER JOIN . . . ON 1=1.

Select random non-repeating rows per user

The use case is, that I have a table products and user_match_product. For a specific user, I want to select X random products, for which that user has no match.
The naive way to do that, would be to make something like
SELECT * FROM products WHERE id NOT IN (SELECT p_id FROM user_match_product WHERE u_id = 123) ORDER BY random() LIMIT X
but that will become a performance bottleneck when having millions of rows.
I thought of some possible solutions which I will present here now. I would love to hear about your solutions for that problem or suggestions regarding my solutions.
Solution 1: Trust the randomness
Based on the fact that the product ids are monotonically increasing, one could optimistically generate X*C random numbers R_i where i between 1 and X*C, which are in the range [min_id, max_id], and hope that a select like the following will return X elements.
SELECT * FROM products p1 WHERE p1.id IN (R_1, R_2, ..., R_XC) AND NOT EXISTS (SELECT * FROM user_match_product WHERE u_id = 123 AND p_id = p1.id) LIMIT X
Advantages
If the random number generator is good, this will probably work very well within O(1)
Old and newly added products have the same probability of being choosen
Disadvantages
If the number of matches is near to the number of products, the collision probability might be very high.
Solution 2: Block-wise PRNG
One could create a permutation function permutate(seed, start, end, value) for the domain [START, END] that uses a seed for randomness. At time t0 a user A has 0 matched products and observes that E0 products exist. The first block for the user A at t0 is for the domain [1, E0]. The user remembers a counter C which initially is 0.
To select X products the user A first has to create the permutations P_i like
P_i = permutate(seed, START, END, C + i)
The following has to hold for the function.
permutate(seed, start, end, value) is element of [start, end]
value is element of [start, end]
The following query will return X non-repeating elements.
SELECT * FROM products WHERE id IN (P_1, ..., P_X)
When C reaches END, the next block is allocated by using END + 1 as the new START, the current count of products E1 as new END. The seed and C stay the same.
Advantages
No collisions possible
Guaranteed O(1)
Disadvantages
The current block has to be finished before new products can be selected
I'd go with approach #1.
You can get a first estimate of C by counting the user's rows in user_match_product (supposed unique). If he already possesses half the possible products, selecting twice the number of random products seems a good heuristic.
You can also have a last-ditch correction that verifies that the number of extracted products is actually X. If it was, say, X/3, you'd need to run the same extraction two more times (avoiding already-generated random product IDs), and increase that user's C constant by a factor of three.
Also, knowing what the range of product IDs is, you could select random numbers in that range that do not appear in user_match_product (i.e. your first stage query is only against user_match_product) which is bound to have a (much?) lower cardinality than products. Then, those IDs that pass the test can be safely selected from products.
If you want to choose X products that the user doesn't have, the first thing that comes to mind is to enumerate the products and use order by rand() (or the equivalent, depending on the database). This is your first solution:
SELECT p.*
FROM products p
WHERE NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
A simple way to make this more efficient is to choose an arbitrary subset. You can actually do this using random() as well, but in the where clause:
SELECT p.*
FROM products p
WHERE random() < Y AND
NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
The question is: what is "Y"? Well, let's say the number of products is P and the user has U. Then, if we choose a random set of (X + U) products, we can definitely get X products the user does not have. This suggests that the expression random() < (X + U) / P would be sufficient. Alas, the vagaries of random numbers say that sometimes we would get enough and sometimes not enough. Let's add a factor such as 3 to be safe. This is actually really, really, really, really safe for most values of X, U, and P.
The idea is a query such as this:
SELECT p.*
FROM Products p CROSS JOIN
(SELECT COUNT(*) as p FROM Products) v1 CROSS JOIN
(SELECT COUNT(*) as u FROM User_Match_Product WHERE u_id = 123) v2
WHERE random() < 3 * (u + x) / p AND
NOT EXISTS (SELECT 1 FROM User_Match_Product WHERE ump.p_id = p.id and ump.u_id = 123)
ORDER BY random()
LIMIT X;
Note that these calculations require a small amount of time with appropriate indexes on Products and User_Match_Product.
So, if you have 1,000,000 products and a typical user has 20. You want to recommend 10 more. Then the expression (20 + 10)*3/1000000 --> 90/1000000. This query will scan the products table, pull out 90 rows randomly and then sort them and choose an appropriate 10 rows. Sorting 90 rows is, essentially, constant time relative to the original operation.
For many purposes, the cost of the table scan is acceptable. It sure beats the cost of sorting all the data, for instance.
The alternative approach is to load all products for a user into the application. Then pull a random product out and compare to the list:
select p.id
from Products p cross join
(select min(id) as minid, max(id) as maxid as p from Products) v1
where p.id >= minid + random() * (maxid - minid)
order by p.id
limit 1;
(Note the calculation can be done outside the query so you can just plug in a constant.)
Many query optimizers will resolve this query constant time by doing an index scan. You can then check in the application whether the user has the product already. This would then run about X times for the user, providing O(1) performance. However, this has rather bad worst case performance: if there are not X available products, it will run indefinitely. Of course, additional logic can fix this problem.

filter out deviating record with sql

We have this set of data that we need to get the average of a column. a select avg(x) from y does the trick. However we need a more accurate figure.
I figured that there must be a way of filtering records that has either too high or too low values(spikes) so that we can exclude them in calculating the average.
There are three types of average, and what you are originally using is the mean - the sum of all the values divided by the number of values.
You might find it more useful to get the mode - the most frequently occuring value:
select name,
(select top 1 h.run_duration
from sysjobhistory h
where h.step_id = 0
and h.job_id = j.job_id
group by h.run_duration
order by count(*) desc) run_duration
from sysjobs j
If you did want to get rid of any values outside the original standard deviation, you could find the average and the standard deviation in a subquery, eliminate those values which are outside the range : average +- standard deviation, then do a further average of the remaining values, but you start running the risk of having meaningless values:
select oh.job_id, avg(oh.run_duration) from sysjobhistory oh
inner join (select job_id, avg(h.run_duration) avgduration,
stdev(h.run_duration) stdev_duration
from sysjobhistory h
group by job_id) as m on m.job_id = oh.job_id
where oh.step_id = 0
and abs(oh.run_duration - m.avgduration) < m.stdev_duration
group by oh.job_id
in sql server there's also the STDEV function so maybe that can be of some help...