Get distance and duration to closest matrix in SQL - sql-server-2012

I have an logic to find the most optimised to perform delivery.
Lets say, I have location A,B,C. So need distance and duration from A to B, B to A, A to C, C to A, B to C and C to B.
I know how to come out with above query. Example result would be NewMatrix in fiddle.
http://sqlfiddle.com/#!6/9cce7/1
I have a table where I store current matrix we have based on past deliveries. (AppMatrix in table above)
So I need to lookup distance and duration in this table, to find closest matching origin and destination. I have created following function which works just perfect to get my answer :
SELECT TOP 1 Distance, ([Time]/60) as Duration FROM [AppMatrix]
ORDER BY ABS([OriginSiteLat] - #OriginLat) + ABS([OriginSiteLng] - #OriginLong)
,ABS([DestSiteLat] - #DestinationLat) + ABS([DestSiteLng] - #DestinationLong)
The problem is slowness. Since I need to perform these call with each matrix (I can have 700 different deliveries in a day, 700*700 = 14000, this just too slow - it takes few hours to return result)
I'm working on best how to limit the data, but any advise on how to optimize performance is appreciated. Maybe advice on how to use spatial here would help.
This is my current code :
SELECT * FROM CT as A
INNER JOIN CT AS B ON A.Latitude <> B.Latitude AND A.Longitude<>B.Longitude
CROSS APPLY [dbo].[ufn_ClosestLocation](A.Latitude,A.Longitude, B.Latitude, B.Longitude) R

Related

Nested SQL evaluation question with unnest

this may be a basic question but I just couldn't figure it out. Sample data and query could be found here. (under the "First-touch" tab)
I'll skip the marketing terminology here but basically what the query does is attributing credits/points to placements (ads) based on certain rule. Here, the rule is "first-touch", which means the credit goes to the first ad user interacted with - could be view or click. The "FLOODLIGHT" here means the user takes action to actually buy the product (conversion).
As you can see in the sample data, user 1 has one conversion and the first ad is placement 22 (first-touch), so 22 gets 1 point. User 2 has two conversions and the first ad of each is 11, so 11 gets 2 points.
The logic is quite simple here but I had a difficult time understanding the query itself. What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time? Aren't they essentially the same? I mean both of them came from UNNEST(t.*_path.events). And attributed_event.event_time also came from the same place.
What does prev_conversion_event.event_time, conversion_event.event_time, and attributed_event.event_time evaluate to in this scenario anyway? I'm just confused as hell here. Much appreciate the help!
For convenience I'm pasting the sample data, the query and output below:
Sample data
Output
/* Substitute *_paths for the specific paths table that you want to query. */
SELECT
(
SELECT
attributed_event_metadata.placement_id
FROM (
SELECT
AS STRUCT attributed_event.placement_id,
ROW_NUMBER() OVER(ORDER BY attributed_event.event_time ASC) AS rank
FROM
UNNEST(t.*_paths.events) AS attributed_event
WHERE
attributed_event.event_type != "FLOODLIGHT"
AND attributed_event.event_time < conversion_event.event_time
AND attributed_event.event_time > (
SELECT
IFNULL( (
SELECT
MAX(prev_conversion_event.event_time) AS event_time
FROM
UNNEST(t.*_paths.events) AS prev_conversion_event
WHERE
prev_conversion_event.event_type = "FLOODLIGHT"
AND prev_conversion_event.event_time < conversion_event.event_time),
0)) ) AS attributed_event_metadata
WHERE
attributed_event_metadata.rank = 1) AS placement_id,
COUNT(*) AS credit
FROM
adh.*_paths AS t,
UNNEST(*_paths.events) AS conversion_event
WHERE
conversion_event.event_type = "FLOODLIGHT"
GROUP BY
placement_id
HAVING
placement_id IS NOT NULL
ORDER BY
credit DESC
It is a quite convoluted query to be fair, I think I know what are you asking, please correct me if not the case.
What's the point of comparing prev_conversion_event.event_time < conversion_event.event_time?
You are doing something like "I want all the events from this (unnest), and for every event, I want to know which events are the predecessor of each other".
Say you have [A, B, C, D] and they are ordered in succession (A happened before B, A and B happened before C, and so on), the result of that unnesting and joining over that condition will get you something like [A:(NULL), B:(A), C:(A, B), D:(A, B, C)] (excuse the notation, hope it is not confusing), being each key:value pair, the Event:(Predecessors). Note that A has no events before it, but B has A, etc.
Now you have a nice table with all the conversion events joined with the events that happened before that one.

One-dimensional earth mover's distance in BigQuery/SQL

Let P and Q be two finite probability distributions on integers, with support between 0 and some large integer N. The one-dimensional earth mover's distance between P and Q is the minimum cost you have to pay to transform P into Q, considering that it costs r*|n-m| to "move" a probability r associated to integer n to another integer m.
There is a simple algorithm to compute this. In pseudocode:
previous = 0
sum = 0
for i from 0 to N:
previous = P(i) - Q(i) + previous
sum = sum + abs(previous) // abs = absolute value
return sum
Now, suppose you have two tables that contain each a probability distribution. Column n contains integers, and column p contains the corresponding probability. The tables are correct (all probabilities are between 0 and 1, their sum is I want to compute the earth mover's distance between these two tables in BigQuery (Standard SQL).
Is it possible? I feel like one would need to use analytical functions, but I don't have much experience with them, so I don't know how to get there.
What if N (the maximum integers) is very large, but my tables are not? Can we adapt the solution to avoid doing a computation for each integer i?
Hopefully I fully understand your problem. This seems to be what you're looking for:
WITH Aggr AS (
SELECT rp.n AS n, SUM(rp.p - rq.p)
OVER(ORDER BY rp.n ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS emd
FROM P rp
LEFT JOIN Q rq
ON rp.n = rq.n
) SELECT SUM(ABS(a.emd)) AS total_emd
FROM Aggr a;
WRT question #2, note that we only scan what's actually in tables, regardless of the N, assuming a one-to-one match for every n in P with n in Q.
I adapted Michael's answer to fix its issues, here's the solution I ended up with. Suppose the integers are stored in column i and the probability in column p. First I join the two tables, then I compute EMD(i) for all i using the window, then I sum all absolute values.
WITH
joined_table AS (
SELECT
IFNULL(table1.i, table2.i) AS i,
IFNULL(table1.p, 0) AS p,
IFNULL(table2.p, 0) AS q,
FROM table1
OUTER JOIN table2
ON table1.i = table2.i
),
aggr AS (
SELECT
(SUM(p-q) OVER win) * (i - (LAG(i,1) OVER win)) AS emd
FROM joined_table
WINDOW win AS (
ORDER BY i
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
)
SELECT SUM(ABS(emd)) AS total_emd
FROM aggr

select query showing decimal places on some fields but not others

I have two tables, A & B.
Table A has a column called Nominal which is a float.
Table B has a column called Units which is also a float.
I have a simple select query that highlights any differences between Nominals in table A & Units in table B.
select coalesce(A.Id, B.Id) Id, A.Nominal, B.Units, isnull(A.Nominal, 0) - isnull(B.Units, 0) Diff
from tblA A full outer join tblB B
on tblA.Id = tblB.Id
where isnull(A.Nominal, 0) - isnull(B.Units, 0) <> 0
this query works. However this morning I have a slight problem.
The query is showing on line as having a difference,
Id Nominal Units Diff
FJLK 100000 100000 1.4515E-11
So obviously one or both of the figures are not 100,000 exactly. However when I run a select query on both tables (individually) on this id both of them return 100,000 I can't see which one has decimal places, why is this? Is this some sort of default display in SQL Server?
In the excel you will find this kind of behavior.
It's a standard way to represent a low numbers. The number 1.4515E-11 you got is same 1.4515 * 10^(-11)

Best practice for calculate user rate and more

I am building an application that shares some stuff...
Each Object can be rated has a rating 1..5 start. I keep for each the number of rates per star so can calculate the Avg rate.
So per Obj I have: Avg rate and total rate.
I need to get the top10 rated Obj - so can do it using AvgRate+TotalRate (those who has these values as top10).
I want to have in the server an sql table like this:
ObjId (index), totalRate, AvgRate...
If possible to have this table sorted so that can get the top10 as the first 10?
How can query the top10 with the calculation I want?
Also - I need to get the top10 per users. So per user I have all the Obj he shared so can have all of the rates of these Obj - with all info per Obj as mentioned before.
I need to know how to calculate a user rate, and also - how to fast get the top10.
Any ideas?
Later Edit: Sorry, didn't understand your question when writing this answer, gonna leave it still for others..
What's your formula for TotalRate ? And what do you mean by "so can do it using AvgRate+TotalRate" Why are you summing an average to TotalRate - whatever that is?
Best practice is to always compute the sums/averages incrementally.
I would model Obj like this:
A total number of rates received
B total sum of points received
C average (float: B/A )
D - foreign key to user (author/owner of Obj)
When object receives rate X, you then recompute A = A + 1, B = B + X, C = B/A.
In the same manner pre-compute aggregate sums/average. So if Obj belongs to user, create the same fields (A, B, C) to User model/table, and when Obj receives rate X, also update A, B, C values for user D (owner of Obj). Then, when selecting top 10 users, you do not need to join with Obj table (which may get huge), you only select users - descending by B or C column, limit 10.

Select pair of rows that obey a rule

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !
As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm
Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.
This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.
I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.
Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.
IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).