Select pair of rows that obey a rule - sql

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !

As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm

Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.

This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.

I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.

Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.

IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).

Related

Optimize a PowerBI data load containing a Nearest Neighbor calculation

I have a PowerBI report that shows metrics and visuals for a large amount of quote data extracted by an API, roughly 400k records a week. These quotes only contain latitude and longitude points for location, but shareholders need to slice views by our service areas. We have a fact table of areas with IDs and geography polygons that I am able to reference.
Currently, the report uses a gnarly custom SQL query that pulls this data from the transactional database, transforms it, and finds the nearest area through a cross apply method.
Here's an example of the code:
-- step 1 : get quotes from the first table
SELECT Col1, Col2...
INTO #AllQuotes
FROM Quotes1
LEFT JOIN (FactTables)
INNER JOIN([filtering self join])
WHERE expression
-- Step 2 : insert quotes from a separate table into our first temp table to get a table with all quote data
INSERT INTO #AllQuotes
SELECT Col1, Col2
FROM Quotes2
LEFT JOIN(Fact Tables)
INNER JOIN([filtering self join])
WHERE expression
-- Step 3 : Use CROSS APPLY to check the distance of every quote from every area, only selecting the shortest distance
SELECT *
FROM (SELECT *
FROM #AllQuotes as t
CROSS APPLY (SELECT TOP 1 a.AreaName,
a.AreaPoly.STDistance(geography::STGeomFromText('POINT('+ cast(t.PickLongitudeTemp as VARCHAR(20)) +' '+ cast(t.PickLatitudeTemp as VARCHAR(20)) +')', 4326).MakeValid()) AS 'DistanceToZone'
FROM Area as a
WHERE (a.AreaPoly.STIsValid() = 1)
AND (a.AreaPoly.STDistance(geography::STGeomFromText('POINT('+ cast(t.PickLongitudeTemp as VARCHAR(20)) +' '+ cast(t.PickLatitudeTemp as VARCHAR(20)) +')', 4326).MakeValid()) IS NOT NULL)
ORDER BY a.AreaPoly.STDistance(geography::STGeomFromText('POINT('+ cast(t.PickLongitudeTemp as VARCHAR(20)) +' '+ cast(t.PickLatitudeTemp as VARCHAR(20)) +')', 4326).MakeValid()) ASC) AS t2 ) AS llz;
This is obviously very computationally expensive and is making the PowerBI mashup engine work in overdrive. We are starting to have issues with CPU load on our database due to poor data load optimization. PowerBI rebuilds its data model every refresh and its query engine is not the strongest at using complex queries. Compounding this with the large amount of data, it quickly becomes a real issue with our stability.
Our database doesn't have a schema that is conducive to making efficient analytics queries, there is no transformation happening as it's loaded, and a process to hit a maps API to associate addresses with lat/longs. In order to produce reports with any value, I need to perform a lot of transformations within the query or within the loading process. This isn't the best thing to do, I know, but its what I got working and that could provide value.
I decided to try to move the query into something server side so that PowerBI only needed to load an already transformed and prepped dataset. With views I was able to get a dataset of all of my quotes and their lat/longs.
Now how would I go about running step 3? I have a few ideas:
Use a nested view
Refactor every temp table into a monolith of CTEs that then get transformed by a final view
Research a new method for solving a Lat/Long to Polygon matching problem.
I would like to have a final table that PowerBI can import with a simple SELECT * FROM #AllQuotes so that the mashup engine has to do less work constructing the data model. This would also allow me to implement incremental refresh and be able to only import a day's worth of data as time goes on rather than the full dataset.
Any solutions or ideas on how to match Lat/Long points to a list of geography Polygons in a PBI friendly way would be greatly appreciated.
Can't say I'm a spatial expert, but I don't think you are really using your index. STDistance has to run against every combo of quote/area then sort to find the smallest distance. So you need to reduce the number of areas each quote is compared against
If you review your data, I'd guess you find something like 30% of quotes are within 5,000 meters of your quote location. And 80% are within 10,000 meters.
With that in mind, I think we can add some queries to find those close matches first. This should be able use you spatial indexes efficiently as it can first filter down to only close matches, reducing the number of times you have to calculate the distance of a quote to each area.
Conceptual Code Approach: First Find Quick Matches within Predefined Distance(s)
/*First identify matches with in small distance like 5,000 meters*/
UPDATE #Quote
SET NearestAreaID = C.AreaID
FROM #Quote AS A
CROSS APPLY (SELECT QuoteGeogPoint = geography::Point(A.PickLatitudeTemp, A.PickLongitudeTemp, 4326)) AS B
CROSS APPLY ( SELECT TOP(1) B.AreaID
FROM Area AS DTA
/*STBuffer creates circle of 5000 meters around Quote location
STIntersects matches only on data that intersects with that point*/
WHERE B.QuoteGeogPoint.STBuffer(5000).STIntersects(DTA.AreaPoly)
ORDER BY DTA.AreaPoly.STDistance(Q.QuoteGeogPoint)
) AS C
/*Could run above query again for a medium distance say 10,000 meters*/
UPDATE #Quote
SET NearestAreaID = C.AreaID
FROM #Quote AS A
CROSS APPLY (SELECT QuoteGeogPoint = geography::Point(A.PickLatitudeTemp, A.PickLongitudeTemp, 4326)) AS B
CROSS APPLY ( SELECT TOP(1) B.AreaID
FROM Area AS DTA
WHERE B.QuoteGeogPoint.STBuffer(10000).STIntersects(DTA.AreaPoly)
ORDER BY DTA.AreaPoly.STDistance(Q.QuoteGeogPoint)
) AS C
WHERE A.NearestAreaID IS NULL /*No matches found*/
Match Quotes Regardless of Area Distance
Once you've found the easy matches, use this script (your current step 3) to clean up any stragglers
/*Find matches for any didn't have a match in defined distances*/
UPDATE #Quote
SET NearestAreaID = C.AreaID
FROM #Quote AS A
CROSS APPLY (SELECT QuoteGeogPoint = geography::Point(A.PickLatitudeTemp, A.PickLongitudeTemp, 4326)) AS B
CROSS APPLY ( SELECT TOP(1) B.AreaID
FROM Area AS DTA
ORDER BY DTA.AreaPoly.STDistance(Q.QuoteGeogPoint)
) AS C
WHERE NearestAreaID IS NULL /*No matches already found*/

SQL: Reduce resultset to X rows?

I have the following MYSQL table:
measuredata:
- ID (bigint)
- timestamp
- entityid
- value (double)
The table contains >1 billion entries. I want to be able to visualize any time-window. The time window can be size of "one day" to "many years". There are measurement values round about every minute in DB.
So the number of entries for a time-window can be quite different. Say from few hundrets to several thousands or millions.
Those values are ment to be visualiuzed in a graphical chart-diagram on a webpage.
If the chart is - lets say - 800px wide, it does not make sense to get thousands of rows from database if time-window is quite big. I cannot show more than 800 values on this chart anyhow.
So, is there a way to reduce the resultset directly on DB-side?
I know "average" and "sum" etc. as aggregate function. But how can I i.e. aggregate 100k rows from a big time-window to lets say 800 final rows?
Just getting those 100k rows and let the chart do the magic is not the preferred option. Transfer-size is one reason why this is not an option.
Isn't there something on DB side I can use?
Something like avg() to shrink X rows to Y averaged rows?
Or a simple magic to just skip every #th row to shrink X to Y?
update:
Although I'm using MySQL right now, I'm not tied to this. If PostgreSQL f.i. provides a feature that could solve the issue, I'm willing to switch DB.
update2:
I maybe found a possible solution: https://mike.depalatis.net/blog/postgres-time-series-database.html
See section "Data aggregation".
The key is not to use a unixtimestamp but a date and "trunc" it, avergage the values and group by the trunc'ed date. Could work for me, but would require a rework of my table structure. Hmm... maybe there's more ... still researching ...
update3:
Inspired by update 2, I came up with this query:
SELECT (`timestamp` - (`timestamp` % 86400)) as aggtimestamp, `entity`, `value` FROM `measuredata` WHERE `entity` = 38 AND timestamp > UNIX_TIMESTAMP('2019-01-25') group by aggtimestamp
Works, but my DB/index/structue seems not really optimized for this: Query for last year took ~75sec (slow test machine) but finally got only a one value per day. This can be combined with avg(value), but this further increases query time... (~82sec). I will see if it's possible to further optimize this. But I now have an idea how "downsampling" data works, especially with aggregation in combination with "group by".
There is probably no efficient way to do this. But, if you want, you can break the rows into equal sized groups and then fetch, say, the first row from each group. Here is one method:
select md.*
from (select md.*,
row_number() over (partition by tile order by timestamp) as seqnum
from (select md.*, ntile(800) over (order by timestamp) as tile
from measuredata md
where . . . -- your filtering conditions here
) md
) md
where seqnum = 1;

Closest position between randomly moving objects

I have a large database tables that contains grid references (X and Y) associated with various objects (each with a unique object identifier) as they move with time. The objects move at approximately constant speed but random directions.
The table looks something like this….
CREATE TABLE positions (
objectId INTEGER,
x_coord INTEGER,
y_coord INTEGER,
posTime TIMESTAMP);
I want to find which two objects got closest to each other and at what time.
Finding the distance between two fixes is relatively easy – simple Pythagoras for the differences between the X and Y values should do the trick.
The first problem seems to be one of volume. The grid itself is large, 100,000 possible X co-ordinates and a similar number of Y co-ordinates. For any given time period the table might contain 10,000 grid reference positions for 1000 different objects – 10 million rows in total.
That’s not in itself a large number, but I can’t think of a way of avoiding doing a ‘product query’ to compare every fix to every other fix. Doing this with 10 million rows will produce 100 million million results.
The next issue is that I’m not just interested in the closest two fixes to each other, I’m interested in the closest two fixes from different objects.
Another issue is that I need to match time as well as position – I’m not just interested in two objects that have visited the same grid square, they need to have done so at the same time.
The other point (may not be relevant) is that the items are unlikely to every occupy exactly the same location at the same time.
I’ve got as far as a simple product query with a few sample rows, but I’m not sure on my next steps. I’m beginning to think this isn’t going something I can pull off with a single SQL query (please prove me wrong) and I’m likely to have to extract the data and subject it to some procedural programming.
Any suggestions?
I’m not sure what SE forum this best suited for – database SQL? Programming? Maths?
UPDATE - Another issue to add to the complexity, the timestamping for each object and position is irregular, one item might have a position recorded at 14:10:00 and another at 14:10:01. If these two positions are right next to each other and one second apart then they may actually represent the closest position although the time don't match!
In order to reduce the number of tested combinations you should segregate them by postime using subqueries. Also, it's recommended you create an index by postime to increase performance.
create index ix1_time on positions (postime);
Since you didn't mention any specific database I assumed PostgreSQL since it's easy to use (for me). The solution should look like:
with t as (
select distinct(postime) as pt from positions
)
select *
from t,
(
select *
from (
select
a.objectid as aid, b.objectid as bid,
a.x_coord + a.y_coord + b.x_coord + b.y_coord as dist -- fix here!
from t
join positions a on a.postime = t.pt
join positions b on b.postime = t.pt
where a.objectid <> b.objectid
) x
order by dist desc
limit 1
) y;
This SQL should compare each 10000 objects against each other on by postime. It will test 10 million combinations for each different postime value, but not against other postime values.
Please note: I used a.x_coord + a.y_coord + b.x_coord + b.y_coord as the distance formula. I leave the correct one for you to implement here.
In total it will compute 10 million x 1000 time values: a total of 10 billion comparisons. It will return the closest two points for each timepos, that is a total of 1000 rows.

Sum two counts in a new column without repeating the code

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X
For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V
This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T
You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C
Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

SQL Cross Apply Performance Issues

My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...