Optimize a PowerBI data load containing a Nearest Neighbor calculation - sql

I have a PowerBI report that shows metrics and visuals for a large amount of quote data extracted by an API, roughly 400k records a week. These quotes only contain latitude and longitude points for location, but shareholders need to slice views by our service areas. We have a fact table of areas with IDs and geography polygons that I am able to reference.
Currently, the report uses a gnarly custom SQL query that pulls this data from the transactional database, transforms it, and finds the nearest area through a cross apply method.
Here's an example of the code:
-- step 1 : get quotes from the first table
SELECT Col1, Col2...
INTO #AllQuotes
FROM Quotes1
LEFT JOIN (FactTables)
INNER JOIN([filtering self join])
WHERE expression
-- Step 2 : insert quotes from a separate table into our first temp table to get a table with all quote data
INSERT INTO #AllQuotes
SELECT Col1, Col2
FROM Quotes2
LEFT JOIN(Fact Tables)
INNER JOIN([filtering self join])
WHERE expression
-- Step 3 : Use CROSS APPLY to check the distance of every quote from every area, only selecting the shortest distance
SELECT *
FROM (SELECT *
FROM #AllQuotes as t
CROSS APPLY (SELECT TOP 1 a.AreaName,
a.AreaPoly.STDistance(geography::STGeomFromText('POINT('+ cast(t.PickLongitudeTemp as VARCHAR(20)) +' '+ cast(t.PickLatitudeTemp as VARCHAR(20)) +')', 4326).MakeValid()) AS 'DistanceToZone'
FROM Area as a
WHERE (a.AreaPoly.STIsValid() = 1)
AND (a.AreaPoly.STDistance(geography::STGeomFromText('POINT('+ cast(t.PickLongitudeTemp as VARCHAR(20)) +' '+ cast(t.PickLatitudeTemp as VARCHAR(20)) +')', 4326).MakeValid()) IS NOT NULL)
ORDER BY a.AreaPoly.STDistance(geography::STGeomFromText('POINT('+ cast(t.PickLongitudeTemp as VARCHAR(20)) +' '+ cast(t.PickLatitudeTemp as VARCHAR(20)) +')', 4326).MakeValid()) ASC) AS t2 ) AS llz;
This is obviously very computationally expensive and is making the PowerBI mashup engine work in overdrive. We are starting to have issues with CPU load on our database due to poor data load optimization. PowerBI rebuilds its data model every refresh and its query engine is not the strongest at using complex queries. Compounding this with the large amount of data, it quickly becomes a real issue with our stability.
Our database doesn't have a schema that is conducive to making efficient analytics queries, there is no transformation happening as it's loaded, and a process to hit a maps API to associate addresses with lat/longs. In order to produce reports with any value, I need to perform a lot of transformations within the query or within the loading process. This isn't the best thing to do, I know, but its what I got working and that could provide value.
I decided to try to move the query into something server side so that PowerBI only needed to load an already transformed and prepped dataset. With views I was able to get a dataset of all of my quotes and their lat/longs.
Now how would I go about running step 3? I have a few ideas:
Use a nested view
Refactor every temp table into a monolith of CTEs that then get transformed by a final view
Research a new method for solving a Lat/Long to Polygon matching problem.
I would like to have a final table that PowerBI can import with a simple SELECT * FROM #AllQuotes so that the mashup engine has to do less work constructing the data model. This would also allow me to implement incremental refresh and be able to only import a day's worth of data as time goes on rather than the full dataset.
Any solutions or ideas on how to match Lat/Long points to a list of geography Polygons in a PBI friendly way would be greatly appreciated.

Can't say I'm a spatial expert, but I don't think you are really using your index. STDistance has to run against every combo of quote/area then sort to find the smallest distance. So you need to reduce the number of areas each quote is compared against
If you review your data, I'd guess you find something like 30% of quotes are within 5,000 meters of your quote location. And 80% are within 10,000 meters.
With that in mind, I think we can add some queries to find those close matches first. This should be able use you spatial indexes efficiently as it can first filter down to only close matches, reducing the number of times you have to calculate the distance of a quote to each area.
Conceptual Code Approach: First Find Quick Matches within Predefined Distance(s)
/*First identify matches with in small distance like 5,000 meters*/
UPDATE #Quote
SET NearestAreaID = C.AreaID
FROM #Quote AS A
CROSS APPLY (SELECT QuoteGeogPoint = geography::Point(A.PickLatitudeTemp, A.PickLongitudeTemp, 4326)) AS B
CROSS APPLY ( SELECT TOP(1) B.AreaID
FROM Area AS DTA
/*STBuffer creates circle of 5000 meters around Quote location
STIntersects matches only on data that intersects with that point*/
WHERE B.QuoteGeogPoint.STBuffer(5000).STIntersects(DTA.AreaPoly)
ORDER BY DTA.AreaPoly.STDistance(Q.QuoteGeogPoint)
) AS C
/*Could run above query again for a medium distance say 10,000 meters*/
UPDATE #Quote
SET NearestAreaID = C.AreaID
FROM #Quote AS A
CROSS APPLY (SELECT QuoteGeogPoint = geography::Point(A.PickLatitudeTemp, A.PickLongitudeTemp, 4326)) AS B
CROSS APPLY ( SELECT TOP(1) B.AreaID
FROM Area AS DTA
WHERE B.QuoteGeogPoint.STBuffer(10000).STIntersects(DTA.AreaPoly)
ORDER BY DTA.AreaPoly.STDistance(Q.QuoteGeogPoint)
) AS C
WHERE A.NearestAreaID IS NULL /*No matches found*/
Match Quotes Regardless of Area Distance
Once you've found the easy matches, use this script (your current step 3) to clean up any stragglers
/*Find matches for any didn't have a match in defined distances*/
UPDATE #Quote
SET NearestAreaID = C.AreaID
FROM #Quote AS A
CROSS APPLY (SELECT QuoteGeogPoint = geography::Point(A.PickLatitudeTemp, A.PickLongitudeTemp, 4326)) AS B
CROSS APPLY ( SELECT TOP(1) B.AreaID
FROM Area AS DTA
ORDER BY DTA.AreaPoly.STDistance(Q.QuoteGeogPoint)
) AS C
WHERE NearestAreaID IS NULL /*No matches already found*/

Related

Query times out after 6 hours, how to optimize it?

I have two tables, shapes and squares, that I'm joining based on intersections of GEOGRAHPY columns.
The shapes table contains travel routes for vehicles:
shape_key STRING identifier for the shape
shape_lines ARRAY<GEOGRAPHY> consecutive line segments making up the shape
shape_geography GEOGRAPHY the union of all shape_lines
shape_length_km FLOAT64 length of the shape in kilometers
Rows: 65k
Size: 718 MB
We keep shape_lines separated out in an ARRAY because shapes sometimes double back on themselves, and we want to keep those line segments separate instead of deduplicating them.
The squares table contains a grid of 1×1 km squares:
square_key INT64 identifier of the grid square
square_geography GEOGRAPHY four-cornered polygon describing the grid square
Rows: 102k
Size: 15 MB
The shapes represent travel routes for vehicles. For each shape, we have computed emissions of harmful substances in a separate table. The aim is to calculate the emissions per grid square, assuming that they are evenly distributed along the route. To that end, we need to know what portion of the route shape intersects with each grid cell.
Here's the query to compute that:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
shapes,
squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Sadly, this query times out after 6 hours instead of producing a useful result.
In the worst case, the query can produce 6.6 billion rows, but that will not happen in practice. I estimate that each shape typically intersects maybe 50 grid squares, so the output should be around 65k * 50 = 3.3M rows; nothing that BigQuery shouldn't be able to handle.
I have considered the geographic join optimizations performed by BigQuery:
Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.
Check. I even rewrote my INNER JOIN to the equivalent "comma" join shown above.
Spatial joins perform better when your geography data is persisted.
Check. Both shape_geography and square_geography come straight from existing tables.
BigQuery implements optimized spatial JOINs for INNER JOIN and CROSS JOIN operators with the following standard SQL predicate functions: [...] ST_Intersects
Check. Just a single ST_Intersect call, no other conditions.
Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.
Check. None of these cases apply.
So I think BigQuery should be able to optimize this join using whatever spatial indexing data structures it uses.
I have also considered the advice about cross joins:
Avoid joins that generate more outputs than inputs.
This query definitely generates more outputs than inputs; that's in its nature and cannot be avoided.
When a CROSS JOIN is required, pre-aggregate your data.
To avoid performance issues associated with joins that generate more outputs than inputs:
Use a GROUP BY clause to pre-aggregate the data.
Check. I already pre-aggregated the emissions data grouped by shapes, so that each shape in the shapes table is unique and distinct.
Use a window function. Window functions are often more efficient than using a cross join. For more information, see analytic functions.
I don't think it's possible to use a window function for this query.
I suspect that BigQuery allocates resources based on the number of input rows, not on the size of the intermediate tables or output. That would explain the pathological behaviour I'm seeing.
How can I make this query run in reasonable time?
I think the squares got inverted, resulting in almost-full Earth polygons:
select st_area(square_geography), * from `open-transport-data.public.squares`
Prints results like 5.1E14 - which is full globe area. So any line intersects almost all the squares. See BigQuery doc for details : https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation
You can invert them by running ST_GeogFromText(wkt, FALSE) - which chooses smaller polygon, ignoring polygon orientation, this works reasonably fast:
SELECT
shape_key,
square_key,
SAFE_DIVIDE(
(
SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
FROM UNNEST(shape_lines) AS line
),
shape_length_km)
AS square_portion
FROM
`open-transport-data.public.shapes`,
(select
square_key,
st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
from `open-transport-data.public.squares`) squares
WHERE
ST_INTERSECTS(shape_geography, square_geography)
Below would definitely not fit the comments format so I have to post this as an answer ...
I did three adjustment to your query
using JOIN ... ON instead of CROSS JOIN ... WHERE
commenting out square_portion calculation
using destination table with Allow Large Results option
Even though you expected just 3.3 M rows in output - in reality it is about 6.6 B ( 6,591,549,944) rows - you can see result of my experiment below
Note warning about Billing Tier - so you better use Reservations if available
Obviously, un-commenting square_portion calculation will increase Slots usage - so, you might potentially need to revisit your requirements/expectations

Detect loops in hierarchy using SQL Server query [duplicate]

I have parent child data in excel which gets loaded into a 3rd party system running MS SQL server. The data represents a directed (hopefully) acyclic graph. 3rd party means I don't have a completely free hand in the schema. The excel data is a concatenation of other files and the possibility exists that in the cross-references between the various files someone has caused a loop - i.e. X is a child of Y (X->Y) then elsewhere (Y->A->B-X). I can write vb, vba etc on the excel or on the SQL server db. The excel file is almost 30k rows so I'm worried about a combinatorial explosion as the data is set to grow. So some of the techniques like creating a table with all the paths might be pretty unwieldy. I'm thinking of simply writing a program that, for each root, does a tree traversal to each leaf and if the depth gets greater than some nominal value flags it.
Better suggestions or pointers to previous discussion welcomed.
You can use a recursive CTE to detect loops:
with prev as (
select RowId, 1 AS GenerationsRemoved
from YourTable
union all
select RowId, prev.GenerationsRemoved + 1
from prev
inner join YourTable on prev.RowId = ParentRowId
and prev.GenerationsRemoved < 55
)
select *
from prev
where GenerationsRemoved > 50
This does require you to specify a maximum recursion level: in this case the CTE runs to 55, and it selects as erroneous rows with more than 50 children.

SQL Cross Apply Performance Issues

My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...

Select pair of rows that obey a rule

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !
As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm
Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.
This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.
I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.
Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.
IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).

SQL - Getting the max effective date less than a date in another table

I'm currently working on a conversion script to transfer a bunch of old data out of an SQL Server 2000 database and onto a SQL Server 2008. One of thing things I'm trying to accomplish during this conversion is to eliminate all of the composite keys and replace them with a "proper" primary key. Obviously, when I transfer the data I need to inject the foreign key values into the new table structures.
I'm currently stuck with one data set though and I can't seem to get my head around it in a set-based fashion. The two tables with which I am working are called Charge and Law. They have a 1:1 relationship and "link" on three columns. The first two are an equal link on the LawSource and LawStatue columns, but the third column is causing me problems. The ChargeDate column should link to the LawDate column where LawDate <= ChargeDate.
My current query is returning more than one row (in some cases) for a given Charge because the Law may have more than one LawDate that is less than or equal to the ChargeDate.
Here's what I currently have:
select LawId
from Law a
join Charge b on b.LawSource = a.LawSource
and b.LawStatute = a.LawStatute
and b.ChargeDate >= a.LawDate
Any way I can rewrite this to get the most recent entry in the Law table that is the same (or earlier) date at the ChargeDate?
This would be easier in SQL 2008 with the partitioning functions (so, it should be easier in the future for you).
The usual caveats of "I don't have your schema, so this isn't tested" apply, but I think it should do what you need.
select
l.LawID
from
law l
join (
select
a.LawSource,
a.LawStatue,
max(a.LawDate) LawDate
from
Law a
join Charge b on b.LawSource = a.LawSource
and b.LawStatute = a.LawStatute
and b.ChargeDate >= a.LawDate
group by
a.LawSource, a.LawStatue
) d on l.LawSource = d.LawSource and l.LawStatue = d.LawStatue and l.LawDate = d.LawDate
If performance is not an issue, cross apply provides a very readable way:
select *
from Law l
cross apply
(
select top 1 *
from Charge
where LawSource = l.LawSource
and LawStatute = l.LawStatute
and ChargeDate >= l.LawDate
order by
ChargeDate
) c
For each row, this looks up the row in the Charge table with the smallest ChargeDate.
To include rows from Law without a matching Charge, change cross apply to outer apply.