Query Optimization Problems (spatial) - sql

I have two datasets with spatial data.
Dataset 1 has approximately 15,000,000 records.
Dataset 2 has approximately 16,000,000 records.
Both are using the data type geography (GPS coordinates) and all records are points.
Both tables have spatial indexes with cells_per_object = 1 and the levels are (HIGH, HIGH, HIGH, HIGH)
All points are located in a, globally speaking, small area (1 U.S. state). The points are spread out enough to warrant using geography rather than a projection to geometry.
DECLARE #g GEOGRAPHY
SET #g = (SELECT TOP 1 GPSPoint FROM Dataset1)
EXEC sp_help_spatial_geography_index 'Dataset1', 'Dataset1_SpatialIndex', 0, #g
Shows
propvalue-propname
1-Total_Number_Of_ObjectCells_In_Level0_For_QuerySample
28178-Total_Number_Of_ObjectCells_In_Level1_In_Index
1-Total_Number_Of_ObjectCells_In_Level4_For_QuerySample
14923330-Total_Number_Of_ObjectCells_In_Level4_In_Index
1-Total_Number_Of_Intersecting_ObjectCells_In_Level1_In_Index
1-Total_Number_Of_Intersecting_ObjectCells_In_Level4_For_QuerySample
14923330-Total_Number_Of_Intersecting_ObjectCells_In_Level4_In_Index
1-Total_Number_Of_Border_ObjectCells_In_Level0_For_QuerySample
28177-Total_Number_Of_Border_ObjectCells_In_Level1_In_Index
740-Number_Of_Rows_Selected_By_Primary_Filter
0-Number_Of_Rows_Selected_By_Internal_Filter
740-Number_Of_Times_Secondary_Filter_Is_Called
1-Number_Of_Rows_Output
99.99504-Percentage_Of_Rows_NotSelected_By_Primary_Filter
0-Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter
0-Internal_Filter_Efficiency
0.135135-Primary_Filter_Efficiency
Which means that the query
DECLARE #g GEOGRPAHY
SET #g = (SELECT TOP 1 GPSPoint FROM Dataset1)
SELECT TOP 1
*
FROM
Dataset2 D
WHERE
#g.Filter(D.GPSPoint.STBuffer(1)) = 1
Takes almost an hour to complete.
I have also tried doing
WITH TABLE1 AS (
SELECT
A.RecordID,
B.RecordID,
RANK() OVER (PARTITION BY A.RecordID ORDER BY A.GPSPoint.STDistance(B.GPSPoint) ASC) AS 'Ranking'
FROM
Dataset1 A
INNER JOIN
Dataset2 B
ON
B.GPSPoint.Filter(A.GPSPoint.STBuffer(1)) = 1
AND A.GPSPoint.STDistance(B.GPSPoint) <= 50
)
SELECT
*
FROM
TABLE1
WHERE
Ranking = 1
Which ends up being about 1,000 times faster, but at that rate what I am trying to do will take a query running for six months to complete. I honestly do no know what to do at this point. The end goal is to do a nearest neighbor search for every record in dataset1 to find the closest point in dataset2, but like this it seems impossible.
Does anyone have any ideas where I could improve the efficiency of this process?

Try this: It is based on recommendations on MSDN.
SELECT TOP(1)
A.RecordID,
B.RecordID,
A.GPSPoint.STDistance(B.GPSPoint) AS Distance
FROM
Dataset1 A
INNER JOIN
Dataset2 B
ON
A.GPSPoint.STDistance(B.GPSPoint) <= 50
AND B.GPSPoint IS NOT NULL
ORDER BY BY A.GPSPoint.STDistance(B.GPSPoint) ASC
Note I have removed this, try the query above first, then add these predicates and see how it effects the indexing.
B.GPSPoint.Filter(A.GPSPoint.STBuffer(1)) = 1
AND
//or try B.GPSPoint.STIntersects(A.GPSPoint.STBuffer(1)) = 1
The following requirements must be met for a Nearest Neighbor query to use a spatial index:
A spatial index must be present on one of the spatial columns and the STDistance() method must use that column in the WHERE and ORDER BY clauses.
The TOP clause cannot contain a PERCENT statement.
The WHERE clause must contain a STDistance() method
If there are multiple predicates in the WHERE clause then the predicate containing STDistance() method must be connected by an AND conjunction to the other predicates. The STDistance() method cannot be in an optional part of the WHERE clause.
The first expression in the ORDER BY clause must use the STDistance() method.
Sort order for the first STDistance() expression in the ORDER BY clause must be ASC.
All the rows for which STDistance returns NULL must be filtered out.

Related

How to pull rows from a SQL table until quotas for multiple columns are met?

I've been able to find a few examples of questions similar to this one, but most only involve a single column being checked.
SQL Select until Quantity Met
Select rows until condition met
I have a large table representing facilities, with columns for each type of resource available and the number of those specific resources available per facility. I want this stored procedure to be able to take integer values in as multiple parameters (representing each of these columns) and a Lat/Lon. Then it should iterate over the table sorted by distance, and return all rows (facilities) until the required quantity of available resources (specified by the parameters) are met.
Data source example:
Id
Lat
Long
Resource1
Resource2
...
1
50.123
4.23
5
12
...
2
61.234
5.34
0
9
...
3
50.634
4.67
21
18
...
Result Wanted:
#latQuery = 50.634
#LongQuery = 4.67
#res1Query = 10
#res2Query = 20
Id
Lat
Long
Resource1
Resource2
...
3
50.634
4.67
21
18
...
1
50.123
4.23
5
12
...
Result includes all rows that meet the queries individually. Result is also sorted by distance to the requested lat/lon
I'm able to sort the results by distance, and sum the total running values as suggested in other threads, but I'm having some trouble with the logic comparing the running values with the quota provided in the params.
First I have some CTEs to get most recent edits, order by distance and then sum the running totals
WITH cte1 AS (SELECT
#origin.STDistance(geography::Point(Facility.Lat, Facility.Long, 4326)) AS distance,
Facility.Resource1 as res1,
Facility.Resource2 as res2
-- ...etc
FROM Facility
),
cte2 AS (SELECT
distance,
res1,
SUM(res1) OVER (ORDER BY distance) AS totRes1,
res2,
SUM(res1) OVER (ORDER BY distance) AS totRes2
-- ...etc, there's 15-20 columns here
FROM cte1
)
Next, with the results of that CTE, I need to pull rows until all quotas are met. Having the issues here, where it works for one row but my logic with all the ANDs isn't exactly right.
SELECT * FROM cte2 WHERE (
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1)) AND
(totRes2 <= #res2Query OR (totRes2 > #res2Query AND totRes2- res2 <= #totRes2)) AND
-- ... I also feel like this method of pulling the next row once it's over may be convoluted as well?
)
As-is right now, it's mostly returning nothing, and I'm guessing it's because it's too strict? Essentially, I want to be able to let the total values go past the required values until they are all past the required values, and then return that list.
Has anyone come across a better method of searching using separate quotas for multiple columns?
See my update in the answers/comments
I think you are massively over-complicating this. This does not need any joins, just some running sum calculations, and the right OR logic.
The key to solving this is that you need all rows, where the running sum up to the previous row is less than the requirement for all requirements. This means that you include all rows where the requirement has not been met, and the first row for which the requirement has been met or exceeded.
To do this you can subtract the current row's value from the running sum.
You could utilize a ROWS specification of ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING. But then you need to deal with NULL on the first row.
In any event, even a regular running sum should always use ROWS UNBOUNDED PRECEDING, because the default is RANGE UNBOUNDED PRECEDING, which is subtly different and can cause incorrect results, as well as being slower.
You can also factor out the distance calculation into a CROSS APPLY (VALUES, avoiding the need for lots of CTEs or derived tables. You now only need one level of derivation.
DECLARE #origin geography = geography::Point(#latQuery, #LongQuery, 4326);
SELECT
f.Id,
f.Lat,
f.Long,
f.Resource1,
f.Resource2
FROM (
SELECT f.*,
SumRes1 = SUM(f.Resource1) OVER (ORDER BY v1.Distance ROWS UNBOUNDED PRECEDING) - f.Resource1,
SumRes2 = SUM(f.Resource2) OVER (ORDER BY v1.Distance ROWS UNBOUNDED PRECEDING) - f.Resource2
FROM Facility f
CROSS APPLY (VALUES(
#origin.STDistance(geography::Point(f.Lat, f.Long, 4326))
)) v1(Distance)
) f
WHERE (
f.SumRes1 < #res1Query
OR f.SumRes2 < #res2Query
);
db<>fiddle
Was able to figure out the problem on my own here. The primary issue I was running into was that I was comparing 25 different columns' running totals versus the 25 stored proc parameters (quotas of resources required by the search).
Changing the lines such as these
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1)) AND --...
to
(totRes1 <= #res1Query OR (totRes1 > #res1Query AND totRes1- res1 <= #totRes1) OR #res1Query = 0) AND --...
(adding in the OR #res1Query = 0)solved my issue.
In other words, the search is often only for one or two columns (types of resources) - leaving others as zero. The way my logic was set up caused it to skip over lots of rows because it was instantly marking them as having met the quota (value less than or equal to the quota). like #A Neon Tetra suggested, was pretty close to it already.
Update:
First attempt didn't exactly fix my own issue. Posting the stripped down version of my code that is now working for me.
DECLARE #Lat AS DECIMAL(12,6)
DECLARE #Lon AS DECIMAL(12,6)
DECLARE #res1Query AS INT
DECLARE #res2Query AS INT
-- repeat for Resource 3 through 25, etc...
DECLARE #origin geography = geography::Point(#Lat, #Lon, 4326);
-- CTE to be able to expose distance
cte AS (SELECT TOP(99999) -- --> this is hacky, it won't let me order by distance unless I'm selecting TOP(x) or some other fn?
dbo.Facility.FacilityGUID,
dbo.Facility.Lat,
dbo.Facility.Lon,
#origin.STDistance(geography::Point(dbo.Facility.Lat, dbo.Facility.Lon, 4326))
AS distance,
dbo.Facility.Resource1 AS res1,
dbo.Facility.Resource2 AS res2,
-- repeat for Resource 3 through 25, etc...
FROM dbo.Facility
ORDER BY distance),
-- third CTE - has access to distance so we can keep track of a running total ordered by distance
---> have to separate into two since you can't reference the same alias (distance) again within the same SELECT
fullCTE AS (SELECT
FacilityID,
Lat,
Long,
distance,
res1,
SUM(res1) OVER (ORDER BY distance)AS totRes1,
res2,
SUM(res2) OVER (ORDER BY distance)AS totRes2,
-- repeat for Resource 3 through 25, etc...
FROM cte)
SELECT * -- Customize what you're pulling here for your output as needed
FROM dbo.Facility INNER JOIN fullCTE ON (fullCTE.FacilityID = dbo.Facility.FacilityID)
WHERE EXISTS
(SELECT
FacilityID
FROM fullCTE WHERE (
FacilityID = dbo.Facility.FacilityID AND
-- Keep pulling rows until all conditions are met, as opposed to pulling rows while they're under the quota
NOT (
((totRes1 - res1 >= #res1Query AND #res1Query <> 0) OR (#res1Query = 0)) AND
((totRes2 - res2 >= #res2Query AND #res2Query <> 0) OR (#res2Query = 0)) AND
-- repeat for Resource 3 through 25, etc...
)
)
)

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!
This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

Nested subquery in Access alias causing "enter parameter value"

I'm using Access (I normally use SQL Server) for a little job, and I'm getting "enter parameter value" for Night.NightId in the statement below that has a subquery within a subquery. I expect it would work if I wasn't nesting it two levels deep, but I can't think of a way around it (query ideas welcome).
The scenario is pretty simple, there's a Night table with a one-to-many relationship to a Score table - each night normally has 10 scores. Each score has a bit field IsDouble which is normally true for two of the scores.
I want to list all of the nights, with a number next to each representing how many of the top 2 scores were marked IsDouble (would be 0, 1 or 2).
Here's the SQL, I've tried lots of combinations of adding aliases to the column and the tables, but I've taken them out for simplicity below:
select Night.*
,
( select sum(IIF(IsDouble,1,0)) from
(SELECT top 2 * from Score where NightId=Night.NightId order by Score desc, IsDouble asc, ID)
) as TopTwoMarkedAsDoubles
from Night
This is a bit of speculation. However, some databases have issues with correlation conditions in multiply nested subqueries. MS Access might have this problem.
If so, you can solve this by using aggregation with a where clause that chooses the top two values:
select s.nightid,
sum(IIF(IsDouble, 1, 0)) as TopTwoMarkedAsDoubles
from Score as s
where s.id in (select top 2 s2.id
from score as s2
where s2.nightid = s.nightid
order by s2.score desc, s2.IsDouble asc, s2.id
)
group by s.nightid;
If this works, it is a simply matter to join Night back in to get the additional columns.
Your subquery can only see one level above it. so Night.NightId is totally unknown to it hence why you are being prompted to enter a value. You can use a Group By to get the value you want for each NightId then correlate that back to the original Night table.
Select *
From Night
left join (
Select N.NightId
, sum(IIF(S.IsDouble,1,0)) as [Number of Doubles]
from Night N
inner join Score S
on S.NightId = S.NightId
group by N.NightId) NightsWithScores
on Night.NightId = NightsWithScores.NightId
Because of the IIF(S.IsDouble,1,0) I don't see the point is using top.

select outliers based on sigma and standard deviation in sql

The sample data is like this.
I want select outliers out of 4 sigma for each class.
I tried
select value,class,AVG(value) as mean, STDEV(value)as st, size from Data
having value<mean-2*st OR value>mean+2*st group by calss
it seems does not work. Should I use having or where clause here?
The results I want is the whole 3rd row and 8th row.
When the condition you are looking at is a property of the row, use where i.e. where class = 1 (all rows with class 1) or where size > 2 (all rows with size > 2). When the condition is a property of a set of rows you use group by ... having e.g. group by class having avg(value) > 2 (all classes with average value > 2).
In this case you want where but there is a complication. You don't have enough information in each row alone to write the necessary where clause, so you will have to get it through a subquery.
Ultimately you want something like SELECT value, class, size FROM Data WHERE value < mean - 2 *st OR value > mean + 2*st; however you need a subquery to get mean and st.
One way to do this is:
SELECT value, Data.class, size, mean, st FROM Data,
INNER JOIN (
SELECT class, AVG(value) AS mean, STDEV(value) AS st
FROM Data GROUP BY class
) AS stats ON stats.class = Data.class
WHERE value < mean - 2 * st OR value > mean + 2 * st;
This creates a subquery which gets your means and standard deviations for each class, joins those numbers to the rows with matching classes, and then applies your outlier check.

Find related "ordered pairs" in SQL

Let's say I have a table format that looks exactly like this:
I'd like to write a query that locates the maximum station for a given frame and output case (results are grouped by frame & output case) but also return the ordered P (& eventually V2, V3, T, M2 & M3) that would be associated with the maximum station. The desired query is shown below:
I can't for the life of me figure this out. I've posted a copy of the access database to my google drive: https://drive.google.com/folderview?id=0B9VpkDoFQISJOFcwS2RMSGJ5RVk&usp=sharing
select x.*, t.p
from (select frame, outputcase, max(station) as max_station
from tbl
group by frame, outputcase) x
inner join tbl t
on x.frame = t.frame
and x.outputcase = t.outputcase
and x.max_station = t.station
order by x.frame, x.outputcase;
Just as a note to avoid confusion, w/ that second column, t is the table alias, p is the column name.
The subquery, which I've assigned an alias of x, finds the max(station) for each unique combination of (frame, outputcase). That is what you want, but the problem does not stop there, you also want column p. The reason that couldn't be selected in the same query is because you would have had to group by it, and you don't want the max(station) for each combination of (frame, outputcase, p). You want the max(station) for each combination of (frame, outputcase).
Because we couldn't get column p in that first step, we have to join back to the original table using the value we obtained (which I've assigned an alias, max_station), and the obvious join conditions of frame and outputcase. So we join back to the original table on those 3 things, 2 of which are fields on the actual table, one of which was calculated in the subquery (max_station).
Because we've joined back to the original table, we can then select column p from the original table.
Takes a bit to return the query, but the result below provides the desired result:
SELECT t1.*
FROM [Element Forces - Frames] as t1
WHERE t1.Station In (SELECT TOP 1 t2.Station
FROM [Element Forces - Frames] as t2
WHERE t2.Frame = t1.Frame
ORDER BY t2.Station DESC)
ORDER BY t1.Frame ASC, t1.OutputCase ASC;
I still want to thank everyone who posted answers. I'm sure it's just syntax errors on my part that I was struggling with.