We have a table of Customers, with each one's location as a Geography column, and a table of Branch Offices also with each one's location as a Geography column (we populate the Geography columns from latitude and longitude columns)
We need to run a query (view) that's intended to show the closest branch office to each customer, based on Geography columns, and it runs fine with a couple of thousand customers. We just received a big job that needs to run with 700,000 customers and it takes hours to run. Can anyone suggest any ways to speed up this SQL?
WITH CLOSEST AS (
SELECT *, ROW_NUMBER()
OVER (
PARTITION BY CustNum
ORDER BY Miles
) AS RowNo
FROM
(
SELECT
CustNum,
BranchNum,
CONVERT(DECIMAL(10, 6), (BranchLoc.STDistance(CustLoc)) / 1609.344) AS Miles
FROM
Branch_Locations
CROSS JOIN
Cust_Locations
) AS T
)
SELECT TOP 100 PERCENT CustNum, BranchNum, Miles, RowNo FROM CLOSEST WHERE RowNo = 1 ORDER BY CustNum, MILES
Could there be a way to put the distance comparison into the JOIN? Nothing comes to mind so far...
Thanks for any suggestions!
So, what you're doing here is calculating the distance from each point to each other point, then ranking. SQL Server Spatial is actually set up in such a way that this is entirely unnecessary.
The first thing you want to do is make a spatial index on each table; documentation on how to do this can be found here. Don't worry too much about the specific paramters here, while you can definitely improve performance by adjusting them, having a spatial index at all will drastically improve performance.
The second thing you want to do is to make sure the spatial index is being used; documentation on how to make sure this happens can be found here. Make sure that you filter out any null spatial information!
So, what this has told as so far is a way to take a point and find the closest point in another long list of tables; but this is SQL Server, we want to to this set based!
My recommendation is to use a little a priori knowledge and write a query using that.
WITH CLOSEST AS (
SELECT
C.CustNum,
B.BranchNum,
ROW_NUMBER() OVER (PARTITION BY C.CustNum ORDER BY B.BranchLoc.STDistance(C.CustLoc)/1609.344 ASC) AS Miles
FROM
Branch_Locations B
INNER JOIN
Cust_Locations C
ON
B.BranchLoc.STDistance(C.CustLoc)/1609.344 < 100 --100 miles as a maximum search distance is a reasonable number to me
WHERE
B.BranchLoc IS NOT NULL
AND C.CustLoc IS NOT NULL
) AS T
SELECT
CustNum,
BranchNum,
Miles,
RowNo
FROM
CLOSEST
WHERE
RowNo = 1
ORDER BY
CustNum,
MILES
There are other techniques that you can use, such as my response here, however at the end of the day the most important takeaway is to create spatial indexes and make sure they are used.
Related
I have a table of existing customers and another table of potential customers. I want to return a list of potential customers rank ordered by the number of radii of existing purchasers that they appear in.
There are many rows in the potential customers table per each existing customer, and the radius around a given existing customer could encompass multiple potential customers. I want to return a list of potential customers ordered by the count of the existing customer radii that they fall within.
SELECT pur.contact_id AS purchaser, count(pot.*) AS nearby_potential_customers
FROM purchasers_geocoded pur, potential_customers_geocoded pot
WHERE ST_DWithin(pur.geom,pot.geom,1000)
GROUP BY purchaser;
Does anyone have advice on how to proceed?
EDIT:
With some help, I wrote this query, which seems to do the job, but I'm verifying now.
WITH prequalified_leads_table AS (
SELECT *
FROM nearby_potential_customers
WHERE market_val > 80000
AND market_val < 120000
)
, proximate_to_existing AS (
SELECT pot.prop_id AS prequalified_leads
FROM purchasers_geocoded pur, prequalified_leads_table pot
WHERE ST_DWithin(pot.geom,pur.geom,100)
)
SELECT prequalified_leads, count(prequalified_leads)
FROM proximate_to_existing
GROUP BY prequalified_leads
ORDER BY count(*) DESC;
I want to return a list of potential customers ordered by the count of the existing customer radii that they fall within.
Your query tried the opposite of your statement, counting potential customers around existing ones.
Inverting that, and after adding some tweaks:
SELECT pot.contact_id AS potential_customer
, rank() OVER (ORDER BY pur.nearby_customers DESC
, pot.contact_id) AS rnk
, pur.nearby_customers
FROM potential_customers_geocoded pot
LEFT JOIN LATERAL (
SELECT count(*) AS nearby_customers
FROM purchasers_geocoded pur
WHERE ST_DWithin(pur.geom, pot.geom, 1000)
) pur ON true
ORDER BY 2;
I suggest a subquery with LEFT JOIN LATERAL ... ON true to get counts. Should make use of the spatial index that you undoubtedly have:
CREATE INDEX ON purchasers_geocoded USING gist (geom);
Thereby retaining rows with 0 nearby customers in the result - your original join style would exclude those. Related:
What is the difference between LATERAL and a subquery in PostgreSQL?
Then ORDER BY the resulting nearby_customers in the outer query (not: nearby_potential_customers).
It's not clear whether you want to add an actual rank. Use the window function rank() if so. I made the rank deterministic while being at it, breaking ties with an additional ORDER BY expression: pot.contact_id. Else, peers are returned in arbitrary order which can change for every execution.
ORDER BY 2 is short syntax for "order by the 2nd out column". See:
Select first row in each GROUP BY group?
Related:
How do I query all rows within a 5-mile radius of my coordinates?
I would like to create a view that returns information about articles whose condition on individual warehouses fell below 20% compared to the previous day.
My table structure is as follows:
I have no idea how to create such a view. Any help or suggestion is welcome. Thanks in advance!
Your question is a bit vague. For instance, what if data for a day is missing? You also mention "warehouses", but there is no such field in the data. Similarly, "condition" is a bit hard to follow. That said, let me assume that you mean "previous day in the data for individual articles and you are interested in quantities that fall by 20%".
select t.*
from (select t.*,
lag(t.quantity) over (partition by articlename order by dateadd) as prev_quantity
from t
) t
where t.quantity < t.prev_quantity * (1 - 0.2);
I have a table that has 14,091 rows (2 columns, let's say first name, last name). I then have a calendar table that has 553 rows of just dates (first of each month). I do a cross join in order to get every combination of first name, last name, & first of month because this is my requirement. This takes just over a minute.
Is there anything I can do about this to make it faster or can a cross join never get any faster like I suspect?
People Table
first_name varchar2(100)
last_name varchar2(1000)
Dates Table
dt DateTime
select a.first_name, a.last_name, b.dt
from people a, dates b
It will be slow as it making all possible combinations. 14091 * 553. It will not going to be fast until you have either index or inner join.
Yeah. Takes over a minute. Let's get this clear. You talk of 14091 * 553 rows - that is 7792323. Rounded that is 7.8 million rows. And loading them into a data table (which is not known for performance).
Want to see slow? Put them into a grid. THEN you see slow.
The requirements make no sense in a table. None. Absolutely none.
And no, there is no way to speed up the loading of 7.8 million rows into a data structure that is not meant to hold these amounts of data.
Currently I am working on generating demographics of a database and we have added geography datatype in one of the tables. For demographics I have to produce max, min and avg of columns with other things.
Using
select MIN(Location) FROM SpatialTable
didn't work as geography datatype is incomparable.
So I used following query :
SELECT Location
FROM SpatialTable
WHERE Location.Lat IN (SELECT MIN(Location.Lat)
FROM SpatialTable
WHERE Location.Long IN (SELECT MIN(Location.Long)
FROM SpatialTable))
which basically selects the records with minimum Longitudes and then among those records it selects the one with minimum Latitude. But this can also be done other way round in which first MIN latitude is selected and among them MIN longitude is selected, like this:
SELECT
Location
FROM
SpatialTable
WHERE
Location.Long IN
(SELECT MIN(Location.Long)
FROM SpatialTable
WHERE Location.Lat IN (SELECT MIN(Location.Lat) FROM SpatialTable))
which may produce different result.
Is there a precise way to compare geographic data. I am using SQL Server 2008 R2 edition and my table has one Location column of geography type and an identity column.
To determine the minimum of a geography type, first you have to define what you mean by minimum. What is the minimum of a geography? It's like asking
what is the minimum of a dog?
How can one geography be less or more than another? Is London less than Paris*? Answer this, and you'll have your answer. At a guess, I'd say your answer may be the STDistance function.
*No, it's greater. Any fule knows that
Irrelevant to geography, you can use ROW_NUMBER() function to get row with min(or max) value on custom criteria.
SELECT x.* FROM
(
SELECT *
, ROW_NUMBER() OVER (ORDER BY Location.Lat, Location.Long) RN
FROM SpatialTable
) x
WHERE x.RN = 1
I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.
For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Every query of that sort (only Born parameter used) returns just over 1 million results.
I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.
Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.
Updated following discussion in comments
The cause of the problem here is very low cardinality of the IX_Person_Born index.
SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.
You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.
This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.
Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.
A high cardinality index (say for full date of birth), rather than year would be much quicker.
This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.
In short: there's no significantly quicker way to get the count using T-SQL and indexes.
You have a couple of options:
1. Data Aggregation
Either OLAP/Cube rollups or do it yourself:
select Born, count(*)
from Person
group by Born
The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.
2. Parallel Queries
Split into two queries:
SELECT count(*)
FROM Person
WHERE Born = '1970'
SELECT TOP 30 *
FROM Person
WHERE Born = '1970'
Then run these either in parallel server side, or add it to the user interface.
3. No-SQL
This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.
At this point a technology change is probably out, but it may be worth investigating so I've included it.
I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.
DECLARE #TotalPeople int
--does this query run fast enough? If not, there is no hope for a combo query.
SET #TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, #TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.
One of the column 'Born' have not more than 30 different values and there is an index defined on it.
Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.
Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).
CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born
GO
CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)
By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *, **max(Row) AS TotalPeople**
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
why not like that ?
edit , dont know why bold doesnt work :<
Here is a novel approach using system dmv's if you can get by with a "good enough" count, you don't mind creating an index for every distinct value for [Born], and you don't mind feeling a little bit dirty inside.
Create a filtered index for each year:
--pick a column to index, it doesn't matter which.
CREATE INDEX IX_Person_filt_1970 on Person ( id ) WHERE Born = '1970'
CREATE INDEX IX_Person_filt_1971 on Person ( id ) WHERE Born = '1971'
CREATE INDEX IX_Person_filt_1972 on Person ( id ) WHERE Born = '1972'
Then use the [rows] column from sys.partitions to to get a rowcount.
WITH PersonSubset as
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
FROM Person WITH (INDEX(IX_Person_Born))
WHERE Born = '1970'
)
SELECT *,
(
SELECT sum(rows)
FROM sys.partitions p
inner join sys.indexes i on p.object_id = i.object_id and p.index_id =i.index_id
inner join sys.tables t on t.object_id = i.object_id
WHERE t.name ='Person'
and i.name = 'IX_Person_filt_' + '1970' --or at #p1
) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30
Sys.partitions isn't guaranteed to be accurate in 100% of cases (usually it is exact or really close) This approach won't work if you need to filter on anything but [Born]