Comparing Geographic datatypes in SQL Server - sql

Currently I am working on generating demographics of a database and we have added geography datatype in one of the tables. For demographics I have to produce max, min and avg of columns with other things.
Using
select MIN(Location) FROM SpatialTable
didn't work as geography datatype is incomparable.
So I used following query :
SELECT Location
FROM SpatialTable
WHERE Location.Lat IN (SELECT MIN(Location.Lat)
FROM SpatialTable
WHERE Location.Long IN (SELECT MIN(Location.Long)
FROM SpatialTable))
which basically selects the records with minimum Longitudes and then among those records it selects the one with minimum Latitude. But this can also be done other way round in which first MIN latitude is selected and among them MIN longitude is selected, like this:
SELECT
Location
FROM
SpatialTable
WHERE
Location.Long IN
(SELECT MIN(Location.Long)
FROM SpatialTable
WHERE Location.Lat IN (SELECT MIN(Location.Lat) FROM SpatialTable))
which may produce different result.
Is there a precise way to compare geographic data. I am using SQL Server 2008 R2 edition and my table has one Location column of geography type and an identity column.

To determine the minimum of a geography type, first you have to define what you mean by minimum. What is the minimum of a geography? It's like asking
what is the minimum of a dog?
How can one geography be less or more than another? Is London less than Paris*? Answer this, and you'll have your answer. At a guess, I'd say your answer may be the STDistance function.
*No, it's greater. Any fule knows that

Irrelevant to geography, you can use ROW_NUMBER() function to get row with min(or max) value on custom criteria.
SELECT x.* FROM
(
SELECT *
, ROW_NUMBER() OVER (ORDER BY Location.Lat, Location.Long) RN
FROM SpatialTable
) x
WHERE x.RN = 1

Related

INSERT INTO two columns from a SELECT query

I have a table called VIEWS with Id, Day, Month, name of video, name of browser... but I'm interested only in Id, Day and Month.
The ID can be duplicate because the user (ID) can watch a video multiple days in multiple months.
This is the query for the minimum date and the maximum date.
SELECT ID, CONCAT(MIN(DAY), '/', MIN(MONTH)) AS MIN_DATE,
CONCAT(MAX(DAY), '/', MAX(MONTH)) AS MAX_DATE,
FROM Views
GROUP BY ID
I want to insert this select with two columns(MIN_DATE and MAX_DATE) to two new columns with insert into.
How can be the insert into query?
To do what you are trying to do (there are some issues with your solution, please read my comments below), first you need to add the new columns to the table.
ALTER TABLE Views ADD MIN_DATE VARCHAR(10)
ALTER TABLE Views ADD MAX_DATE VARCHAR(10)
Then you need to UPDATE your new columns (not INSERT, because you don't want new rows). Determine the min/max for each ID, then join the result back to the table to be able to update each row. You can't update directly from a GROUP BY as rows are grouped and lose their original row.
;WITH MinMax
(
SELECT
ID,
CONCAT(MIN(V.DAY), '/', MIN(V.MONTH)) AS MIN_DATE,
CONCAT(MAX(V.DAY), '/', MAX(V.MONTH)) AS MAX_DATE
FROM
Views AS V
GROUP BY
ID
)
UPDATE V SET
MIN_DATE = M.MIN_DATE,
MAX_DATE = M.MAX_DATE
FROM
MinMax AS M
INNER JOIN Views AS V ON M.ID = V.ID
The problems that I see with this design are:
Storing aggregated columns: you usually want to do this only for performance issues (which I believe is not the case here), as querying the aggregated (grouped) rows is faster due to being less rows to read. The problem is that you will have to update the grouped values each time one of the original rows is updated, which as extra processing time. Another option would be periodically updating the aggregated values, but you will have to accept that for a period of time the grouped values are not really representing the tracking table.
Keeping aggregated columns on the same table as the data they are aggregating: this is normalization problem. Updating or inserting a row will trigger updating all rows with the same ID as the min/max values might have changed. Also the min/max values will always be repeated on all rows that belong to the same ID, which is extra space that you are wasting. If you had to save aggregated data, you need to save it on a different table, which causes the problems I listed on the previous point.
Using text data type to store dates: you always want to work dates with a proper DATETIME data type. This will not only enable to use date functions like DATEADD or DATEDIFF, but also save space (varchars that store dates need more bytes that DATETIME). I don't see the year part on your query, it should be considered to compute a min/max (this might depend what you are storing on this table).
Computing the min/max incorrectly: If you have the following rows:
ID DAY MONTH
1 5 1
1 3 2
The current result of your query would be 3/1 as MIN_DATE and 5/2 as MAX_DATE, which I believe is not what you are trying to find. The lowest here should be the 5th of January and the highest the 3rd of February. This is a consequence of storing date parts as independent values and not the whole date as a DATETIME.
What you usually want to do for this scenario is to group directly on the query that needs the data grouped, so you will do the GROUP BY on the SELECT that needs the min/max. Having an index by ID would make the grouping very fast. Thus, you save the storage space you would use to keep the aggregated values and also the result is always the real grouped result at the time that you are querying.
Would be something like the following:
;WITH MinMax
(
SELECT
ID,
CONCAT(MIN(V.DAY), '/', MIN(V.MONTH)) AS MIN_DATE, -- Date problem (varchar + min/max computed seperately)
CONCAT(MAX(V.DAY), '/', MAX(V.MONTH)) AS MAX_DATE -- Date problem (varchar + min/max computed seperately)
FROM
Views AS V
GROUP BY
ID
)
SELECT
V.*,
M.MIN_DATE,
M.MAX_DATE
FROM
MinMax AS M
INNER JOIN Views AS V ON M.ID = V.ID

SQL to find best row in group based on multiple columns?

Let's say I have an Oracle table with measurements in different categories:
CREATE TABLE measurements (
category CHAR(8),
value NUMBER,
error NUMBER,
created DATE
)
Now I want to find the "best" row in each category, where "best" is defined like this:
It has the lowest errror.
If there are multiple measurements with the same error, the one that was created most recently is the considered to be the best.
This is a variation of the greatest N per group problem, but including two columns instead of one. How can I express this in SQL?
Use ROW_NUMBER:
WITH cte AS (
SELECT m.*, ROW_NUMBER() OVER (PARTITION BY category ORDER BY error, created DESC) rn
FROM measurements m
)
SELECT category, value, error, created
FROM cte
WHERE rn = 1;
For a brief explanation, the PARTITION BY clause instructs the DB to generate a separate row number for each group of records in the same category. The ORDER BY clause places those records with the smallest error first. Should two or more records in the same category be tied with the lowest error, then the next sorting level would place the record with the most recent creation date first.

Any suggestions to speed up slow geography query?

We have a table of Customers, with each one's location as a Geography column, and a table of Branch Offices also with each one's location as a Geography column (we populate the Geography columns from latitude and longitude columns)
We need to run a query (view) that's intended to show the closest branch office to each customer, based on Geography columns, and it runs fine with a couple of thousand customers. We just received a big job that needs to run with 700,000 customers and it takes hours to run. Can anyone suggest any ways to speed up this SQL?
WITH CLOSEST AS (
SELECT *, ROW_NUMBER()
OVER (
PARTITION BY CustNum
ORDER BY Miles
) AS RowNo
FROM
(
SELECT
CustNum,
BranchNum,
CONVERT(DECIMAL(10, 6), (BranchLoc.STDistance(CustLoc)) / 1609.344) AS Miles
FROM
Branch_Locations
CROSS JOIN
Cust_Locations
) AS T
)
SELECT TOP 100 PERCENT CustNum, BranchNum, Miles, RowNo FROM CLOSEST WHERE RowNo = 1 ORDER BY CustNum, MILES
Could there be a way to put the distance comparison into the JOIN? Nothing comes to mind so far...
Thanks for any suggestions!
So, what you're doing here is calculating the distance from each point to each other point, then ranking. SQL Server Spatial is actually set up in such a way that this is entirely unnecessary.
The first thing you want to do is make a spatial index on each table; documentation on how to do this can be found here. Don't worry too much about the specific paramters here, while you can definitely improve performance by adjusting them, having a spatial index at all will drastically improve performance.
The second thing you want to do is to make sure the spatial index is being used; documentation on how to make sure this happens can be found here. Make sure that you filter out any null spatial information!
So, what this has told as so far is a way to take a point and find the closest point in another long list of tables; but this is SQL Server, we want to to this set based!
My recommendation is to use a little a priori knowledge and write a query using that.
WITH CLOSEST AS (
SELECT
C.CustNum,
B.BranchNum,
ROW_NUMBER() OVER (PARTITION BY C.CustNum ORDER BY B.BranchLoc.STDistance(C.CustLoc)/1609.344 ASC) AS Miles
FROM
Branch_Locations B
INNER JOIN
Cust_Locations C
ON
B.BranchLoc.STDistance(C.CustLoc)/1609.344 < 100 --100 miles as a maximum search distance is a reasonable number to me
WHERE
B.BranchLoc IS NOT NULL
AND C.CustLoc IS NOT NULL
) AS T
SELECT
CustNum,
BranchNum,
Miles,
RowNo
FROM
CLOSEST
WHERE
RowNo = 1
ORDER BY
CustNum,
MILES
There are other techniques that you can use, such as my response here, however at the end of the day the most important takeaway is to create spatial indexes and make sure they are used.

Regarding use of a query

I am solving few sql queries myself,
in a question , which says
Find the largest country (by area) in each continent, show the continent, the name and the area:
SELECT continent, name, area
FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND area>0)
I don't understand what does he mean by world x and world y ? could anyone please explain that?
x and y are aliases. it allows you to identify the table in "WHERE y.continent=x.continent"
x and y are used as aliases (a short alternative name for reference purposes) of the table. This allows the use of the world table in two different scopes.
x and y are just aliases that are used to qualify the columns: if you had aliases and used the same table twice, it is not clear to which table instance a column belongs.
In your case, you are matching two instances of the same table on a column - continent - and the aliases are used to make it clear to the sql engine what is going on.
That is aliasing the table name, commonly written as:
FROM `table` AS `t`
x and y are table aliases. You use them to make the query more concise/readable and/or to use a query which selects the same table multiple times like here.
In SQL-Server 2005 and later you can use this query to get the desired result:
WITH CTE AS
(
SELECT continent, name, area,
rank=dense_rank() over(Partition By x.continent Order By area Desc)
From world
)
SELECT continent, name, area FROM CTE WHERE rank = 1
DENSE_RANK might return multiple countries per continent if they have the same largest area. If you just want one replace DENSE_RANK with ROW_NUMBER.

MySQL Single Row Returned From Temporary Table

I am running the following queries against a database:
CREATE TEMPORARY TABLE med_error_third_party_tmp
SELECT `med_error_category`.description AS category, `med_error_third_party_category`.error_count AS error_count
FROM
`med_error_category` INNER JOIN `med_error_third_party_category` ON med_error_category.`id` = `med_error_third_party_category`.`category`
WHERE
year = 2003
GROUP BY `med_error_category`.id;
The only problem is that when I create the temporary table and do a select * on it then it returns multiple rows, but the query above only returns one row. It seems to always return a single row unless I specify a GROUP BY, but then it returns a percentage of 1.0 like it should with a GROUP BY.
SELECT category,
error_count/SUM(error_count) AS percentage
FROM med_error_third_party_tmp;
Here are the server specs:
Server version: 5.0.77
Protocol version: 10
Server: Localhost via UNIX socket
Does anybody see a problem with this that is causing the problem?
Standard SQL requires you to specify a GROUP BY clause if any column is not wrapped in an aggregate function (IE: MIN, MAX, COUNT, SUM, AVG, etc), but MySQL supports "hidden columns in the GROUP BY" -- which is why:
SELECT category,
error_count/SUM(error_count) AS percentage
FROM med_error_third_party_tmp;
...runs without error. The problem with the functionality is that because there's no GROUP BY, the SUM is the SUM of the error_count column for the entire table. But the other column values are completely arbitrary - they can't be relied upon.
This:
SELECT category,
error_count/(SELECT SUM(error_count)
FROM med_error_third_party_tmp) AS percentage
FROM med_error_third_party_tmp;
...will give you a percentage on a per row basis -- category values will be duplicated because there's no grouping.
This:
SELECT category,
SUM(error_count)/x.total AS percentage
FROM med_error_third_party_tmp
JOIN (SELECT SUM(error_count) AS total
FROM med_error_third_party_tmp) x
GROUP BY category
...will gives you a percentage per category of the sum of the categories error_count values vs the sum of the error_count values for the entire table.
another way to do it - without the temp table as seperate item...
select category, error_count/sum(error_count) "Percentage"
from (SELECT mec.description category
, metpc.error_count
FROM med_error_category mec
, med_error_third_party_category metpc
WHERE mec.id = metpc.category
AND year = 2003
GROUP BY mec.id
);
i think you will notice that the percentage is unchanging over the categories. This is probably not what you want - you probably want to group the errors by category as well.