Select outliers in table - sql

I have a table with about 100 000 names/rows that look something like this. There are about 3000 different Refnrs. The names are clustered around the Refnr geographically. The problem is that there are some names that have the wrong location. I need to find the rows who dont fit in with the others. I figured I would do this by finding the Latidude OR Longitude that is too far away from the Longitude and Latitude in the rest of the same Refnrs. So if you see the first Refnr they two of them are located at Latitude 10.67xxx, and 1 is located at Latitude 10.34xxx.
So if I say that I want to compare all the names in the different Refnrs and sort out where the 2nd decimal number differs from the rest of the names.
Is there any way to do this so that I dont have to manually run a query 3000 times?
Refnr
Latitude
Longitude
Name
123
10.67643
50.67523
bob
123
10.67143
50.67737
joe
123
10.34133
50.67848
al
234
11.56892
50.12324
berny
234
11.56123
50.12432
bonny
234
11.98135
50.12223
arby
567
10.22892
50.67143
nilly
567
10.22123
50.67236
tilly
567
10.22148
50.22422
billy
I need a select to give me this.
Refnr
Latitude
Longitude
Name
123
10.34133
50.67848
al
234
11.98135
50.12223
arby
567
10.22148
50.22422
billy
Thanks for the help.

Here's what is hopefully a working solution - it gives the 3 outliers from your sample data, will be interesting to see if it works on your larger data set.
Create a CTE for each longitude and latitude, count the number of matching values based on first 2 decimal places only and choose the minimum of each group - that's the group's outlier.
Join the results with the main table and filter to only rows matching the outlier lat or long.
with outlierLat as (
select top (1) with ties refnr, Round(latitude,2,1) latitude
from t
group by refnr, Round(latitude,2,1)
order by Count(*)
), outlierLong as (
select top (1) with ties refnr, Round(Longitude,2,1) Longitude
from t
group by refnr, Round(Longitude,2,1)
order by Count(*)
)
select t.*
from t
left join outlierLat lt on lt.refnr=t.refnr and Round(t.latitude,2,1)=lt.latitude
left join outlierLong lo on lo.refnr=t.refnr and Round(t.Longitude,2,1)=lo.Longitude
where lt.latitude is not null or lo.Longitude is not null
See demo Fiddle

This got overly complex, and may not be that useful. Still, it was interesting to work on.
First, set up the test data
DROP TABLE #Test
GO
CREATE TABLE #Test
(
Refnr int not null
,Latitude decimal(7,5) not null
,Longitude decimal(7,5) not null
,Name varchar(100) not null
)
INSERT #Test VALUES
(123, 10.67643, 50.67523, 'bob')
,(123, 10.67143, 50.67737, 'joe')
,(123, 10.34133, 50.67848, 'al')
,(234, 11.56892, 50.12324, 'berny')
,(234, 11.56123, 50.12432, 'bonny')
,(234, 11.98135, 50.12223, 'arby')
,(567, 10.22892, 50.67143, 'nilly')
,(567, 10.22123, 50.67236, 'tilly')
,(567, 10.22148, 50.22422, 'billy')
SELECT *
from #Test
As requirements are a tad imprecise, use this to round lat, lon to the desired precision. Adjust as necessary.
DECLARE #Precision TINYINT = 1
--SELECT
-- Latitude
-- ,round(Latitude, #Precision)
-- from #Test
Then it gets messy. Problems will up with if there are multiple "outliers", by EITHER latitude OR longitude. I think this will account for all, and remove duplicates, but further review and testing is called for.
;WITH cteGroups as (
-- Set up groups by lat/lon proximity
SELECT
Refnr
,'Latitude' Type
,round(Latitude, #Precision) Proximity
,count(*) HowMany
from #Test
group by
Refnr
,round(Latitude, #Precision)
UNION ALL SELECT
Refnr
,'Longitude' Type
,round(Longitude, #Precision) Proximity
,count(*) HowMany
from #Test
group by
Refnr
,round(Longitude, #Precision)
)
,cteOutliers as (
-- Identify outliers
select
Type
,Refnr
,Proximity
,row_number() over (partition by Type, Refnr order by HowMany desc) Ranking
from cteGroups
)
-- Pull out all items that match with outliers
select te.*
from cteOutliers cte
inner join #Test te
on te.Refnr = cte.Refnr
and ( (cte.Type = 'Latitude' and round(te.Latitude, #Precision) = Proximity)
or (cte.Type = 'Longitude' and round(te.Longitude, #Precision) = Proximity) )
where cte.Ranking > 1 -- Not in the larger groups

This averages out the center of the locations and looks for ones far from it
SELECT *
, ABS((SELECT Sum(Latitude) / COUNT(*) FROM #Test) - Latitude)
+ ABS((SELECT Sum(Longitude) / COUNT(*) FROM #Test) - Longitude) as Awayfromhome
from #Test
Order by Awayfromhome desc

Related

Display percentage of registered members that have not rated a Movie

I have the following three tables. See full db<>fiddle here
members
member_id
first_name
last_name
1
Roby
Dauncey
2
Isa
Garfoot
3
Sullivan
Carletto
4
Jacintha
Beacock
5
Mikey
Keat
6
Cindy
Stenett
7
Alexina
Deary
8
Perkin
Bachmann
10
Suzann
Genery
39
Horatius
Baukham
41
Bendicty
Willisch
movies
movie_id
movie_name
movie_genre
10
The Bloody Olive
Comedy,Crime,Film-Noir
56
Attack of The Killer Tomatoes
(no genres listed)
ratings
rating_id
movie_id
member_id
rating
19
10
39
2
10
56
41
1
Now the question is:
Out of the total number registered members, how many have actually left a movie rating? Display the result as a percentage
This is what I have tried:
SELECT CONVERT(VARCHAR,(CONVERT(FLOAT,COUNT([Number of Members])) / CONVERT(FLOAT,COUNT(*)) * 100)) + '%'
AS 'Members Percentage'
FROM (
SELECT COUNT(*) AS 'Number of Members'
FROM members
WHERE member_id IN (
SELECT member_id FROM members
EXCEPT
SELECT member_id FROM ratings
)
) MembersNORatings
And my query result is displaying as 100%. Which is obvious that the result is wrong.
Members Percentage
100%
What I figured out was that in the first line of the query:
COUNT(*) value is being recognized as the value equivalent to the alias [Number of Members]. That's why it is showing 100%.
I thought of replacing COUNT(*) with SELECT COUNT(*) FROM members but before I try to run the query, it was showing error saying
Incorrect Syntax near SELECT.
What change do I need to make in my existing query in order to get the proper percentage result?
You can use a cross apply to determine using a sub-query whether a given member has left a rating or not (because you can't use a sub-query in an aggregation). Then divide (ensuring you use decimal division, not integer) to get the percentage.
select
count(*) TotalMembers
, sum(r.HasRating) TotalWithRatings
, convert(decimal(9,2), 100 * sum(r.HasRating) / (count(*) * 1.0)) PercentageWithRatings
from #members m
cross apply (
select case when exists (select 1 from #ratings r where r.member_id = m.member_id) then 1 else 0 end
) r (HasRating);
Returns:
TotalMembers
TotalWithRatings
PercentageWithRatings
50
2
4.00
As mentioned in the comments, there are several ways to approach this. For example:
Option #1 - OUTER JOIN + DISTINCT
SELECT TotalMembers
, TotalMembersWithRatings
, CAST( 100.0 * TotalMembersWithRatings
/ NULLIF(TotalMembers, 0 )
AS DECIMAL(10,2)) AS MemberPercentage
FROM (
SELECT COUNT(DISTINCT m.member_id) AS TotalMembers
, COUNT(DISTINCT r.member_id) AS TotalMembersWithRatings
FROM members m LEFT JOIN ratings r ON r.member_id = m.member_id
) t
Option #2 - CTE + ROW_NUMBER()
WITH memberRatings AS (
SELECT member_id, ROW_NUMBER() OVER(
PARTITION BY member_id
ORDER BY member_id
) AS RowNum
FROM ratings
)
SELECT COUNT(mr.member_id) AS TotalMembers
, COUNT(mr.member_id) AS TotalWithRatings
, CAST( 100.0 * COUNT(mr.member_id)
/ NULLIF(COUNT(m.member_id), 0 )
AS DECIMAL(10,2)) AS MemberPercentage
FROM members m LEFT JOIN memberRatings mr ON mr.member_id = m.member_id
AND mr.RowNum = 1
Option #3 - CROSS APPLY
SELECT
COUNT(*) TotalMembers
, SUM(r.HasRating) TotalWithRatings
, CONVERT(decimal(9,2), 100 * sum(r.HasRating) / (count(*) * 1.0)) PercentageWithRatings
FROM members m
CROSS APPLY (
SELECT CASE WHEN exists (select 1 from ratings r where r.member_id = m.member_id) THEN 1
ELSE 0
END
) r (HasRating);
Execution Plans - Take #1
There's a LOT more to analyzing execution plans than just comparing a single number. However, high level plans do provide some useful indicators.
With the small data samples provided, the plans suggest options #2 (CTE) and #3 (APPLY) are likely to be the most performant (19%), and option #1 (OUTER JOIN + DISTINCT) the least at (63%), likely due to the count(distinct) which can often be slower than alternative options.
Original Sample Size:
TableName
TotalRows
movies
50
members
50
ratings
50
Execution Plans - Take #2
However, populate the tables with more than a few sample rows of data and the same rough comparison produces a different result. Option #2 (CTE) still seems likely to be the least expensive query (9%), but Option #3 (APPLY) is now the most expensive (76%). You can see the majority of that cost is the index spool used due to how APPLY operates:
New Sample Size
TableName
TotalRows
movies
4105
members
29941
ratings
14866
New Execution Plans
With the increased amount of data, STATISTICS IO shows option #2 has far less logical reads and scans and option #3 (APPLY) which as has the most. While Option #1, which appears to have a lower cost overall (15%) it still has a much higher number of logical reads. (Add a non-clustered index on member_id and movie_id and the numbers, while similar, change once again.) So don't just look at a single number.
New Statistics IO
While overall, option #2 (CTE) would seem likely to be most efficient, there are a lot of factors involved (indexes, data volume, statistics, version, etc), so you should examine the actual execution plans in your own environment.
As with most things, the answer as to which is best is: it depends.
Late to the party, but you don't need to join the tables if you only want to know how many members made a rating, not who.
What you need is
count entries in members table
count (distinct) members in ratings
get quota of 'rating' members (rating members divided by total members)
to get nonrating members, substract the quota from 1.0
multiply with 100 to get the percent value
This is how you could do the calculation step by step using CTEs:
with count_members as (
select count(member_id) as member_count from members
), count_raters as (
select count(distinct member_id) as rater_count from ratings
), convert_both as (
select top 1
cast(m.member_count as decimal(10,2)) as member_count,
cast(r.rater_count as decimal(10,2)) as rater_count
from count_members as m cross join count_raters as r
), calculate_quota as (
select (rater_count / member_count) as quota from convert_both
), invert_quota as (
select (1.0 - quota) as quota from calculate_quota
)
select (quota * 100) as percentage from invert_quota;
Alternatively, that's how you could roll it all into one:
select (
(1.0 - (
cast((select count(distinct member_id) from ratings) as decimal(10,2))
/
cast((select count(member_id) from members) as decimal(10,2))
) ) * 100
) as percentage;
dbfiddle here

how to remove coordinates from geojson with less then 4 values

as the title say, i am doing a query on a bikesharing data stored in bigquery
I am able to extract the data and arrange it in a correct order to be displayed in a path chart. In the data, there are coordinated with only start and end long and lat, or sometimes only start long and lat, how do i remove anything with less then 4 points?
this is the code , i am also limited to select only
SELECT
routeID ,
json_extract(st_asgeojson(st_makeline( array_agg(st_geogpoint(locs.lon, locs.lat) order by locs.date))),'$.coordinates') as geo
FROM
howardcounty.routebatches
where unlockedAt between {{start_date}} and {{end_date}}
cross join UNNEST(locations) as locs
GROUP BY routeID
order by routeID
limit 10
have also included a screen shot for clarity
To apply a condition after a group by, please use a having. For a simply condition -- Are there at least two dataset for the route? -- this query can be used:
With dummy as (
Select 1 as routeID, [struct(current_timestamp() as date, 1 as lon, 2 as lat),struct(current_timestamp() as date, 3 as lon, 4 as lat)] as locations
Union all select 2 as routeID, [struct(current_timestamp() as date, 10 as lon, 20 as lat)]
)
SELECT
routeID , count(locs.date) as amountcoord,
json_extract(st_asgeojson(st_makeline( array_agg(st_geogpoint(locs.lon, locs.lat) order by locs.date))),'$.coordinates') as geo
FROM
#howardcounty.routebatches
dummy
#where unlockedAt between {{start_date}} and {{end_date}}
cross join UNNEST(locations) as locs
GROUP BY routeID
having count(locs.date)>1
order by routeID
limit 10
For more complex ones, a nested select may do the job:
Select *
from (
--- your code ---
) where length(geo)-length(replace(geo,"]","")) > 1+4
The JSON is transformed to a string in your code. If you count the ] and substract one for the end of the JSON, the inside arrays are counted.

SQL percentage of the total

Hi how can I get the percentage of each record over the total?
Lets imagine I have one table with the following
ID code Points
1 101 2
2 201 3
3 233 4
4 123 1
The percentage for ID 1 is 20% for 2 is 30% and so one
how do I get it?
There's a couple approaches to getting that result.
You essentially need the "total" points from the whole table (or whatever subset), and get that repeated on each row. Getting the percentage is a simple matter of arithmetic, the expression you use for that depends on the datatypes, and how you want that formatted.
Here's one way (out a couple possible ways) to get the specified result:
SELECT t.id
, t.code
, t.points
-- , s.tot_points
, ROUND(t.points * 100.0 / s.tot_points,1) AS percentage
FROM onetable t
CROSS
JOIN ( SELECT SUM(r.points) AS tot_points
FROM onetable r
) s
ORDER BY t.id
The view query s is run first, that gives a single row. The join operation matches that row with every row from t. And that gives us the values we need to calculate a percentage.
Another way to get this result, without using a join operation, is to use a subquery in the SELECT list to return the total.
Note that the join approach can be extended to get percentage for each "group" of records.
id type points %type
-- ---- ------ -----
1 sold 11 22%
2 sold 4 8%
3 sold 25 50%
4 bought 1 50%
5 bought 1 50%
6 sold 10 20%
To get that result, we can use the same query, but a a view query for s that returns total GROUP BY r.type, and then the join operation isn't a CROSS join, but a match based on type:
SELECT t.id
, t.type
, t.points
-- , s.tot_points_by_type
, ROUND(t.points * 100.0 / s.tot_points_by_type,1) AS `%type`
FROM onetable t
JOIN ( SELECT r.type
, SUM(r.points) AS tot_points
FROM onetable r
GROUP BY r.type
) s
ON s.type = t.type
ORDER BY t.id
To do that same result with the subquery, that's going to be a correlated subquery, and that subquery is likely to get executed for every row in t.
This is why it's more natural for me to use a join operation, rather than a subquery in the SELECT list... even when a subquery works the same. (The patterns we use for more complex queries, like assigning aliases to tables, qualifying all column references, and formatting the SQL... those patterns just work their way back into simple queries. The rationale for these patterns is kind of lost in simple queries.)
try like this
select id,code,points,(points * 100)/(select sum(points) from tabel1) from table1
To add to a good list of responses, this should be fast performance-wise, and rather easy to understand:
DECLARE #T TABLE (ID INT, code VARCHAR(256), Points INT)
INSERT INTO #T VALUES (1,'101',2), (2,'201',3),(3,'233',4), (4,'123',1)
;WITH CTE AS
(SELECT * FROM #T)
SELECT C.*, CAST(ROUND((C.Points/B.TOTAL)*100, 2) AS DEC(32,2)) [%_of_TOTAL]
FROM CTE C
JOIN (SELECT CAST(SUM(Points) AS DEC(32,2)) TOTAL FROM CTE) B ON 1=1
Just replace the table variable with your actual table inside the CTE.

Find Segment with Longest Stay Per Booking

We have a number of bookings and one of the requirements is that we display the Final Destination for a booking based on its segments. Our business has defined the Final Destination as that in which we have the longest stay. And Origin being the first departure point.
Please note this is not the segments with the Longest Travel time i.e. Datediff(minute, DepartDate, ArrivalDate) This is requesting the one with the Longest gap between segments.
This is a simplified version of the tables:
Create Table Segments
(
BookingID int,
SegNum int,
DepartureCity varchar(100),
DepartDate datetime,
ArrivalCity varchar(100),
ArrivalDate datetime
);
Create Table Bookings
(
BookingID int identity(1,1),
Locator varchar(10)
);
Insert into Segments values (1,2,'BRU','2010-03-06 10:40','FIH','2010-03-06 20:20:00')
Insert into Segments values (1,4,'FIH','2010-03-13 21:50:00','BRU', '2010-03-14 07:25:00')
Insert into Segments values (2,2,'BOD','2010-02-10 06:50:00','AMS','2010-02-10 08:50:00')
Insert into Segments values (2,3,'AMS','2010-02-10 10:40:00','EBB','2010-02-10 20:40:00')
Insert into Segments values (2,4,'EBB','2010-02-28 22:55:00','AMS','2010-03-01 05:35:00')
Insert into Segments values (2,5,'AMS','2010-03-01 10:25:00','BOD','2010-03-01 12:15:00')
insert into Segments values (3,2,'BRU','2010-03-09 12:10:00','IAD','2010-03-09 14:46:00')
Insert into Segments Values (3,3,'IAD','2010-03-13 17:57:00','BRU','2010-03-14 07:15:00')
insert into segments values (4,2,'BRU','2010-07-27','ADD','2010-07-28')
insert into segments values (4,4,'ADD','2010-07-28','LUN','2010-07-28')
insert into segments values (4,5,'LUN','2010-08-23','ADD','2010-08-23')
insert into segments values (4,6,'ADD','2010-08-23','BRU','2010-08-24')
Insert into Bookings values('5MVL7J')
Insert into Bookings values ('Y2IMXQ')
insert into bookings values ('YCBL5C')
Insert into bookings values ('X7THJ6')
I have created a SQL Fiddle with real data here:
SQL Fiddle Example
I have tried to do the following, however this doesn't appear to be correct.
SELECT Locator, fd.*
FROM Bookings ob
OUTER APPLY
(
SELECT Top 1 DepartureCity, ArrivalCity
from
(
SELECT DISTINCT
seg.segnum ,
seg.DepartureCity ,
seg.DepartDate ,
seg.ArrivalCity ,
seg.ArrivalDate,
(SELECT
DISTINCT
DATEDIFF(MINUTE , seg.ArrivalDate , s2.DepartDate)
FROM Segments s2
WHERE s2.BookingID = seg.BookingID AND s2.segnum = seg.segnum + 1) 'LengthOfStay'
FROM Bookings b(NOLOCK)
INNER JOIN Segments seg (NOLOCK) ON seg.bookingid = b.bookingid
WHERE b.Locator = ob.locator
) a
Order by a.lengthofstay desc
)
FD
The results I expect are:
Locator Origin Destination
5MVL7J BRU FIH
Y2IMXQ BOD EBB
YCBL5C BRU IAD
X7THJ6 BRU LUN
I get the feeling that a CTE would be the best approach, however my attempts do this so far failed miserably. Any help would be greatly appreciated.
I have managed to get the following query working but it only works for one at a time due to the top one, but I'm not sure how to tweak it:
WITH CTE AS
(
SELECT distinct s.DepartureCity, s.DepartDate, s.ArrivalCity, s.ArrivalDate, b.Locator , ROW_NUMBER() OVER (PARTITION BY b.Locator ORDER BY SegNum ASC) RN
FROM Segments s
JOIN bookings b ON s.bookingid = b.BookingID
)
SELECT C.Locator, c.DepartureCity, a.ArrivalCity
FROM
(
SELECT TOP 1 C.Locator, c.ArrivalCity, c1.DepartureCity, DATEDIFF(MINUTE,c.ArrivalDate, c1.DepartDate) 'ddiff'
FROM CTE c
JOIN cte c1 ON c1.Locator = C.Locator AND c1.rn = c.rn + 1
ORDER BY ddiff DESC
) a
JOIN CTE c ON C.Locator = a.Locator
WHERE c.rn = 1
You can try something like this:
;WITH CTE_Start AS
(
--Ordering of segments to eliminate gaps
SELECT *, ROW_NUMBER() OVER (PARTITION BY BookingID ORDER BY SegNum) RN
FROM dbo.Segments
)
, RCTE_Stay AS
(
--recursive CTE to calculate stay between segments
SELECT *, 0 AS Stay FROM CTE_Start s WHERE RN = 1
UNION ALL
SELECT sNext.*, DATEDIFF(Mi, s.ArrivalDate, sNext.DepartDate)
FROM CTE_Start sNext
INNER JOIN RCTE_Stay s ON s.RN + 1 = sNext.RN AND s.BookingID = sNext.BookingID
)
, CTE_Final AS
(
--Search for max(stay) for each bookingID
SELECT *, ROW_NUMBER() OVER (PARTITION BY BookingID ORDER BY Stay DESC) AS RN_Stay
FROM RCTE_Stay
)
--join Start and Final on RN=1 to find origin and departure
SELECT b.Locator, s.DepartureCity AS Origin, f.DepartureCity AS Destination
FROM CTE_Final f
INNER JOIN CTE_Start s ON f.BookingID = s.BookingID
INNER JOIN dbo.Bookings b ON b.BookingID = f.BookingID
WHERE s.RN = 1 AND f.RN_Stay = 1
SQLFiddle DEMO
You can use the OUTER APPLY + TOP operators to find the next values SegNum. After finding the gap between segments are used MIN/MAX aggregate functions with OVER clause as conditions in the CASE expression
;WITH cte AS
(
SELECT seg.BookingID,
CASE WHEN MIN(seg.segNum) OVER(PARTITION BY seg.BookingID) = seg.segNum
THEN seg.DepartureCity END AS Origin,
CASE WHEN MAX(DATEDIFF(MINUTE, seg.ArrivalDate, o.DepartDate)) OVER(PARTITION BY seg.BookingID)
= DATEDIFF(MINUTE, seg.ArrivalDate, o.DepartDate)
THEN o.DepartureCity END AS Destination
FROM Segments seg (NOLOCK)
OUTER APPLY (
SELECT TOP 1 seg2.DepartDate, seg2.DepartureCity
FROM Segments seg2
WHERE seg.BookingID = seg2.BookingID
AND seg.SegNum < seg2.SegNum
ORDER BY seg2.SegNum ASC
) o
)
SELECT b.Locator, MAX(c.Origin) AS Origin, MAX(c.Destination) AS Destination
FROM cte c JOIN Bookings b ON c.BookingID = b.BookingID
GROUP BY b.Locator
See demo on SQLFiddle
The statement below:
;WITH DataSource AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY BookingID ORDER BY DATEDIFF(SS,DepartDate,ArrivalDate) DESC) AS Row
,Segments.BookingID
,Segments.SegNum
,Segments.DepartureCity
,Segments.DepartDate
,Segments.ArrivalCity
,Segments.ArrivalDate
,DATEDIFF(SS,DepartDate,ArrivalDate) AS DiffInSeconds
FROM Segments
)
SELECT *
FROM DataSource DS
INNER JOIN Bookings B
ON DS.[BookingID] = B.[BookingID]
Will give the following output:
So, adding the following clause to the above statement:
WHERE Row = 1
will give you what you need.
Few important things:
As you can see from the screenshot below, there are two records with same difference in second. If you want to show both of them (or all of them if there are), instead ROW_NUMBER function use RANK function.
The return type of DATEDIFF is INT. So, there is limitation for seconds max deference value. It is as follows:
If the return value is out of range for int (-2,147,483,648 to
+2,147,483,647), an error is returned. For millisecond, the maximum difference between startdate and enddate is 24 days, 20 hours, 31
minutes and 23.647 seconds. For second, the maximum difference is 68
years.

Sorting twice on same column

I'm having a bit of a weird question, given to me by a client.
He has a list of data, with a date between parentheses like so:
Foo (14/08/2012)
Bar (15/08/2012)
Bar (16/09/2012)
Xyz (20/10/2012)
However, he wants the list to be displayed as follows:
Foo (14/08/2012)
Bar (16/09/2012)
Bar (15/08/2012)
Foot (20/10/2012)
(notice that the second Bar has moved up one position)
So, the logic behind it is, that the list has to be sorted by date ascending, EXCEPT when two rows have the same name ('Bar'). If they have the same name, it must be sorted with the LATEST date at the top, while staying in the other sorting order.
Is this even remotely possible? I've experimented with a lot of ORDER BY clauses, but couldn't find the right one. Does anyone have an idea?
I should have specified that this data comes from a table in a sql server database (the Name and the date are in two different columns). So I'm looking for a SQL-query that can do the sorting I want.
(I've dumbed this example down quite a bit, so if you need more context, don't hesitate to ask)
This works, I think
declare #t table (data varchar(50), date datetime)
insert #t
values
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
select t.*
from #t t
inner join (select data, COUNT(*) cg, MAX(date) as mg from #t group by data) tc
on t.data = tc.data
order by case when cg>1 then mg else date end, date desc
produces
data date
---------- -----------------------
Foo 2012-08-14 00:00:00.000
Bar 2012-09-16 00:00:00.000
Bar 2012-08-15 00:00:00.000
Xyz 2012-10-20 00:00:00.000
A way with better performance than any of the other posted answers is to just do it entirely with an ORDER BY and not a JOIN or using CTE:
DECLARE #t TABLE (myData varchar(50), myDate datetime)
INSERT INTO #t VALUES
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-09-16'),
('Xyz','2012-10-20')
SELECT *
FROM #t t1
ORDER BY (SELECT MIN(t2.myDate) FROM #t t2 WHERE t2.myData = t1.myData), T1.myDate DESC
This does exactly what you request and will work with any indexes and much better with larger amounts of data than any of the other answers.
Additionally it's much more clear what you're actually trying to do here, rather than masking the real logic with the complexity of a join and checking the count of joined items.
This one uses analytic functions to perform the sort, it only requires one SELECT from your table.
The inner query finds gaps, where the name changes. These gaps are used to identify groups in the next query, and the outer query does the final sorting by these groups.
I have tried it here (SQL Fiddle) with extended test-data.
SELECT name, dat
FROM (
SELECT name, dat, SUM(gap) over(ORDER BY dat, name) AS grp
FROM (
SELECT name, dat,
CASE WHEN LAG(name) OVER (ORDER BY dat, name) = name THEN 0 ELSE 1 END AS gap
FROM t
) x
) y
ORDER BY grp, dat DESC
Extended test-data
('Bar','2012-08-12'),
('Bar','2012-08-11'),
('Foo','2012-08-14'),
('Bar','2012-08-15'),
('Bar','2012-08-16'),
('Bar','2012-09-17'),
('Xyz','2012-10-20')
Result
Bar 2012-08-12
Bar 2012-08-11
Foo 2012-08-14
Bar 2012-09-17
Bar 2012-08-16
Bar 2012-08-15
Xyz 2012-10-20
I think that this works, including the case I asked about in the comments:
declare #t table (data varchar(50), [date] datetime)
insert #t
values
('Foo','20120814'),
('Bar','20120815'),
('Bar','20120916'),
('Xyz','20121020')
; With OuterSort as (
select *,ROW_NUMBER() OVER (ORDER BY [date] asc) as rn from #t
)
--Now we need to find contiguous ranges of the same data value, and the min and max row number for such a range
, Islands as (
select data,rn as rnMin,rn as rnMax from OuterSort os where not exists (select * from OuterSort os2 where os2.data = os.data and os2.rn = os.rn - 1)
union all
select i.data,rnMin,os.rn
from
Islands i
inner join
OuterSort os
on
i.data = os.data and
i.rnMax = os.rn-1
), FullIslands as (
select
data,rnMin,MAX(rnMax) as rnMax
from Islands
group by data,rnMin
)
select
*
from
OuterSort os
inner join
FullIslands fi
on
os.rn between fi.rnMin and fi.rnMax
order by
fi.rnMin asc,os.rn desc
It works by first computing the initial ordering in the OuterSort CTE. Then, using two CTEs (Islands and FullIslands), we compute the parts of that ordering in which the same data value appears in adjacent rows. Having done that, we can compute the final ordering by any value that all adjacent values will have (such as the lowest row number of the "island" that they belong to), and then within an "island", we use the reverse of the originally computed sort order.
Note that this may, though, not be too efficient for large data sets. On the sample data it shows up as requiring 4 table scans of the base table, as well as a spool.
Try something like...
ORDER BY CASE date
WHEN '14/08/2012' THEN 1
WHEN '16/09/2012' THEN 2
WHEN '15/08/2012' THEN 3
WHEN '20/10/2012' THEN 4
END
In MySQL, you can do:
ORDER BY FIELD(date, '14/08/2012', '16/09/2012', '15/08/2012', '20/10/2012')
In Postgres, you can create a function FIELD and do:
CREATE OR REPLACE FUNCTION field(anyelement, anyarray) RETURNS numeric AS $$
SELECT
COALESCE((SELECT i
FROM generate_series(1, array_upper($2, 1)) gs(i)
WHERE $2[i] = $1),
0);
$$ LANGUAGE SQL STABLE
If you do not want to use the CASE, you can try to find an implementation of the FIELD function to SQL Server.