SQL Server query grouped by max value in column - sql

****Update:**
using the Rank() over partition syntax available in MS SQL Server 2005 does indeed point me in the right direction, it (or maybe I should write "I") is unable to give me the results I need without resorting to enumerating rows in code.
For example, if we select TOP (1) of rank, I get only one value, ie., slot 1. If I use MAX(), then I get the top ranked value for each slot...which, in my case, doesn't work, because if slot 2's top value is NULL, but it's next to MAX value is non-empty, that is the one I want.
So, unable to find a completely T-SQL solution, I've resorted to filtering as much as possible in SQL and then enumerating the results in code on the client side.
Original:
I've been hitting advanced T-SQL books, StackOverflow and google trying to figure out how to handle this query either by using pivots or by using analytic functions. So far, I haven't hit on the right combination.
I have schedules that are ranked (higher value, greater precedence). Each schedule has a playlist of a certain number of numbered slots with files.
What I need to do, is line up all the schedules and their associated playlists, and for each slot, grab the file from the schedule having the highest ranking value.
so, if I had a query for a specific customer with a join between the playlists and the schedules, ordered by Schedule.Rank DESC like so:
PlaylistId Schedule.Rank SlotNumber FileId
100 100 1 1001
100 100 2 NULL
100 100 3 NULL
200 80 1 1101
200 80 2 NULL
200 80 3 NULL
300 60 1 1201
300 60 2 NULL
300 60 3 2202
400 20 1 1301
400 20 2 2301
400 20 3 NULL
From this, I need to find the FileId for the MAX ranked row per slotnumber:
SlotNumber FileId Schedule.Rank
1 1001 100
2 2301 20
3 2202 60
Any ideas on how to do this?
Table Definitions below:
CREATE TABLE dbo.Playlists(
id int NOT NULL)
CREATE TABLE dbo.Customers(
id int NOT NULL,
name nchar(10) NULL)
CREATE TABLE dbo.Schedules(
id int NOT NULL,
rank int NOT NULL,
playlistid int NULL,
customerid int NULL)
CREATE TABLE dbo.PlaylistSlots(
id int NOT NULL,
slotnumber int NOT NULL,
playlistid int NULL,
fileid int NULL)

SELECT slotnumber, fileid, rank
FROM
(
SELECT slotnumber, fileid, Schedules.rank, RANK() OVER (PARTITION BY slotnumber ORDER BY Schedules.rank DESC) as rankfunc
FROM Schedules
INNER JOIN PlaylistSlots ON Schedules.playlistid = PlaylistSlots.playlistid
) tmp
WHERE rankfunc = 1

Have you looked at SQL Server's (2005 onwards) PARTITION and RANK features?

select SlotNumber, FileId, ScheduleRank
FROM intermediateTable a,
(
SELECT SlotNumber, Max(Schedule.Rank) as MaxRank
FROM intermediateTable O
WHERE FileId is not null GROUP BY SlotNumber) b
WHERE b.SlotNumber = a.SlotNumber and b.MaxRank = a.Rank
This query uses the intermediate output to build the final output.
Does this help?

Related

Select records from a specific key onwards

I have a table that has more than three trillion records
The main key of this table is guid
As below
GUID Value mid id
0B821574-8E85-4FB7-8047-553393E385CB 4 51 15
716F74B0-80D8-4869-86B4-99FF9EB10561 0 510 153
7EBA2C31-FFC8-4071-B11A-9E2B7ED16B2B 2 5 3
85491F90-E4C6-4030-B1E5-B9CA36238AE2 1 58 7
F04FA30C-0C35-4B9F-A01C-708C0189815D 20 50 13
guid is primary key
I want to select 10 records from where the key is equal to, for example, 85491F90-E4C6-4030-B1E5-B9CA36238AE2
You can use order by and top. Assuming that guid defines the ordering of the rows:
select top (10) t.*
from mytable t
where guid >= '85491F90-E4C6-4030-B1E5-B9CA36238AE2'
order by guid
If the ordering is defined in an other column, say id (that should be unique as well), then you would use a correlated subquery for filterig:
select top (10) t.*
from mytable t
where id >= (select id from mytable t1 where guid = '85491F90-E4C6-4030-B1E5-B9CA36238AE2')
order by id
To read data onward You can use OFFSET .. FETCH in the ORDER BY since MS SQL Server 2012. According learn.microsoft.com something like this:
-- Declare and set the variables for the OFFSET and FETCH values.
DECLARE #StartingRowNumber INT = 1
, #RowCountPerPage INT = 10;
-- Create the condition to stop the transaction after all rows have been returned:
WHILE (SELECT COUNT(*) FROM mytable) >= #StartingRowNumber
BEGIN
-- Run the query until the stop condition is met:
SELECT *
FROM mytable WHERE guid = '85491F90-E4C6-4030-B1E5-B9CA36238AE2'
ORDER BY id
OFFSET #StartingRowNumber - 1 ROWS
FETCH NEXT #RowCountPerPage ROWS ONLY;
-- Increment #StartingRowNumber value:
SET #StartingRowNumber = #StartingRowNumber + #RowCountPerPage;
CONTINUE
END;
In the real world it will not be enough, because another processes could (try) read or write data in your table at the same time.
Please, read documentation, for example, search for "Running multiple queries in a single transaction" in the https://learn.microsoft.com/en-us/sql/t-sql/queries/select-order-by-clause-transact-sql
Proper indexes for fields id and guid must to be created/applied to provide performance

Count and Group By Returning Different Sets - Halfway solved :/

Can anyone help? My trouble is that people might have the same id or different and have different name spellings. If I group by id (which is not the primary key) I get a different amount of rows than if I group by ID and name. How do I just group by ID, while still having the ID and Name in the select?
Create Table Client(ID Int, Name Varchar(15))
Insert Into Client VALUES(11,'Batman'),(22,'Batman'),(33,'Robin'),(44,'Joker'),(44,'The Joker'),(33,'Robin')
Select Count(ID) From Client
Select * From Client
--This returns 4 rows as it should
Select Count (ID)
From Client
Group By ID
--This returns 5 rows because Joker and The Joker have different names, but the same ID. I want to count by ID and not the name, since so many have typos.
Select Count (ID), [Name] , ID
From Client
Group By ID, [Name]
How do I do this and have it work?
Select Count (ID), [Name] , ID
From Client
Group By ID --<< Always throws and error unless I include Name, which
--returns too many rows.
It should return
Count Name ID
1 Batman 11
1 Batman 22
2 Joker 44 --<< Correct
2 Robin 33
And not
Count Name ID
1 Batman 11
1 Batman 22
2 Robin 33
1 Joker 44 --Wrong
1 The Joker 44 --Wrong
using select count(*) from ClientLog will tell you exactly how many records there are in your table. If your ID field is the primary key, then select count(ID) from ClientLog should return the same number.
Your first query is a little confusing, because you're grouping by ID but not displaying the ID. So you're likely getting a row for each record, where the row value is 1.
Your 2nd query is also a bit confusing, because there's no aggregation happening (since your ID field is unique).
What specifically are you trying to obtain in your query (if anything besides just how many records you have in your table)?

How to update each row of a table with a random row from another table

I'm building my first de-identification script, and running into issues with my approach.
I have a table dbo.pseudonyms whose firstname column is populated with 200 rows of data. Every row in this column of 200 rows has a value (none are null). This table also has an id column (int, primary key, not null) with the numbers 1-200.
What I want to do is, in one statement, re-populate my entire USERS table with firstname data randomly selected for each row from my pseudonyms table.
To generate the random number for picking I'm using ABS(Checksum(NewId())) % 200. Every time I do SELECT ABS(Checksum(NewId())) % 200 I get a numeric value in the range I'm looking for just fine, no intermittently erratic behavior.
HOWEVER, when I use this formula in the following statement:
SELECT pn.firstname
FROM DeIdentificationData.dbo.pseudonyms pn
WHERE pn.id = ABS(Checksum(NewId())) % 200
I get VERY intermittent results. I'd say about 30% of the results return one name picked out of the table (this is the expected result), about 30% come back with more than one result (which is baffling, there are no duplicate id column values), and about 30% come back with NULL (even though there are no empty rows in the firstname column)
I did look for quite a while for this specific issue, but to no avail so far. I'm assuming the issue has to do with using this formula as a pointer, but I'd be at a loss how to do this otherwise.
Thoughts?
Why your query in the question returns unexpected results
Your original query selects from Pseudonyms. Server scans through each row of the table, picks the ID from that row, generates a random number, compares the generated number to the ID.
When by chance the generated number for particular row happen to be the same as ID of that row, this row is returned in the result set. It is quite possible that by chance generated number would never be the same as ID, as well as that generated number coincided with ID several times.
A bit more detailed:
Server picks a row with ID=1.
Generates a random number, say 25. Why not? A decent random number.
Is 1 = 25 ? No => This row is not returned.
Server picks a row with ID=2.
Generates a random number, say 125. Why not? A decent random number.
Is 2 = 125 ? No => This row is not returned.
And so on...
Here is a complete solution on SQL Fiddle
Sample data
DECLARE #VarPseudonyms TABLE (ID int IDENTITY(1,1), PseudonymName varchar(50) NOT NULL);
DECLARE #VarUsers TABLE (ID int IDENTITY(1,1), UserName varchar(50) NOT NULL);
INSERT INTO #VarUsers (UserName)
SELECT TOP(1000)
'UserName' AS UserName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
INSERT INTO #VarPseudonyms (PseudonymName)
SELECT TOP(200)
'PseudonymName'+CAST(ROW_NUMBER() OVER(ORDER BY sys.all_objects.object_id) AS varchar) AS PseudonymName
FROM sys.all_objects
ORDER BY sys.all_objects.object_id;
Table Users has 1000 rows with the same UserName for each row. Table Pseudonyms has 200 rows with different PseudonymNames:
SELECT * FROM #VarUsers;
ID UserName
-- --------
1 UserName
2 UserName
...
999 UserName
1000 UserName
SELECT * FROM #VarPseudonyms;
ID PseudonymName
-- -------------
1 PseudonymName1
2 PseudonymName2
...
199 PseudonymName199
200 PseudonymName200
First attempt
At first I tried a direct approach. For each row in Users I want to get one random row from Pseudonyms:
SELECT
U.ID
,U.UserName
,CA.PseudonymName
FROM
#VarUsers AS U
CROSS APPLY
(
SELECT TOP(1)
P.PseudonymName
FROM #VarPseudonyms AS P
ORDER BY CRYPT_GEN_RANDOM(4)
) AS CA
;
It turns out that optimizer is too smart and this produced some random, but the same PseudonymName for each User, which is not what I expected:
ID UserName PseudonymName
1 UserName PseudonymName181
2 UserName PseudonymName181
...
999 UserName PseudonymName181
1000 UserName PseudonymName181
So, I tweaked this approach a bit and generated a random number for each row in Users first. Then I used the generated number to find the Pseudonym with this ID for each row in Users using CROSS APPLY.
CTE_Users has an extra column with random number from 1 to 200. In CTE_Joined we pick a row from Pseudonyms for each User.
Finally we UPDATE the original Users table.
Final solution
WITH
CTE_Users
AS
(
SELECT
U.ID
,U.UserName
,1 + 200 * (CAST(CRYPT_GEN_RANDOM(4) as int) / 4294967295.0 + 0.5) AS rnd
FROM #VarUsers AS U
)
,CTE_Joined
AS
(
SELECT
CTE_Users.ID
,CTE_Users.UserName
,CA.PseudonymName
FROM
CTE_Users
CROSS APPLY
(
SELECT P.PseudonymName
FROM #VarPseudonyms AS P
WHERE P.ID = CAST(CTE_Users.rnd AS int)
) AS CA
)
UPDATE CTE_Joined
SET UserName = PseudonymName;
Results
SELECT * FROM #VarUsers;
ID UserName
1 PseudonymName41
2 PseudonymName132
3 PseudonymName177
...
998 PseudonymName60
999 PseudonymName141
1000 PseudonymName157
SQL Fiddle
A simpler approach:
UPDATE u
SET u.FirstName = p.Name
FROM Users u
CROSS APPLY (
SELECT TOP(1) p.Name
FROM pseudonyms p
WHERE u.Id IS NOT NULL -- must be some unique identifier on Users
ORDER BY NEWID()
) p
Full example from: https://stackoverflow.com/a/36185100/6620329
Update a random Users id into UpdatedBy column of Table01
UPDATE a
SET a.UpdatedBy=b.id
FROM [dbo].[Table01] a
CROSS APPLY (
SELECT
id,
ROW_NUMBER() over(partition by 1 order by NEWID()) RN
FROM Users b
WHERE a.id != b.id
) b
WHERE RN = 1

Select statement, table sample, equal distribution

Let's assume there is a SQL Server 2008 table like below, that holds 10 million rows.
One of the fields is Id, since it's identity it is from 1 to 10 million.
CREATE TABLE dbo.Stats
(
id INT IDENTITY(1,1) PRIMARY KEY,
field1 INT,
field2 INT,
...
)
Is there an efficient way by doing one select statement to get a subset of this data that satisfies the following requirements:
contains a limited number of rows in the result set, i.e. 100, 200, etc.
provides equal distribution of a certain column, not random, i.e. of column id
So, in our example, if we return 100 rows, the result set would look like this:
Row 1 - 100 000
Row 2 - 200 000
Row 3 - 300 000
...
Row 100 - 10 000 000
I want to avoid using cursor and storing this in a separate table.
Not sure how efficient it's going to be, but thie following query will return every 100000th row (relative to ordering established by id):
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY id) RN
FROM Stats
) T
WHERE RN % 100000 = 0
ORDER BY id
Since it does not rely on actual id values, this will work even if you have "holes" in the sequence of id values.
Something like this?
SELECT id FROM dbo..Stats WHERE id % 100000 = 0
it should work, since you are saying that id goes from 1 to 10 000 000. If number of rows is not known, but number of resulting rows is what you know, then just calculate that 100000 number like (if you would like 100 resulting rows):
SELECT id FROM Stats WHERE (id % (SELECT COUNT(id) FROM Stats) / 100) = 0

Efficient SQL 2000 Query for Selecting Preferred Candy

(I wish I could have come up with a more descriptive title... suggest one or edit this post if you can name the type of query I'm asking about)
Database: SQL Server 2000
Sample Data (assume 500,000 rows):
Name Candy PreferenceFactor
Jim Chocolate 1.0
Brad Lemon Drop .9
Brad Chocolate .1
Chris Chocolate .5
Chris Candy Cane .5
499,995 more rows...
Note that the number of rows with a given 'Name' is unbounded.
Desired Query Results:
Jim Chocolate 1.0
Brad Lemon Drop .9
Chris Chocolate .5
~250,000 more rows...
(Since Chris has equal preference for Candy Cane and Chocolate, a consistent result is adequate).
Question:
How do I Select Name, Candy from data where each resulting row contains a unique Name such that the Candy selected has the highest PreferenceFactor for each Name. (speedy efficient answers preferred).
What indexes are required on the table? Does it make a difference if Name and Candy are integer indexes into another table (aside from requiring some joins)?
You will find that the following query outperforms every other answer given, as it works with a single scan. This simulates MS Access's First and Last aggregate functions, which is basically what you are doing.
Of course, you'll probably have foreign keys instead of names in your CandyPreference table. To answer your question, it is in fact very much best if Candy and Name are foreign keys into another table.
If there are other columns in the CandyPreferences table, then having a covering index that includes the involved columns will yield even better performance. Making the columns as small as possible will increase the rows per page and again increase performance. If you are most often doing the query with a WHERE condition to restrict rows, then an index that covers the WHERE conditions becomes important.
Peter was on the right track for this, but had some unneeded complexity.
CREATE TABLE #CandyPreference (
[Name] varchar(20),
Candy varchar(30),
PreferenceFactor decimal(11, 10)
)
INSERT #CandyPreference VALUES ('Jim', 'Chocolate', 1.0)
INSERT #CandyPreference VALUES ('Brad', 'Lemon Drop', .9)
INSERT #CandyPreference VALUES ('Brad', 'Chocolate', .1)
INSERT #CandyPreference VALUES ('Chris', 'Chocolate', .5)
INSERT #CandyPreference VALUES ('Chris', 'Candy Cane', .5)
SELECT
[Name],
Candy = Substring(PackedData, 13, 30),
PreferenceFactor = Convert(decimal(11,10), Left(PackedData, 12))
FROM (
SELECT
[Name],
PackedData = Max(Convert(char(12), PreferenceFactor) + Candy)
FROM CandyPreference
GROUP BY [Name]
) X
DROP TABLE #CandyPreference
I actually don't recommend this method unless performance is critical. The "canonical" way to do it is OrbMan's standard Max/GROUP BY derived table and then a join to it to get the selected row. Though, that method starts to become difficult when there are several columns that participate in the selection of the Max, and the final combination of selectors can be duplicated, that is, when there is no column to provide arbitrary uniqueness as in the case here where we use the name if the PreferenceFactor is the same.
Edit: It's probably best to give some more usage notes to help improve clarity and to help people avoid problems.
As a general rule of thumb, when trying to improve query performance, you can do a LOT of extra math if it will save you I/O. Saving an entire table seek or scan speeds up the query substantially, even with all the converts and substrings and so on.
Due to precision and sorting issues, use of a floating point data type is probably a bad idea with this method. Though unless you are dealing with extremely large or small numbers, you shouldn't be using float in your database anyway.
The best data types are those that are not packed and sort in the same order after conversion to binary or char. Datetime, smalldatetime, bigint, int, smallint, and tinyint all convert directly to binary and sort correctly because they are not packed. With binary, avoid left() and right(), use substring() to get the values reliably returned to their originals.
I took advantage of Preference having only one digit in front of the decimal point in this query, allowing conversion straight to char since there is always at least a 0 before the decimal point. If more digits are possible, you would have to decimal-align the converted number so things sort correctly. Easiest might be to multiply your Preference rating so there is no decimal portion, convert to bigint, and then convert to binary(8). In general, conversion between numbers is faster than conversion between char and another data type, especially with date math.
Watch out for nulls. If there are any, you must convert them to something and then back.
select c.Name, max(c.Candy) as Candy, max(c.PreferenceFactor) as PreferenceFactor
from Candy c
inner join (
select Name, max(PreferenceFactor) as MaxPreferenceFactor
from Candy
group by Name
) cm on c.Name = cm.Name and c.PreferenceFactor = cm.MaxPreferenceFactor
group by c.Name
order by PreferenceFactor desc, Name
I tried:
SELECT X.PersonName,
(
SELECT TOP 1 Candy
FROM CandyPreferences
WHERE PersonName=X.PersonName AND PreferenceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonName, MAX(PreferenceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonName
) AS X
This seems to work, though I can't speak to efficiency without real data and a realistic load.
I did create a primary key over PersonName and Candy, though. Using SQL Server 2008 and no additional indexes shows it using two clustered index scans though, so it could be worse.
I played with this a bit more because I needed an excuse to play with the Data Generation Plan capability of "datadude". First, I refactored the one table to have separate tables for candy names and person names. I did this mostly because it allowed me to use the test data generation without having to read the documentation. The schema became:
CREATE TABLE [Candies](
[CandyID] [int] IDENTITY(1,1) NOT NULL,
[Candy] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Candies] PRIMARY KEY CLUSTERED
(
[CandyID] ASC
),
CONSTRAINT [UC_Candies] UNIQUE NONCLUSTERED
(
[Candy] ASC
)
)
GO
CREATE TABLE [Persons](
[PersonID] [int] IDENTITY(1,1) NOT NULL,
[PersonName] [nvarchar](100) NOT NULL,
CONSTRAINT [PK_Preferences.Persons] PRIMARY KEY CLUSTERED
(
[PersonID] ASC
)
)
GO
CREATE TABLE [CandyPreferences](
[PersonID] [int] NOT NULL,
[CandyID] [int] NOT NULL,
[PrefernceFactor] [real] NOT NULL,
CONSTRAINT [PK_CandyPreferences] PRIMARY KEY CLUSTERED
(
[PersonID] ASC,
[CandyID] ASC
)
)
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Candies] FOREIGN KEY([CandyID])
REFERENCES [Candies] ([CandyID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Candies]
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Persons] FOREIGN KEY([PersonID])
REFERENCES [Persons] ([PersonID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Persons]
GO
The query became:
SELECT P.PersonName, C.Candy
FROM (
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
) AS Y
INNER JOIN Persons P ON Y.PersonID = P.PersonID
INNER JOIN Candies C ON Y.TopCandy = C.CandyID
With 150,000 candies, 200,000 persons, and 500,000 CandyPreferences, the query took about 12 seconds and produced 200,000 rows.
The following result surprised me. I changed the query to remove the final "pretty" joins:
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
This now takes two or three seconds for 200,000 rows.
Now, to be clear, nothing I've done here has been meant to improve the performance of this query: I considered 12 seconds to be a success. It now says it spends 90% of its time in a clustered index seek.
Comment on Emtucifor solution (as I cant make regular comments)
I like this solution, but have some comments how it could be improved (in this specific case).
It can't be done much if you have everything in one table, but having few tables as in John Saunders' solution will make things a bit different.
As we are dealing with numbers in [CandyPreferences] table we can use math operation instead of concatenation to get max value.
I suggest PreferenceFactor to be decimal instead of real, as I believe we don't need here size of real data type, and even further I would suggest decimal(n,n) where n<10 to have only decimal part stored in 5 bytes. Assume decimal(3,3) is enough (1000 levels of preference factor), we can do simple
PackedData = Max(PreferenceFactor + CandyID)
Further, if we know we have less than 1,000,000 CandyIDs we can add cast as:
PackedData = Max(Cast(PreferenceFactor + CandyID as decimal(9,3)))
allowing sql server to use 5 bytes in temporary table
Unpacking is easy and fast using floor function.
Niikola
-- ADDED LATER ---
I tested both solutions, John's and Emtucifor's (modified to use John's structure and using my suggestions). I tested also with and without joins.
Emtucifor's solution clearly wins, but margins are not huge. It could be different if SQL server had to perform some Physical reads, but they were 0 in all cases.
Here are the queries:
SELECT
[PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM (
SELECT
[PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM z5CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy,
HighestPreference as PreferenceFactor
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT [PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM ( SELECT [PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT X.PersonID,
( SELECT TOP 1 cp.CandyId
FROM z5CandyPreferences cp
WHERE PersonID=X.PersonID AND cp.[PrefernceFactor]=X.HighestPreference
) CandyId,
HighestPreference as PreferenceFactor
FROM ( SELECT PersonID,
MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
) AS Y on p.PersonId = Y.PersonId
Inner Join z5Candies as c on c.CandyID=Y.CandyId
And the results:
TableName nRows
------------------ -------
z5Persons 200,000
z5Candies 150,000
z5CandyPreferences 497,445
Query Rows Affected CPU time Elapsed time
--------------------------- ------------- -------- ------------
Emtucifor (no joins) 183,289 531 ms 3,122 ms
John Saunders (no joins) 183,289 1,266 ms 2,918 ms
Emtucifor (with joins) 183,289 1,031 ms 3,990 ms
John Saunders (with joins) 183,289 2,406 ms 4,343 ms
Emtucifor (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 1 2,022
John Saunders (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183,290 587,677
Emtucifor (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
Worktable 0 0
z5Candies 1 526
z5CandyPreferences 1 2,022
z5Persons 1 733
John Saunders (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183292 587,912
z5Persons 3 802
Worktable 0 0
z5Candies 3 559
Worktable 0 0
you could use following select statements
select Name,Candy,PreferenceFactor
from candyTable ct
where PreferenceFactor =
(select max(PreferenceFactor)
from candyTable where ct.Name = Name)
but with this select you will get "Chris" 2 times in your result set.
if you want to get the the most preferred food by user than use
select top 1 Name,Candy,PreferenceFactor
from candyTable ct
where name = #name
and PreferenceFactor=
(select max([PreferenceFactor])
from candyTable where name = #name )
i think changing the name and candy to integer types might help you improve performance. you also should insert indexes on both columns.
[Edit] changed ! to #
SELECT Name, Candy, PreferenceFactor
FROM table AS a
WHERE NOT EXISTS(SELECT * FROM table AS b
WHERE b.Name = a.Name
AND (b.PreferenceFactor > a.PreferenceFactor OR (b.PreferenceFactor = a.PreferenceFactor AND b.Candy > a.Candy))
select name, candy, max(preference)
from tablename
where candy=#candy
order by name, candy
usually indexing is required on columns which are frequently included in where clause. In this case I would say indexing on name and candy columns would be of highest priority.
Having lookup tables for columns usually depends on number of repeating values with in columns. Out of 250,000 rows, if there are only 50 values that are repeating, you really need to have integer reference (foreign key) there. In this case, candy reference should be done and name reference really depends on the number of distinct people within the database.
I changed your column Name to PersonName to avoid any common reserved word conflicts.
SELECT PersonName, MAX(Candy) AS PreferredCandy, MAX(PreferenceFactor) AS Factor
FROM CandyPreference
GROUP BY PersonName
ORDER BY Factor DESC
SELECT d.Name, a.Candy, d.MaxPref
FROM myTable a, (SELECT Name, MAX(PreferenceFactor) AS MaxPref FROM myTable) as D
WHERE a.Name = d.Name AND a.PreferenceFactor = d.MaxPref
This should give you rows with matching PrefFactor for a given Name.
(e.g. if John as a HighPref of 1 for Lemon & Chocolate).
Pardon my answer as I am writing it without SQL Query Analyzer.
Something like this would work:
select name
, candy = substring(preference,7,len(preference))
-- convert back to float/numeric
, factor = convert(float,substring(preference,1,5))/10
from (
select name,
preference = (
select top 1
-- convert from float/numeric to zero-padded fixed-width string
right('00000'+convert(varchar,convert(decimal(5,0),preferencefactor*10)),5)
+ ';' + candy
from candyTable b
where a.name = b.name
order by
preferencefactor desc
, candy
)
from (select distinct name from candyTable) a
) a
Performance should be decent with with method. Check your query plan.
TOP 1 ... ORDER BY in a correlated subquery allows us to specify arbitrary rules for which row we want returned per row in the outer query. In this case, we want the highest preference factor per name, with candy for tie-breaks.
Subqueries can only return one value, so we must combine candy and preference factor into one field. The semicolon is just for readability here, but in other cases, you might use it to parse the combined field with CHARINDEX in the outer query.
If you wanted full precision in the output, you could use this instead (assuming preferencefactor is a float):
convert(varchar,preferencefactor) + ';' + candy
And then parse it back with:
factor = convert(float,substring(preference,1,charindex(';',preference)-1))
candy = substring(preference,charindex(';',preference)+1,len(preference))
I tested also ROW_NUMBER() version + added additional index
Create index IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)
Response times between Emtucifor's and ROW_NUMBER() version (with index in place) are marginal (if any - test should be repeated number of times and take averages, but I expect there would not be any significant difference)
Here is query:
Select p.PersonName,
c.Candy,
y.PrefernceFactor
From z5Persons p
Inner Join (Select * from (Select cp.PersonId,
cp.CandyId,
cp.PrefernceFactor,
ROW_NUMBER() over (Partition by cp.PersonId Order by cp.PrefernceFactor, cp.CandyId ) as hp
From z5CandyPreferences cp) X
Where hp=1) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
and results with and without new index:
| Without index | With Index
----------------------------------------------
Query (Aff.Rows 183,290) |CPU time Elapsed time | CPU time Elapsed time
-------------------------- |-------- ------------ | -------- ------------
Emtucifor (with joins) |1,031 ms 3,990 ms | 890 ms 3,758 ms
John Saunders (with joins) |2,406 ms 4,343 ms | 1,735 ms 3,414 ms
ROW_NUMBER() (with joins) |2,094 ms 4,888 ms | 953 ms 3,900 ms.
Emtucifor (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
Worktable | 0 0 | 0 0
z5Candies | 1 526 | 1 526
z5CandyPreferences | 1 2,022 | 1 990
z5Persons | 1 733 | 1 733
John Saunders (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 183292 587,912 | 183,290 585,570
z5Persons | 3 802 | 1 733
Worktable | 0 0 | 0 0
z5Candies | 3 559 | 1 526
Worktable | 0 0 | - -
ROW_NUMBER() (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 3 2233 | 1 990
z5Persons | 3 802 | 1 733
z5Candies | 3 559 | 1 526
Worktable | 0 0 | 0 0