Efficient SQL 2000 Query for Selecting Preferred Candy - sql

(I wish I could have come up with a more descriptive title... suggest one or edit this post if you can name the type of query I'm asking about)
Database: SQL Server 2000
Sample Data (assume 500,000 rows):
Name Candy PreferenceFactor
Jim Chocolate 1.0
Brad Lemon Drop .9
Brad Chocolate .1
Chris Chocolate .5
Chris Candy Cane .5
499,995 more rows...
Note that the number of rows with a given 'Name' is unbounded.
Desired Query Results:
Jim Chocolate 1.0
Brad Lemon Drop .9
Chris Chocolate .5
~250,000 more rows...
(Since Chris has equal preference for Candy Cane and Chocolate, a consistent result is adequate).
Question:
How do I Select Name, Candy from data where each resulting row contains a unique Name such that the Candy selected has the highest PreferenceFactor for each Name. (speedy efficient answers preferred).
What indexes are required on the table? Does it make a difference if Name and Candy are integer indexes into another table (aside from requiring some joins)?

You will find that the following query outperforms every other answer given, as it works with a single scan. This simulates MS Access's First and Last aggregate functions, which is basically what you are doing.
Of course, you'll probably have foreign keys instead of names in your CandyPreference table. To answer your question, it is in fact very much best if Candy and Name are foreign keys into another table.
If there are other columns in the CandyPreferences table, then having a covering index that includes the involved columns will yield even better performance. Making the columns as small as possible will increase the rows per page and again increase performance. If you are most often doing the query with a WHERE condition to restrict rows, then an index that covers the WHERE conditions becomes important.
Peter was on the right track for this, but had some unneeded complexity.
CREATE TABLE #CandyPreference (
[Name] varchar(20),
Candy varchar(30),
PreferenceFactor decimal(11, 10)
)
INSERT #CandyPreference VALUES ('Jim', 'Chocolate', 1.0)
INSERT #CandyPreference VALUES ('Brad', 'Lemon Drop', .9)
INSERT #CandyPreference VALUES ('Brad', 'Chocolate', .1)
INSERT #CandyPreference VALUES ('Chris', 'Chocolate', .5)
INSERT #CandyPreference VALUES ('Chris', 'Candy Cane', .5)
SELECT
[Name],
Candy = Substring(PackedData, 13, 30),
PreferenceFactor = Convert(decimal(11,10), Left(PackedData, 12))
FROM (
SELECT
[Name],
PackedData = Max(Convert(char(12), PreferenceFactor) + Candy)
FROM CandyPreference
GROUP BY [Name]
) X
DROP TABLE #CandyPreference
I actually don't recommend this method unless performance is critical. The "canonical" way to do it is OrbMan's standard Max/GROUP BY derived table and then a join to it to get the selected row. Though, that method starts to become difficult when there are several columns that participate in the selection of the Max, and the final combination of selectors can be duplicated, that is, when there is no column to provide arbitrary uniqueness as in the case here where we use the name if the PreferenceFactor is the same.
Edit: It's probably best to give some more usage notes to help improve clarity and to help people avoid problems.
As a general rule of thumb, when trying to improve query performance, you can do a LOT of extra math if it will save you I/O. Saving an entire table seek or scan speeds up the query substantially, even with all the converts and substrings and so on.
Due to precision and sorting issues, use of a floating point data type is probably a bad idea with this method. Though unless you are dealing with extremely large or small numbers, you shouldn't be using float in your database anyway.
The best data types are those that are not packed and sort in the same order after conversion to binary or char. Datetime, smalldatetime, bigint, int, smallint, and tinyint all convert directly to binary and sort correctly because they are not packed. With binary, avoid left() and right(), use substring() to get the values reliably returned to their originals.
I took advantage of Preference having only one digit in front of the decimal point in this query, allowing conversion straight to char since there is always at least a 0 before the decimal point. If more digits are possible, you would have to decimal-align the converted number so things sort correctly. Easiest might be to multiply your Preference rating so there is no decimal portion, convert to bigint, and then convert to binary(8). In general, conversion between numbers is faster than conversion between char and another data type, especially with date math.
Watch out for nulls. If there are any, you must convert them to something and then back.

select c.Name, max(c.Candy) as Candy, max(c.PreferenceFactor) as PreferenceFactor
from Candy c
inner join (
select Name, max(PreferenceFactor) as MaxPreferenceFactor
from Candy
group by Name
) cm on c.Name = cm.Name and c.PreferenceFactor = cm.MaxPreferenceFactor
group by c.Name
order by PreferenceFactor desc, Name

I tried:
SELECT X.PersonName,
(
SELECT TOP 1 Candy
FROM CandyPreferences
WHERE PersonName=X.PersonName AND PreferenceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonName, MAX(PreferenceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonName
) AS X
This seems to work, though I can't speak to efficiency without real data and a realistic load.
I did create a primary key over PersonName and Candy, though. Using SQL Server 2008 and no additional indexes shows it using two clustered index scans though, so it could be worse.
I played with this a bit more because I needed an excuse to play with the Data Generation Plan capability of "datadude". First, I refactored the one table to have separate tables for candy names and person names. I did this mostly because it allowed me to use the test data generation without having to read the documentation. The schema became:
CREATE TABLE [Candies](
[CandyID] [int] IDENTITY(1,1) NOT NULL,
[Candy] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Candies] PRIMARY KEY CLUSTERED
(
[CandyID] ASC
),
CONSTRAINT [UC_Candies] UNIQUE NONCLUSTERED
(
[Candy] ASC
)
)
GO
CREATE TABLE [Persons](
[PersonID] [int] IDENTITY(1,1) NOT NULL,
[PersonName] [nvarchar](100) NOT NULL,
CONSTRAINT [PK_Preferences.Persons] PRIMARY KEY CLUSTERED
(
[PersonID] ASC
)
)
GO
CREATE TABLE [CandyPreferences](
[PersonID] [int] NOT NULL,
[CandyID] [int] NOT NULL,
[PrefernceFactor] [real] NOT NULL,
CONSTRAINT [PK_CandyPreferences] PRIMARY KEY CLUSTERED
(
[PersonID] ASC,
[CandyID] ASC
)
)
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Candies] FOREIGN KEY([CandyID])
REFERENCES [Candies] ([CandyID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Candies]
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Persons] FOREIGN KEY([PersonID])
REFERENCES [Persons] ([PersonID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Persons]
GO
The query became:
SELECT P.PersonName, C.Candy
FROM (
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
) AS Y
INNER JOIN Persons P ON Y.PersonID = P.PersonID
INNER JOIN Candies C ON Y.TopCandy = C.CandyID
With 150,000 candies, 200,000 persons, and 500,000 CandyPreferences, the query took about 12 seconds and produced 200,000 rows.
The following result surprised me. I changed the query to remove the final "pretty" joins:
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
This now takes two or three seconds for 200,000 rows.
Now, to be clear, nothing I've done here has been meant to improve the performance of this query: I considered 12 seconds to be a success. It now says it spends 90% of its time in a clustered index seek.

Comment on Emtucifor solution (as I cant make regular comments)
I like this solution, but have some comments how it could be improved (in this specific case).
It can't be done much if you have everything in one table, but having few tables as in John Saunders' solution will make things a bit different.
As we are dealing with numbers in [CandyPreferences] table we can use math operation instead of concatenation to get max value.
I suggest PreferenceFactor to be decimal instead of real, as I believe we don't need here size of real data type, and even further I would suggest decimal(n,n) where n<10 to have only decimal part stored in 5 bytes. Assume decimal(3,3) is enough (1000 levels of preference factor), we can do simple
PackedData = Max(PreferenceFactor + CandyID)
Further, if we know we have less than 1,000,000 CandyIDs we can add cast as:
PackedData = Max(Cast(PreferenceFactor + CandyID as decimal(9,3)))
allowing sql server to use 5 bytes in temporary table
Unpacking is easy and fast using floor function.
Niikola
-- ADDED LATER ---
I tested both solutions, John's and Emtucifor's (modified to use John's structure and using my suggestions). I tested also with and without joins.
Emtucifor's solution clearly wins, but margins are not huge. It could be different if SQL server had to perform some Physical reads, but they were 0 in all cases.
Here are the queries:
SELECT
[PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM (
SELECT
[PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM z5CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy,
HighestPreference as PreferenceFactor
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT [PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM ( SELECT [PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT X.PersonID,
( SELECT TOP 1 cp.CandyId
FROM z5CandyPreferences cp
WHERE PersonID=X.PersonID AND cp.[PrefernceFactor]=X.HighestPreference
) CandyId,
HighestPreference as PreferenceFactor
FROM ( SELECT PersonID,
MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
) AS Y on p.PersonId = Y.PersonId
Inner Join z5Candies as c on c.CandyID=Y.CandyId
And the results:
TableName nRows
------------------ -------
z5Persons 200,000
z5Candies 150,000
z5CandyPreferences 497,445
Query Rows Affected CPU time Elapsed time
--------------------------- ------------- -------- ------------
Emtucifor (no joins) 183,289 531 ms 3,122 ms
John Saunders (no joins) 183,289 1,266 ms 2,918 ms
Emtucifor (with joins) 183,289 1,031 ms 3,990 ms
John Saunders (with joins) 183,289 2,406 ms 4,343 ms
Emtucifor (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 1 2,022
John Saunders (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183,290 587,677
Emtucifor (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
Worktable 0 0
z5Candies 1 526
z5CandyPreferences 1 2,022
z5Persons 1 733
John Saunders (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183292 587,912
z5Persons 3 802
Worktable 0 0
z5Candies 3 559
Worktable 0 0

you could use following select statements
select Name,Candy,PreferenceFactor
from candyTable ct
where PreferenceFactor =
(select max(PreferenceFactor)
from candyTable where ct.Name = Name)
but with this select you will get "Chris" 2 times in your result set.
if you want to get the the most preferred food by user than use
select top 1 Name,Candy,PreferenceFactor
from candyTable ct
where name = #name
and PreferenceFactor=
(select max([PreferenceFactor])
from candyTable where name = #name )
i think changing the name and candy to integer types might help you improve performance. you also should insert indexes on both columns.
[Edit] changed ! to #

SELECT Name, Candy, PreferenceFactor
FROM table AS a
WHERE NOT EXISTS(SELECT * FROM table AS b
WHERE b.Name = a.Name
AND (b.PreferenceFactor > a.PreferenceFactor OR (b.PreferenceFactor = a.PreferenceFactor AND b.Candy > a.Candy))

select name, candy, max(preference)
from tablename
where candy=#candy
order by name, candy
usually indexing is required on columns which are frequently included in where clause. In this case I would say indexing on name and candy columns would be of highest priority.
Having lookup tables for columns usually depends on number of repeating values with in columns. Out of 250,000 rows, if there are only 50 values that are repeating, you really need to have integer reference (foreign key) there. In this case, candy reference should be done and name reference really depends on the number of distinct people within the database.

I changed your column Name to PersonName to avoid any common reserved word conflicts.
SELECT PersonName, MAX(Candy) AS PreferredCandy, MAX(PreferenceFactor) AS Factor
FROM CandyPreference
GROUP BY PersonName
ORDER BY Factor DESC

SELECT d.Name, a.Candy, d.MaxPref
FROM myTable a, (SELECT Name, MAX(PreferenceFactor) AS MaxPref FROM myTable) as D
WHERE a.Name = d.Name AND a.PreferenceFactor = d.MaxPref
This should give you rows with matching PrefFactor for a given Name.
(e.g. if John as a HighPref of 1 for Lemon & Chocolate).
Pardon my answer as I am writing it without SQL Query Analyzer.

Something like this would work:
select name
, candy = substring(preference,7,len(preference))
-- convert back to float/numeric
, factor = convert(float,substring(preference,1,5))/10
from (
select name,
preference = (
select top 1
-- convert from float/numeric to zero-padded fixed-width string
right('00000'+convert(varchar,convert(decimal(5,0),preferencefactor*10)),5)
+ ';' + candy
from candyTable b
where a.name = b.name
order by
preferencefactor desc
, candy
)
from (select distinct name from candyTable) a
) a
Performance should be decent with with method. Check your query plan.
TOP 1 ... ORDER BY in a correlated subquery allows us to specify arbitrary rules for which row we want returned per row in the outer query. In this case, we want the highest preference factor per name, with candy for tie-breaks.
Subqueries can only return one value, so we must combine candy and preference factor into one field. The semicolon is just for readability here, but in other cases, you might use it to parse the combined field with CHARINDEX in the outer query.
If you wanted full precision in the output, you could use this instead (assuming preferencefactor is a float):
convert(varchar,preferencefactor) + ';' + candy
And then parse it back with:
factor = convert(float,substring(preference,1,charindex(';',preference)-1))
candy = substring(preference,charindex(';',preference)+1,len(preference))

I tested also ROW_NUMBER() version + added additional index
Create index IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)
Response times between Emtucifor's and ROW_NUMBER() version (with index in place) are marginal (if any - test should be repeated number of times and take averages, but I expect there would not be any significant difference)
Here is query:
Select p.PersonName,
c.Candy,
y.PrefernceFactor
From z5Persons p
Inner Join (Select * from (Select cp.PersonId,
cp.CandyId,
cp.PrefernceFactor,
ROW_NUMBER() over (Partition by cp.PersonId Order by cp.PrefernceFactor, cp.CandyId ) as hp
From z5CandyPreferences cp) X
Where hp=1) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
and results with and without new index:
| Without index | With Index
----------------------------------------------
Query (Aff.Rows 183,290) |CPU time Elapsed time | CPU time Elapsed time
-------------------------- |-------- ------------ | -------- ------------
Emtucifor (with joins) |1,031 ms 3,990 ms | 890 ms 3,758 ms
John Saunders (with joins) |2,406 ms 4,343 ms | 1,735 ms 3,414 ms
ROW_NUMBER() (with joins) |2,094 ms 4,888 ms | 953 ms 3,900 ms.
Emtucifor (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
Worktable | 0 0 | 0 0
z5Candies | 1 526 | 1 526
z5CandyPreferences | 1 2,022 | 1 990
z5Persons | 1 733 | 1 733
John Saunders (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 183292 587,912 | 183,290 585,570
z5Persons | 3 802 | 1 733
Worktable | 0 0 | 0 0
z5Candies | 3 559 | 1 526
Worktable | 0 0 | - -
ROW_NUMBER() (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 3 2233 | 1 990
z5Persons | 3 802 | 1 733
z5Candies | 3 559 | 1 526
Worktable | 0 0 | 0 0

Related

First name should randomly match with other FIRST name

All first name should randomly match with each other and when I tried to run query again the First Name should be match with others name. Not the match with FIRST time match.
For example I have 6 records in one table ...
First name column looks like:
JHON
LEE
SAM
HARRY
JIM
KRUK
So I want result like
First name1 First name2
Jhon. Harry
LEE. KRUK
HARRY SAM
The simplest solution is to first randomly sort the records, then calculate the grouping and a sequence number within the group and then finally select out the groups as rows.
You can follow along with the logic in this fiddle: https://dbfiddle.uk/9JlK59w4
DECLARE #Sorted TABLE
(
Id INT PRIMARY KEY,
FirstName varchar(30),
RowNum INT IDENTITY(1,1)
);
INSERT INTO #Sorted (Id, FirstName)
SELECT Id, FirstName
FROM People
ORDER BY NEWID();
WITH Pairs as
(
SELECT *
, (RowNum+1)/2 as PairNum
, RowNum % 2 as Ordinal
FROM #Sorted
)
SELECT
Person1.FirstName as [First name1], Person2.FirstName as [First name2]
FROM Pairs Person1
LEFT JOIN Pairs Person2 ON Person1.PairNum = Person2.PairNum AND Person2.Ordinal = 1
WHERE Person1.Ordinal = 0
ORDER BY Person1.PairNum
ORDER BY NEWID() is used here to randomly sort the records. Note that it is indeterminate and will return a new value with each execution. It's not very efficient, but is suitable for our requirement.
You can't easily use CTE's for producing lists of randomly sorted records because the result of a CTE is not cached. Each time the CTE is referenced in the subsequent logic can result in re-evaluating the expression. Run this fiddle a few times and watch how it often allocates the names incorrectly: https://dbfiddle.uk/rpPdkkAG
Due to the volatility of NEWID() this example stores the results in a table valued variable. For a very large list of records a temporary table might be more efficient.
PairNum uses the simple divide by n logic to assign a group number with a length of n
It is necessary to add 1 to the RowNum because the integer math will round down, see this in action in the fiddle.
Ordinal uses the modulo on the RowNumber and is a value we can use to differentiate between Person 1 and Person 2 in the pair. This helps us keep the rest of the logic determinate.
In the final SELECT we select first from the Pairs that have an Ordinal of 0, then we join on the Pairs that have an Ordinal of 1 matching by the PairNum
You can see in the fiddle I added a solution using groups of 3 to show how this can be easily extended to larger groupings.

Speeding up partitioning query on ancient SQL Server version

The Setup
I've got performance and conceptional problems with getting a query right on SQL Server 7 running on a dual core 2GHz + 2GB RAM machine - no chance of getting that out of the way, as you might expect :-/.
The Situation
I'm working with a legacy database and I need to mine for data to get various insights. I've got the all_stats table that contains all the stat data for a thingy in a specific context. These contexts are grouped with the help of the group_contexts table. A simplified schema:
+--------------------------------------------------------------------+
| thingies |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| all_stats |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
| context_id | INT FOREIGN KEY REFERENCES contexts(id) |
| value | FLOAT NULL |
| some_date | DATETIME NOT NULL |
| thingy_id | INT NOT NULL FOREIGN KEY REFERENCES thingies(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| group_contexts |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
| group_id | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id | INT NOT NULL FOREIGN KEY REFERENCES contexts(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| contexts |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| groups |
+--------------------------------------------------------------------+
| group_id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
The Problem
The task is, for a given set of thingies, to find and aggregate the 3 most recent (all_stats.some_date) stats of a thingy for all groups the thingy has stats for. I know it sounds easy but I can't get around how to do this properly in SQL - I'm not exactly a prodigy.
My Bad Solution (no it's really bad...)
My solution right now is to fill a temporary table with all the required data and UNION ALLing the data I need:
-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...
-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
group_id INT NOT NULL,
thingy_id INT NOT NULL,
value FLOAT NOT NULL,
some_date DATETIME NOT NULL)
-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
(SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
FROM
(SELECT groups.group_id,data.value,data.thingy_id,data.some_date
FROM
-- Getting the groups associated with the contexts
-- of all the stats available
(SELECT DISTINCT context.group_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
) AS groups
INNER JOIN
-- Joining the available groups with the actual
-- stat data of the group for a thingy
(SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
WHERE stat.value IS NOT NULL
AND stat.value >= 0) AS data
ON data.group_id = groups.group_id) AS joined
) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)
-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to
-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 982
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
UNION ALL
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 314159
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }
UNION ALL
-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */
This works - slowly, but it works - for small sets of data (e.g. say 100 thingies that have 10 stats attached each) but the problem domain it has to eventually work is in 10000+ thingies with potentially hundreds of stats per thingy. As a side note: the generated SQL query is ridiculously large: a pretty small query involves say 350 thingies that have data in 3 context groups and it's amounting to more than 250 000 formatted lines of SQL - executing in a stunning 5 minutes.
So if anyone has an idea how to solve this I really, really would appreciate your help :-).
On your ancient SQL Server release you need to use some old-style Scalar Subquery to get the last three rows for all thingies in a single query :-)
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(
SELECT s.group_id,s.thingy_id,s.value
FROM #stats AS s
where (select count(*) from #stats as s2
where s.group_id = s2.group_id
and s.thingy_id = s2.thingy_id
and s.some_date <= s2.some_date
) <= 3
) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
To get better performance you need to add a clustered index, probably (group_id,thingy_id,some_date desc,value) to the #stats table.
If group_id,thingy_id,some_date is unique you should remove the useless ID column, otherwise order by group_id,thingy_id,some_date desc during the Insert/Select into #stats and use ID instead of some_date for finding the last three rows.

Insert multiple rows in one table based on number in another table

I am creating a database for the first time using Postgres 9.3 on MacOSX.
Let's say I have table A and B. A starts off as empty and B as filled. I would like the number of entries in column all_names in table B to equal the number for each names in table A like table B below. Thus names should contain each unique entry from all_names and number its count. I am not used to the syntax, yet, so I do not really know how to go about it. The birthday column is redundant.
Table A
names | number
------+--------
Carl | 3
Bill | 4
Jen | 2
Table B
all_names | birthday
-----------+------------
Carl | 17/03/1980
Carl | 22/08/1994
Carl | 04/09/1951
Bill | 02/12/2003
Bill | 11/03/1975
Bill | 04/06/1986
Bill | 08/07/2005
Jen | 05/03/2009
Jen | 01/04/1945
Would this be the correct way to go about it?
insert into a (names, number)
select b.all_names, count(b.all_names)
from b
group by b.all_names;
Answer to original question
Postgres allows set-returning functions (SRF) to multiply rows. generate_series() is your friend:
INSERT INTO b (all_names, birthday)
SELECT names, current_date -- AS birthday ??
FROM (SELECT names, generate_series(1, number) FROM a);
Since the introduction of LATERAL in Postgres 9.3 you can do stick to standard SQL: the SRF moves from the SELECT to the FROM list:
INSERT INTO b (all_names, birthday)
SELECT a.names, current_date -- AS birthday ??
FROM a, generate_series(1, a.number) AS rn
LATERAL is implicit here, as explained in the manual:
LATERAL can also precede a function-call FROM item, but in this case
it is a noise word, because the function expression can refer to
earlier FROM items in any case.
Reverse operation
The above is the reverse operation (approximately) of a simple aggregate count():
INSERT INTO a (name, number)
SELECT all_names, count(*)
FROM b
GROUP BY 1;
... which fits your updated question.
Note a subtle difference between count(*) and count(all_names). The former counts all rows, no matter what, while the latter only counts rows where all_names IS NOT NULL. If your column all_names is defined as NOT NULL, both return the same, but count(*) is a bit shorter and faster.
About GROUP BY 1:
GROUP BY + CASE statement

Rewriting mysql select to reduce time and writing tmp to disk

I have a mysql query that's taking several minutes which isn't very good as it's used to create a web page.
Three tables are used: poster_data contains information on individual posters. poster_categories lists all the categories (movies, art, etc) while poster_prodcat lists the posterid number and the categories it can be in e.g. one poster would have multiple lines for say, movies, indiana jones, harrison ford, adventure films, etc.
this is the slow query:
select *
from poster_prodcat,
poster_data,
poster_categories
where poster_data.apnumber = poster_prodcat.apnumber
and poster_categories.apcatnum = poster_prodcat.apcatnum
and poster_prodcat.apcatnum='623'
ORDER BY aptitle ASC
LIMIT 0, 32
According to the explain:
It was taking a few minutes. Poster_data has just over 800,000 rows, while poster_prodcat has just over 17 million. Other category queries with this select are barely noticeable, while poster_prodcat.apcatnum='623' has about 400,000 results and is writing out to disk
hope you find this helpful - http://pastie.org/1105206
drop table if exists poster;
create table poster
(
poster_id int unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists category;
create table category
(
cat_id mediumint unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists poster_category;
create table poster_category
(
cat_id mediumint unsigned not null,
poster_id int unsigned not null,
primary key (cat_id, poster_id) -- note the clustered composite index !!
)
engine = innodb;
-- FYI http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
select count(*) from category
count(*)
========
500,000
select count(*) from poster
count(*)
========
1,000,000
select count(*) from poster_category
count(*)
========
125,675,688
select count(*) from poster_category where cat_id = 623
count(*)
========
342,820
explain
select
p.*,
c.*
from
poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
pc.cat_id = 623
order by
p.name
limit 32;
id select_type table type possible_keys key key_len ref rows
== =========== ===== ==== ============= === ======= === ====
1 SIMPLE c const PRIMARY PRIMARY 3 const 1
1 SIMPLE p index PRIMARY name 257 null 32
1 SIMPLE pc eq_ref PRIMARY PRIMARY 7 const,foo_db.p.poster_id 1
select
p.*,
c.*
from
poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
pc.cat_id = 623
order by
p.name
limit 32;
Statement:21/08/2010
0:00:00.021: Query OK
The query you listed is how the final query will look like? (So they have the apcatnum=/ID/ ?)
where poster_data.apnumber=poster_prodcat.apnumber and poster_categories.apcatnum=poster_prodcat.apcatnum and poster_prodcat.apcatnum='623'
poster_prodcat.apcatnum='623'
will vastly decrease the data-set mysql has to work on, thus this should be the first parsed part of the query.
Then go on to swap the where-comparisons so those minimizing the data-set the most will be parsed first.
You may also want to try sub-queries. I’m not sure that will help, but mysql probably won’t first get all 3 tables, but first do the sub-query and then the other one. This should minimize memory consumption while querying.
Although this is not an option if you really want to select all columns (as you’re using a * there).
You need to have an index on apnumber in POSTER_DATA. Scanning 841,152 records is killing the performance.
Looks like the query is using the apptitle index to get the ordering but it is doing a full scan to filter the results. I think it might help if you have a composite index across both apptitle and apnumber on poster_data. MySQL might then be able to use this to do both the sort order and the filter.
create index data_title_anum_idx on poster_data(aptitle,apnumber);

Do I need to use multiple column SQL Server index in the same order as I declare it?

When I declare a clustered index, specifying: column1, column2 and column3 in this order - do I need to use the columns in that same order?
For example, will this use the clustered index mentioned earlier to update multiple rows:
UPDATE Table1
WHERE column3 = 1
AND column2 = 1
AND column1 = 1
The order you use declare the items in the Where clause, as you have stated, should not make a difference as to whether the database server is able to use an index which covers those columns.
It's true that when you're checking for exact equality, that order does not matter.
But that's not to say that the order in the index does not matter -- perhaps this is what your co-worker was trying to say. For example, if I have a table:
PersonID FName LName
-------- ------- -----
1 John Smith
2 Bill Jones
3 Frank Smith
4 Jane Jackson
...
(assume a significantly large table)
and I define an index on it in the order (LName, FName), that index will necessarily perform differently than an index defined in the order (FName, LName), depending on what the query is.
For example, for the query:
SELECT * FROM People WHERE LName = 'Smith', you will most likely get a better plan for the first type of index than for the second type.
Likewise,
SELECT * FROM People WHERE FName = 'John' will perform better with the second index structure over the first.
And
SELECT * FROM People WHERE FName = 'John' AND LName = 'Smith' will perform identically no matter what order the index is created.