Speeding up partitioning query on ancient SQL Server version - sql

The Setup
I've got performance and conceptional problems with getting a query right on SQL Server 7 running on a dual core 2GHz + 2GB RAM machine - no chance of getting that out of the way, as you might expect :-/.
The Situation
I'm working with a legacy database and I need to mine for data to get various insights. I've got the all_stats table that contains all the stat data for a thingy in a specific context. These contexts are grouped with the help of the group_contexts table. A simplified schema:
+--------------------------------------------------------------------+
| thingies |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| all_stats |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
| context_id | INT FOREIGN KEY REFERENCES contexts(id) |
| value | FLOAT NULL |
| some_date | DATETIME NOT NULL |
| thingy_id | INT NOT NULL FOREIGN KEY REFERENCES thingies(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| group_contexts |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
| group_id | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id | INT NOT NULL FOREIGN KEY REFERENCES contexts(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| contexts |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| groups |
+--------------------------------------------------------------------+
| group_id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
The Problem
The task is, for a given set of thingies, to find and aggregate the 3 most recent (all_stats.some_date) stats of a thingy for all groups the thingy has stats for. I know it sounds easy but I can't get around how to do this properly in SQL - I'm not exactly a prodigy.
My Bad Solution (no it's really bad...)
My solution right now is to fill a temporary table with all the required data and UNION ALLing the data I need:
-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...
-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
group_id INT NOT NULL,
thingy_id INT NOT NULL,
value FLOAT NOT NULL,
some_date DATETIME NOT NULL)
-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
(SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
FROM
(SELECT groups.group_id,data.value,data.thingy_id,data.some_date
FROM
-- Getting the groups associated with the contexts
-- of all the stats available
(SELECT DISTINCT context.group_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
) AS groups
INNER JOIN
-- Joining the available groups with the actual
-- stat data of the group for a thingy
(SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
WHERE stat.value IS NOT NULL
AND stat.value >= 0) AS data
ON data.group_id = groups.group_id) AS joined
) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)
-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to
-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 982
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
UNION ALL
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 314159
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }
UNION ALL
-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */
This works - slowly, but it works - for small sets of data (e.g. say 100 thingies that have 10 stats attached each) but the problem domain it has to eventually work is in 10000+ thingies with potentially hundreds of stats per thingy. As a side note: the generated SQL query is ridiculously large: a pretty small query involves say 350 thingies that have data in 3 context groups and it's amounting to more than 250 000 formatted lines of SQL - executing in a stunning 5 minutes.
So if anyone has an idea how to solve this I really, really would appreciate your help :-).

On your ancient SQL Server release you need to use some old-style Scalar Subquery to get the last three rows for all thingies in a single query :-)
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(
SELECT s.group_id,s.thingy_id,s.value
FROM #stats AS s
where (select count(*) from #stats as s2
where s.group_id = s2.group_id
and s.thingy_id = s2.thingy_id
and s.some_date <= s2.some_date
) <= 3
) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
To get better performance you need to add a clustered index, probably (group_id,thingy_id,some_date desc,value) to the #stats table.
If group_id,thingy_id,some_date is unique you should remove the useless ID column, otherwise order by group_id,thingy_id,some_date desc during the Insert/Select into #stats and use ID instead of some_date for finding the last three rows.

Related

Recursively duplicating entries

I am attempting to duplicate an entry. That part isn't hard. The tricky part is: there are n entries connected with a foreign key. And for each of those entries, there are n entries connected to that. I did it manually using a lookup to duplicate and cross reference the foreign keys.
Is there some subroutine or method to duplicate an entry and search for and duplicate foreign entries? Perhaps there is a name for this type of replication I haven't stumbled on yet, is there a specific database related title for this type of operation?
PostgreSQL 8.4.13
main entry (uid is serial)
uid | title
-----+-------
1 | stuff
department (departmentid is serial, uidref is foreign key for uid above)
departmentid | uidref | title
--------------+--------+-------
100 | 1 | Foo
101 | 1 | Bar
sub_category of department (textid is serial, departmentref is foreign for departmentid above)
textid | departmentref | title
-------+---------------+----------------
1000 | 100 | Text for Foo 1
1001 | 100 | Text for Foo 2
1002 | 101 | Text for Bar 1
You can do it all in a single statement using data-modifying CTEs (requires Postgres 9.1 or later).
Your primary keys being serial columns makes it easier:
WITH m AS (
INSERT INTO main (<all columns except pk>)
SELECT <all columns except pk>
FROM main
WHERE uid = 1
RETURNING uid AS uidref -- returns new uid
)
, d AS (
INSERT INTO department (<all columns except pk>)
SELECT <all columns except pk>
FROM m
JOIN department d USING (uidref)
RETURNING departmentid AS departmentref -- returns new departmentids
)
INSERT INTO sub_category (<all columns except pk>)
SELECT <all columns except pk>
FROM d
JOIN sub_category s USING (departmentref);
Replace <all columns except pk> with your actual columns. pk is for primary key, like main.uid.
The query returns nothing. You can return pretty much anything. You just didn't specify anything.
You wouldn't call that "replication". That term usually is applied for keeping multiple database instances or objects in sync. You are just duplicating an entry - and depending objects recursively.
Aside about naming conventions:
It would get even simpler with a naming convention that labels all columns signifying "ID of table foo" with the same (descriptive) name, like foo_id. There are other naming conventions floating around, but this is the best for writing queries, IMO.

Adding string to the primary key?

I want to add some string with the primary key value while creating the table in sql?
Example:
my primary key column should automatically generate values like below:
'EMP101'
'EMP102'
'EMP103'
How to achieve it?
Try this: (For SQL Server 2012)
UPDATE MyTable
SET EMPID = CONCAT('EMP' , EMPID)
Or this: (For SQL Server < 2012)
UPDATE MyTable
SET EMPID = 'EMP' + EMPID
SQLFiddle for SQL Server 2008
SQLFiddle for SQL Server 2012
Since you want to set auto increment in VARCHAR type column you can try this table schema:
CREATE TABLE MyTable
(EMP INT NOT NULL IDENTITY(1000, 1)
,[EMPID] AS 'EMP' + CAST(EMP AS VARCHAR(10)) PERSISTED PRIMARY KEY
,EMPName VARCHAR(20))
;
INSERT INTO MyTable(EMPName) VALUES
('AA')
,('BB')
,('CC')
,('DD')
,('EE')
,('FF')
Output:
| EMP | EMPID | EMPNAME |
----------------------------
| 1000 | EMP1000 | AA |
| 1001 | EMP1001 | BB |
| 1002 | EMP1002 | CC |
| 1003 | EMP1003 | DD |
| 1004 | EMP1004 | EE |
| 1005 | EMP1005 | FF |
See this SQLFiddle
Here you can see EMPID is auto incremented column with Primary key.
Source: HOW TO SET IDENTITY KEY/AUTO INCREMENT ON VARCHAR COLUMN IN SQL SERVER (Thanks to #bvr)
What the rule of thumb is, is that never use meaningful information in primary keys (like Employee Number / Social Security number). Let that just be a plain autoincremented integer. However constant the data seems - it may change at one point (new legislation comes and all SSNs are recalculated).
it seems the only reason you are want to use a non-integer keys is that the key is generated as string concatenation with another column to make it unique.
From a best practice perspective, it is strongly recommended that integer primary keys are used, but often, this guidance is ignored.
May be going through the following posts might be of help:
Should I design a table with a primary key of varchar or int?
SQL primary key: integer vs varchar
You can achieve it at least in two ways:
Generate new id on the fly when you insert a new record
Create INSTEAD OF INSERT trigger that will do that for you
If you have a table schema like this
CREATE TABLE Table1
([emp_id] varchar(12) primary key, [name] varchar(64))
For the first scenario you can use a query
INSERT INTO Table1 (emp_id, name)
SELECT newid, 'Jhon'
FROM
(
SELECT 'EMP' + CONVERT(VARCHAR(9), COALESCE(REPLACE(MAX(emp_id), 'EMP', ''), 0) + 1) newid
FROM Table1 WITH (TABLOCKX, HOLDLOCK)
) q
Here is SQLFiddle demo
For the second scenario you can a trigger like this
CREATE TRIGGER tg_table1_insert ON Table1
INSTEAD OF INSERT AS
BEGIN
DECLARE #max INT
SET #max =
(SELECT COALESCE(REPLACE(MAX(emp_id), 'EMP', ''), 0)
FROM Table1 WITH (TABLOCKX, HOLDLOCK)
)
INSERT INTO Table1 (emp_id, name)
SELECT 'EMP' + CONVERT(VARCHAR(9), #max + ROW_NUMBER() OVER (ORDER BY (SELECT 1))), name
FROM INSERTED
END
Here is SQLFiddle demo
I am looking to do something similar but don't see an answer to my problem here.
I want a primary Key like "JonesB_01" as this is how we want our job number represented in our production system.
--ID | First_Name | Second_Name | Phone | Etc..
-- Bob Jones 9999-999-999
--ID = "Second_Name"+"F"irst Initial+"_(01-99)"
The number 01-99 has been included to allow for multiple instances of a customer with the same surname and first initial. In our industry it's not unusual for the same customer to have work done on multiple occasions but are not repeat business on an ongoing basis. I expect this convention to last a very long time. If we ever exceed it, then I can simply add a third interger.
I want this to auto populate to keep data entry as simple as possible.
I managed to get a solution to work using Excel formulars and a few helper cells but am new to SQL.
--CellA2 = JonesB_01 (=concatenate(D2+E2))
--CellB2 = "Bob"
--CellC2 = "Jones"
--CellD2 = "JonesB" (=if(B2="","",Concatenate(C2,Left(B2)))
--CellE2 = "_01" (=concatenate("_",Text(F2,"00"))
--CellF2 = "1" (=If(D2="","",Countif($D$2:$D2,D2))
Thanks.
SELECT 'EMP' || TO_CHAR(NVL(MAX(TO_NUMBER(SUBSTR(A.EMP_NO, 4,3))), '000')+1) AS NEW_EMP_NO
FROM
(SELECT 'EMP101' EMP_NO
FROM DUAL
UNION ALL
SELECT 'EMP102' EMP_NO
FROM DUAL
UNION ALL
SELECT 'EMP103' EMP_NO
FROM DUAL
) A

Speeding up this big JOIN

EDIT: there was a mistake in the following question that explains the observations. I could delete the question but this might still be useful to someone. The mistake was that the actual query running on the server was SELECT * FROM t (which was silly) when I thought it was running SELECT t.* FROM t (which makes all the difference). See tobyobrian's answer and the comments to it.
I've a too slow query in a situation with a schema as follows. Table t has data rows indexed by t_id. t adjoins tables x and y via junction tables t_x and t_y each of which contains only the foreigns keys required for the JOINs:
CREATE TABLE t (
t_id INT NOT NULL PRIMARY KEY,
data columns...
);
CREATE TABLE t_x (
t_id INT NOT NULL,
x_id INT NOT NULL,
PRIMARY KEY (t_id, x_id),
KEY (x_id)
);
CREATE TABLE t_y (
t_id INT NOT NULL,
y_id INT NOT NULL,
PRIMARY KEY (t_id, y_id),
KEY (y_id)
);
I need to export the stray rows in t, i.e. those not referenced in either junction table.
SELECT t.* FROM t
LEFT JOIN t_x ON t_x.t_id=t.t_id
LEFT JOIN t_y ON t_y.t_id=t.t_id
WHERE t_x.t_id IS NULL OR t_y.t_id IS NULL
INTO OUTFILE ...;
t has 21 M rows while t_x and t_y both have about 25 M rows. So this is naturally going to be a slow query.
I'm using MyISAM so I thought I'd try to speed it up by preloading the t_x and t_y indexes. The combined size of t_x.MYI and t_y.MYI was about 1.2 M bytes so I created a dedicated key buffer for them, assigned their PRIMARY keys to the dedicated buffer and LOAD INDEX INTO CACHE'ed them.
But as I watch the query in operation, mysqld is using about 1% CPU, the average system IO pending queue length is around 5, and mysqld's average seek size is in the 250 k range. Moreover, nearly all the IO is mysqld reading from t_x.MYI and t_x.MYD.
I don't understand:
Why mysqld is reading the .MYD files at all?
Why mysqld isn't using the preloaded the t_x and t_y indexes?
Could it have something to do with the t_x and t_y PRIMARY keys being over two columns?
EDIT: The query explained:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 20980052 | |
| 1 | SIMPLE | t_x | ref | PRIMARY | PRIMARY | 4 | db.t.t_id | 235849 | Using index |
| 1 | SIMPLE | t_y | ref | PRIMARY | PRIMARY | 4 | db.t.t_id | 207947 | Using where |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
Use not exists - this will be the fastest - much better than 'joins' or using 'not in' in this sitution.
SELECT t.* FROM t a
Where not exists (select 1 from t_x b
where b.t_id = a.t_id)
or not exists (select 1 from t_y c
where c.t_id = a.t_id);
I can answer part 1 of your question, and i may or may not be able to answer part two if you post the output of EXPLAIN:
In order to select t.* it needs to look in the MYD file - only the primary key is in the index, to fetch the data columns you requested it needs the rest of the columns.
That is, your query is quite probably filtering the results very quickly, its just struggling to copy all the data you wanted.
Also note that you will probably have duplicates in your output - if one row has no refs in t_x, but 3 in x_y you will have the same t.* repeated 3 times. Given we think the where clause is sufficiently efficient, and much time is spent on reading the actual data, this is quite possibly the source of your problems. try changing to select distinct and see if that helps your efficiency
This may be a bit more efficient:
SELECT *
FROM t
WHERE t.id NOT IN (
SELECT DISTINCT t_id
FROM t_x
UNION
SELECT DISTINCT t_id
FROM t_y
);

Efficient SQL 2000 Query for Selecting Preferred Candy

(I wish I could have come up with a more descriptive title... suggest one or edit this post if you can name the type of query I'm asking about)
Database: SQL Server 2000
Sample Data (assume 500,000 rows):
Name Candy PreferenceFactor
Jim Chocolate 1.0
Brad Lemon Drop .9
Brad Chocolate .1
Chris Chocolate .5
Chris Candy Cane .5
499,995 more rows...
Note that the number of rows with a given 'Name' is unbounded.
Desired Query Results:
Jim Chocolate 1.0
Brad Lemon Drop .9
Chris Chocolate .5
~250,000 more rows...
(Since Chris has equal preference for Candy Cane and Chocolate, a consistent result is adequate).
Question:
How do I Select Name, Candy from data where each resulting row contains a unique Name such that the Candy selected has the highest PreferenceFactor for each Name. (speedy efficient answers preferred).
What indexes are required on the table? Does it make a difference if Name and Candy are integer indexes into another table (aside from requiring some joins)?
You will find that the following query outperforms every other answer given, as it works with a single scan. This simulates MS Access's First and Last aggregate functions, which is basically what you are doing.
Of course, you'll probably have foreign keys instead of names in your CandyPreference table. To answer your question, it is in fact very much best if Candy and Name are foreign keys into another table.
If there are other columns in the CandyPreferences table, then having a covering index that includes the involved columns will yield even better performance. Making the columns as small as possible will increase the rows per page and again increase performance. If you are most often doing the query with a WHERE condition to restrict rows, then an index that covers the WHERE conditions becomes important.
Peter was on the right track for this, but had some unneeded complexity.
CREATE TABLE #CandyPreference (
[Name] varchar(20),
Candy varchar(30),
PreferenceFactor decimal(11, 10)
)
INSERT #CandyPreference VALUES ('Jim', 'Chocolate', 1.0)
INSERT #CandyPreference VALUES ('Brad', 'Lemon Drop', .9)
INSERT #CandyPreference VALUES ('Brad', 'Chocolate', .1)
INSERT #CandyPreference VALUES ('Chris', 'Chocolate', .5)
INSERT #CandyPreference VALUES ('Chris', 'Candy Cane', .5)
SELECT
[Name],
Candy = Substring(PackedData, 13, 30),
PreferenceFactor = Convert(decimal(11,10), Left(PackedData, 12))
FROM (
SELECT
[Name],
PackedData = Max(Convert(char(12), PreferenceFactor) + Candy)
FROM CandyPreference
GROUP BY [Name]
) X
DROP TABLE #CandyPreference
I actually don't recommend this method unless performance is critical. The "canonical" way to do it is OrbMan's standard Max/GROUP BY derived table and then a join to it to get the selected row. Though, that method starts to become difficult when there are several columns that participate in the selection of the Max, and the final combination of selectors can be duplicated, that is, when there is no column to provide arbitrary uniqueness as in the case here where we use the name if the PreferenceFactor is the same.
Edit: It's probably best to give some more usage notes to help improve clarity and to help people avoid problems.
As a general rule of thumb, when trying to improve query performance, you can do a LOT of extra math if it will save you I/O. Saving an entire table seek or scan speeds up the query substantially, even with all the converts and substrings and so on.
Due to precision and sorting issues, use of a floating point data type is probably a bad idea with this method. Though unless you are dealing with extremely large or small numbers, you shouldn't be using float in your database anyway.
The best data types are those that are not packed and sort in the same order after conversion to binary or char. Datetime, smalldatetime, bigint, int, smallint, and tinyint all convert directly to binary and sort correctly because they are not packed. With binary, avoid left() and right(), use substring() to get the values reliably returned to their originals.
I took advantage of Preference having only one digit in front of the decimal point in this query, allowing conversion straight to char since there is always at least a 0 before the decimal point. If more digits are possible, you would have to decimal-align the converted number so things sort correctly. Easiest might be to multiply your Preference rating so there is no decimal portion, convert to bigint, and then convert to binary(8). In general, conversion between numbers is faster than conversion between char and another data type, especially with date math.
Watch out for nulls. If there are any, you must convert them to something and then back.
select c.Name, max(c.Candy) as Candy, max(c.PreferenceFactor) as PreferenceFactor
from Candy c
inner join (
select Name, max(PreferenceFactor) as MaxPreferenceFactor
from Candy
group by Name
) cm on c.Name = cm.Name and c.PreferenceFactor = cm.MaxPreferenceFactor
group by c.Name
order by PreferenceFactor desc, Name
I tried:
SELECT X.PersonName,
(
SELECT TOP 1 Candy
FROM CandyPreferences
WHERE PersonName=X.PersonName AND PreferenceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonName, MAX(PreferenceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonName
) AS X
This seems to work, though I can't speak to efficiency without real data and a realistic load.
I did create a primary key over PersonName and Candy, though. Using SQL Server 2008 and no additional indexes shows it using two clustered index scans though, so it could be worse.
I played with this a bit more because I needed an excuse to play with the Data Generation Plan capability of "datadude". First, I refactored the one table to have separate tables for candy names and person names. I did this mostly because it allowed me to use the test data generation without having to read the documentation. The schema became:
CREATE TABLE [Candies](
[CandyID] [int] IDENTITY(1,1) NOT NULL,
[Candy] [nvarchar](50) NOT NULL,
CONSTRAINT [PK_Candies] PRIMARY KEY CLUSTERED
(
[CandyID] ASC
),
CONSTRAINT [UC_Candies] UNIQUE NONCLUSTERED
(
[Candy] ASC
)
)
GO
CREATE TABLE [Persons](
[PersonID] [int] IDENTITY(1,1) NOT NULL,
[PersonName] [nvarchar](100) NOT NULL,
CONSTRAINT [PK_Preferences.Persons] PRIMARY KEY CLUSTERED
(
[PersonID] ASC
)
)
GO
CREATE TABLE [CandyPreferences](
[PersonID] [int] NOT NULL,
[CandyID] [int] NOT NULL,
[PrefernceFactor] [real] NOT NULL,
CONSTRAINT [PK_CandyPreferences] PRIMARY KEY CLUSTERED
(
[PersonID] ASC,
[CandyID] ASC
)
)
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Candies] FOREIGN KEY([CandyID])
REFERENCES [Candies] ([CandyID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Candies]
GO
ALTER TABLE [CandyPreferences]
WITH CHECK ADD CONSTRAINT [FK_CandyPreferences_Persons] FOREIGN KEY([PersonID])
REFERENCES [Persons] ([PersonID])
GO
ALTER TABLE [CandyPreferences]
CHECK CONSTRAINT [FK_CandyPreferences_Persons]
GO
The query became:
SELECT P.PersonName, C.Candy
FROM (
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
) AS Y
INNER JOIN Persons P ON Y.PersonID = P.PersonID
INNER JOIN Candies C ON Y.TopCandy = C.CandyID
With 150,000 candies, 200,000 persons, and 500,000 CandyPreferences, the query took about 12 seconds and produced 200,000 rows.
The following result surprised me. I changed the query to remove the final "pretty" joins:
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM CandyPreferences
GROUP BY PersonID
) AS X
This now takes two or three seconds for 200,000 rows.
Now, to be clear, nothing I've done here has been meant to improve the performance of this query: I considered 12 seconds to be a success. It now says it spends 90% of its time in a clustered index seek.
Comment on Emtucifor solution (as I cant make regular comments)
I like this solution, but have some comments how it could be improved (in this specific case).
It can't be done much if you have everything in one table, but having few tables as in John Saunders' solution will make things a bit different.
As we are dealing with numbers in [CandyPreferences] table we can use math operation instead of concatenation to get max value.
I suggest PreferenceFactor to be decimal instead of real, as I believe we don't need here size of real data type, and even further I would suggest decimal(n,n) where n<10 to have only decimal part stored in 5 bytes. Assume decimal(3,3) is enough (1000 levels of preference factor), we can do simple
PackedData = Max(PreferenceFactor + CandyID)
Further, if we know we have less than 1,000,000 CandyIDs we can add cast as:
PackedData = Max(Cast(PreferenceFactor + CandyID as decimal(9,3)))
allowing sql server to use 5 bytes in temporary table
Unpacking is easy and fast using floor function.
Niikola
-- ADDED LATER ---
I tested both solutions, John's and Emtucifor's (modified to use John's structure and using my suggestions). I tested also with and without joins.
Emtucifor's solution clearly wins, but margins are not huge. It could be different if SQL server had to perform some Physical reads, but they were 0 in all cases.
Here are the queries:
SELECT
[PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM (
SELECT
[PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
SELECT X.PersonID,
(
SELECT TOP 1 CandyID
FROM z5CandyPreferences
WHERE PersonID=X.PersonID AND PrefernceFactor=x.HighestPreference
) AS TopCandy,
HighestPreference as PreferenceFactor
FROM
(
SELECT PersonID, MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT [PersonID],
CandyID = Floor(PackedData),
PreferenceFactor = Cast(PackedData-Floor(PackedData) as decimal(3,3))
FROM ( SELECT [PersonID],
PackedData = Max(Cast([PrefernceFactor] + [CandyID] as decimal(9,3)))
FROM [z5CandyPreferences] With (NoLock)
GROUP BY [PersonID]
) X
) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
Select p.PersonName,
c.Candy,
y.PreferenceFactor
From z5Persons p
Inner Join (SELECT X.PersonID,
( SELECT TOP 1 cp.CandyId
FROM z5CandyPreferences cp
WHERE PersonID=X.PersonID AND cp.[PrefernceFactor]=X.HighestPreference
) CandyId,
HighestPreference as PreferenceFactor
FROM ( SELECT PersonID,
MAX(PrefernceFactor) AS HighestPreference
FROM z5CandyPreferences
GROUP BY PersonID
) AS X
) AS Y on p.PersonId = Y.PersonId
Inner Join z5Candies as c on c.CandyID=Y.CandyId
And the results:
TableName nRows
------------------ -------
z5Persons 200,000
z5Candies 150,000
z5CandyPreferences 497,445
Query Rows Affected CPU time Elapsed time
--------------------------- ------------- -------- ------------
Emtucifor (no joins) 183,289 531 ms 3,122 ms
John Saunders (no joins) 183,289 1,266 ms 2,918 ms
Emtucifor (with joins) 183,289 1,031 ms 3,990 ms
John Saunders (with joins) 183,289 2,406 ms 4,343 ms
Emtucifor (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 1 2,022
John Saunders (no joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183,290 587,677
Emtucifor (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
Worktable 0 0
z5Candies 1 526
z5CandyPreferences 1 2,022
z5Persons 1 733
John Saunders (with joins)
--------------------------------------------
Table Scan count logical reads
------------------- ---------- -------------
z5CandyPreferences 183292 587,912
z5Persons 3 802
Worktable 0 0
z5Candies 3 559
Worktable 0 0
you could use following select statements
select Name,Candy,PreferenceFactor
from candyTable ct
where PreferenceFactor =
(select max(PreferenceFactor)
from candyTable where ct.Name = Name)
but with this select you will get "Chris" 2 times in your result set.
if you want to get the the most preferred food by user than use
select top 1 Name,Candy,PreferenceFactor
from candyTable ct
where name = #name
and PreferenceFactor=
(select max([PreferenceFactor])
from candyTable where name = #name )
i think changing the name and candy to integer types might help you improve performance. you also should insert indexes on both columns.
[Edit] changed ! to #
SELECT Name, Candy, PreferenceFactor
FROM table AS a
WHERE NOT EXISTS(SELECT * FROM table AS b
WHERE b.Name = a.Name
AND (b.PreferenceFactor > a.PreferenceFactor OR (b.PreferenceFactor = a.PreferenceFactor AND b.Candy > a.Candy))
select name, candy, max(preference)
from tablename
where candy=#candy
order by name, candy
usually indexing is required on columns which are frequently included in where clause. In this case I would say indexing on name and candy columns would be of highest priority.
Having lookup tables for columns usually depends on number of repeating values with in columns. Out of 250,000 rows, if there are only 50 values that are repeating, you really need to have integer reference (foreign key) there. In this case, candy reference should be done and name reference really depends on the number of distinct people within the database.
I changed your column Name to PersonName to avoid any common reserved word conflicts.
SELECT PersonName, MAX(Candy) AS PreferredCandy, MAX(PreferenceFactor) AS Factor
FROM CandyPreference
GROUP BY PersonName
ORDER BY Factor DESC
SELECT d.Name, a.Candy, d.MaxPref
FROM myTable a, (SELECT Name, MAX(PreferenceFactor) AS MaxPref FROM myTable) as D
WHERE a.Name = d.Name AND a.PreferenceFactor = d.MaxPref
This should give you rows with matching PrefFactor for a given Name.
(e.g. if John as a HighPref of 1 for Lemon & Chocolate).
Pardon my answer as I am writing it without SQL Query Analyzer.
Something like this would work:
select name
, candy = substring(preference,7,len(preference))
-- convert back to float/numeric
, factor = convert(float,substring(preference,1,5))/10
from (
select name,
preference = (
select top 1
-- convert from float/numeric to zero-padded fixed-width string
right('00000'+convert(varchar,convert(decimal(5,0),preferencefactor*10)),5)
+ ';' + candy
from candyTable b
where a.name = b.name
order by
preferencefactor desc
, candy
)
from (select distinct name from candyTable) a
) a
Performance should be decent with with method. Check your query plan.
TOP 1 ... ORDER BY in a correlated subquery allows us to specify arbitrary rules for which row we want returned per row in the outer query. In this case, we want the highest preference factor per name, with candy for tie-breaks.
Subqueries can only return one value, so we must combine candy and preference factor into one field. The semicolon is just for readability here, but in other cases, you might use it to parse the combined field with CHARINDEX in the outer query.
If you wanted full precision in the output, you could use this instead (assuming preferencefactor is a float):
convert(varchar,preferencefactor) + ';' + candy
And then parse it back with:
factor = convert(float,substring(preference,1,charindex(';',preference)-1))
candy = substring(preference,charindex(';',preference)+1,len(preference))
I tested also ROW_NUMBER() version + added additional index
Create index IX_z5CandyPreferences On z5CandyPreferences(PersonId,PrefernceFactor,CandyID)
Response times between Emtucifor's and ROW_NUMBER() version (with index in place) are marginal (if any - test should be repeated number of times and take averages, but I expect there would not be any significant difference)
Here is query:
Select p.PersonName,
c.Candy,
y.PrefernceFactor
From z5Persons p
Inner Join (Select * from (Select cp.PersonId,
cp.CandyId,
cp.PrefernceFactor,
ROW_NUMBER() over (Partition by cp.PersonId Order by cp.PrefernceFactor, cp.CandyId ) as hp
From z5CandyPreferences cp) X
Where hp=1) Y on p.PersonId = Y.PersonId
Inner Join z5Candies c on c.CandyId=Y.CandyId
and results with and without new index:
| Without index | With Index
----------------------------------------------
Query (Aff.Rows 183,290) |CPU time Elapsed time | CPU time Elapsed time
-------------------------- |-------- ------------ | -------- ------------
Emtucifor (with joins) |1,031 ms 3,990 ms | 890 ms 3,758 ms
John Saunders (with joins) |2,406 ms 4,343 ms | 1,735 ms 3,414 ms
ROW_NUMBER() (with joins) |2,094 ms 4,888 ms | 953 ms 3,900 ms.
Emtucifor (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
Worktable | 0 0 | 0 0
z5Candies | 1 526 | 1 526
z5CandyPreferences | 1 2,022 | 1 990
z5Persons | 1 733 | 1 733
John Saunders (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 183292 587,912 | 183,290 585,570
z5Persons | 3 802 | 1 733
Worktable | 0 0 | 0 0
z5Candies | 3 559 | 1 526
Worktable | 0 0 | - -
ROW_NUMBER() (with joins) Without index | With Index
-----------------------------------------------------------------------
Table |Scan count logical reads | Scan count logical reads
-------------------|---------- ------------- | ---------- -------------
z5CandyPreferences | 3 2233 | 1 990
z5Persons | 3 802 | 1 733
z5Candies | 3 559 | 1 526
Worktable | 0 0 | 0 0

Remove rows NOT referenced by a foreign key

This is somewhat related to this question:
I have a table with a primary key, and I have several tables that reference that primary key (using foreign keys). I need to remove rows from that table, where the primary key isn't being referenced in any of those other tables (as well as a few other constraints).
For example:
Group
groupid | groupname
1 | 'group 1'
2 | 'group 3'
3 | 'group 2'
... | '...'
Table1
tableid | groupid | data
1 | 3 | ...
... | ... | ...
Table2
tableid | groupid | data
1 | 2 | ...
... | ... | ...
and so on. Some of the rows in Group aren't referenced in any of the tables, and I need to remove those rows. In addition to this, I need to know how to find all of the tables/rows that reference a given row in Group.
I know that I can just query every table and check the groupid's, but since they are foreign keys, I imagine that there is a better way of doing it.
This is using Postgresql 8.3 by the way.
DELETE
FROM group g
WHERE NOT EXISTS
(
SELECT NULL
FROM table1 t1
WHERE t1.groupid = g.groupid
UNION ALL
SELECT NULL
FROM table1 t2
WHERE t2.groupid = g.groupid
UNION ALL
…
)
At the heart of it, SQL servers don't maintain 2-way info for constraints, so your only option is to do what the server would do internally if you were to delete the row: check every other table.
If (and be damn sure first) your constraints are simple checks and don't carry any "on delete cascade" type statements, you can attempt to delete everything from your group table. Any row that does delete would thus have nothing reference it. Otherwise, you're stuck with Quassnoi's answer.