In SQL Server 2012, let's have three tables: Foos, Lookup1 and Lookup2 created with the following SQL:
CREATE TABLE Foos (
Id int NOT NULL,
L1 int NOT NULL,
L2 int NOT NULL,
Value int NOT NULL,
CONSTRAINT PK_Foos PRIMARY KEY CLUSTERED (Id ASC)
);
CREATE TABLE Lookup1 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup1 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE TABLE Lookup2 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup2 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE NONCLUSTERED INDEX IX_Foos ON Foos (
L1 ASC,
L2 ASC,
Value ASC
);
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup1
FOREIGN KEY(L2) REFERENCES Lookup1 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup1;
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup2
FOREIGN KEY(L1) REFERENCES Lookup2 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup2;
BAD PLAN:
The following SQL query to get Foos by the lookup tables:
select top(1) f.* from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
does not fully utilize the IX_Foos index, see http://sqlfiddle.com/#!6/cd5c1/1/0 and the plan with data.
(It just chooses one of the lookup tables.)
GOOD PLAN:
However if I rewrite the query:
declare #l1Id int = (select Id from Lookup1 where Name = 'a');
declare #l2Id int = (select Id from Lookup2 where Name = 'b');
select top(1) f.* from Foos f
where f.L1 = #l1Id and f.L2 = #l2Id
order by f.Value
it works as expected. It firstly lookup both lookup tables and then uses to seek the IX_Foos index.
Is it possible to use a hint to force the SQL Server in the first query (with joins) to lookup the ids first and then use it for IX_Foos?
Because if the Foos table is quite large, the first query (with joins) locks the whole table:(
NOTE: The inner join query comes from LINQ. Or is it possible to force LINQ in Entity Framework to rewrite the queries using declare. Since doing the lookup in multiple requests could have longer roundtrip delay in more complex queries.
NOTE2: In Oracle it works ok, it seems like a problem of SQL Server.
NOTE3: The locking issue is more apparent when adding TOP(1) to the select f.* from Foos .... (For instance you need to get only the min or max value.)
UPDATE:
According to the #Hoots hint, I have changed IX_Lookup1 and IX_Lookup2:
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
It helps, but it is still sorting all results:
Why is it taking all 10,000 rows from Foos that are matching f.L1 and f.L2, instead of just taking the first row. (The IX_Foos contains Value ASC so it could find the first row without processing all 10,000 rows and sort them.) The previous plan with declared variables is using the IX_Foos, so it is not doing the sort.
Looking at the query plans, SQL Server is using the same indexes in both versions of the SQL you've put down, it's just in the second version of sql it's executing 3 seperate pieces of SQL rather than 1 and so evaluating the indexes at different times.
I have checked and I think the solution is to change the indexes as below...
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
and
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
when it evaluates the index it won't go off and need to get the ID from the table data as it will have it in the index. This changes the plan to be what you want, hopefully preventing the locking you're seeing but I'm not going to guarantee that side of it as locking isn't something I'll be able to reproduce.
UPDATE: I now see the issue...
The second piece of SQL is effectively not using set based operations. Simplifying what you've done you're doing...
select f.*
from Foos f
where f.L1 = 1
and f.L2 = 1
order by f.Value desc
Which only has to seek on a simple index to get the results that are already ordered.
In the first bit of SQL (as shown below) you're combining different data sets that has indexes only on the individual table items. The next two bits of SQL do the same thing with the same query plan...
select f.* -- cost 0.7099
from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
select f.* -- cost 0.7099
from Foos f
inner join (SELECT l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
The reason I've put both down is because you can hint in the second version quite easily that it's not set based but singular and write it down as this...
select f.* -- cost 0.095
from Foos f
inner join (SELECT TOP 1 l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
Of course you can only do this knowing that the sub query will bring back a single record whether the top 1 is mentioned or not. This then brings down the cost from 0.7099 to 0.095. I can only summise that now that there is explicitly a single record input the optimiser now knows the order of things can be dealt with by the index rather than having to 'manually' order them.
Note: 0.7099 isn't very large for a query that runs singularly i.e. you'll hardly notice but if it's part of a larger set of executions you can get the cost down if you like. I suspect the question is more about the reason why, which I believe is down to set based operations against singular seeks.
Try to use CTE like this
with cte as
(select min(Value) as Value from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b')
select top(1) * from Foos where exists (select * from cte where cte.Value=Foos.Value)
option (recompile)
This will twice reduce logical reads from Foos table and execution time.
set statistics io,time on
1) your first query with indexes by #Hoots
Estimated Subtree Cost = 0.888
Table 'Foos'. Scan count 1, logical reads 59
CPU time = 15 ms, elapsed time = 151 ms.
2) this cte query with the same indexes
Estimated Subtree Cost = 0.397
Table 'Foos'. Scan count 2, logical reads 34
CPU time = 15 ms, elapsed time = 66 ms.
But this technique for billions of rows in Foos can be quite slow as far as we touch this table twice instead of your first query.
I have a table of records on my database which has about a million records. Most of the records are public - meaning all the users on the system are able to view them. However on the same exact table, I have private records as well, usually couple of hundreds for each user. I have about 1K users on the system.
Each record has 3 main columns:
ID - Enum of the record ID. Unique primary key.
UserID - Identifies the record owner. Null = General record available to everyone. ID = Private record available only for this specific user ID.
RecID - Public record ID. Unique for all public records. If a public record is changed by a user, the system duplicates this record with a new ID, but the same RecID.
Example
ID RecID UserID Comments
----------------------------------------------------------------------------
1 1000 NULL General record
2 1000 1 Modification of record ID=1, available only for userID=1
3 1001 NULL General Record
4 1002 NULL General Record
5 1001 2 Modification of record ID=3, available only for userID=2
If User 1 logs into the system, he should get the list of records 2,3,4
If User 2 logs into the system, he should get the list of records 1,4,5
If user 3 logs into the system, he should get the list of records 1,3,4
The query I'm using is as follow:
SELECT *
FROM TB_Records
WHERE UserID = #UserID
OR (RecID IS NULL AND NOT RecID IN (SELECT RecID
FROM TB_Records
WHERE UserID = #UserID)
The problem I'm having is performance. Adding on top of this query sorting filtering and paging results with a performance of 5-10 seconds for each select. When removing the 3rd line of the query - selecting all the records, the performance is much better, 1-2 seconds.
I would like to know if there is a better way to handle such a requirement.
Thanks
This query doesn't make sense. The AND NOT part is unnecessary, because a NULL value of RecID would not do what you expect. I think you mean:
SELECT r.*
FROM TB_Records r
WHERE r.UserID = #UserID OR
(r.UserId IS NULL AND NOT r.RecID IN (SELECT r2.RecID
FROM TB_Records r2
WHERE r2.UserID = #UserID)
First, create indexes on TB_Records(UserId, RecId). That might help. Next, I would try changing this to an explicit left outer join:
select r.*
from TB_Records r left outer join
TB_Records r2
on r2.UserId = #UserId and
r2.RecId = r.RecId
where r.UserId = #UserId or r2.RecId is NULL;
EDIT:
One more attempt, with a different approach. This uses a window function to see if the user is present for a given record:
select r.*
from (select r.*,
max(case when r.UserId = #UserId then 1 else 0 end) over (partition by RecId) as HasUser
from TB_Records r
) t
where r.UserId = #UserId or HasUser = 0;
Otherwise, you should put the execution plans in the question. Sometimes, it a query with union all will optimize better than one with or:
select r.*
from TB_Records r
where r.UserId = #UserId
union all
select r.*
from TB_Records r left outer join
TB_Records r2
on r2.UserId = #UserId and
r2.RecId = r.RecId
where r2.RecId is NULL;
I have the following query, which retrieves 4 adverts from certain categories in a random order.
At the moment, if a user has more than 1 advert, then potentially all of those ads might be retrieved - I need to limit it so that only 1 ad per user is displayed.
Is this possible to achieve in the same query?
SELECT a.advert_id, a.title, a.url, a.user_id,
FLOOR(1 + RAND() * x.m_id) 'rand_ind'
FROM adverts AS a
INNER JOIN advert_categories AS ac
ON a.advert_id = ac.advert_id,
(
SELECT MAX(t.advert_id) - 1 'm_id'
FROM adverts t
) x
WHERE ac.category_id IN
(
SELECT category_id
FROM website_categories
WHERE website_id = '8'
)
AND a.advert_type = 'text'
GROUP BY a.advert_id
ORDER BY rand_ind
LIMIT 4
Note: The solution is the last query at the bottom of this answer.
Test Schema and Data
create table adverts (
advert_id int primary key, title varchar(20), url varchar(20), user_id int, advert_type varchar(10))
;
create table advert_categories (
advert_id int, category_id int, primary key(category_id, advert_id))
;
create table website_categories (
website_id int, category_id int, primary key(website_id, category_id))
;
insert website_categories values
(8,1),(8,3),(8,5),
(1,1),(2,3),(4,5)
;
insert adverts (advert_id, title, user_id) values
(1, 'StackExchange', 1),
(2, 'StackOverflow', 1),
(3, 'SuperUser', 1),
(4, 'ServerFault', 1),
(5, 'Programming', 1),
(6, 'C#', 2),
(7, 'Java', 2),
(8, 'Python', 2),
(9, 'Perl', 2),
(10, 'Google', 3)
;
update adverts set advert_type = 'text'
;
insert advert_categories values
(1,1),(1,3),
(2,3),(2,4),
(3,1),(3,2),(3,3),(3,4),
(4,1),
(5,4),
(6,1),(6,4),
(7,2),
(8,1),
(9,3),
(10,3),(10,5)
;
Data properties
each website can belong to multiple categories
for simplicity, all adverts are of type 'text'
each advert can belong to multiple categories. If a website has multiple categories that are matched multiple times in advert_categories for the same user_id, this causes the advert_id's to show twice when using a straight join between 3 tables in the next query.
This query joins the 3 tables together (notice that ids 1, 3 and 10 each appear twice)
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
inner join adverts a on a.advert_id = ac.advert_id and a.advert_type = 'text'
where wc.website_id='8'
order by a.advert_id
To make each website show only once, this is the core query to show all eligible ads, each only once
select *
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
The next query retrieves all the advert_id's to be shown
select advert_id, user_id
from (
select
advert_id, user_id,
#r := #r + 1 r
from (select #r:=0) r
cross join
(
# core query -- vvv
select a.advert_id, a.user_id
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
# core query -- ^^^
order by rand()
) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2
There are 3 levels to this query
aliased EligibleAdsAndUserIDs: core query, sorted randomly using order by rand()
aliased RowNumbered: row number added to core query, using MySQL side-effecting #variables
the outermost query forces mysql to collect rows as numbered randomly in the inner queries, and group by user_id causes it to retain only the first row for each user_id. limit 2 causes the query to stop as soon as two distinct user_id's have been encountered.
This is the final query which takes the advert_id's from the previous query and joins it back to table adverts to retrieve the required columns.
only once per user_id
feature user's with more ads proportionally (statistically) to the number of eligible ads they have
Note: Point (2) works because the more ads you have, the more likely you will hit the top placings in the row numbering subquery
select a.advert_id, a.title, a.url, a.user_id
from
(
select advert_id
from (
select
advert_id, user_id,
#r := #r + 1 r
from (select #r:=0) r
cross join
(
# core query -- vvv
select a.advert_id, a.user_id
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
# core query -- ^^^
order by rand()
) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2
) Top2
inner join adverts a on a.advert_id = Top2.advert_id;
I'm thinking through something but don't have MySQL available.. can you try this query to see if it works or crashes...
SELECT
PreQuery.user_id,
(select max( tmp.someRandom ) from PreQuery tmp where tmp.User_ID = PreQuery.User_ID ) MaxRandom
from
( select adverts.user_id,
rand() someRandom
from adverts, advert_categories
where adverts.advert_id = advert_categories.advert_id ) PreQuery
If the "tmp" alias is recognized as a temp buffer of the preliminary query as defined by the OUTER FROM clause, I might have something that will work... I think the field as a select statement from a queried from WONT work, but if it does, I know I'll have something solid for you.
Ok, this one might make the head hurt a bit, but lets get the logical thing going... The inner most "Core Query" is a basis that gets all unique and randomly assigned QUALIFIED Users that have a qualifying ad base on the category chosen, and type = 'text'. Since the order is random, I don't care what the assigned sequence is, and order by that. The limit 4 will return the first 4 entries that qualify. This is regardless of one user having 1 ad vs another having 1000 ads.
Next, join to the advertisements, reversing the table / join qualifications... but by having a WHERE - IN SUB-SELECT, the sub-select will be on each unique USER ID that was qualified by the "CoreQuery" and will ONLY be done 4 times based on ITs inner limit. So even if 100 users with different advertisements, we get 4 users.
Now, the Join to the CoreQuery is the Advert Table based on the same qualifying user. Typically this would join ALL records against the core query given they are for the same user in question... This is correct... HOWEVER, the NEXT WHERE clause is what filters it down to only ONE ad for the given person.
The Sub-Select is making sure its "Advert_ID" matches the one selected in the sub-select. The sub-select is based ONLY on the current "CoreQuery.user_ID" and gets ALL the qualifying category / ads for the user (wrong... we don't want ALL ads)... So, by adding an ORDER BY RAND() will randomize only this one person's ads in the result set... then Limiting THAT by 1 will only give ONE of their qualified ads...
So, the CoreQuery restricts down to 4 users. Then for each qualified user ID, gets only 1 of the qualified ads (by its inner order by RAND() and LIMIT 1 )...
Although I don't have MySQL to try, the queries are COMPLETELY legit and hope it works for you.... man, I love brain teasers like this...
SELECT
ad1.*
from
( SELECT ad.user_id,
count(*) as UserAdCount,
RAND() as ANYRand
from
website_categories wc
inner join advert_categories ac
ON wc.category_id = ac.category_id
inner join adverts ad
ON ac.advert_id = ad.advert_id
AND ad.advert_type = 'text'
where
wc.website_id = 8
GROUP BY
1
order by
3
limit
4 ) CoreQuery,
adverts ad1
WHERE
ad1.advert_type = 'text'
AND CoreQuery.User_ID = ad1.User_ID
AND ad1.advert_id in
( select
ad2.advert_id
FROM
adverts ad2,
advert_categories ac2,
website_categories wc2
WHERE
ad2.user_id = CoreQuery.user_id
AND ad2.advert_id = ac2.advert_id
AND ac2.category_id = wc2.category_id
AND wc2.website_id = 8
ORDER BY
RAND()
LIMIT
1 )
I like to suggest that you do the random with php. This is way faster than doing it in mySQL.
"However, when the table is large (over about 10,000 rows) this method of selecting a random row becomes increasingly slow with the size of the table and can create a great load on the server. I tested this on a table I was working that contained 2,394,968 rows. It took 717 seconds (12 minutes!) to return a random row."
http://www.greggdev.com/web/articles.php?id=6
set #userid = -1;
select
a.id,
a.title,
case when #userid = a.userid then
0
else
1
end as isfirst,
(#userid := a.userid)
from
adverts a
inner join advertcategories ac on ac.advertid = a.advertid
inner join categories c on c.categoryid = ac.categoryid
where
c.website = 8
order by
a.userid,
rand()
having
isfirst = 1
limit 4
Add COUNT(a.user_id) as owned in the main select directive and add HAVING owned < 2 after Group By
http://dev.mysql.com/doc/refman/5.5/en/select.html
I think this is the way to do it, if the one user has more than one advert then we will not select it.
An application of mine is trying to execute a count(*) query which returns after about 30 minutes. What's strange is that the query is very simple and the tables involved are large, but not gigantic (10,000 and 50,000 records).
The query which takes 30 minutes is:
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
The database schema is essentially:
create table BATCH (
BATCH_ID int not null,
[other columns]...,
CONSTRAINT PK_BATCH PRIMARY KEY (BATCH_ID)
);
create table GROUP (
GROUP_ID int not null,
BATCH_ID int,
ENABLED char(1) not null,
[other columns]...,
CONSTRAINT PK_GROUP PRIMARY KEY (GROUP_ID),
CONSTRAINT FK_GROUP_BATCH_ID FOREIGN KEY (BATCH_ID)
REFERENCES BATCH (BATCH_ID),
CONSTRAINT CHK_GROUP_ENABLED CHECK(ENABLED in ('Y', 'N'))
);
create table RECORD (
GROUP_ID int not null,
RECORD_NUMBER int not null,
[other columns]...,
CONSTRAINT PK_RECORD PRIMARY KEY (GROUP_ID, RECORD_NUMBER),
CONSTRAINT FK_RECORD_GROUP_ID FOREIGN KEY (GROUP_ID)
REFERENCES GROUP (GROUP_ID)
);
create index IDX_GROUP_BATCH_ID on GROUP(BATCH_ID);
I checked whether there are any blocks in the database and there are none. I also ran the following pieces of the query and all except the last two returned instantly:
select count(*) from RECORD -- 55,501
select count(*) from GROUP -- 11,693
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
-- 55,501
select count(*)
from GROUP g
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
-- 3,112
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1
-- 27,742 - took around 5 minutes to run
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.ENABLED = 'Y'
-- 51,749 - took around 5 minutes to run
Can someone explain what's going on? How can I improve the query's performance? Thanks.
A coworker figured out the issue. It's because the table statistics weren't being updated and the last time the table was analyzed was a couple of months ago (when the table was essentially empty). I ran analyze table RECORD compute statistics and now the query is returning in less than a second.
I'll have to talk to the DBA about why the table statistics weren't being updated.
SELECT COUNT(*)
FROM RECORD R
LEFT OUTER JOIN GROUP G ON G.GROUP_ID = R.GROUP_ID
AND G.BATCH_ID = 1
AND G.ENABLED = 'Y'
Try that and let me know how it turns out. Not saying this IS the answer, but since I don't have access to a DB right now, I can't test it. Hope it works for ya.
An explain plan would be a good place to start.
See here:
Strange speed changes with sql query
for how to use the explain plan syntax (and query to see the result.)
If that doesn't show anything suspicious, you'll probably want to look at a trace.
I need to attach unlimited attributes to a record in a table, and I've already designed a system using #3 below using mysql. Unfortunately I am finding that searching across a million records is getting slow. Is #2 a better approach, or is there a better way alltogether? Is this a case for using a view? I'd like to have my keys table separate so I know what attributes are being stored for each record.
1: simple:
table records: id, recordname, valname
select recordname from records where valname = 'myvalue'
2: a little more complex:
table records: id recordname
table keyvalues: id recordid keyname valname
select r.recordname
from records r
right join keyvalues kv on kv.recordid = r.id
and kv.keyname='mykey'
and kv.valname = 'myvalue'
3: most complex:
table records: id recordname
table keys: id keyname
table values: id recordid keyid valname
select r.recordname
from records r
right join keys k on k.keyname='mykey'
right join values v on v.recordid = r.id
and v.keyid = k.id
and v.valname = 'myvalue'
I would use inner joins. This will give a smaller result set and the one you want.
Did you try this query?
select r.recordname
from records r
left join values link on link.recordid = r.id and link.valname = 'myvalue'
left join keys k on r.keyid = link.key.id and k.keyname = 'mykey'
However, I think the real way to do it is to have 4 tables
table records: id recordname
table keys: id keyname
table values : id valuename
table joins : id recordid keyid valueid
Then (with the right indexes) you can have a query like this
select r.recordname
from joins j
left join records r on j.recordid = r.id
left join keys k on j.keyid = k.id
left join values v on j.valueid = v.id
where v.valuename = 'myvalue' and k.keyname = 'mykey'
This should be quite fast... all it has to do is find the id in values and keys and then do a scan on j. If you have the right indexes these will be quick.