Index over multiple lookup tables in SQL Server

Index over multiple lookup tables in SQL Server - sql

In SQL Server 2012, let's have three tables: Foos, Lookup1 and Lookup2 created with the following SQL:
CREATE TABLE Foos (
Id int NOT NULL,
L1 int NOT NULL,
L2 int NOT NULL,
Value int NOT NULL,
CONSTRAINT PK_Foos PRIMARY KEY CLUSTERED (Id ASC)
);
CREATE TABLE Lookup1 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup1 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE TABLE Lookup2 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup2 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE NONCLUSTERED INDEX IX_Foos ON Foos (
L1 ASC,
L2 ASC,
Value ASC
);
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup1
FOREIGN KEY(L2) REFERENCES Lookup1 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup1;
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup2
FOREIGN KEY(L1) REFERENCES Lookup2 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup2;
BAD PLAN:
The following SQL query to get Foos by the lookup tables:
select top(1) f.* from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
does not fully utilize the IX_Foos index, see http://sqlfiddle.com/#!6/cd5c1/1/0 and the plan with data.
(It just chooses one of the lookup tables.)
GOOD PLAN:
However if I rewrite the query:
declare #l1Id int = (select Id from Lookup1 where Name = 'a');
declare #l2Id int = (select Id from Lookup2 where Name = 'b');
select top(1) f.* from Foos f
where f.L1 = #l1Id and f.L2 = #l2Id
order by f.Value
it works as expected. It firstly lookup both lookup tables and then uses to seek the IX_Foos index.
Is it possible to use a hint to force the SQL Server in the first query (with joins) to lookup the ids first and then use it for IX_Foos?
Because if the Foos table is quite large, the first query (with joins) locks the whole table:(
NOTE: The inner join query comes from LINQ. Or is it possible to force LINQ in Entity Framework to rewrite the queries using declare. Since doing the lookup in multiple requests could have longer roundtrip delay in more complex queries.
NOTE2: In Oracle it works ok, it seems like a problem of SQL Server.
NOTE3: The locking issue is more apparent when adding TOP(1) to the select f.* from Foos .... (For instance you need to get only the min or max value.)
UPDATE:
According to the #Hoots hint, I have changed IX_Lookup1 and IX_Lookup2:
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
It helps, but it is still sorting all results:
Why is it taking all 10,000 rows from Foos that are matching f.L1 and f.L2, instead of just taking the first row. (The IX_Foos contains Value ASC so it could find the first row without processing all 10,000 rows and sort them.) The previous plan with declared variables is using the IX_Foos, so it is not doing the sort.

Looking at the query plans, SQL Server is using the same indexes in both versions of the SQL you've put down, it's just in the second version of sql it's executing 3 seperate pieces of SQL rather than 1 and so evaluating the indexes at different times.
I have checked and I think the solution is to change the indexes as below...
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
and
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
when it evaluates the index it won't go off and need to get the ID from the table data as it will have it in the index. This changes the plan to be what you want, hopefully preventing the locking you're seeing but I'm not going to guarantee that side of it as locking isn't something I'll be able to reproduce.
UPDATE: I now see the issue...
The second piece of SQL is effectively not using set based operations. Simplifying what you've done you're doing...
select f.*
from Foos f
where f.L1 = 1
and f.L2 = 1
order by f.Value desc
Which only has to seek on a simple index to get the results that are already ordered.
In the first bit of SQL (as shown below) you're combining different data sets that has indexes only on the individual table items. The next two bits of SQL do the same thing with the same query plan...
select f.* -- cost 0.7099
from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
select f.* -- cost 0.7099
from Foos f
inner join (SELECT l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
The reason I've put both down is because you can hint in the second version quite easily that it's not set based but singular and write it down as this...
select f.* -- cost 0.095
from Foos f
inner join (SELECT TOP 1 l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
Of course you can only do this knowing that the sub query will bring back a single record whether the top 1 is mentioned or not. This then brings down the cost from 0.7099 to 0.095. I can only summise that now that there is explicitly a single record input the optimiser now knows the order of things can be dealt with by the index rather than having to 'manually' order them.
Note: 0.7099 isn't very large for a query that runs singularly i.e. you'll hardly notice but if it's part of a larger set of executions you can get the cost down if you like. I suspect the question is more about the reason why, which I believe is down to set based operations against singular seeks.

Try to use CTE like this
with cte as
(select min(Value) as Value from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b')
select top(1) * from Foos where exists (select * from cte where cte.Value=Foos.Value)
option (recompile)
This will twice reduce logical reads from Foos table and execution time.
set statistics io,time on
1) your first query with indexes by #Hoots
Estimated Subtree Cost = 0.888
Table 'Foos'. Scan count 1, logical reads 59
CPU time = 15 ms, elapsed time = 151 ms.
2) this cte query with the same indexes
Estimated Subtree Cost = 0.397
Table 'Foos'. Scan count 2, logical reads 34
CPU time = 15 ms, elapsed time = 66 ms.
But this technique for billions of rows in Foos can be quite slow as far as we touch this table twice instead of your first query.

Related

Wrong plan when inner-joining a view/subquery that has left join

I'm trying to build a query that inner joins a view (which exists for reusability), but apparently the fact that this view has an internal left join is somehow messing up the optimizer, and I can't really understand why (indices statistics are updated).
Below is an MCVE. It's actually very simple. You can picture it as a simple customer (B) - order (C) design where customer's address (optional) is in another table (A). And then we have a view to join the customer to it's address (vw_B).
Metadata and example data:
create table A (
id int not null,
fieldA char(10) not null,
constraint pk_A primary key (id)
);
create table B (
id int not null,
fieldB char(10) not null,
idA int,
constraint pk_B primary key (id),
constraint fk_A foreign key (idA) references A (id)
);
create view VW_B as
select b.*, a.fieldA from B
left join A on a.id = b.idA;
create table C (
id int not null,
mydate date not null,
idB int not null,
constraint pk_C primary key (id),
constraint fk_B foreign key (idB) references B (id)
);
create index ix_C on C (mydate);
insert into A (id, fieldA)
with recursive n as (
select 1 as n from rdb$database
union all
select n.n + 1 from n
where n < 10
)
select n.n, 'A' from n;
SET STATISTICS INDEX PK_A;
insert into B (id, fieldB, idA)
with recursive n as (
select 1 as n from rdb$database
union all
select n.n + 1 from n
where n < 100
)
select n.n, 'B', IIF(MOD(n.n, 5) = 0, null, MOD(n.n, 10)+1) from n;
SET STATISTICS INDEX PK_B;
SET STATISTICS INDEX FK_A;
insert into C (id, mydate, idB)
with recursive n as (
select 1 as n from rdb$database
union all
select n.n + 1 from n
where n < 1000
)
select n.n, cast('01.01.2020' as date) + 100*rand(), mod(n.n, 100)+1 from n;
SET STATISTICS INDEX PK_C;
SET STATISTICS INDEX FK_B;
SET STATISTICS INDEX IX_C;
With this design, I want to have a query that can join all tables in such a way that I can efficiently search orders by date (c.mydate) or any indexed customer information (table B). The obvious choice is an inner join between B and C, and it works fine. But if I want to add customer's address to the result, by using vw_B instead of B, the optimizer no longer selects the best plan.
Here are some queries to show this:
Manually joining all tables and filtering by date. Optimizer works fine.
select c.*, b.fieldB, a.fieldA from C
inner join B on b.id = c.idB
left join A on a.id = b.idA
where c.mydate = '01.01.2020'
PLAN JOIN (JOIN (C INDEX (IX_C), B INDEX (PK_B)), A INDEX (PK_A))
Reusing vw_B to have A table joined automatically. Optimizer selects a NATURAL plan on (VW_B B).
select c.*, b.fieldB, b.fieldA from C
inner join VW_B b on b.id = c.idB
where c.mydate = '01.01.2020'
PLAN JOIN (JOIN (B B NATURAL, B A INDEX (PK_A)), C INDEX (FK_B, IX_C))
Why does that happen? I thought these two queries should produce the exact same operation in the engine. Now, this is a very simple MVCE, and I have much more complex views that are very reusable, and with larger tables joining with those views is causing performance issues.
Do you have any suggestions to improve performance/PLAN selection, but preserving the convenience of reusability that views provide?
Server version is WI-V3.0.4.33054.

The Firebird optimizer is not intelligent enough to consider the queries equivalent.
Your query with view is equivalent to:
select c.*, b.fieldB, a.fieldA from C
inner join (B left join A on a.id = b.idA)
on b.id = c.idB
where c.mydate = '01.01.2020'
This will produce (almost) the same plan. So, the problem is not with the use of views or not itself, but with how table expressions are nested. This changes how they are evaluated by the engine, and which reordering of joins the engine thinks are possible.
As BrakNicku indicated in the comments, there is no general solution for this.

Finding an ID with the maximum value of an attribute in a group with SQL

If we have a simple table (SQLite syntax):
CREATE TABLE Receptions (
ID INTEGER PRIMARY KEY AUTOINCREMENT,
ID_Patients INTEGER NOT NULL,
ID_Doctors INTEGER NOT NULL,
StartDateTime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP);
filled with some ids, what is the best SQL query for getting doctor with latest date for each patient? Is it possible to optimize the following example somehow to make it work faster?
SELECT ID_Patients pid, ID_Doctors did FROM Receptions INNER JOIN
(SELECT MAX(StartDateTime) maxDate, ID_Patients pid FROM Receptions GROUP BY pid) a
ON a.pid = pid AND a.maxDate = StartDateTime;
I wonder if anyone can explain how this query is executed and what data structures are created on the server side (assuming there are all required indices).

This might work faster with a correlated subquery:
SELECT r.*
FROM Receptions r
WHERE r.StartDateTime = (SELECT MAX(r2.StartDateTime)
FROM FROM Receptions r2
WHERE r2.pid = r.pid
);
For performance, you want an index on Receptions(pid, StartDateTime).

Performance of TOP(1) select on multiple tables

I'm having a performance problem with a TOP(1) (or EXISTS) select statement on a join of 2 tables.
I'm using SQL Server 2008 R2.
I have 2 tables:
CREATE TABLE Records(
Id PRIMARY KEY INT NOT NULL,
User INT NOT NULL,
RecordType INT NOT NULL)
CREATE TABLE Values(
Id PRIMARY KEY BIGINT NOT NULL,
RecordId INT NOT NULL,
Field INT NOT NULL,
Value NVARCHAR(400) NOT NULL,
CONSTRAINT FK_Values_Record FOREIGN KEY(RecordId) REFERENCES Records(Id))
with indexes:
CREATE NONCLUSTERED INDEX IDX_Records ON Records(User ASC, RecordType ASC) INCLUDE(Id)
CREATE NONCLUSTERED INDEX IDX_Values ON Values(RecordId ASC, Field ASC) INCLUDE(Value)
CREATE NONCLUSTERED INDEX IDX_ValuesByVal ON Values(Field ASC, Value ASC) INCLUDE(RecordId)
The tables contain a lot of data, around 100 million records in Records and 150 million in Values, and they are still growing. Some users have a lot of data, some only a small amount.
For some user/field combination we might have no records in the Values table, but for some other user/field we have almost as many records in the Values table as we have in the Records table for that user.
I want to write a query testing if I have any data for a user/field combination. My first try was this:
SELECT TOP(1) V.Field
FROM Records R
INNER JOIN Values V ON V.RecordId = R.Id
WHERE R.User = #User
AND R.RecordType = #RecordType
AND V.Field = #Field
The problem with this query was, that if the execution plan was not in the server's cache and the first user did not have a lot of data, the server would put an execution plan for this query that did not work well for a user with a lot of data, resulting in a timeout (more than 15 seconds). The same problem occurred for RecordTypes or Fields. So I had to hardcode the id's in the query instead of using variables.
SELECT TOP(1) V.Field
FROM Records R
INNER JOIN Values V ON V.RecordId = R.Id
WHERE R.User = 123
AND R.RecordType = 45
AND V.Field = 67
But even then the server would sometime do a a table scan instead of using the available indexes, also resulting in timeouts. So i had to add FORCESEEK to the query:
SELECT TOP(1) V.Field
FROM Records R WITH (FORCESEEK)
INNER JOIN Values V WITH (FORCESEEK) ON V.RecordId = R.Id
WHERE R.User = 123
AND R.RecordType = 45
AND V.Field = 67
But even now, the server sometimes first seeks in the Records table and then in the Values table, instead of first seeking in the Values table and then in the Records table, also resulting in timeouts. I don't know why this result in a timeout, but it does. As fields are linked to a RecordType in my model, I could remove the RecordType clause, forcing the server of first seeking in the Values table
SELECT TOP(1) V.Field
FROM Records R WITH (FORCESEEK)
INNER JOIN Values V WITH (FORCESEEK) ON V.RecordId = R.Id
WHERE R.User = 123
AND V.Field = 67
With this last change I no longer have any timeouts, but still the query take around 1 to 2 seconds, sometimes even 5 to 7 seconds.
I still don't understand why this takes this much time.
Does anyone have any ideas how to improve this query to avoid these long querytimes ?

Should not make any difference but for grins try
SELECT TOP(1) 1
FROM Records R
JOIN Values V
ON V.RecordId = R.Id
AND R.User = 123
AND R.RecordType = 45
AND V.Field = 67

Table index over a complex primary key in SQL Server

I got following tables in my database
user
status
statusToUser
statusToUser works as a link table between the other two for a many to many relationship
the table definition is the following:
User_Id
Status_Id
those columns are the primary key for the table and have a single index which holds both of them, but when running a query optimization for "missing queries" I got in the list the suggestion to add over user_id another index.
the question is do I really need another index over just that column, having already the other index?
thanks
Edit:
these are two different queries, same approach:
SELECT user_seeks * avg_total_user_cost * ( avg_user_impact * 0.01 ) AS [index_advantage] ,
dbmigs.last_user_seek ,
dbmid.[statement] AS [Database.Schema.Table] ,
dbmid.equality_columns ,
dbmid.inequality_columns ,
dbmid.included_columns ,
dbmigs.unique_compiles ,
dbmigs.user_seeks ,
dbmigs.avg_total_user_cost ,
dbmigs.avg_user_impact
FROM sys.dm_db_missing_index_group_stats AS dbmigs WITH ( NOLOCK )
INNER JOIN sys.dm_db_missing_index_groups AS dbmig WITH ( NOLOCK )
ON dbmigs.group_handle = dbmig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details AS dbmid WITH ( NOLOCK )
ON dbmig.index_handle = dbmid.index_handle
WHERE dbmid.[database_id] = DB_ID()
ORDER BY index_advantage DESC ;
number 2
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SELECT TOP 20
ROUND(s.avg_total_user_cost *
s.avg_user_impact
* (s.user_seeks + s.user_scans),0)
AS [Total Cost]
, d.[statement] AS [Table Name]
, equality_columns
, inequality_columns
, included_columns
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
ON d.index_handle = g.index_handle
ORDER BY [Total Cost] DESC

Both fields in a junction table is foreign keys to other tables. It is usually a good idea to have a index on the foreign keys so a clustered key on (user_id, status_id) and a non clustered on (status_id, user_id) would be a good idea.
A delete in the status table or in the user table will have to check the existence of rows in statusToUser and if the only index you have is (user_id, status_id) the delete in user can use the primary key but the delete in status has to do a clustered index scan of statusToUser to verify that there are no rows in there that matches the row that is to be deleted.
The same goes for a predicates on status in queries. The primary key on (user_id, status_id) will not be of any help and you can end up with a clustered index scan instead of a potential seek or it might need to do an expensive sort operation.

Oracle sql query running for (almost) forever

An application of mine is trying to execute a count(*) query which returns after about 30 minutes. What's strange is that the query is very simple and the tables involved are large, but not gigantic (10,000 and 50,000 records).
The query which takes 30 minutes is:
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
The database schema is essentially:
create table BATCH (
BATCH_ID int not null,
[other columns]...,
CONSTRAINT PK_BATCH PRIMARY KEY (BATCH_ID)
);
create table GROUP (
GROUP_ID int not null,
BATCH_ID int,
ENABLED char(1) not null,
[other columns]...,
CONSTRAINT PK_GROUP PRIMARY KEY (GROUP_ID),
CONSTRAINT FK_GROUP_BATCH_ID FOREIGN KEY (BATCH_ID)
REFERENCES BATCH (BATCH_ID),
CONSTRAINT CHK_GROUP_ENABLED CHECK(ENABLED in ('Y', 'N'))
);
create table RECORD (
GROUP_ID int not null,
RECORD_NUMBER int not null,
[other columns]...,
CONSTRAINT PK_RECORD PRIMARY KEY (GROUP_ID, RECORD_NUMBER),
CONSTRAINT FK_RECORD_GROUP_ID FOREIGN KEY (GROUP_ID)
REFERENCES GROUP (GROUP_ID)
);
create index IDX_GROUP_BATCH_ID on GROUP(BATCH_ID);
I checked whether there are any blocks in the database and there are none. I also ran the following pieces of the query and all except the last two returned instantly:
select count(*) from RECORD -- 55,501
select count(*) from GROUP -- 11,693
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
-- 55,501
select count(*)
from GROUP g
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
-- 3,112
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1
-- 27,742 - took around 5 minutes to run
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.ENABLED = 'Y'
-- 51,749 - took around 5 minutes to run
Can someone explain what's going on? How can I improve the query's performance? Thanks.

A coworker figured out the issue. It's because the table statistics weren't being updated and the last time the table was analyzed was a couple of months ago (when the table was essentially empty). I ran analyze table RECORD compute statistics and now the query is returning in less than a second.
I'll have to talk to the DBA about why the table statistics weren't being updated.

SELECT COUNT(*)
FROM RECORD R
LEFT OUTER JOIN GROUP G ON G.GROUP_ID = R.GROUP_ID
AND G.BATCH_ID = 1
AND G.ENABLED = 'Y'
Try that and let me know how it turns out. Not saying this IS the answer, but since I don't have access to a DB right now, I can't test it. Hope it works for ya.

An explain plan would be a good place to start.
See here:
Strange speed changes with sql query
for how to use the explain plan syntax (and query to see the result.)
If that doesn't show anything suspicious, you'll probably want to look at a trace.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas