Table index over a complex primary key in SQL Server

Table index over a complex primary key in SQL Server - sql

I got following tables in my database
user
status
statusToUser
statusToUser works as a link table between the other two for a many to many relationship
the table definition is the following:
User_Id
Status_Id
those columns are the primary key for the table and have a single index which holds both of them, but when running a query optimization for "missing queries" I got in the list the suggestion to add over user_id another index.
the question is do I really need another index over just that column, having already the other index?
thanks
Edit:
these are two different queries, same approach:
SELECT user_seeks * avg_total_user_cost * ( avg_user_impact * 0.01 ) AS [index_advantage] ,
dbmigs.last_user_seek ,
dbmid.[statement] AS [Database.Schema.Table] ,
dbmid.equality_columns ,
dbmid.inequality_columns ,
dbmid.included_columns ,
dbmigs.unique_compiles ,
dbmigs.user_seeks ,
dbmigs.avg_total_user_cost ,
dbmigs.avg_user_impact
FROM sys.dm_db_missing_index_group_stats AS dbmigs WITH ( NOLOCK )
INNER JOIN sys.dm_db_missing_index_groups AS dbmig WITH ( NOLOCK )
ON dbmigs.group_handle = dbmig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details AS dbmid WITH ( NOLOCK )
ON dbmig.index_handle = dbmid.index_handle
WHERE dbmid.[database_id] = DB_ID()
ORDER BY index_advantage DESC ;
number 2
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SELECT TOP 20
ROUND(s.avg_total_user_cost *
s.avg_user_impact
* (s.user_seeks + s.user_scans),0)
AS [Total Cost]
, d.[statement] AS [Table Name]
, equality_columns
, inequality_columns
, included_columns
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
ON d.index_handle = g.index_handle
ORDER BY [Total Cost] DESC

Both fields in a junction table is foreign keys to other tables. It is usually a good idea to have a index on the foreign keys so a clustered key on (user_id, status_id) and a non clustered on (status_id, user_id) would be a good idea.
A delete in the status table or in the user table will have to check the existence of rows in statusToUser and if the only index you have is (user_id, status_id) the delete in user can use the primary key but the delete in status has to do a clustered index scan of statusToUser to verify that there are no rows in there that matches the row that is to be deleted.
The same goes for a predicates on status in queries. The primary key on (user_id, status_id) will not be of any help and you can end up with a clustered index scan instead of a potential seek or it might need to do an expensive sort operation.

Related

optmize delete based on "count" of elements in table with foreign key

I has two tables, "tracks" (header of track and track_points - points in track).
Schema:
CREATE TABLE tracks(
id INTEGER PRIMARY KEY ASC,
start_time TEXT NOT NULL
);
CREATE TABLE track_points (
id INTEGER PRIMARY KEY AUTOINCREMENT,
data BLOB,
track_id INTEGER NOT NULL,
FOREIGN KEY(track_id) REFERENCES tracks(id) ON DELETE CASCADE
);
CREATE INDEX track_id_idx ON track_points (track_id);
CREATE INDEX start_time_idx ON tracks (start_time);
And I want delete all "tracks" that has 0 or 1 point.
Note if 0 points in tracks, then it has no rows in "track_points".
I write such query:
DELETE FROM tracks WHERE tracks.id IN
(SELECT track_id FROM
(SELECT tracks.id as track_id, COUNT(track_points.id) as track_len FROM tracks
LEFT JOIN track_points ON tracks.id=track_points.track_id GROUP BY tracks.id)
WHERE track_len<=1)
it seems to work, but I wonder is it possible to optmize such query?
I mean time of work (now 10 seconds on big table on my machine).
Or may be simplification of this SQL code is possible (with preservance of work time of course)?

You can simplify your code by removing 1 level of your subqueries, because you can achieve the same with a HAVING clause instead of an outer WHERE clause:
DELETE FROM tracks
WHERE id IN (
SELECT t.id
FROM tracks t LEFT JOIN track_points p
ON t.id = p.track_id
GROUP BY t.id
HAVING COUNT(p.id) <= 1
);
The above code may not make any difference, but it's simpler.
The same logic could also be applied by using EXCEPT:
DELETE FROM tracks
WHERE id IN (
SELECT id FROM tracks
EXCEPT
SELECT track_id
FROM track_points
GROUP BY track_id
HAVING COUNT(*) > 1
);
What you can try is a query that does not involve the join of the 2 tables.
Aggregate only in track_points and get the track_ids with 2 or more occurrences. Then delete all the rows from tracks with ids that are not included in the result of the previous query:
DELETE FROM tracks
WHERE id NOT IN (
SELECT track_id
FROM track_points
GROUP BY track_id
HAVING COUNT(*) > 1
);

Performance of TOP(1) select on multiple tables

I'm having a performance problem with a TOP(1) (or EXISTS) select statement on a join of 2 tables.
I'm using SQL Server 2008 R2.
I have 2 tables:
CREATE TABLE Records(
Id PRIMARY KEY INT NOT NULL,
User INT NOT NULL,
RecordType INT NOT NULL)
CREATE TABLE Values(
Id PRIMARY KEY BIGINT NOT NULL,
RecordId INT NOT NULL,
Field INT NOT NULL,
Value NVARCHAR(400) NOT NULL,
CONSTRAINT FK_Values_Record FOREIGN KEY(RecordId) REFERENCES Records(Id))
with indexes:
CREATE NONCLUSTERED INDEX IDX_Records ON Records(User ASC, RecordType ASC) INCLUDE(Id)
CREATE NONCLUSTERED INDEX IDX_Values ON Values(RecordId ASC, Field ASC) INCLUDE(Value)
CREATE NONCLUSTERED INDEX IDX_ValuesByVal ON Values(Field ASC, Value ASC) INCLUDE(RecordId)
The tables contain a lot of data, around 100 million records in Records and 150 million in Values, and they are still growing. Some users have a lot of data, some only a small amount.
For some user/field combination we might have no records in the Values table, but for some other user/field we have almost as many records in the Values table as we have in the Records table for that user.
I want to write a query testing if I have any data for a user/field combination. My first try was this:
SELECT TOP(1) V.Field
FROM Records R
INNER JOIN Values V ON V.RecordId = R.Id
WHERE R.User = #User
AND R.RecordType = #RecordType
AND V.Field = #Field
The problem with this query was, that if the execution plan was not in the server's cache and the first user did not have a lot of data, the server would put an execution plan for this query that did not work well for a user with a lot of data, resulting in a timeout (more than 15 seconds). The same problem occurred for RecordTypes or Fields. So I had to hardcode the id's in the query instead of using variables.
SELECT TOP(1) V.Field
FROM Records R
INNER JOIN Values V ON V.RecordId = R.Id
WHERE R.User = 123
AND R.RecordType = 45
AND V.Field = 67
But even then the server would sometime do a a table scan instead of using the available indexes, also resulting in timeouts. So i had to add FORCESEEK to the query:
SELECT TOP(1) V.Field
FROM Records R WITH (FORCESEEK)
INNER JOIN Values V WITH (FORCESEEK) ON V.RecordId = R.Id
WHERE R.User = 123
AND R.RecordType = 45
AND V.Field = 67
But even now, the server sometimes first seeks in the Records table and then in the Values table, instead of first seeking in the Values table and then in the Records table, also resulting in timeouts. I don't know why this result in a timeout, but it does. As fields are linked to a RecordType in my model, I could remove the RecordType clause, forcing the server of first seeking in the Values table
SELECT TOP(1) V.Field
FROM Records R WITH (FORCESEEK)
INNER JOIN Values V WITH (FORCESEEK) ON V.RecordId = R.Id
WHERE R.User = 123
AND V.Field = 67
With this last change I no longer have any timeouts, but still the query take around 1 to 2 seconds, sometimes even 5 to 7 seconds.
I still don't understand why this takes this much time.
Does anyone have any ideas how to improve this query to avoid these long querytimes ?

Should not make any difference but for grins try
SELECT TOP(1) 1
FROM Records R
JOIN Values V
ON V.RecordId = R.Id
AND R.User = 123
AND R.RecordType = 45
AND V.Field = 67

Index over multiple lookup tables in SQL Server

In SQL Server 2012, let's have three tables: Foos, Lookup1 and Lookup2 created with the following SQL:
CREATE TABLE Foos (
Id int NOT NULL,
L1 int NOT NULL,
L2 int NOT NULL,
Value int NOT NULL,
CONSTRAINT PK_Foos PRIMARY KEY CLUSTERED (Id ASC)
);
CREATE TABLE Lookup1 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup1 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE TABLE Lookup2 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup2 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE NONCLUSTERED INDEX IX_Foos ON Foos (
L1 ASC,
L2 ASC,
Value ASC
);
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup1
FOREIGN KEY(L2) REFERENCES Lookup1 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup1;
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup2
FOREIGN KEY(L1) REFERENCES Lookup2 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup2;
BAD PLAN:
The following SQL query to get Foos by the lookup tables:
select top(1) f.* from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
does not fully utilize the IX_Foos index, see http://sqlfiddle.com/#!6/cd5c1/1/0 and the plan with data.
(It just chooses one of the lookup tables.)
GOOD PLAN:
However if I rewrite the query:
declare #l1Id int = (select Id from Lookup1 where Name = 'a');
declare #l2Id int = (select Id from Lookup2 where Name = 'b');
select top(1) f.* from Foos f
where f.L1 = #l1Id and f.L2 = #l2Id
order by f.Value
it works as expected. It firstly lookup both lookup tables and then uses to seek the IX_Foos index.
Is it possible to use a hint to force the SQL Server in the first query (with joins) to lookup the ids first and then use it for IX_Foos?
Because if the Foos table is quite large, the first query (with joins) locks the whole table:(
NOTE: The inner join query comes from LINQ. Or is it possible to force LINQ in Entity Framework to rewrite the queries using declare. Since doing the lookup in multiple requests could have longer roundtrip delay in more complex queries.
NOTE2: In Oracle it works ok, it seems like a problem of SQL Server.
NOTE3: The locking issue is more apparent when adding TOP(1) to the select f.* from Foos .... (For instance you need to get only the min or max value.)
UPDATE:
According to the #Hoots hint, I have changed IX_Lookup1 and IX_Lookup2:
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
It helps, but it is still sorting all results:
Why is it taking all 10,000 rows from Foos that are matching f.L1 and f.L2, instead of just taking the first row. (The IX_Foos contains Value ASC so it could find the first row without processing all 10,000 rows and sort them.) The previous plan with declared variables is using the IX_Foos, so it is not doing the sort.

Looking at the query plans, SQL Server is using the same indexes in both versions of the SQL you've put down, it's just in the second version of sql it's executing 3 seperate pieces of SQL rather than 1 and so evaluating the indexes at different times.
I have checked and I think the solution is to change the indexes as below...
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
and
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
when it evaluates the index it won't go off and need to get the ID from the table data as it will have it in the index. This changes the plan to be what you want, hopefully preventing the locking you're seeing but I'm not going to guarantee that side of it as locking isn't something I'll be able to reproduce.
UPDATE: I now see the issue...
The second piece of SQL is effectively not using set based operations. Simplifying what you've done you're doing...
select f.*
from Foos f
where f.L1 = 1
and f.L2 = 1
order by f.Value desc
Which only has to seek on a simple index to get the results that are already ordered.
In the first bit of SQL (as shown below) you're combining different data sets that has indexes only on the individual table items. The next two bits of SQL do the same thing with the same query plan...
select f.* -- cost 0.7099
from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
select f.* -- cost 0.7099
from Foos f
inner join (SELECT l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
The reason I've put both down is because you can hint in the second version quite easily that it's not set based but singular and write it down as this...
select f.* -- cost 0.095
from Foos f
inner join (SELECT TOP 1 l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
Of course you can only do this knowing that the sub query will bring back a single record whether the top 1 is mentioned or not. This then brings down the cost from 0.7099 to 0.095. I can only summise that now that there is explicitly a single record input the optimiser now knows the order of things can be dealt with by the index rather than having to 'manually' order them.
Note: 0.7099 isn't very large for a query that runs singularly i.e. you'll hardly notice but if it's part of a larger set of executions you can get the cost down if you like. I suspect the question is more about the reason why, which I believe is down to set based operations against singular seeks.

Try to use CTE like this
with cte as
(select min(Value) as Value from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b')
select top(1) * from Foos where exists (select * from cte where cte.Value=Foos.Value)
option (recompile)
This will twice reduce logical reads from Foos table and execution time.
set statistics io,time on
1) your first query with indexes by #Hoots
Estimated Subtree Cost = 0.888
Table 'Foos'. Scan count 1, logical reads 59
CPU time = 15 ms, elapsed time = 151 ms.
2) this cte query with the same indexes
Estimated Subtree Cost = 0.397
Table 'Foos'. Scan count 2, logical reads 34
CPU time = 15 ms, elapsed time = 66 ms.
But this technique for billions of rows in Foos can be quite slow as far as we touch this table twice instead of your first query.

Oracle sql query running for (almost) forever

An application of mine is trying to execute a count(*) query which returns after about 30 minutes. What's strange is that the query is very simple and the tables involved are large, but not gigantic (10,000 and 50,000 records).
The query which takes 30 minutes is:
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
The database schema is essentially:
create table BATCH (
BATCH_ID int not null,
[other columns]...,
CONSTRAINT PK_BATCH PRIMARY KEY (BATCH_ID)
);
create table GROUP (
GROUP_ID int not null,
BATCH_ID int,
ENABLED char(1) not null,
[other columns]...,
CONSTRAINT PK_GROUP PRIMARY KEY (GROUP_ID),
CONSTRAINT FK_GROUP_BATCH_ID FOREIGN KEY (BATCH_ID)
REFERENCES BATCH (BATCH_ID),
CONSTRAINT CHK_GROUP_ENABLED CHECK(ENABLED in ('Y', 'N'))
);
create table RECORD (
GROUP_ID int not null,
RECORD_NUMBER int not null,
[other columns]...,
CONSTRAINT PK_RECORD PRIMARY KEY (GROUP_ID, RECORD_NUMBER),
CONSTRAINT FK_RECORD_GROUP_ID FOREIGN KEY (GROUP_ID)
REFERENCES GROUP (GROUP_ID)
);
create index IDX_GROUP_BATCH_ID on GROUP(BATCH_ID);
I checked whether there are any blocks in the database and there are none. I also ran the following pieces of the query and all except the last two returned instantly:
select count(*) from RECORD -- 55,501
select count(*) from GROUP -- 11,693
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
-- 55,501
select count(*)
from GROUP g
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
-- 3,112
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1
-- 27,742 - took around 5 minutes to run
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.ENABLED = 'Y'
-- 51,749 - took around 5 minutes to run
Can someone explain what's going on? How can I improve the query's performance? Thanks.

A coworker figured out the issue. It's because the table statistics weren't being updated and the last time the table was analyzed was a couple of months ago (when the table was essentially empty). I ran analyze table RECORD compute statistics and now the query is returning in less than a second.
I'll have to talk to the DBA about why the table statistics weren't being updated.

SELECT COUNT(*)
FROM RECORD R
LEFT OUTER JOIN GROUP G ON G.GROUP_ID = R.GROUP_ID
AND G.BATCH_ID = 1
AND G.ENABLED = 'Y'
Try that and let me know how it turns out. Not saying this IS the answer, but since I don't have access to a DB right now, I can't test it. Hope it works for ya.

An explain plan would be a good place to start.
See here:
Strange speed changes with sql query
for how to use the explain plan syntax (and query to see the result.)
If that doesn't show anything suspicious, you'll probably want to look at a trace.

Sql Server query performance?

I have contacts that can be in more than one group and have more than one request. I need to simply get contacts for a specific group that have no specific requests.
How do I improve the performance of this query:
SELECT top 1 con_name ,
con_id
FROM tbl_group_to_contact gc
INNER JOIN tbl_contact c ON gc.con_id = c.id
WHERE group_id = '81'
AND NOT c.id IN ( SELECT con_id
FROM tbl_request_to_contact
WHERE request_id = '124' )
When I run that query with Explanation plan it shows that this query:
SELECT con_id
FROM tbl_request_to_contact
WHERE request_id = '124'
is expensive with using an index seek.
|--Top(TOP EXPRESSION:((1)))
|--Nested Loops(Left Anti Semi Join, OUTER REFERENCES:([c].[id]))
|--Nested Loops(Inner Join, OUTER REFERENCES:([gc].[con_id], [Expr1006]) WITH UNORDERED PREFETCH)
| |--Clustered Index Scan(OBJECT:([db_newsletter].[dbo].[tbl_group_to_contact].[PK_tbl_group_to_contact_1] AS [gc]), WHERE:([db_newsletter].[dbo].[tbl_group_to_contact].[group_id] as [gc].[group_id]=(81)) ORDERED FORWARD)
| |--Clustered Index Seek(OBJECT:([db_newsletter].[dbo].[tbl_contact].[PK_tbl_contact] AS [c]), SEEK:([c].[id]=[db_newsletter].[dbo].[tbl_group_to_contact].[con_id] as [gc].[con_id]) ORDERED FORWARD)
|--Top(TOP EXPRESSION:((1)))
|--Clustered Index Seek(OBJECT:([db_newsletter].[dbo].[tbl_request_to_contact].[PK_tbl_request_to_contact] AS [cc]), SEEK:([cc].[request_id]=(124)), WHERE:([db_newsletter].[dbo].[tbl_contact].[id] as [c].[id]=[db_newsletter].[dbo].[tbl_request_to_contact].[con_id] as [cc].[con_id]) ORDERED FORWARD)

Your query is ok, just create the following indexes:
tbl_request_to_contact (request_id, con_id)
tbl_group_to_contact (group_id, con_id)
Since the tables seem to be the link tables, you want to make these composites the primary keys:
ALTER TABLE tbl_request_to_contact ADD CONSTRAINT pk_rc PRIMARY KEY (request_id, con_id)
ALTER TABLE tbl_group_to_contact ADD CONSTRAINT pk_gc (group_id, con_id)
, making sure that request_id and group_id go first.
Also, if your request_id and group_id are integers, pass the integers as the parameters, not strings:
SELECT con_name, con_id
FROM tbl_group_to_contact gc
JOIN tbl_contact c
ON c.id = gc.con_id
WHERE group_id = 81
AND c.id NOT IN
(
SELECT con_id
FROM tbl_request_to_contact
WHERE request_id = 124
)
, or an implicit conversion may occur rendering the indexes unusable.
Update:
From your plan I see that you miss the index on tbl_group_to_contact. Full table scan is required to filter the groups.
Create the index:
CREATE UNIQUE INDEX ux_gc ON tbl_group_to_contact (group_id, con_id)

You may want to try running the SQL Server Database Tuning Advisor.

I agree with #Quassnoi with the indexes. Plus you can use a left join to only show users who don't have requests. This usually has better performance than a sub-query.
What is the request_id = '124' for? Other request id's don't matter?
SELECT con_name ,
con_id
FROM tbl_group_to_contact gc
INNER JOIN tbl_contact c ON gc.con_id = c.id
LEFT JOIN tbl_request_to_contact rtc ON gc.con_id = rtc.con_id
WHERE group_id = '81' and rtc.request_id IS NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas