All developers know that "IN" and DISTINCT create issue for all sql query. My colleague created below query but now He is not working at my employed company.PLease take a look below code. How can I tune up my query for high performance?
SELECT xxx
, COUNT(DISTINCT Id) Count
FROM Test (NOLOCK)
WHERE IsDeleted = 0
AND xxx IN
(
SELECT CAST(value AS INT)
FROM STRING_SPLIT(#ProductIds, ',')
)
GROUP BY xxx
All developers know that "IN" and DISTINCT create issue for all sql query.
This is not necessarily true. They do hurt performance, but sometimes they are necessary.
The IN is probably not a big deal. It gets evaluated once. If you have another way to pass in a list -- say using a temporary table -- that is better.
The COUNT(DISTINCT id) is suspicious. I would expect id to already be unique. If so, then just use COUNT(*).
The WITH (NOLOCK) is not recommended unless you really know what you are doing. Working with data that might be inconsistent is dangerous.
I have used used Sentry One Plan Explorer to help find the tuning points of queries I am having performance issues with:
https://www.sentryone.com/plan-explorer
First you need to decide what good performance is in your environment, then find the worst parts of the query and optimize those first.
Last, consider how you are storing your data, look for places it makes sense to add an index if needed.
better you have to create an index for XXX column
Related
So, I have an SQL query for MSSQL looking like this (simplified for readability):
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
WHERE STATUS NOT IN (107)
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
It works fine, but I need a comparison against another table in the inner query, so I added a left join and said comparison:
SELECT ...
FROM (
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA
LEFT JOIN OTHER_TABLE ON MY_DATA.ID=OTHER_TABLE.ID //new JOIN
WHERE STATUS NOT IN (107) AND (DEPARTMENT_ID='SP' OR DEPARTMENT_ID='BL') //new AND branch
GROUP BY ...
) q
WHERE q.Tdays > 0
GROUP BY ...
This query works, but is A LOT slower thant the previous one. The wierd thing is, commenting out the new AND-branch of the WHERE clause while leaving the JOIN as it is makes it faster again. As if it's not joining another table that is slowing the query down, but the actual string comparisons... I am lost as to why this is so slow, or how I could speed it up... any advice would be appreciated!
Use an INNER JOIN. The outer join is being undone by the WHERE clause anyway:
SELECT ..., ROUND(SUM(TOTAL_TIME)/86400.0,2) ...
FROM MY_DATA d INNER JOIN
OTHER_TABLE ot
ON d.ID = ot.ID //new JOIN
WHERE od.STATUS NOT IN (107) AND DEPARTMENT_ID IN ('SP', 'BL') //new AND branch
GROUP BY ...
(The IN shouldn't make a difference; it is just easier to write.)
Next, if this still has slow performance, then look at the execution plans. It means that SQL Server is making a poor decision, probably on the JOIN algorithm. Normally, I fix this by forbidding nested loop joins, but there might be other solutions as well.
It's hard to say definitively what will or won't speed things up without seeing the execution plan. Also, understanding how fast you need it to be affects what steps you might want to (or not want to) consider taking.
What follows is admittedly somewhat vague, but these are a few things that came to mind when I thought about this. Take a look at the execution plan as Philip Couling suggested in that good link to get an idea where the pain points are, and of course, take these suggestions with a grain of salt.
You might consider adding some indexes to either or both of the tables. The execution plan might even give you suggestions on what could be useful, but off the top of my head, something on OTHER_TABLE.DEPARTMENT_ID probably wouldn't hurt.
You might be able to build potential new indexes as Filtered Indexes if you know those hard-coded search terms (like STATUS and DEPARTMENT_ID are always going to be the same).
You could pre-calculate some of this information if it's not changing so rapidly that you need to query it fresh on every call. This comes back to how fast you need it to go, because for just about any query, you can add columns or pre-populated lookup tables to avoid doing work at run time. For example, you could make an new bit field like IsNewOrBranch or IsStatusNot107 (both somewhat egregious steps, but things which could work). Or that might be pre-aggregating the data in the inner query ahead of time.
I know you simplified the query for our benefit, but that also makes it a little hard to know what's going on with the subquery, and the subsequent GROUP BY being performed against that subquery. There might be a way to avoid having to do two group bys.
Along the same vein, you might also look into splitting what you're doing into separate statements if SQL is having a difficult time figuring out how best to return the data. For example, you might populate a temp table or table variable with the results of your inner query, then perform your subsequent GROUP BY on that. While this approach isn't always useful, there are many times where trying to cram all the work into a single query will actually end up being worse than several individual, simple, optimized steps would be.
And as Gordon Linoff suggested, there are a number of query hints which could be used to coax the execution plan into doing things a specific way. But be careful, often that way lies madness.
Your SQL is fine, and restricting your data with an additional AND clause should usually not make it slower.
As it happens, choosing a fast execution path is a hard problem, and SQL Server sometimes (albeit seldom) gets it wrong.
What you can do to help SQL Server find the best execution path is to:
make sure the statistics on your tables are up-to-date and
make sure that there is an "obviously suitable" index that SQL Server can use. SQL Server Management studio will usually give you suggestions on missing indexes when selecting the "show actual execution plan" option.
I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.
Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.
I have a question regarding SQL performance and was hoping someone would have the answer.
I have the database table tbl_users and I want to get the total number of users I have. I could write it as SELECT COUNT(*) FROM tbl_users. I presume such query would have performance implications were I to have a handful of users vs. several millions of them. (So, assumption #1 is that the more rows I have, the more resources this query will consume).
In this particular case I need to run this query at a relatively high frequency and each time I need to get up-to-date data (so, caching is not an option).
Assuming my assmption #1 is correct, I then thought of structuring it the following way:
create tbl_stats with a field userCounter
each time there is an insert in tbl_users, userCounter is updated +1
each time I need to get my user count, I can pull that one field from tbl_stats
Now, I realize that by doing it this way, the data in userCounter is technically a duplicate, which is bad form.
So, will my first query (assuming millions of rows of data) consume that many resources to warrant me to implement my alternative design? If so (or if possibly yes), then is my alternative design consistent with best practices?
If your table is indexed, which it almost certainly will be, then the performance of select count(*) probably will not be as bad as you might anticipate - even if you have millions of rows.
But, if it does become a concern, then rather than roll your own solution, look into using an indexed view.
I have a database table with almost 5 million records, the following query returns in less than a second
select count(userID) from tblUsers
This query returns in 2 seconds
select count(*) from tblUsers
I'd personally just go with select count() rather than creating a duplicate field
I think this is one of those scenarios where you really need to measure the performance to make a good decision. I would wager that a simple COUNT() isn't going to create enough latency that you would need to implement your proposed work-around.
If you are worried I would encapsulate your COUNT() in a function or stored procedure so you can quickly swap it out later if performance does become a problem.
On some systems you can ask the system to maintain the counts for you. For example, in SQL Server you can have an indexed view on the count:
create view vwCountUsers
with schema binding
as
select count_big(*) as count
from dbo.tbl_users;
create clustered index cdxCountUsers on vwCountUsers (count);
The system will maintain the count for you and will always be available at nearly no cost.
If you have a desperate need and a real business case for up to the minute accurate counts, then the trigger would be the way to go. Just make sure it caters for all multi-user issues such as concurrency and transactions.
It could become a bottleneck because instead of 5 transactions being able to insert into a new table, they will queue up waiting to update the userCounter table, and you may even get deadlocks.
There are other options for less accurate counts, but if you want accurate then there are very few other choices, but I'll try to think of some:
You could partition the data and in userCounter store a count by day. If the data only gets added for the current day, do a select sum(dailycount) from counter + select count(*) from table where {date=today}
You could at least use the nolock or readpast options to lessen resource usage:
select * from tbl with (readpast)
select * from tbl with (nolock)
There are somethings it makes sense to precalculate for performace reasons (comlex calculations over years of data). That's why data warehouses exists much of the time to speed reporting. Select count(*) is generally not one of them if you have any indexing on the table at all. There are far worse performance problems to solve than that. I get 1 second to return the count on a table with 13 million rows.
I'm all about writing code that will is more likely to perform well than the alternative (avoiding correlated subqueries, using set-based operatiosn instead of cursors, having sargable where clauses), but this is a mirco optimization that should not be addressed until there is a real performance problem.
The below MSSQL2005 query is very slow. I feel like their ought to be a way to speed it up, but am unsure how. Note that I editted the inner join to use select statements to make it more obvious (to people reading this question) what is going on, though this has no impact on speed (probably the Execution plan is the same either way). Interestingly, I never actually use keywordvaluegroups for anything more than a count, but I'm not sure if there is a way to capitalize on this.
select top 1 cde.processPath as 'keywordValue', count(*) as 'total'
from dbo.ClientDefinitionEntry AS cde INNER JOIN dbo.KeywordValueGroups AS kvg
ON cde.keywordGroupId = kvg.keywordValueGrpId
where kvg.[name] = #definitionName
group by cde.processPath
order by total desc
Edit: Apparently, people keep complaining about my use of subqueries. In fact, it makes no difference. I added them right before posting this question to make it easier to see what is going on. But they only made things more confusing, so I changed it not to use them.
Edit: Indexes in use:
ClientDefinitionEntry:
IX_ClientDefinitionEntry |nonclustered located on PRIMARY|clientId, keywordGroupId
KeyWordValueGroups
IX_KeywordValueGroups |nonclustered located on PRIMARY|keywordValueGrpId
IX_KeywordValueGroups_2 |nonclustered located on PRIMARY|version
IX_KeywordValueGroups_Name |nonclustered located on PRIMARY|name
How does the execution plan looks like ?
By having a look at it, you'll learn which part of the query takes the most time / resources.
Have you indexes on the columns where you filter on ? Have you indexes on the columns that you use for joining ? Have you indexes on the columns that you use for sorting ?
once you've taken a look at this, and the query is still slow, you can take a look at how your database / table is fragmented (dbcc showcontig), and see if it is necessary to rebuild the indexes.
It might be helpfull to have a maintenance plan which rebuilds your indexes on a regular basis.
Run the query with this option on:
SET SHOWPLAN_TEXT ON
And add the result to the question.
Also check if your statistics are up to date:
SELECT
object_name = Object_Name(ind.object_id),
IndexName = ind.name,
StatisticsDate = STATS_DATE(ind.object_id, ind.index_id)
FROM SYS.INDEXES ind
order by STATS_DATE(ind.object_id, ind.index_id) desc
And information about indexes, table definitions and foreign keys would be helpful.
I'd make sure you had the following indexes.
The ID on KeywordValueGroups.
The Name on KeywordValueGroups.
The ID on ClientDefinitionEntry with an INCLUDE for the processPath.
CREATE INDEX [IX_ClientDefinitionEntry_Id_ProcessPath] ON [dbo].[ClientDefinitionEntry] ( [keywordGroupId] ASC ) INCLUDE ( [processPath]) ON [PRIMARY]
CREATE INDEX [IX_KeywordValueGroups_Id] ON [dbo].[KeywordValueGroups] ( [keywordValueGrpId] ASC )
CREATE INDEX [IX_KeywordValueGroups_Name] ON [dbo].[KeywordValueGroups] ( [name] ASC )
I'd also change the query to the following.
select top 1
cde.processPath as 'keywordValue',
count(*) as 'total'
from
dbo.ClientDefinitionEntry AS cde
INNER JOIN
dbo.KeywordValueGroups AS kvg
ON
cde.keywordGroupId = kvg.keywordValueGrpId
where
kvg.[name] = #definitionName
group by
processPath
order by
total desc
There really isn't enough information to know for sure. If you are having performance problems in that query, then the tables must have a non trivial amount of data and you must be missing important indexes.
Which indexes will definitely help depends deeply on how large the tables are, and to a lesser extent on the distribution of values in the KeywordGroupId and KeywordValueGrpId fields.
Lacking any other information, I would say that you want to make sure that dbo.KeywordValueGroups.[name] is indexed, as well as dbo.ClientDefinitionEntry.[keywordGroupId].
Because of the way the query is written, an index on dbo.KeywordValueGroups.[keywordValueGrpId] alone cannot help, but a composite index on [name], [keywordValueGrpId] probably will. If you have that index, you don't need a dedicated index on [name].
Based on gut-feeling alone, I might hazard that the index on [name] is a must, and that cde.keywordGroupId is likely important. Whether the composite index on [name], [keywordValueGrpId] would help, it depends on how many records are there with the same [name].
The only way to know for sure is to add the indexes and see what happens.
You also need to think about how often this query runs (so, how important is it to make it fast), and how often the underlying data changes. Depending on your particular circumstances, the increase in speed might not justify the added cost of maintaining the indexes.
Not sure how many records we are talking about but this:
order by total desc
is on a calculated column meaning every calculation on every row will have to be done before the ordering can be done. Likely this is one of the things slowing it down, but I see no way out of that particular problem. Not a problem if you only have a few records after the join, but could be if there are lots of them.
I'd concentrate on indexing first. We often forget that when we create foriegn keys they are not automatically indexed. Check to see if both parts of the join are indexed.
Since you are passing a value in a parameter, you might also have a parameter sniffing problem. Google this for techniques to fix that.