I recently started doing some performance tuning on a client's stored procedures and i bumped into this chunk of code and could'nt find a way to make it work more efficiently.
declare #StationListCount int;
select #StationListCount = count(*) from #StationList;
declare #FleetsCnt int;
select #FleetsCnt=COUNT(*) from #FleetIds;
declare #StationCnt int;
select #StationCnt=COUNT(*) from #StationIds;
declare #VehiclesCnt int;
select #VehiclesCnt=COUNT(*) from #VehicleIds;
declare #TrIds table(VehicleId bigint,TrId bigint,InRange bit);
insert into #TrIds(VehicleId,TrId,InRange)
select t.VehicleID,t.FuelTransactionId,1
from dbo.FuelTransaction t
join dbo.Fleet f on f.FleetID = t.FleetID and f.CompanyID=#ActorCompanyID
where t.TransactionTime>=#From and (#To is null or t.TransactionTime<#To)
and (#StationListCount=0 or exists (select id fRom #StationList where t.FuelStationID = ID))
and (#FleetsCnt=0 or exists (select ID from #FleetIds where ID = t.FleetID))
and (#StationCnt=0 or exists (select ID from #StationIds where ID = t.FuelStationID))
and (#VehiclesCnt=0 or exists (select ID from #VehicleIds where ID = t.VehicleID))
and t.VehicleID is not null
the insert command slows the whole procedure and takes 99% of the resources.
I am not sure but i think these nested loops are referring to the queries inside the where clause
I would very much appreciate the help i can get on this.
Thank you!
There are couple of things that you actually should go over and see the performance differences. First of all, as the previous answer suggest you should omit the count(*)-like aggragates as much as possible. If the table is so big, the cost of these functions exponentially increase. You can even think of storing those counts in a seperate table with proper index constraints.
I also suggest you to split the select statement into multiple statements because when you use so many NULL checks, or, and conditions in combinations; your indexes may be bypassed so that your query cost increases a lot. Sometimes, using UNIONs may provide far better performance than using such conditions.
Actually, you should try all these and see what fits your needs
hope it helps.
Insert is using only 1 table for vehicle Id so joining other tables doesn't requires.
I don't see the declaration of the #table variables, but (assuming the IDs in them are unique) consider communicating this information to the optimizer, IOW add primary key constraints to them.
Also, add the option(recompile) to the end of the query.
Related
I have a SELECT query on a view, that contains 500.000+ rows. Let's keep it simple:
SELECT * FROM dbo.Document WHERE MemberID = 578310
The query runs fast, ~0s
Let's rewrite it to work with the set of values, which reflects my needs more:
SELECT * FROM dbo.Document WHERE MemberID IN (578310)
This is same fast, ~0s
But now, the set is of IDs needs to be variable; let's define it as:
DECLARE #AuthorizedMembers TABLE
(
MemberID BIGINT NOT NULL PRIMARY KEY, --primary key
UNIQUE NONCLUSTERED (MemberID) -- and index, as if it could help...
);
INSERT INTO #AuthorizedMembers SELECT 578310
The set contains the same, one value but is a table variable now. The performance of such query drops to 2s, and in more complicated ones go as high as 25s and more, while with a fixed id it stays around ~0s.
SELECT *
FROM dbo.Document
WHERE MemberID IN (SELECT MemberID FROM #AuthorizedMembers)
is the same bad as:
SELECT *
FROM dbo.Document
WHERE EXISTS (SELECT MemberID
FROM #AuthorizedMembers
WHERE [#AuthorizedMembers].MemberID = Document.MemberID)
or as bad as this:
SELECT *
FROM dbo.Document
INNER JOIN #AuthorizedMembers AS AM ON AM.MemberID = Document.MemberID
The performance is same for all the above and always much worse than the one with a fixed value.
The dynamic SQL comes with help easily, so creating an nvarchar like (id1,id2,id3) and building a fixed query with it keeps my query times ~0s. But I would like to avoid using Dynamic SQL as much as possible and if I do, I would like to keep it always the same string, regardless the values (using parameters - which above method does not allow).
Any ideas how to get the performance of the table variable similar to a fixed array of values or avoid building a different dynamic SQL code for each run?
P.S. I have tried the above with a user defined type with same results
Edit:
The results with a temporary table, defined as:
CREATE TABLE #AuthorizedMembers
(
MemberID BIGINT NOT NULL PRIMARY KEY
);
INSERT INTO #AuthorizedMembers SELECT 578310
have improved the execution time up to 3 times. (13s -> 4s). Which is still significantly higher than dynamic SQL <1s.
Your options:
Use a temporary table instead of a TABLE variable
If you insist on using a TABLE variable, add OPTION(RECOMPILE) at the end of your query
Explanation:
When the compiler compiles your statement, the TABLE variable has no rows in it and therefore doesn't have the proper cardinalities. This results in an inefficient execution plan. OPTION(RECOMPILE) forces the statement to be recompiled when it is run. At that point the TABLE variable has rows in it and the compiler has better cardinalities to produce an execution plan.
The general rule of thumb is to use temporary tables when operating on large datasets and table variables for small datasets with frequent updates. Personally I only very rarely use TABLE variables because they generally perform poorly.
I can recommend this answer on the question "What's the difference between temporary tables and table variables in SQL Server?" if you want an in-depth analysis on the differences.
For the DB gurus out there, I was wondering if there is any functional/performance difference between Joining to the results a SELECT statement and Joining to a previously filled table variable. I'm working in SQL Server 2008 R2.
Example (TSQL):
-- Create a test table
DROP TABLE [dbo].[TestTable]
CREATE TABLE [dbo].[TestTable](
[id] [int] NOT NULL,
[value] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the test table with a few rows
INSERT INTO [dbo].[TestTable]
SELECT 1123, 'test1'
INSERT INTO [dbo].[TestTable]
SELECT 2234, 'test2'
INSERT INTO [dbo].[TestTable]
SELECT 3345, 'test3'
-- Create a reference table
DROP TABLE [dbo].[TestRefTable]
CREATE TABLE [dbo].[TestRefTable](
[id] [int] NOT NULL,
[refvalue] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the reference table with a few rows
INSERT INTO [dbo].[TestRefTable]
SELECT 1123, 'ref1'
INSERT INTO [dbo].[TestRefTable]
SELECT 2234, 'ref2'
-- Scenario 1: Insert matching results into it's own table variable, then Join
-- Create a table variable
DECLARE #subset TABLE ([id] INT NOT NULL, [refvalue] VARCHAR(MAX))
INSERT INTO #subset
SELECT * FROM [dbo].[TestRefTable]
WHERE [dbo].[TestRefTable].[id] = 1123
SELECT t.*, s.*
FROM [dbo].[TestTable] t
JOIN #subset s
ON t.id = s.id
-- Scenario 2: Join directly to SELECT results
SELECT t.*, s.*
FROM [dbo].TestTable t
JOIN (SELECT * FROM [dbo].[TestRefTable] WHERE id = 1123) s
ON t.id = s.id
In the "real" world, the tables and table variable are pre-defined. What I'm looking at is being able to have the matched reference rows available for further operations, but I'm concerned that the extra steps will slow the query down. Are there technical reasons as to why one would be faster than the other? What sort of performance difference may be seen between the two approaches? I realize it is difficult (if not impossible) to give a definitive answer, just looking for some advice for this scenario.
The database engine has an optimizer to figure out the best way to execute a query. There is more under the hood than you probably imagine. For instance, when SQL Server is doing a join, it has a choice of at least four join algorithms:
Nested Loop
Index Lookup
Merge Join
Hash Join
(not to mention the multi-threaded versions of these.)
It is not important that you understand how each of these works. You just need to understand two things: different algorithms are best under different circumstances and SQL Server does its best to choose the best algorithm.
The choice of join algorithm is only one thing the optimizer does. It also has to figure out the ordering of the joins, the best way to aggregate results, whether a sort is needed for an order by, how to access the data (via indexes or directly), and much more.
When you break the query apart, you are making an assumption about optimization. In your case, you are making the assumption that the first best thing is to do a select on a particular table. You might be right. If so, your result with multiple queries should be about as fast as using a single query. Well, maybe not. When in a single query, SQL Server does not have to buffer all the results at once; it can stream results from one place to another. It may also be able to take advantage of parallelism in a way that splitting the query prevents.
In general, the SQL Server optimizer is pretty good, so you are best letting the optimizer do the query all in one go. There are definitely exceptions, where the optimizer may not choose the best execution path. Sometimes fixing this is as easy as being sure that statistics are up-to-date on tables. Other times, you can add optimizer hints. And other times you can restructure the query, as you have done.
For instance, one place where loading data into a local table is useful is when the table comes from a different server. The optimizer may not have full information about the size of the table to make the best decisions.
In other words, keep the query as one statement. If you need to improve it, then focus on optimization after it works. You generally won't have to spend much time on optimization, because the engine is pretty good at it.
This would give the same result?
SELECT t.*, s.*
FROM dbo.TestTable AS t
JOIN dbo.TestRefTable AS s ON t.id = s.id AND s.id = 1123
Basically, this is a cross join of all records from TestTable and TestRefTable with id = 1123.
Joining to table variables will also result in bad cardinality estimates by the optimizer. Table variables are always assumed by the optimizer to contain only a single row. The more rows it actually has the worse that estimate becomes. This causes the optimizer to assume the wrong number of rows for the table itself, but in other places, for operators that might then join to that result, it can result in wrong estimations of the number executions for that operation.
Personally I think Table parameters should be used for getting data into and out of the server conveniently using client apps (C# .Net apps make good use of them), or for passing data between Stored Procs, but should not be used too much within the proc itself. The importance of getting rid of them within the Proc code itself increases with the expected number of rows to be carried by the parameter.
Sub Selects will perform better, or immediately copying into a temp table will work well. There is overhead for copying into the temp table, but again, the more rows you have the more worth it that overhead becomes because the estimates by the optimizer get worse and worse.
In general a derived table in the query is probably going to be faster than joining to a table variable because it can make use of indexes and they are not available in table variables. However, temp tables can also have indexes creted and that might solve the potential performance difference.
Also if the number of table variable records is expected to be small, then indexes won't make a great deal of difference anyway and so there would be little or no differnce.
As alawys you need to test on your own system as number of records and table design and index design havea great deal to do with what works best.
I'd expect the direct Table join to be faster than the Table to TableVariable, and use less resources.
Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog
what are the alternatives to using cursors in sql server.
i already know a trick which involves using the Row_Number() function which numbers the rows then i can loop over them one by one. any other ideas?
When I don't want to complicate things with SQL cursors I often populate temporary tables or table variables, then do a while loop to go through them.
For example:
declare #someresults table (
id int,
somevalue varchar(10)
)
insert into #someresults
select
id,
somevalue
from
whatevertable
declare #currentid int
declare #currentvalue varchar(10)
while exists(select 1 from #someresults)
begin
select top 1 #currentid = id, #currentvalue = somevalue from #someresults
--work with those values here
delete from #someresults where id = #currentid
end
Several options:
Best is to re-analyze the problem from a Mathematical Set-based perspective. If this can be done, it will most likely provide the best solution in both calrity and performance.
Second, use a Temporary table variable to store only the keys. Insert the keys into this temp table variable using a recursive Common table expression if possible, or failing that, use a T-SQL programming loop (Where Clause or constructed iterative loop of some kind), and then when the temp table variable has all the key values in it, use it to join to the real tables in the appropriate way to execute whatever your real SQL design goal happens to be... Use only the keys as you recursively or iteratively build the temp table to keep it as narrow as possible during the expensive construction phase...
use a temporary table (on disk) in a similar way to the above. This is a better choice when you need this temp table variable to contain more than a few columns and/or a very large (> 1M) number of rows, or if you need the temp table to have more than a primary Key index....
All,
I am seeing some really weird behavior when I run a query in terms of performance between using a variable that's value is set at the beginning to actually using the value as a constant in the query.
What I am seeing is that
DECLARE #ID BIGINT
SET #ID = 5
SELECT * FROM tblEmployee WHERE ID = #ID
runs much faster than when I run
SELECT * FROM tblEmployee WHERE ID = 5
This is obviously a simpler version of the actual query but does anyone know of known issues in SQL Server 2005 the way it parses queries that would explain this behavior. My original query goes from 13 seconds to 8 minutes between the two approaches.
Thanks,
Ashish
Are you sure it's that way around?
Normally the parameterised query will be slower because SQL Server doesnp't know in advance what the parameter will be. A constant can be optimised right away.
One thing to note here about datatypes though.. what does this do:
SELECT * FROM tblEmployee WHERE ID = CAST(5 as bigint)
Also, reverse the execution order. We saw something odd the other day and the plans changed when we changed order.
Another way, mask ID to remove "parameter sniffing" affects on the first query. And difference?
DECLARE #ID BIGINT
SET #ID = 5
DECLARE #MaskedID BIGINT
SET #MaskedID = #ID
SELECT * FROM tblEmployee WHERE ID = #MaskedID
Finally, add OPTION (RECOMPILE) to each query. It means the plan is discarded and not re-used so it compiles differently.
Have you checked the query plans for each? That's always the first thing I do when I'm trying to analyze a performance issue.
If values get cached, you could be drawing an unwarranted conclusion that one approach is faster than another. Is there always this difference?
From what I understand it's to do with cached query plans.
When you run Select * from A Where B = #C it's one query plan regardless of value of #C. so if you run 10x with different values for #C, it's a single query plan.
When you run:
Select * from A Where B = 1 it creates a query plan
Select * from A Where B = 2 creates another
Select * from A Where B = 3 creates another
etc.
All this does is eat up memory.
Google query plan caching and literals and I'm sure you turn up detail explanations