SQL query performance/SELECT TOP X behavior - sql

was wondering why this query performs slowly. If anyone could walk me through how its processed that would be great. The DB being queried has over 500 million rows. Is this query really that poorly written that a TOP 10 takes so long to complete it may as well never finish? How might I improve the query assuming I still want to query data by month+year?
SELECT TOP 10 *
FROM ADB.dbo.Stuff tt
WHERE MONTH(tt.SomeDate) = 5
AND
YEAR(tt.SomeDate) = 2011
Does SELECT TOP 10 not just halt after 10 results have been acquired? Or does it take so long because it hasn't found my conditions yet while going through the 500m+ rows?
Thanks and sorry for such a simple question.

It has to scan the entire table because MONTH(column) and YEAR(column) are not sargable, and you haven't told SQL Server what you mean by TOP. While it's true that SQL Server may be able to short circuit onc it's found your 10 rows, it may be so far into the scan when that happens that the difference to you is minimal. This is especially true if you find zero rows or < 10 rows that match your where clause.
A much better WHERE clause would be:
WHERE SomeDate >= '20110501' AND SomeDate < '20110601';
If you don't want to construct the strings, you can pass those in as parameters / variables and do this:
DECLARE #year INT;
DECLARE #month INT;
SET #year = 2011;
SET #month = 5;
...
WHERE SomeDate >= DATEADD(MONTH, #month-1, DATEADD(YEAR, #year-1900, '19000101'))
AND SomeDate < DATEADD(MONTH, #month, DATEADD(YEAR, #year-1900, '19000101'));
In either case, if there is an index on SomeDate, it can be used and a table scan can be avoided. You want to avoid a table scan on a table with 500 million rows, even if you're only looking for 10 rows, and even if short circuiting might happen.
Even without a table scan, however, this query is still going to be inefficient. Do you really need all of the columns? If an index on SomeDate is used the seek will still have to do a lookup into the clustered index or a covering index to retrieve the rest of the columns. If you don't need those columns, don't include them.
And as bluefeet pointed out, this TOP 10 stuff makes no sense if you haven't told SQL Server which 10 you mean, and you do that using ORDER BY. If the ORDER BY uses a suitable index you may avoid the additional costly sort operator you might think you're avoiding by not using ORDER BY anyway.

Related

Why is it slower to get a Select when in where clause you use variables?

I have the following query that takes approximately 2 minutes to return an output.
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
I noticed that if I use this, it takes less than a second
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= '2015-06-22'
Why does that happen? I need the first query since the date can vary from day to day.
When you use a literal value in a query SQL Server will generate a query plan that is optimised for that specific value.
For example if you have a non clustered index on createdDate, the query optimiser can get a good estimate on the number of rows with createdDate >= '2015-06-22'. If this a small proportion of the rows on the table the optimal query will find the matching records in the createdDate index, then lookup the rest of the selected columns for the matching rows from the table.
When you use a variable in the WHERE clause SQL Server generates a single query plan that will be used for all possible values of #today. The plan generated is intended to be optimal for an arbitrary value of #today, but a single plan can only be optimal for a certain number of rows being selected.
Say the estimate for the average number of rows selected is half the number of rows in the table, in this case it is estimated to be more efficient to scan the entire table to filter the records rather than filtering the createdDate index and then having to do a large number of lookups on the table to get the rest of the selected columns.
The issue is that SQL Server is using a single query plan for queries that can have radically different row counts. The reason a single query plan is used for all values of #today is often it is more expensive to compile a optimal query than to just run a sub-optimal one. In your example this obviously isn't the case.
There are ways to change this behaviour:
1) You can also get Sql Server to generate a single plan but optimize it for a pre-determined value of #today
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
OPTION (OPTIMIZE FOR (#today = '2015-06-22'))
This should produce the same query as using the con.createdDate >= '2015-06-22' predicate and could provide a nice solution if your application is always going to be querying for records after a pre determined date.
2) Specifying the recompile option will cause a new query plan to be generated every time it is run, this allows a query optimised for the specific #today value.
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
OPTION (RECOMPILE)
This should produce the same query as using the con.createdDate >= '2015-06-22' predicate.
WARNING: RECOMPLING QUERIES CAN ALSO BE SLOW. USING THE RECOMPILE OPTION CAN DEGRADE PERFOMANCE. USE AS LAST RESORT. ETC.
This is just a workaround. Apply this commnad:
UPDATE STATISTICS GAS_consumptions WITH FULLSCAN, NORECOMPUTE
Do some querys for some minutes.
Mesure again.
Take in mind that , NORECOMPUTE will requiere the execution of this command on a regular basis, because it stops the way that sqlserver generates statistics in the future.
In a similar case I got, this was "the solution". I don't know why it happens

Efficient SQL view with automatic date range based on Getdate

So I am trying to create a view that automatically pulls data from the last twelve months (starting from the end of the last month.)
When I run this with a where clause like the following :
WHERE Visit_Date between 'Dec 1 2012' and 'Dec 1 2013'
It will run in a ~1min.
I have some calculations that will automatically create those dates. But when I use them in the where clause the query is still running after 15 minutes.
WHERE Visit_Date between DATEADD(mm,-12,DATEADD(mm,DATEDIFF(mm,12,GETDATE()),0))
and Dateadd(dd,-1,DATEADD(mm,DATEDIFF(mm,12,GETDATE()),0))
The query is running on a table with 50+ million records. I am sure there this is a more efficient way to do this. I am guessing what is happening is its running through the Getdate() calculations for every row, which is obviously not ideal.
Any suggestions?
Please keep in mind that I am creating a view, and I don't usually write stored procedures or dynamic SQL.
I think #Andriy is probably right (I've also blogged about it) that this is due to a cardinality estimation bug (the dates are reversed when making estimates). You can see more details in KB #2481274, Connect #630583 and the question #Andriy pointed out:
Query runs slow with date expression, but fast with string literal
For something you only change once a month, I think you could consider creating a job that alters the view at the beginning of every month, hard-coding the date range into the view. This might not be a terrible alternative to enabling trace flag 4199, which may or may not be a permanent fix, and may or may not cause other issues if you turn it on globally (as opposed to just for the session that runs this query - again no guarantee it will always make this fast). Here is the kind of process I am thinking about:
CREATE PROCEDURE dbo.AlterThatView
AS
BEGIN
SET NOCOUNT ON;
DECLARE #end DATE = DATEADD(DAY, 1-DAY(GETDATE()), GETDATE());
DECLARE #start DATE = DATEADD(MONTH, -12, #end);
DECLARE #sql NVARCHAR(MAX) = N'ALTER VIEW dbo.ViewName
AS
SELECT ...
WHERE Visit_Date >= ''' + CONVERT(CHAR(8), #start, 112) + '''
AND Visit_Date < ''' + CONVERT(CHAR(8), #end, 112) + ''';';
EXEC sp_executesql #sql;
END
GO
Just create a job that runs, say, a minute after midnight on every 1st of the month, and calls this procedure. You may want to have the job to run a SELECT from the view once too.
Please don't use BETWEEN for date range queries and stop using lazy shorthand for dateparts.
Since you can return rows in < 1 minute in a table with 50M+ rows, I'm guessing you have an index on the Visit_Date column. In your first query, the SQL query plan generator does a seek on the index because it has a rough idea of how many rows will be returned, because it knows the date boundaries. It then makes a determination that an index seek on the index is the best plan of action.
In your second query, it doesn't know, or doesn't know as accurately how many rows might be returned, so it is probably deciding to do an index or table scan instead of a seek.
One option you could consider is using an Index Hint on the query. If this is not production code and just, perhaps, an ad-hoc query you perform on occasion, an index hint is safe. The problem is that if the index gets dropped, or the name changes, the query will fail. So keep that in mind.
Something else to keep in mind is that if you provide an index hint, SQL Server will use that index. If the span of time between your start and end date is such that a larger percentage of your table gets returned, a seek may not be as efficient as a scan (which is why a scan is sometimes selected by SQL Server).
Your best friend here is analyzing the estimated query plan that is generated. You can get this in SSMS. I would experiment with a few approaches until you can get an index seek (not a scan) being performed on your query.

MSSQL Speed Up Query 365 Millon Rows

I have roughly 365 Million Rows inside my tables and each day we add an additional Million rows after the data is a year old it gets moved to a different table that archives our data.
I have a PK Clustered index on DataCollectionID.
I have one other index: a Unique Nonclusted index on AssetID, DataPointID, and DatapointDate
I need to run multiple select queries against the table pretty quickly... here is my select Query:
SELECT [DataPointID]
,[SourceTag]
,[DatapointDate]
,[DataPointValue]
FROM DataCollection
Where
DatapointDate >= '2012-09-07' AND
DatapointDate < '2012-09-08' AND
DataPointID = 1100
ORDER BY DatapointDate
This query should return 8,640 rows which it does but it takes 00:00:08 (8 seconds) to execute. Even if I said give me top 10 it still takes 8 seconds. Can someone please help me speed up this process?
I think a more effective index to help this query would be on DataPointID, DataPointDate, in that order. This will allow the optimizer to quickly narrow down the field with an equality operator on the first index column, then find the date range within that set.
There are some good examples of indexes and similar queries here:
http://sqlserverpedia.com/wiki/Index_Selectivity_and_Column_Order
If this is dynamic SQL, you should put it into a stored procedure, and remember to use SET NOCOUNT ON.
Otherwise it sounds like a hardware problem: in which case more memory might help.
You need a better covering index, something like:
create index _idx ON DataCollection ( DataPointDate, DataPointId )
include ( SourceTag, DataPointValue )
You generally want the most selective ( ie most unique ) column at the front of the index, so this may be dataPointDate or dataPointId depending on your data.

Hard-Coding date string much faster than DateTime in SELECT?

I have a very large table (15 million rows, this is an audit table).
I need to run a query that checks for occurrences in the audit table that are after a certain date and meet certain criteria (I am looking for audit records that took place on current day only)
When I run:
SELECT Field1, Field2 FROM AUDIT_TABLE WHERE AUDIT_DATE >= '8/9/12'
The results come back fairly quick (a few seconds, not bad for 15M rows)
When I run:
SELECT Field1, Field2 FROM AUDIT_TABLE WHERE AUDIT_DATE >= #DateTime
It takes 11-15 seconds and does a full table scan.
The actual field I am querying against is a DATETIME type, and the index is also on that field.
Sounds like you are stuck with a bad plan, probably because someone used a parameter at some point that selected enough of the table that a table scan was the most efficient way for that parameter value. Try running the query once this way:
SELECT ... FROM AUDIT_TABLE WHERE AUDIT_DATE >= #DateTIme OPTION (RECOMPILE);
And then change your code this way:
SELECT ... FROM dbo.AUDIT_TABLE WHERE AUDIT_DATE >= #DateTime;
Using the dbo. prefix will at the very least prevent different users with different schemas from polluting the plan cache with different versions of the plan. It will also disassociate future queries from the bad plan that is stored.
If you are going to vary between selecting recent rows (small %) and a lot of rows, I would probably just leave the OPTION (RECOMPILE) on there. Paying the minor CPU penalty in recompilation every time is going to be cheaper than getting stuck with a bad plan for most of your queries.
Another trick I've seen used to bypass parameter sniffing:
ALTER PROCEDURE dbo.whatever
#DateTime DATETIME
AS
BEGIN
SET NOCOUNT ON;
DECLARE #dt DATETIME;
SET #dt = #DateTime;
SELECT ... WHERE AUDIT_DATE >= #dt;
END
GO
It's kind of a dirty and unintuitive trick, but it gives the optimizer a better glimpse at the parameter value and a better chance to optimize for that value.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog