I have a very large table (15 million rows, this is an audit table).
I need to run a query that checks for occurrences in the audit table that are after a certain date and meet certain criteria (I am looking for audit records that took place on current day only)
When I run:
SELECT Field1, Field2 FROM AUDIT_TABLE WHERE AUDIT_DATE >= '8/9/12'
The results come back fairly quick (a few seconds, not bad for 15M rows)
When I run:
SELECT Field1, Field2 FROM AUDIT_TABLE WHERE AUDIT_DATE >= #DateTime
It takes 11-15 seconds and does a full table scan.
The actual field I am querying against is a DATETIME type, and the index is also on that field.
Sounds like you are stuck with a bad plan, probably because someone used a parameter at some point that selected enough of the table that a table scan was the most efficient way for that parameter value. Try running the query once this way:
SELECT ... FROM AUDIT_TABLE WHERE AUDIT_DATE >= #DateTIme OPTION (RECOMPILE);
And then change your code this way:
SELECT ... FROM dbo.AUDIT_TABLE WHERE AUDIT_DATE >= #DateTime;
Using the dbo. prefix will at the very least prevent different users with different schemas from polluting the plan cache with different versions of the plan. It will also disassociate future queries from the bad plan that is stored.
If you are going to vary between selecting recent rows (small %) and a lot of rows, I would probably just leave the OPTION (RECOMPILE) on there. Paying the minor CPU penalty in recompilation every time is going to be cheaper than getting stuck with a bad plan for most of your queries.
Another trick I've seen used to bypass parameter sniffing:
ALTER PROCEDURE dbo.whatever
#DateTime DATETIME
AS
BEGIN
SET NOCOUNT ON;
DECLARE #dt DATETIME;
SET #dt = #DateTime;
SELECT ... WHERE AUDIT_DATE >= #dt;
END
GO
It's kind of a dirty and unintuitive trick, but it gives the optimizer a better glimpse at the parameter value and a better chance to optimize for that value.
Related
I have the following query that takes approximately 2 minutes to return an output.
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
I noticed that if I use this, it takes less than a second
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= '2015-06-22'
Why does that happen? I need the first query since the date can vary from day to day.
When you use a literal value in a query SQL Server will generate a query plan that is optimised for that specific value.
For example if you have a non clustered index on createdDate, the query optimiser can get a good estimate on the number of rows with createdDate >= '2015-06-22'. If this a small proportion of the rows on the table the optimal query will find the matching records in the createdDate index, then lookup the rest of the selected columns for the matching rows from the table.
When you use a variable in the WHERE clause SQL Server generates a single query plan that will be used for all possible values of #today. The plan generated is intended to be optimal for an arbitrary value of #today, but a single plan can only be optimal for a certain number of rows being selected.
Say the estimate for the average number of rows selected is half the number of rows in the table, in this case it is estimated to be more efficient to scan the entire table to filter the records rather than filtering the createdDate index and then having to do a large number of lookups on the table to get the rest of the selected columns.
The issue is that SQL Server is using a single query plan for queries that can have radically different row counts. The reason a single query plan is used for all values of #today is often it is more expensive to compile a optimal query than to just run a sub-optimal one. In your example this obviously isn't the case.
There are ways to change this behaviour:
1) You can also get Sql Server to generate a single plan but optimize it for a pre-determined value of #today
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
OPTION (OPTIMIZE FOR (#today = '2015-06-22'))
This should produce the same query as using the con.createdDate >= '2015-06-22' predicate and could provide a nice solution if your application is always going to be querying for records after a pre determined date.
2) Specifying the recompile option will cause a new query plan to be generated every time it is run, this allows a query optimised for the specific #today value.
DECLARE #today DATETIME
SET #today = GETDATE()
SELECT con.*
FROM GAS_consumptions con
WHERE con.createdDate >= #today
OPTION (RECOMPILE)
This should produce the same query as using the con.createdDate >= '2015-06-22' predicate.
WARNING: RECOMPLING QUERIES CAN ALSO BE SLOW. USING THE RECOMPILE OPTION CAN DEGRADE PERFOMANCE. USE AS LAST RESORT. ETC.
This is just a workaround. Apply this commnad:
UPDATE STATISTICS GAS_consumptions WITH FULLSCAN, NORECOMPUTE
Do some querys for some minutes.
Mesure again.
Take in mind that , NORECOMPUTE will requiere the execution of this command on a regular basis, because it stops the way that sqlserver generates statistics in the future.
In a similar case I got, this was "the solution". I don't know why it happens
I've got a stored procedure that searches for rows in a table based on a given text. TableWithText has a SomeText column of type nvarchar and also a CreateDate DateTime column populated with the date and time that the row was created.
The body of the stored procedure is this:
SELECT TableWithTextID, SomeOtherColumn
FROM TableWithText
WHERE SomeText = #inputText
The value of SomeText for each is guaranteed to be unique although no such constraint is imposed. Therefore this statement is expected to return only one row.
However the table has some 500,000 rows. Given that I know when the row I'm looking for was entered (down to the minute), if I add
AND CreateDate >= #CreateDate
to the stored procedure, will the MS SQL query optimizer reduce the amount of query rows to those created after #CreateDate before it searches for the input text?
The best thing to do is to review the execution plan and see what the optimizer is telling you. You might think there is a problem just by looking at the query and the number of rows but the actual cost is quite low.
If you already have an index on CreateDate, then add this to the where clause and it should take advantage of that.
Otherwise, you would be better off indexing the SomeText field if this is something that is run a lot and you are noticing full table scans when executing this. I'm guessing it's used in other queries two given that it's a unique thing?
Yes, potentially, but only if you add an index on the CreateDate column (or an index on SomeText, CreateDate)
So I am trying to create a view that automatically pulls data from the last twelve months (starting from the end of the last month.)
When I run this with a where clause like the following :
WHERE Visit_Date between 'Dec 1 2012' and 'Dec 1 2013'
It will run in a ~1min.
I have some calculations that will automatically create those dates. But when I use them in the where clause the query is still running after 15 minutes.
WHERE Visit_Date between DATEADD(mm,-12,DATEADD(mm,DATEDIFF(mm,12,GETDATE()),0))
and Dateadd(dd,-1,DATEADD(mm,DATEDIFF(mm,12,GETDATE()),0))
The query is running on a table with 50+ million records. I am sure there this is a more efficient way to do this. I am guessing what is happening is its running through the Getdate() calculations for every row, which is obviously not ideal.
Any suggestions?
Please keep in mind that I am creating a view, and I don't usually write stored procedures or dynamic SQL.
I think #Andriy is probably right (I've also blogged about it) that this is due to a cardinality estimation bug (the dates are reversed when making estimates). You can see more details in KB #2481274, Connect #630583 and the question #Andriy pointed out:
Query runs slow with date expression, but fast with string literal
For something you only change once a month, I think you could consider creating a job that alters the view at the beginning of every month, hard-coding the date range into the view. This might not be a terrible alternative to enabling trace flag 4199, which may or may not be a permanent fix, and may or may not cause other issues if you turn it on globally (as opposed to just for the session that runs this query - again no guarantee it will always make this fast). Here is the kind of process I am thinking about:
CREATE PROCEDURE dbo.AlterThatView
AS
BEGIN
SET NOCOUNT ON;
DECLARE #end DATE = DATEADD(DAY, 1-DAY(GETDATE()), GETDATE());
DECLARE #start DATE = DATEADD(MONTH, -12, #end);
DECLARE #sql NVARCHAR(MAX) = N'ALTER VIEW dbo.ViewName
AS
SELECT ...
WHERE Visit_Date >= ''' + CONVERT(CHAR(8), #start, 112) + '''
AND Visit_Date < ''' + CONVERT(CHAR(8), #end, 112) + ''';';
EXEC sp_executesql #sql;
END
GO
Just create a job that runs, say, a minute after midnight on every 1st of the month, and calls this procedure. You may want to have the job to run a SELECT from the view once too.
Please don't use BETWEEN for date range queries and stop using lazy shorthand for dateparts.
Since you can return rows in < 1 minute in a table with 50M+ rows, I'm guessing you have an index on the Visit_Date column. In your first query, the SQL query plan generator does a seek on the index because it has a rough idea of how many rows will be returned, because it knows the date boundaries. It then makes a determination that an index seek on the index is the best plan of action.
In your second query, it doesn't know, or doesn't know as accurately how many rows might be returned, so it is probably deciding to do an index or table scan instead of a seek.
One option you could consider is using an Index Hint on the query. If this is not production code and just, perhaps, an ad-hoc query you perform on occasion, an index hint is safe. The problem is that if the index gets dropped, or the name changes, the query will fail. So keep that in mind.
Something else to keep in mind is that if you provide an index hint, SQL Server will use that index. If the span of time between your start and end date is such that a larger percentage of your table gets returned, a seek may not be as efficient as a scan (which is why a scan is sometimes selected by SQL Server).
Your best friend here is analyzing the estimated query plan that is generated. You can get this in SSMS. I would experiment with a few approaches until you can get an index seek (not a scan) being performed on your query.
I am using SQL Server 2008-R2, but I'd be interested in a more general answer too ...
I have a table with hundreds of millions of rows, each with a "DateModified" field (datetime2(7))
Now I very frequently poll that table with a query like
select * from table where DateModified > #P1
Here the parameter is always recent (like within the last few minutes) and there are probably only a few records matching that value.
It occurs to me that I am maintaining a big index on the whole table, when I will never use the index for many of those values ... so this sounds like a perfect use of a Filtered Index where I only index the rows that I would possible be querying against...
But in this case what could the filter look like? My only idea was to Filter on
where DateModified > [yesterday]
where [yesterday] is a date literal, but then I'd have to re-define the filter periodically or the advantage of the filter would diminish over time.
On a whim I tried ModifiedDate > DATEADD(d,-1,GETDATE()) but that gave a nondescript error ... wasn't sure how that would be possible.
Is there some other way to accomplish this?
Then finally, if there is a way to do this, should I expect the stats to be wildly wrong in my situation, and would that affect my query performance?
My concern about the stats comes from this article.
I'm trying to propagate changes from one system to another with some disconnected data ... if you'd like to suggest a completely alternate approach to polling "DateModified", I'd be happy to consider it.
Had a similar requirement awhile back and found that functions aren't allowed in the filter.
What you can do, is script out the index and schedule it to run in a job during off-peak (maybe nightly) hours. This will also take care of the stats issue because they will be recreated every time the index is created.
Here's an example of what we wound up doing:
CREATE TABLE FilteredTest (
TestDate datetime
);
Then just run this on a schedule to create with just the newest rows:
DECLARE #sql varchar(8000) = '
IF EXISTS (SELECT 1 FROM sys.indexes WHERE name = ''IX_FilteredTest_TestDate'')
DROP INDEX IX_FilteredTest_TestDate ON FilteredTest;
CREATE NONCLUSTERED INDEX IX_FilteredTest_TestDate ON FilteredTest (
TestDate
)
WHERE TestDate > ''' + CONVERT(varchar(25),DATEADD(d,-1,GETDATE()) ,121) + ''';';
EXEC (#sql);
I can't figure out why this query would be so slow with variables versus without them. I read some where that I need to enable "Dynamic Parameters" but I cannot find where to do this.
DECLARE
#BeginDate AS DATETIME
,#EndDate AS DATETIME
SELECT
#BeginDate = '2010-05-20'
,#EndDate = '2010-05-25'
-- Fix date range to include time values
SET #BeginDate = CONVERT(VARCHAR(10), ISNULL(#BeginDate, '01/01/1990'), 101) + ' 00:00'
SET #EndDate = CONVERT(VARCHAR(10), ISNULL(#EndDate, '12/31/2099'), 101) + ' 23:59'
SELECT
*
FROM
claim c
WHERE
(c.Received_Date BETWEEN #BeginDate AND #EndDate) --this is much slower
--(c.Received_Date BETWEEN '2010-05-20' AND '2010-05-25') --this is much faster
What datatype is "c.Received_Date"?
If it isn't datetime then the column will be converted to datetime because #BeginDate/#EndDate are datetime. This is known as data type precedence. This includes if the column is smalldatetime (as per the link) because datetime has almost the highest precedence
With constants, the optimiser will use the column datatype
The conversion means no index seeks can be used in the plan, which is the cause.
Edit, after seeing query plans
For the literals, SQL Server worked out that the a seek followed by bookmark lookup is best because the values are literals.
Generally, bookmark lookups are expensive (and incidentally one reason why we use covering indexes) for more than a handful of rows.
For the query using variables, it took the general case because if the values change it can reuse the plan. The general case is avoid the bookmark lookups and in this case you have a PK (clustered index) scan
Read more about why bookmark lookups are usually a bad thing on Simple-talk
In this case, you could try an index hint to force it but if the range it too wide it will be really slow. or you could remove SELECT * (bad practice anyway) and replace by SELECT col1, col2 etc and use a covering index
SET STATISTICS IO ON
SET STATISTICS TIME ON
number of scans and logical reads?
You've mentioned that the same query on the same data on a different server runs fast.
Is the hardware identical, or at least reasonably similar?
processors - same number?
Is any processor hyperthreaded?
Is the physical layout of the disks the same (disk speed, separate spindles for data, log, tempdb?)
This behavior can often be seen by out of date statistics.
use dbfoo;
go
exec sp_updatestats
go
Lastly, compare SQL settings on each box:
exec sp_configure 'show advanced options', '1'
go
RECONFIGURE
exec sp_configure;
go