Adding inner query is not changing the execution plan - sql

Consider the following queries.
select * from contact where firstname like '%some%'
select * from
(select * from contact) as t1
where firstname like '%some%'
The execution plans for both queries are same and executes at same time. But, I was expecting the second query will have a different plan and execute more slowly as it has to select all data from contact and apply filter. It looks like I was wrong.
I am wondering how this is happening?
Database Server : SQL server 2005

The "query optimiser" is what's happening. When you run a query, SQL Server uses a cost-based optimiser to identify what is likely to be the best way to fulfil that request (i.e. it's execution plan). Think about it as a route map from Place A to Place B. There may be many different ways to get from A to B, some will be quicker than others. SQL Server will workout different routes to achieve the end goal of returning the data that satisfies the query and go with one that has an acceptable cost. Note, it doesn't necessarily analyse EVERY possible way, as that would be unnecessarily expensive.
In your case, the optimiser has worked out that those 2 queries can be collapsed down to the same thing, hence you get the same plan.

Related

DISTINCT with HASH MATCH (Flow Distinct) in SQL Server

Recently, while working in SQL Server, I got an interesting thing that removing DISTINCT keyword actually decreased my query performance and increased my search time.
I read that DISTINCT can make queries slow so I removed it to make them faster, but that made my query even slower.
Further experimenting, I got that when I add DISTINCT, SQL Server actually do something HASH MATCH (Flow Distinct) and this reduces the time, and even parallelism is added with HASH MATCH.
My query looks like this:
SELECT DISTINCT TOP 5000
[A].[Row Id], [A].[Account], ... other columns
FROM
[Archival System].[dbo].[Activity] A
WHERE
([A].[Row Id] LIKE N'%{search_term}%'
OR [A].[Account] LIKE '%{search_term}%'
OR ... others conditions)
When I remove DISTINCT from this query, it becomes slower.
Here I have added TOP without ORDER BY as I a search string is not going to repeat again and if it occurs it would be rare. Hence I am avoiding ORDER BY to get a little more performance.
Below are my query execution plans with the query code of both my queries (previous and later) and the only change in query is just the DISTINCT keyword
Previous : https://www.brentozar.com/pastetheplan/?id=HyV7FKiU9
Later: https://www.brentozar.com/pastetheplan/?id=S1bFV9jUq
Can anybody tell me what this HASH MATCH (Flow Distinct) does, and why it just works when I add distinct?
And is it reliable to depend on that? As I am using DISTINCT now does it will continue to return results with the same time as it is returning now.
OR is there a better way to improve my query search time.
I am using SQL Server 2012 Enterprise edition.
Thanks in advance.

Strange performance of SELECT COUNT(1)

I have a select query with some complex joins and where conditions and it takes ~9seconds to execute.
Now, the strange thing is if I wrap the query with select count(1) the execution time will increase dramatically.
SELECT COUNT(1) FROM
(
SELECT .... -- initial query, executes ~9s
)
-- executes 1min
That's very strange to me, since I would expect an opposite result - the sql-server engine should be smart enough to optimize the inner query execution (for instance, do not execute nested queries in the select clause, etc).
And that's what execution plans comparison shows! It says it should be 74% to 26% (the former is initial query and latter is wrapped with select count(1)).
But that's not what really happens.
Idk if I should post the query itself, since it's rather large (if you need it then just let me know in comments).
Thaks you!)
When you use count(1) you no longer need all the columns.
This means that SQL Server can consider different execution plans using narrower indexes that do not cover all the columns used in the SELECT list of the original query.
Generally this should of course lead to a leaner, faster, execution plan however looks like in this case you were unlucky and it didn't.
Probably you will find a node with a large discrepancy between actual and estimated rows - this kind of thing will propagate up in the plan and can lead to sub optimal choices of strategy for other sub trees (e.g. sub optimal join orderings or algorithms)

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

Slow SQL query involving CONTAINS and OR

We’re having a problem we were hoping the good folks of Stack Overflow could help us with. We’re running SQL Server 2008 R2 and are having problems with a query that takes a very long time to run on a moderate set of data , about 100000 rows. We're using CONTAINS to search through xml files and LIKE on another column to support leading wild cards.
We’ve reproduced the problem with the following small query that takes about 35 seconds to run:
SELECT something FROM table1
WHERE (CONTAINS(TextColumn, '"WhatEver"') OR
DescriptionColumn LIKE '%WhatEver%')
Query plan:
If we modify the query above to using UNION instead, the running time drops from 35 seconds to < 1 seconds. We would like to avoid using this approach to solve the issue.
SELECT something FROM table1 WHERE (CONTAINS(TextColumn, '"WhatEver"')
UNION
(SELECT something FROM table1 WHERE (DescriptionColumn LIKE '%WhatEver%'))
Query plan:
The column that we’re using CONTAINS to search through is a column with type image and consists of xml files sized anywhere from 1k to 20k in size.
We have no good theories as to why the first query is so slow so we were hoping someone here would have something wise to say on the matter. The query plans don’t show anything out of the ordinary as far as we can tell. We've also rebuilt the indexes and statistics.
Is there anything blatantly obvious we’re overlooking here?
Thanks in advance for your time!
Why are you using DescriptionColumn LIKE '%WhatEver%' instead of CONTAINS(DescriptionColumn, '"WhatEver"')?
CONTAINS is obviously a Full-Text predicate and will use the SQL Server Full-Text engine to filter the search results, however LIKE is a "normal" SQL Server keyword and so SQL Server will not use the Full-Text engine to asist with this query - In this case because the LIKE term begins with a wildcard SQL Server will be unable to use any indexes to help with the query either which will most likely result in a table scan and / or poorer performance than using the Full-Text engine.
Its difficult impossible to tell without an execution plan, however my guess on whats happening would be:
The UNION variation of the query is performing a table scan against table1 - the table scan is not fast, however because there are relatively few rows in the table it is not performing that slowly (compared to a 35s benchmark).
In the OR variation of the query SQL Server is first using the Full-Text engine to filter based on the CONTAINS and then goes on to perform an RDI lookup on each matching row in the result to filter based on the LIKE predicate, however for some reason SQL Server has massively underestimated the number of rows (this can happen with certain types of predicate) and so goes on to perform several thousnad RDI lookups which ends up being incredibly slow (a table scan would have been much quicker).
To really understand whats going on you need to get a query plan.
Did you guys try this:
SELECT *
FROM table
WHERE CONTAINS((column1, column2, column3), '"*keyword*"')
Instead of this:
SELECT *
FROM table
WHERE CONTAINS(column1, '"*keyword*"')
OR CONTAINS(column2, '"*keyword*"')
OR CONTAINS(column3y, '"*keyword*"')
The first one is a lot faster.
I just ran into this. This is reportedly a bug on SQL server 2008 R2:
http://www.arcomit.co.uk/support/kb.aspx?kbid=000060
Your approach of using a UNION of two selects instead of an OR is the workaround they recommend in that article.

Need tips for optimizing SQL Query using a JOIN

The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?
I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.
You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.
Make sure you have indexes on the foreign keys.
It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.
You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.
Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.
You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.
SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.