DISTINCT with HASH MATCH (Flow Distinct) in SQL Server - sql

Recently, while working in SQL Server, I got an interesting thing that removing DISTINCT keyword actually decreased my query performance and increased my search time.
I read that DISTINCT can make queries slow so I removed it to make them faster, but that made my query even slower.
Further experimenting, I got that when I add DISTINCT, SQL Server actually do something HASH MATCH (Flow Distinct) and this reduces the time, and even parallelism is added with HASH MATCH.
My query looks like this:
SELECT DISTINCT TOP 5000
[A].[Row Id], [A].[Account], ... other columns
FROM
[Archival System].[dbo].[Activity] A
WHERE
([A].[Row Id] LIKE N'%{search_term}%'
OR [A].[Account] LIKE '%{search_term}%'
OR ... others conditions)
When I remove DISTINCT from this query, it becomes slower.
Here I have added TOP without ORDER BY as I a search string is not going to repeat again and if it occurs it would be rare. Hence I am avoiding ORDER BY to get a little more performance.
Below are my query execution plans with the query code of both my queries (previous and later) and the only change in query is just the DISTINCT keyword
Previous : https://www.brentozar.com/pastetheplan/?id=HyV7FKiU9
Later: https://www.brentozar.com/pastetheplan/?id=S1bFV9jUq
Can anybody tell me what this HASH MATCH (Flow Distinct) does, and why it just works when I add distinct?
And is it reliable to depend on that? As I am using DISTINCT now does it will continue to return results with the same time as it is returning now.
OR is there a better way to improve my query search time.
I am using SQL Server 2012 Enterprise edition.
Thanks in advance.

Related

Simple SQL query performance puzzles me

A quick note: We're running SQL Server 2012 in house, but the problem seems to also occur in 2008 and 2008 R2, and perhaps older versions as well.
I've been investigating a performance issue in some code of ours, and I've tracked down the problem to the following very simple query:
SELECT min(document_id)
FROM document
WHERE document_id IN
(SELECT TOP 5000 document_id FROM document WHERE document_id > 442684)
I've noticed that this query takes an absurdly long time (between 18s and 70s depending on the resources of the machine running it) to return when that final value (after greater-than) is roughly 442000 or larger. Anything below that, the query returns nearly instantly.
I've since tweaked the query to look like this:
SELECT min(t.document_id)
FROM (SELECT TOP 5000 document_id FROM document WHERE document_id > 442684) t
This returns immediately for all values of > that I've tested with.
I've solved the performance issue at hand, so I'm largely happy, but I'm still puzzling over why the original query performed so poorly for 442000 and why it runs quickly for virtually any value below that (400000, 350000, etc).
Can anyone explain this?
EDIT: Fixed the 2nd query to be min instead of max (that was a typo)
The secret to understanding performance in SQL Server (and other databases) is the execution plan. You would need to look at the execution plan for the queries to understand what is going on.
The first version of your query has a join operation. IN with a subquery is another way to express a JOIN. SQL Server has several ways to implement joins, such as hash-matches, merge-sort, nested-loop, and index-lookup operations. The optimizer chooses the one that it thinks is the best.
Without seeing the execution plans, my best guess is that the optimizer changes its mind as to the best algorithm to use for the in. In my experience, this usually means that it switched to a nested-loop algorithm from a more reasonable one.
Change your query to
select min(document_id) from document where document_id > 442684
The select in (select top 5000) is a bad idea in sql -- it can expand to 5000 if tests. Don't know why the optimizer worked well in the max() case

Solution for speeding up a slow SELECT DISTINCT query in Postgres

The query is basically:
SELECT DISTINCT "my_table"."foo" from "my_table" WHERE...
Pretending that I'm 100% certain the DISTINCT portion of the query is the reason it runs slowly, I've omitted the rest of the query to avoid confusion, since it is the distinct portion's slowness that I'm primarily concerned with (distinct is always a source of slowness).
The table in question has 2.5 million rows of data. The DISTINCT is needed for purposes not listed here (because I don't want back a modified query, but rather just general information about making distinct queries run faster at the DBMS level, if possible).
How can I make DISTINCT run quicker (using Postgres 9, specifically) without altering the SQL (ie, I can't alter this SQL coming in, but have access to optimize something at the DB level)?
Oftentimes, you can make such queries run faster by working around the distinct by using a group by instead:
select my_table.foo
from my_table
where [whatever where conditions you want]
group by foo;
Your DISTINCT is causing it to sort the output rows in order to find duplicates. If you put an index on the column(s) selected by the query, the database may be able to read them out in index order and save the sort step. A lot will depend on the details of the query and the tables involved-- your saying you "know the problem is with the DISTINCT" really limits the scope of available answers.
You can try increasing the work_mem setting, depending on the size of Your dataset It can cause switching the query plan to hash aggregates, which are usually faster.
But before setting it too high globally, first read up on it. You can easily blow up Your server, because the max_connections setting acts as a multiplier to this number.
This means that if you were to set work_mem = 128MB and you set max_connections = 100 (the default), you should have more than 12.8GB of RAM. You're essentially telling the server that it can use that much for performing queries (not even considering any other memory use by Postgres or otherwise).

Slow SQL query involving CONTAINS and OR

We’re having a problem we were hoping the good folks of Stack Overflow could help us with. We’re running SQL Server 2008 R2 and are having problems with a query that takes a very long time to run on a moderate set of data , about 100000 rows. We're using CONTAINS to search through xml files and LIKE on another column to support leading wild cards.
We’ve reproduced the problem with the following small query that takes about 35 seconds to run:
SELECT something FROM table1
WHERE (CONTAINS(TextColumn, '"WhatEver"') OR
DescriptionColumn LIKE '%WhatEver%')
Query plan:
If we modify the query above to using UNION instead, the running time drops from 35 seconds to < 1 seconds. We would like to avoid using this approach to solve the issue.
SELECT something FROM table1 WHERE (CONTAINS(TextColumn, '"WhatEver"')
UNION
(SELECT something FROM table1 WHERE (DescriptionColumn LIKE '%WhatEver%'))
Query plan:
The column that we’re using CONTAINS to search through is a column with type image and consists of xml files sized anywhere from 1k to 20k in size.
We have no good theories as to why the first query is so slow so we were hoping someone here would have something wise to say on the matter. The query plans don’t show anything out of the ordinary as far as we can tell. We've also rebuilt the indexes and statistics.
Is there anything blatantly obvious we’re overlooking here?
Thanks in advance for your time!
Why are you using DescriptionColumn LIKE '%WhatEver%' instead of CONTAINS(DescriptionColumn, '"WhatEver"')?
CONTAINS is obviously a Full-Text predicate and will use the SQL Server Full-Text engine to filter the search results, however LIKE is a "normal" SQL Server keyword and so SQL Server will not use the Full-Text engine to asist with this query - In this case because the LIKE term begins with a wildcard SQL Server will be unable to use any indexes to help with the query either which will most likely result in a table scan and / or poorer performance than using the Full-Text engine.
Its difficult impossible to tell without an execution plan, however my guess on whats happening would be:
The UNION variation of the query is performing a table scan against table1 - the table scan is not fast, however because there are relatively few rows in the table it is not performing that slowly (compared to a 35s benchmark).
In the OR variation of the query SQL Server is first using the Full-Text engine to filter based on the CONTAINS and then goes on to perform an RDI lookup on each matching row in the result to filter based on the LIKE predicate, however for some reason SQL Server has massively underestimated the number of rows (this can happen with certain types of predicate) and so goes on to perform several thousnad RDI lookups which ends up being incredibly slow (a table scan would have been much quicker).
To really understand whats going on you need to get a query plan.
Did you guys try this:
SELECT *
FROM table
WHERE CONTAINS((column1, column2, column3), '"*keyword*"')
Instead of this:
SELECT *
FROM table
WHERE CONTAINS(column1, '"*keyword*"')
OR CONTAINS(column2, '"*keyword*"')
OR CONTAINS(column3y, '"*keyword*"')
The first one is a lot faster.
I just ran into this. This is reportedly a bug on SQL server 2008 R2:
http://www.arcomit.co.uk/support/kb.aspx?kbid=000060
Your approach of using a UNION of two selects instead of an OR is the workaround they recommend in that article.

Adding inner query is not changing the execution plan

Consider the following queries.
select * from contact where firstname like '%some%'
select * from
(select * from contact) as t1
where firstname like '%some%'
The execution plans for both queries are same and executes at same time. But, I was expecting the second query will have a different plan and execute more slowly as it has to select all data from contact and apply filter. It looks like I was wrong.
I am wondering how this is happening?
Database Server : SQL server 2005
The "query optimiser" is what's happening. When you run a query, SQL Server uses a cost-based optimiser to identify what is likely to be the best way to fulfil that request (i.e. it's execution plan). Think about it as a route map from Place A to Place B. There may be many different ways to get from A to B, some will be quicker than others. SQL Server will workout different routes to achieve the end goal of returning the data that satisfies the query and go with one that has an acceptable cost. Note, it doesn't necessarily analyse EVERY possible way, as that would be unnecessarily expensive.
In your case, the optimiser has worked out that those 2 queries can be collapsed down to the same thing, hence you get the same plan.

Need tips for optimizing SQL Query using a JOIN

The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?
I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.
You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.
Make sure you have indexes on the foreign keys.
It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.
You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.
Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.
You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.
SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.