Simple SQL query performance puzzles me

Simple SQL query performance puzzles me - sql

A quick note: We're running SQL Server 2012 in house, but the problem seems to also occur in 2008 and 2008 R2, and perhaps older versions as well.
I've been investigating a performance issue in some code of ours, and I've tracked down the problem to the following very simple query:
SELECT min(document_id)
FROM document
WHERE document_id IN
(SELECT TOP 5000 document_id FROM document WHERE document_id > 442684)
I've noticed that this query takes an absurdly long time (between 18s and 70s depending on the resources of the machine running it) to return when that final value (after greater-than) is roughly 442000 or larger. Anything below that, the query returns nearly instantly.
I've since tweaked the query to look like this:
SELECT min(t.document_id)
FROM (SELECT TOP 5000 document_id FROM document WHERE document_id > 442684) t
This returns immediately for all values of > that I've tested with.
I've solved the performance issue at hand, so I'm largely happy, but I'm still puzzling over why the original query performed so poorly for 442000 and why it runs quickly for virtually any value below that (400000, 350000, etc).
Can anyone explain this?
EDIT: Fixed the 2nd query to be min instead of max (that was a typo)

The secret to understanding performance in SQL Server (and other databases) is the execution plan. You would need to look at the execution plan for the queries to understand what is going on.
The first version of your query has a join operation. IN with a subquery is another way to express a JOIN. SQL Server has several ways to implement joins, such as hash-matches, merge-sort, nested-loop, and index-lookup operations. The optimizer chooses the one that it thinks is the best.
Without seeing the execution plans, my best guess is that the optimizer changes its mind as to the best algorithm to use for the in. In my experience, this usually means that it switched to a nested-loop algorithm from a more reasonable one.

Change your query to
select min(document_id) from document where document_id > 442684
The select in (select top 5000) is a bad idea in sql -- it can expand to 5000 if tests. Don't know why the optimizer worked well in the max() case

Related

DISTINCT with HASH MATCH (Flow Distinct) in SQL Server

Recently, while working in SQL Server, I got an interesting thing that removing DISTINCT keyword actually decreased my query performance and increased my search time.
I read that DISTINCT can make queries slow so I removed it to make them faster, but that made my query even slower.
Further experimenting, I got that when I add DISTINCT, SQL Server actually do something HASH MATCH (Flow Distinct) and this reduces the time, and even parallelism is added with HASH MATCH.
My query looks like this:
SELECT DISTINCT TOP 5000
[A].[Row Id], [A].[Account], ... other columns
FROM
[Archival System].[dbo].[Activity] A
WHERE
([A].[Row Id] LIKE N'%{search_term}%'
OR [A].[Account] LIKE '%{search_term}%'
OR ... others conditions)
When I remove DISTINCT from this query, it becomes slower.
Here I have added TOP without ORDER BY as I a search string is not going to repeat again and if it occurs it would be rare. Hence I am avoiding ORDER BY to get a little more performance.
Below are my query execution plans with the query code of both my queries (previous and later) and the only change in query is just the DISTINCT keyword
Previous : https://www.brentozar.com/pastetheplan/?id=HyV7FKiU9
Later: https://www.brentozar.com/pastetheplan/?id=S1bFV9jUq
Can anybody tell me what this HASH MATCH (Flow Distinct) does, and why it just works when I add distinct?
And is it reliable to depend on that? As I am using DISTINCT now does it will continue to return results with the same time as it is returning now.
OR is there a better way to improve my query search time.
I am using SQL Server 2012 Enterprise edition.
Thanks in advance.

Optimization of large SQL Select query in odbc environment, comparison operators

I'm currently designing a database application that executes SQL statements on a SQL Server linked to the PCs via ODBC drivers (SQL Native Client v10, Local Network, Network Latency >1ms, executed from withing a MS Access 2003 Environment).
I'm dealing with a peculiar select query that is executed often and has to iterate through an indexed table with about 1.5 million entries. Currently the query structure is this:
SELECT *
FROM table1
WHERE field1 = value1
AND field2 = value2
AND textfield1 LIKE '* value3 *'
AND (field3 = value3 OR field4 = value4 OR field5 = value5)
ORDER BY indexedField1 DESC
(Simplified for reading comprehension and understandability, the real query can have up to 4 bracketed AND connected OR blocks, and up to a total of 31 AND connected statements).
Currently this query takes about ~2s every time it gets executed. It returns somewhere between 1.000 and 15.000 records in usual production. I'm looking for a way to make it execute faster or to restructure it in a way to make it work faster.
Coworkers of mine have hinted at the fact that using LIKE operators might be performance inefficient and that restructuring the OR statements in brackets could bring additional performance.
Edit: Additional relevant information: the table that is being pulled from is VERY active, there is an entry roughly every 1-5 minutes into it.
So the final question is:
Given my parameters outlined above, is this version of the query the most simplistic I can get it.
Can I do something to otherwise speed up the query or the execution time thereof.

General query optimisation is beyond the scope of a single answer, though there may be some help to be had on http://dba.stackexchange.com. However, you should learn how to read a query plan and figure our your bottlenecks before you start optimising.
(The way I'd do that would be to take a few typical queries and look at their estimated execution plan through a tool like SQL Server Management Studio. You may have to try to dig out the real SQL Server query that's resulted from your Access query, which your DBA might be able to help with. I'm assuming that your Access query is actually being translated into a SQL Server query and run natively on the server; if it's not then that will be your big problem!)
I'm going to assume you've indexed every column used in every predicate in your WHERE clause and still have a problem. If that's the case, the suspect is likely to be:
AND textfield1 LIKE '* value3 *'
Because that can't use an index. (It's not SARGable, because it has a wildcard at the beginning, so an index won't be any help.)
If you can't rearrange your search or pre-calculate this particular predicate, then you basically have the problem that Full-Text Searching was designed to solve by tokenising and pre-indexing the words in your text, and that will probably be the best solution.

Strange performance of SELECT COUNT(1)

I have a select query with some complex joins and where conditions and it takes ~9seconds to execute.
Now, the strange thing is if I wrap the query with select count(1) the execution time will increase dramatically.
SELECT COUNT(1) FROM
(
SELECT .... -- initial query, executes ~9s
)
-- executes 1min
That's very strange to me, since I would expect an opposite result - the sql-server engine should be smart enough to optimize the inner query execution (for instance, do not execute nested queries in the select clause, etc).
And that's what execution plans comparison shows! It says it should be 74% to 26% (the former is initial query and latter is wrapped with select count(1)).
But that's not what really happens.
Idk if I should post the query itself, since it's rather large (if you need it then just let me know in comments).
Thaks you!)

When you use count(1) you no longer need all the columns.
This means that SQL Server can consider different execution plans using narrower indexes that do not cover all the columns used in the SELECT list of the original query.
Generally this should of course lead to a leaner, faster, execution plan however looks like in this case you were unlucky and it didn't.
Probably you will find a node with a large discrepancy between actual and estimated rows - this kind of thing will propagate up in the plan and can lead to sub optimal choices of strategy for other sub trees (e.g. sub optimal join orderings or algorithms)

Can I trust Execution plans?

I have these Queries:
With CTE(comno) as
(select distinct comno=ErpEnterpriseId from company)
select id=Row_number() over(order by comno),comno from cte
select comno=ErpEnterpriseId,RowNo=Row_number() over (order by erpEnterpriseId) from company group by ErpEnterpriseId
SELECT erpEnterpriseId, ROW_NUMBER() OVER(ORDER BY erpEnterpriseId) AS RowNo
FROM
(
SELECT DISTINCT erpEnterpriseId
FROM Company
) x
All three of them returns identical cost and actual execution plans..why and how so ?

It's all down to the query optimizer - that will by trying to optimize the query you enter into the most efficient execution plan (i.e several different queries could be optimized down to the SAME statement that is estimated to be most efficient).
The main thing you should do when trying to optimise a query and find which one performs the best, is to just try them and compare performance. Run an SQL profiler trace to see what the duration/reads is for each version. I usually run each version of a query 3 times to get an average to compare. Each time, clearing the execution plan and data cache down to prevent skewed results.
It's worth having a read of this MSDN article on the optimizer.

Simple, the optimizer is probably turning all your statements into the same statement.

Just like in English, in which there are many ways to say the same thing, all three of those queries are asking for the same data. The SQL Engine (the query optimizer) knows that and is smart enough to know what you are asking.
Even more appropriately, the engine has information that you don't (or likely don't know) - how the data is organized and indexed. It uses this information to make it's own decision about what the BEST way to get the data is, and that's what it is doing.
Although there are ways to override the optimizer, unless you really know what you are doing, you will probably only hurt performance. So your best option is to write the queries in whatever way make most sense to you (or other humans) for readability and maintainability.

Need tips for optimizing SQL Query using a JOIN

The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?

I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.

You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.

Make sure you have indexes on the foreign keys.

It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.

You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.

Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.

You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.

SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas