I have a fairly generic select query. When I select the top 1245 results of a particular result set, it runs in under a second, as expected. However, if I run it for 1246, it runs continuously as if on an infinite loop. I've checked the formatting of rows 1245 and 1246, the data for which appears completely fine. I can also run the same query on a separate group of users numbering over 2300, which again, runs almost instantly, which makes me think it's not in relation to memory issues.
As a quick example of the query formatting:
SELECT TOP 1246 a.id,
(SELECT TOP 1 col_1 FROM table_1 t INNER JOIN table_2 c ON t.id=c.id WHERE t.id=a.id) AS [columnAlias]
FROM table_3 a
Open to any ideas on troubleshooting.
If I can provide anything else that might help, just ask.
The difference in performance is probably due to changes in the execution plan. You might want to check that statistics are up-to-date.
Second, your query really makes no sense, because the subquery has no relationship to the outer query. So, you might as well accept that you are going to get a single value and move it to the from clause:
SELECT TOP 1246 a.id, col_1 AS [columnAlias]
FROM table_3 a CROSS JOIN
(SELECT TOP 1 col_1 FROM table_1 t INNER JOIN table_2 c ON t.id=c.id);
Finally, if you have some other intention with your query, you should ask another question. If you revise your question, you may invalidate this answer which draws downvotes.
"Running continuously on an infinite loop" when you get to a specific record makes me suspicious that you are encountering a deadlocking situation. There isn't anything obvious in what you posted that would cause it, so I'd suspect there is a context where this is running, e.g. it's part of a several step transaction, that could be the cause.
Related
I am trying to optimize a query that uses an inner join, and I am puzzled by the difference of performance between two very similar queries.
I hope to shed some light on this.
The tables look like this:
Aggregates:
+-recid(key)-+-avg---+
+------------+-------+
History:
+-recid(key)-+-value-+
+------------+-------+
The aim is to get, for a given key (let's assume 1234), avg and value.
I have tried two queries who seem very similar to me:
SELECT a.avg, b.value FROM aggregates a, history b
WHERE a.recid = b.recid
AND a.recid = 1234
Takes 5 seconds to run
But,
SELECT a.avg, b.value FROM aggregates a, history b
WHERE a.recid = 1234
AND b.recid = 1234
runs in less than a second.
Those two queries give the very same result. I would like to understand the huge difference in performance (the end game being a better understanding, to achieve a better performance for this query!)
First, learn to use proper explicit JOIN syntax:
SELECT a.avg, h.value
FROM aggregates a JOIN
history h
ON a.recid = h.recid
WHERE a.recid = 1234;
This doesn't affect performance, but it is the correct, modern syntax.
Assuming you have indexes on aggregates(recid) and history(recid), then the two versions should have very similar execution plans in almost any database I can think of. Those two indexes would be recommended for queries like this.
One possibility is cold versus warm cache. The first time you run a query, the data needs to be loaded into memory. That can take longer. For proper timing, you need to take this into account.
Finally, if you really want to understand the difference, then you need to look at the execution plan. Most databases provide a simple way to "explain" how the query is run.
Not sure absolutely but it could be that your second query execution plan is already cached and thus DB optimizier didn't needed to bring one. BTW, your first query should be changes as below to use ANSI style JOIN syntax
SELECT a.avg, b.value FROM aggregates a
JOIN history b ON a.recid = b.recid
WHERE a.recid = 1234
The second query might be performing a cross join then filtering on the results though it would have to be a very old version of Oracle to be that dumb. But you need to look at the query plan to find out. If they consistently exhibit different performance then I guarantee the query plans will be different.
I have the following SQL:
IF EXISTS
(
SELECT
1
FROM
SomeTable T1
WHERE
SomeField = 1
AND SomeOtherField = 1
AND NOT EXISTS(SELECT 1 FROM SomeOtherTable T2 WHERE T2.KeyField = T1.KeyField)
)
RAISERROR ('Blech.', 16, 1)
The SomeTable table has around 200,000 rows, and the SomeOtherTable table has about the same.
If I execute the inner SQL (the SELECT), it executes in sub-second time, returning no rows. But, if I execute the entire script (IF...RAISERROR) then it takes well over an hour. Why?
Now, obviously, the execution plan is different - I can see that in Enterprise Manager - but again, why?
I could probably do something like SELECT #num = COUNT(*) WHERE ... and then IF #num > 0 RAISERROR but... I think that's missing the point somewhat. You can only code around a bug (and it sure looks like a bug to me) if you know that it exists.
EDIT:
I should mention that I already tried re-jigging the query into an OUTER JOIN as per #Bohemian's answer, but this made no difference to the execution time.
EDIT 2:
I've attached the query plan for the inner SELECT statement:
... and the query plan for the whole IF...RAISERROR block:
Obviously these show the real table/field names, but apart from that the query is exactly as shown above.
The IF does not magically turn off optimizations or damage the plan. The optimizer just noticed that EXISTS only needs one row at most (like a TOP 1). This is called a "row goal" and it normally happens when you do paging. But also with EXISTS, IN, NOT IN and such things.
My guess: if you write TOP 1 to the original query you get the same (bad) plan.
The optimizer tries to be smart here and only produce the first row using much cheaper operations. Unfortunately, it misestimates cardinality. It guesses that the query will produce lots of rows although in reality it produces none. If it estimated correctly you'd just get a more efficient plan, or it would not do the transformation at all.
I suggest the following steps:
fix the plan by reviewing indexes and statistics
if this didn't help, change the query to IF (SELECT COUNT(*) FROM ...) > 0 which will give the original plan because the optimizer does not have a row goal.
It's probably because the optimizer can figure out how to turn your query into a more efficient query, but somehow the IF prevents that. Only an EXPLAIN will tell you why the query is taking so long, but I can tell you how to make this whole thing more efficient... Indtead of using a correlated subquery, which is incredibly inefficient - you get "n" subqueries run for "n" rows in the main table - use a JOIN.
Try this:
IF EXISTS (
SELECT 1
FROM SomeTable T1
LEFT JOIN SomeOtherTable T2 ON T2.KeyField = T1.KeyField
WHERE SomeField = 1
AND SomeOtherField = 1
AND T2.KeyField IS NULL
) RAISERROR ('Blech.', 16, 1)
The "trick" here is to use s LEFT JOIN and filter out all joined rows by testing for a null in the WHERE clause, which is executed after the join is made.
Please try SELECT TOP 1 KeyField. Using primary key will work faster in my guess.
NOTE: I posted this as answer as I couldn't comment.
Is the following the most efficient in SQL to achieve its result:
SELECT *
FROM Customers
WHERE Customer_ID NOT IN (SELECT Cust_ID FROM SUBSCRIBERS)
Could some use of joins be better and achieve the same result?
Any mature enough SQL database should be able to execute that just as effectively as the equivalent JOIN. Use whatever is more readable to you.
One reason why you might prefer to use a JOIN rather than NOT IN is that if the Values in the NOT IN clause contain any NULLs you will always get back no results. If you do use NOT IN remember to always consider whether the sub query might bring back a NULL value!
RE: Question in Comments
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x' <>
'b'
≡ Unknown and True and True
≡ Unknown
Maybe try this
Select cust.*
From dbo.Customers cust
Left Join dbo.Subscribers subs on cust.Customer_ID = subs.Customer_ID
Where subs.Customer_Id Is Null
SELECT Customers.*
FROM Customers
WHERE NOT EXISTS (
SELECT *
FROM SUBSCRIBERS AS s
JOIN s.Cust_ID = Customers.Customer_ID)
When using “NOT IN”, the query performs nested full table scans, whereas for “NOT EXISTS”, the query can use an index within the sub-query.
If you want to know which is more effective, you should try looking at the estimated query plans, or the actual query plans after execution. It'll tell you the costs of the queries (I find CPU and IO cost to be interesting). I wouldn't be surprised much if there's little to no difference, but you never know. I've seen certain queries use multiple cores on our database server, while a rewritten version of that same query would only use one core (needless to say, the query that used all 4 cores was a good 3 times faster). Never really quite put my finger on why that is, but if you're working with large result sets, such differences can occur without your knowing about it.
Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’
The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.
Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.
If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.
SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower
I've got a simple query (postgresql if that matters) that retrieves all items for some_user excluding the ones she has on her wishlist:
select i.*
from core_item i
left outer join core_item_in_basket b on (i.id=b.item_id and b.user_id=__some_user__)
where b.on_wishlist is null;
The above query runs in ~50000ms (yep, the number is correct).
If I remove the "b.on_wishlist is null" condition or make it "b.on_wishlist is not null", the query runs in some 50ms (quite a change).
The query has more joins and conditions but this is irrelevant as only this one slows it down.
Some info on the database size:
core_items has ~ 10.000 records
core_user has ~5.000 records
core_item_in_basket has ~2.000
records (of which some 50% has
on_wishlist = true, the rest is null)
I don't have any indexes (except for ids and foreign keys) on those two tables.
The question is: what should I do to make this run faster? I've got a few ideas myself to check out this evening, but I'd like you guys to help if possible, as well.
Thanks!
try using not exists:
select i.*
from core_item i
where not exists (select * from core_item_in_basket b where i.id=b.item_id and b.user_id=__some_user__)
Sorry for adding 2nd answer, but stackoverflow doesn't let me format comments properly, and since formatting is essential, I have to post answer.
Couple of options:
CREATE INDEX q ON core_item_in_basket (user_id, item_id) WHERE on_wishlist is null;
same index, but change order of columns in it.
SELECT i.* FROM core_item i WHERE i.id not in (select item_id FROM core_item_in_basket WHERE on_wishlist is null AND user_id = __some_user__); (this query can benefit from index from point #1, but will not benefit from index #2.
SELECT * from core_item where id in (select id from core_item EXCEPT select item_id FROM core_item_in_basket WHERE on_wishlist is null AND user_id = __some_user__);
Let us know the results :)
You might want to explain more about the purpose of this query - as some techniques make and some don't make sense, depending on use case.
How often are you running it?
Is it run for only 1 user, or you run it for all users in some kind of loop?
Do: explain analyze and put the output on explain.depesz.com so you will see why it is so slow.
Have you tried adding an index on on_wishlist?
It seems that this column needs to be checked for every row in the query. If your tables are that big, this might have quite a significant impact on the query speed.
As you put the on_wishlist condition in the where clause, which will cause it (depending on the what the query planer decides) to be evaluated after the join has been performed, that comparison has to be done for potentially every row resulting from the join. Both the core_items and core_item_in_basket tables are pretty big, and you don't have an index for that column, so there is very little for the query optimizer to do, which probably leads to the excessive query time.
The size of core_user should have no influence (as it is not referenced in the query).