Mystery query fail: Why did this create a massive output? - sql

I was attempting to do some basic Venn Diagram subtraction to compare a temp table to some live data, and see how they were different.
This query blew up to well north of 15 million returned rows, and I noticed it was duplicating (by 10,000x or more) a known unique field - indicating something went very wrong with my query (I mean by this that rows were being duplicated and I could verify this by this Globally Unique Identifier field). I was expecting to get at most 200 rows returned:
select a.*
from TableOfLiveData a
inner join #TempDataToBeSubtracted b
on a.GUID <> b.guidTemp --I suspect the issue is here
where {here was a limiting condition that should have reduced my live
data to a "pre-join" count(*) of 20,000 at most...}
After I hit Execute the query ran much longer than expected and I could see that millions of rows were being returned before I had to cancel out.
Let me know what the obvious thing is!?!?
edit: FYI: If the where clause were not included, I would expect a VAST amount of rows returned...

Although your query is logically correct, the problem is you have a "Cartesian product" (n x m rows) in your join, but the where clause is executed after the join is made, so you have a colossal number of rows over which the where clause must be executed... so it will be very, very slow.
A better approach is to do an outer join on the key columns, but discard all successful joins by filtering for missed joins:
select a.*
from TableOfLiveData a
left join #TempDataToBeSubtracted b on b.guidTemp = a.GUID
where a.field1 = 3
and a.field2 = 1515
and b.guidTemp is null -- only returns rows that *don't* match
This works because when an outer join is missed, you still get the row from the main table and all columns in the joined table are null.
Creating an index on (field1, field2) will improve performance.

Thank you #Lamak and #MartinSmith for your comments that solved this problem.
By using a 'not equals' in my "on" clause, I ensured that I would be selecting every row in LiveTable that didn't have a GUID in my #TempTable, not just once as I intended, but for each entry in my #TempTable, multiplying my results by about 20,000 in this case (the cardinality of the #TempTable).
To fix this, I did a simple subquery on my #TempTable using the "Not In" Statement as recommended in the comments. This query finished in under a minute and returned under a 100 rows, which was much more in-line with my expectation:
select a.*
from TableOfLiveData a
where a.GUID not in (select b.guidTemp from #TempDataToBeSubtracted b)
and {subsequent constraint statement not relevant to question}

Related

Understanding when to use a subquery over a join

I seem to be missing something. I keep reading that you should use a join instead of a sub-select in most articles I read. However running a quick experiment myself shows a big win for the sub-query when it comes down to execution time.
Trying to get all first names of people that have made a bid (I presume the tables speak for themselves) results in the follwing.
This join takes 10 seconds
select U.firstname
from Bid B
inner join [User] U on U.userName = B.[user]
This query with sub-query takes 3 seconds
select firstname
from [User]
where userName in (select [user] from bid)
Why is my experiment not in line with what I keep reading everywhere or am I missing something?
Experimenting on I found that execution times are the same after adding distinct to both.
They're not the same thing. In the query with joins you can potentially multiply rows or have rows entirely removed from the results.
Inner Join removes rows on non-matched keys. It also multiplies rows on any matched keys that repeat in either one or both tables being joined. Inner Join therefor goes through the additional step of multiplying and removing rows.
The subquery you used is a SELECT. Since there are no filters using a WHERE it is as fast as a simple SELECT and since there are no joins you get results as fast as the results can be selected.
Some may argue that Outer joins return NULLs similar to sub-queries- but they can still multiply rows. Hence, sub-queries and joins are not the same thing.
In the queries you provided, you want to use the 2nd query (the one with the subquery) since it doesn't multiply or remove rows.
Good Read for Subquery vs Inner Join
https://www.essentialsql.com/subquery-versus-inner-join/

SQL Join between tables with conditions

I'm thinking about which should be the best way (considering the execution time) of doing a join between 2 or more tables with some conditions. I got these three ways:
FIRST WAY:
select * from
TABLE A inner join TABLE B on A.KEY = B.KEY
where
B.PARAM=VALUE
SECOND WAY
select * from
TABLE A inner join TABLE B on A.KEY = B.KEY
and B.PARAM=VALUE
THIRD WAY
select * from
TABLE A inner join (Select * from TABLE B where B.PARAM=VALUE) J ON A.KEY=J.KEY
Consider that tables have more than 1 milion of rows.
What your opinion? Which should be the right way, if exists?
Usually putting the condition in where clause or join condition has no noticeable differences in inner joins.
If you are using outer joins ,putting the condition in the where clause improves query time because when you use condition in the where clause of
left outer joins, rows which aren't met the condition will be deleted from the result set and the result set becomes smaller.
But if you use the condition in join clause of left outer joins ,no rows deletes and result set is bigger in comparison to using condition in the where clause.
for more clarification,follow the example.
create table A
(
ano NUMBER,
aname VARCHAR2(10),
rdate DATE
)
----A data
insert into A
select 1,'Amand',to_date('20130101','yyyymmdd') from dual;
commit;
insert into A
select 2,'Alex',to_date('20130101','yyyymmdd') from dual;
commit;
insert into A
select 3,'Angel',to_date('20130201','yyyymmdd') from dual;
commit;
create table B
(
bno NUMBER,
bname VARCHAR2(10),
rdate DATE
)
insert into B
select 3,'BOB',to_date('20130201','yyyymmdd') from dual;
commit;
insert into B
select 2,'Br',to_date('20130101','yyyymmdd') from dual;
commit;
insert into B
select 1,'Bn',to_date('20130101','yyyymmdd') from dual;
commit;
first of all we have normal query which joins 2 tables with each other:
select * from a inner join b on a.ano=b.bno
the result set has 3 records.
now please run below queries:
select * from a inner join b on a.ano=b.bno and a.rdate=to_date('20130101','yyyymmdd')
select * from a inner join b on a.ano=b.bno where a.rdate=to_date('20130101','yyyymmdd')
as you see above results row counts have no differences,and According to my experience there is no noticeable performance differences for data in large volume.
please run below queries:
select * from a left outer join b on a.ano=b.bno and a.rdate=to_date('20130101','yyyymmdd')
in this case,the count of output records will be equal to table A records.
select * from a left outer join b on a.ano=b.bno where a.rdate=to_date('20130101','yyyymmdd')
in this case , records of A which didn't met the condition deleted from the result set and as I said the result set will have less records(in this case 2 records).
According to above examples we can have following conclusions:
1-in case of using inner joins,
there is no special differences between putting condition in where clause or join clause ,but please try to put tables in from clause in order to have minimum intermediate result row counts:
(http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm)
2-In case of using outer joins,whenever you don't care of exact result row counts (don't care of missing records of table A which have no paired records in table B and fields of table B will be null for these records in the result set),put the condition in the where clause to delete a set of rows which aren't met the condition and obviously improve query time by decreasing the result row counts.
but in special cases you HAVE TO put the condition in the join part.for example if you want that your result row count will be equal to table 'A' row counts(this case is common in ETL processes) you HAVE TO put the condition in the join clause.
3-avoiding subquery is recommended by lots of reliable resources and expert programmers.It usually increase the query time and you can use subquery just when its result data set is small.
I hope this will be useful:)
1M rows really isn't that much - especially if you have sensible indexes. I'd start off with making your queries as readable and maintainable as possible, and only start optimizing if you notice a perforamnce problem with the query (and as Gordon Linoff said in his comment - it's doubtful there would even be a difference between the three).
It may be a matter of taste, but to me, the third way seems clumsy, so I'd cross it out. Personally, I prefer using JOIN syntax for the joining logic (i.e., how A and B's rows are matched) and WHERE for filtering (i.e., once matched, which rows interest me), so I'd go for the first way. But again, it really boils down to personal taste and preferences.
You need to look at the execution plans for the queries to judge which is the most computationally efficient. As pointed out in the comments you may find they are equivalent. Here is some information on Oracle execution plans. Depending on what editor / IDE you use the may be a shortcut for this e.g. F5 in PL/SQL Developer.

SQL query either runs endlessly or returns no values

having trouble with a multi-table query today. I tried writing it myself and it didn't seem to work, so I selected all of the columns in the Management Studio Design view. The code SHOULD work but alas it doesn't. If I run this query, it seems to just keep going and going. I left my desk for a minute and when I came back and stopped the query, it had returned something like 2,000,000 rows (there are only about 120,000 in the PODetail table!!):
SELECT PODetail.OrderNum, PODetail.VendorNum, vw_orderHistory.Weight, vw_orderHistory.StdSqft, vw_orderHistory.ReqDate, vw_orderHistory.City,
vw_orderHistory.State, FB_FreightVend.Miles, FB_FreightVend.RateperLoad
FROM PODetail CROSS JOIN
vw_orderHistory CROSS JOIN
FB_FreightVend
ORDER BY ReqDate
Not only that, but it seems that every record had an OrderNum of 0 which shouldn't be the case. So I tried to exclude it...
SELECT PODetail.OrderNum, PODetail.VendorNum, vw_orderHistory.Weight, vw_orderHistory.StdSqft, vw_orderHistory.ReqDate, vw_orderHistory.City,
vw_orderHistory.State, FB_FreightVend.Miles, FB_FreightVend.RateperLoad
FROM PODetail CROSS JOIN
vw_orderHistory CROSS JOIN
FB_FreightVend
WHERE PODetail.OrderNum <> 0
ORDER BY ReqDate
While it executes successfully (no errors), it also returns no records whatsoever. What's going on here? I'm also curious about the query's CROSS JOIN. When I tried writing this myself, I first used "WHERE PODetail.OrderNum = vw_orderHistory.OrderNum" to join those tables but I got the same no results issue. When I tried using JOIN, I got errors regarding "multi-part identifier could not be bound."
A cross join returns a zillion records. The product of the number of records in each table . . . That might be 10,000 * 100,000 * 100 -- this is a big number.
The one caveat is when a table is empty. Then the rows in that table is 0 . . . and 0 times anything is 0. So no rows are returned. And, no rows might be returned quite quickly.
I think you need to learn what join really does in SQL. Then you need to reimplement this with the correct join conditions. Not only will the query run faster, but it will return accurate results.
Do not use cross joins especially on large tables. The link below will help.
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
Also multi-part identifier could not be bound. means the column might not exist as defined. Verify the column exists, datatype and it's assigned name for join.
At condition <> 0 all non corresponding values from PODetail will be omited.
Use (Ordernumber <> 0 or Ordernumber is null)
Avoid CROSS JOINS like the plague. Explicitly define your Order, PO and VendorFreight JOINS.

Sql query optimization using IN over INNER JOIN

Given:
Table y
id int clustered index
name nvarchar(25)
Table anothertable
id int clustered Index
name nvarchar(25)
Table someFunction
does some math then returns a valid ID
Compare:
SELECT y.name
FROM y
WHERE dbo.SomeFunction(y.id) IN (SELECT anotherTable.id
FROM AnotherTable)
vs:
SELECT y.name
FROM y
JOIN AnotherTable ON dbo.SomeFunction(y.id) ON anotherTable.id
Question:
While timing these two queries out I found that at large data sets the first query using IN is much faster then the second query using an INNER JOIN. I do not understand why can someone help explain please.
Execution Plan
Generally speaking IN is different from JOIN in that a JOIN can return additional rows where a row has more than one match in the JOIN-ed table.
From your estimated execution plan though it can be seen that in this case the 2 queries are semantically the same
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
WHERE dbo.Foo(A.Col1) IN (SELECT Col1 FROM B)
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
versus
SELECT
A.Col1
,dbo.Foo(A.Col1)
,MAX(A.Col2)
FROM A
JOIN B ON dbo.Foo(A.Col1) = B.Col1
GROUP BY
A.Col1,
dbo.Foo(A.Col1)
Even if duplicates are introduced by the JOIN then they will be removed by the GROUP BY as it only references columns from the left hand table. Additionally these duplicate rows will not alter the result as MAX(A.Col2) will not change. This would not be the case for all aggregates however. If you were to use SUM(A.Col2) (or AVG or COUNT) then the presence of the duplicates would change the result.
It seems that SQL Server doesn't have any logic to differentiate between aggregates such as MAX and those such as SUM and so quite possibly it is expanding out all the duplicates then aggregating them later and simply doing a lot more work.
The estimated number of rows being aggregated is 2893.54 for IN vs 28271800 for JOIN but these estimates won't necessarily be very reliable as the join predicate is unsargable.
Your second query is a bit funny - can you try this one instead??
SELECT y.name
FROM dbo.y
INNER JOIN dbo.AnotherTable a ON a.id = dbo.SomeFunction(y.id)
Does that make any difference?
Otherwise: look at the execution plans! And possibly post them here. Without knowing a lot more about your tables (amount and distribution of data etc.) and your system (RAM, disk etc.), it's really really hard to give a "globally" valid statement
Well, for one thing: get rid of the scalar UDF that is implied by dbo.SomeFunction(y.id). That will kill your performance real good. Even if you replace it with a one-row inline table-valued function it will be better.
As for your actual question, I have found similar results in other situations and have been similarly perplexed. The optimizer just treats them differently; I'll be interested to see what answers others provide.

Slow query with left outer join and is null condition

I've got a simple query (postgresql if that matters) that retrieves all items for some_user excluding the ones she has on her wishlist:
select i.*
from core_item i
left outer join core_item_in_basket b on (i.id=b.item_id and b.user_id=__some_user__)
where b.on_wishlist is null;
The above query runs in ~50000ms (yep, the number is correct).
If I remove the "b.on_wishlist is null" condition or make it "b.on_wishlist is not null", the query runs in some 50ms (quite a change).
The query has more joins and conditions but this is irrelevant as only this one slows it down.
Some info on the database size:
core_items has ~ 10.000 records
core_user has ~5.000 records
core_item_in_basket has ~2.000
records (of which some 50% has
on_wishlist = true, the rest is null)
I don't have any indexes (except for ids and foreign keys) on those two tables.
The question is: what should I do to make this run faster? I've got a few ideas myself to check out this evening, but I'd like you guys to help if possible, as well.
Thanks!
try using not exists:
select i.*
from core_item i
where not exists (select * from core_item_in_basket b where i.id=b.item_id and b.user_id=__some_user__)
Sorry for adding 2nd answer, but stackoverflow doesn't let me format comments properly, and since formatting is essential, I have to post answer.
Couple of options:
CREATE INDEX q ON core_item_in_basket (user_id, item_id) WHERE on_wishlist is null;
same index, but change order of columns in it.
SELECT i.* FROM core_item i WHERE i.id not in (select item_id FROM core_item_in_basket WHERE on_wishlist is null AND user_id = __some_user__); (this query can benefit from index from point #1, but will not benefit from index #2.
SELECT * from core_item where id in (select id from core_item EXCEPT select item_id FROM core_item_in_basket WHERE on_wishlist is null AND user_id = __some_user__);
Let us know the results :)
You might want to explain more about the purpose of this query - as some techniques make and some don't make sense, depending on use case.
How often are you running it?
Is it run for only 1 user, or you run it for all users in some kind of loop?
Do: explain analyze and put the output on explain.depesz.com so you will see why it is so slow.
Have you tried adding an index on on_wishlist?
It seems that this column needs to be checked for every row in the query. If your tables are that big, this might have quite a significant impact on the query speed.
As you put the on_wishlist condition in the where clause, which will cause it (depending on the what the query planer decides) to be evaluated after the join has been performed, that comparison has to be done for potentially every row resulting from the join. Both the core_items and core_item_in_basket tables are pretty big, and you don't have an index for that column, so there is very little for the query optimizer to do, which probably leads to the excessive query time.
The size of core_user should have no influence (as it is not referenced in the query).