Should I avoid IN() because slower than EXISTS() [duplicate] - sql

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?

Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.

Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)

If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.

Related

SQL SELECT compare values from two tables (without UNION ALL)

I have table T1:
ID IMPACT
1 3
I have table T2
PRIORITY URGENCY
1 2
I need to do the SELECT from T1 table.
I would like to get all the rows from T1 where IMPACT is greater than PRIORITY from T2.
I am working in some IBM application where it is only possible to start with SQL statement after the WHERE clause from the first table T1.
So query (unfortunately) must always start with "SELECT * FROM T1 WHERE..."
This cannot be changed (please have that in mind).
This means that I cannot use some JOIN or UNION ALL statement after the "FROM T1" part because I can start to write SQL query only after the WHERE clause.
SELECT * FROM T1
WHERE
IMPACT> SELECT PRIORITY FROM T2 WHERE URGENCY=2
But I am getting an error for this statement.
Please is it possible to write SQL query starting with:
SELECT * FROM T1
WHERE
You want a subquery, so all you need are parentheses:
SELECT *
FROM T1
WHERE IMPACT > (SELECT T2.PRIORITY FROM T2 WHERE T2.URGENCY = 2)
This assumes that the subquery returns one row (or zero rows, in which case nothing is returned). If the subquery can return more than one row, you should ask another question and be very explicit about what you want done.
One reasonable interpretation (for more than one row) is:
SELECT *
FROM T1
WHERE IMPACT > (SELECT MAX(T2.PRIORITY) FROM T2 WHERE T2.URGENCY = 2)
I would use exists:
select t1.*
from t1
where exists (select 1 from t2 where t1.IMPACT > t2.PRIORITY);

SQL Query Performance Join with condition

calling all sql experts. I have the following select statement:
SELECT 1
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id
WHERE t1.field = xyz
I'm a little bit worried about the performance here. Is the where clause evaluated before or after the join? If its evaluated after, is there way to first evaluate the where clause?
The whole table could easily contain more than a million entries but after the where clause it may be only 1-10 entries left so in my opinion it really is a big performance difference depending on when the where clause is evaluated.
Thanks in advance.
Dimi
You could rewrite your query like this:
SELECT 1
FROM (SELECT * FROM table1 WHERE field = xyz) t1
JOIN table2 t2 ON t1.id = t2.id
But depending on the database product the optimiser might still decide that the best way to do this is to JOIN table1 to table2 and then apply the constraint.
For this query:
SELECT 1
FROM table1 t1 JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.field = xyz;
The optimal indexes are table1(field, id), table2(id).
How the query is executed depends on the optimizer. It is tasked with choosing the based execution plan, given the table statistics and environment.
Each DBMS has its own query optimizer. So by logic of things in case like yours WHERE will be executed first and then JOINpart of the query
As mentioned in the comments and other answers with performance the answer is always "it depends" depending on your dbms and the indexing of the base tables the query may be fine as is and the optimizer may evaluate the where first. Or the join may be efficient anyway if the indexes cover the join requirements.
Alternatively you can force the behavior you require by reducing the dataset of t1 before you do the join using a nested select as Richard suggested or adding the t1.field = xyz to the join for example
ON t1.field = xyz AND t1.id = t2.id
personally if i needed to reduce the dataset before the join I would use a cte
With T1 AS
(
SELECT * FROM table1
WHERE T1.Field = 'xyz'
)
SELECT 1
FROM T1
JOIN Table2 T2
ON T1.Id = T2.Id

Multiple Indexes or Single One in a comparison between two large tables

I'm going to compare two tables on Oracle with about 10 million records in each one.
t1 (anumber, bnumber, cdate, ctime, duration)
t2 (fcode, anumber, bnumber, mdate, mtime, odate, otime, duration)
Rows in these tables are the information of calls from a number to the other for a specific month (august 2012).
For example (12345,9876,120821,120000,68) indicates a call from anumber=12345 to bnumber=9876 in date=2012/08/21 and time=12:08:21 which lasted for 68 seconds.
I want to find records that don't exists in one of these tables but exists in the other. My comparison query is like this
select t1.*
from table1 t1
where not exists(select t1.* from table2 t2
where t1.anumber = t2.anumber
and t1.cdate = t2.mdate
and t1.duration = t2.duration);
and my questions are:
Which kind of indexes is better to use? Multiple index on columns (anumber,cdate,duration) or single index on each of them?
Considering that the third column is duration of a call which could have a wide range, is it worth to create an index on it? doesn't it slower down my query?
What is the fastest way to find the differences between these table?
Is it better to loop through dates and execute my query with (cdate='A DATE MONTH') added to the where clause?
Compared to the above query how much slower is this one:
select t1.*
from table1 t1
where not exists (select t1.*
from table2 t2
where t1.anumber = t2.anumber
and t1.bnumber like '%t2.bnumber%'
and t1.cdate = t2.mdate
and t1.duration = t2.duration);
select * from t1
minus
select * from t2
don't use indexes, you want to scan all 10 million rows in both tables, therefore a TABLE_ACCESS_FULL is rather in this case.
try this way:
select t1.*
from table1 t1
where (t1.anumber, t1.date, t1.duration) not in (select t2.anumber, t2.date, t2.duration
from table2 t2);
see the explain, if good then don't creat indexes, or create like this
create index idx_anum_dat_dur on table2(anumber, date, duration)
query performance depends on the result of which will return, you must to see the explain and try different variants
Your query is equivalient to:
select t1.*
from table1 t1
WHERE t1.bnumber like '%t2.bnumber%' -- << like 'Literal' !!!
AND NOT EXISTS (
select *
from table2 t2
where t2.anumber = t1.anumber
and t2.cdate = t1.mdate
and t2.duration = t1.duration
);
(table references inside quotes are not expanded! Is this the OP's intention ??? )

Most Efficiently Written Query in SQL

I have two tables, Table1 and Table2 and am trying to select values from Table1 based on values in Table2. I am currently writing my query as follows:
SELECT Value From Table1
WHERE
(Key1 in
(SELECT KEY1 FROM Table2 WHERE Foo = Bar))
AND
(Key2 in
(SELECT KEY2 FROM Table2 WHERE Foo = Bar))
This seems a very inefficent way to code the query, is there a better way to write this?
It depends on how the table(s) are indexed. And it depends on what SQL implementation you're using (SQL Server? MySq1? Oracle? MS Access? something else?). It also depends on table size (if the table(s) are small, a table scan may be faster than something more advanced). It matters, too, whether or not the indices are covering indices (meaning that the test can be satisfied with data in the index itself, rather than requiring an additional look-aside to fetch the corresponding data page.) Unless you look at the execution plan, you can't really say that technique X is "better" than technique Y.
However, in general, for this case, you're better off using correlated subqueries, thus:
select *
from table1 t1
where exists( select *
from table2 t2
where t2.key1 = t1.key1
)
and exists( select *
from table2 t2
where t2.key2 = t1.key2
)
A join is a possibility, too:
select t1.*
from table1 t1
join table2 t2a = t2a.key1 = t1.key1 ...
join table2 t2b = t2b.key2 = t1.key2 ...
though that will give you 1 row for every matching combination, though that can be alleviated by using the distinct keyword. It should be noted that a join is not necessarily more efficient than other techniques. Especially, if you have to use distinct as that requires additional work to ensure distinctness.

SQL Optimizer / Execution Plan - Repeated Subqueries

NOTE: I'm currently running my queries in question on a sqlite3 DB, though answers from expertise in any other DBMS will be welcome insight...
I was wondering if the query optimizer makes any attempt to identify repeated queries/subqueries and run them only once if so.
Here is my example query:
SELECT *
FROM table1 AS t1
WHERE t1.fk_id =
(
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
)
OR t1.fk_id =
(
SELECT local_id
FROM ID_MAP
WHERE remote_id =
(
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
)
);
Will the nested query
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111
be run only once (and its results cached for further access) ?
Its not a big deal in this example, since its a simple query that executes only twice, however I need it to run
about 4-5 more times (x2, twice for each child record, so 8-10 really) in my actual program (its grabbing all child records (table1)
associated to a parent record (table2), bound by a foreign key. Its also checking an id mapping table to make sure it queries
for both a locally generated id, as well as the real/updated/new key).
I really appreciate any help with this, thank you.
SQLite has a very simple query optimizer, and does not even try to detect identical subqueries:
> create table t(x);
> explain query plan
select * from t
where x in (select x from t) or
x in (select x from t);
0|0|0|SCAN TABLE t (~500000 rows)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE t (~1000000 rows)
0|0|0|EXECUTE LIST SUBQUERY 2
2|0|0|SCAN TABLE t (~1000000 rows)
The same applies to CTEs and views; if the performance actually matters, your best bet is to create a temporary table for the result of the subquery.
As you asked for insight from other DBs....
In Oracle DBMS, any independent subquery will be executed only once.
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = 1111 -- The result will be the same for any row in t1.
Dependant subqueries will need to executed repeatedly, of course.
Example of dependent subquery:
SELECT t2.fk_id
FROM table2 AS t2
WHERE t2.id = t1.t2_id -- t1.t2_id will have different values for different rows in t1.