Performance of "NOT IN" in SQL query - sql

I'm quite new to SQL query analysis. Recently I stumbled upon a performance issue with one of the queries and I'm wondering whether my thought process is correct here and why Query Optimizer works the way it works in this case.
I'm om SQL Server 2012.
I've got a SQL query that looks like
SELECT * FROM T1
WHERE Id NOT IN
(SELECT DISTINCT T1_Id from T2);
It takes around 30 seconds to run on my test server.
While trying to understand what is taking so long I rewrote it using a temp table, like this:
SELECT DISTINCT T1_Id
INTO #temp from T2;
SELECT * FROM T1
WHERE Id NOT IN
(SELECT T1_Id from #temp);
It runs a hundred times faster than the first one.
Some info about the tables:
T2 has around 1 million rows, and there are around 1000 distinct values of T1_id there. T1 has around 1000+ rows. Initially I only had a clustered index on T2 on a column other than T1_Id, so T1_id wasn't indexed at all.
Looking at the execution plans, I saw that for the first query there were as many index scans as there are distinct T1_id values, so basically SQL Server performs about 1000 index scans in this case.
That made me realize that adding a non-clustered index on T1_id may be a good idea (the index should've been there from the start, admittedly), and adding an index indeed made the original query run much faster since now it does nonclustered index seeks.
What I'm looking for is to understand the Query optimizer behavior for the original query - does it look reasonable? Are there any ways to make it work in a way similar to the temporary table variant that I posted here rather than doing multiple scans? Am I just misunderstanding something here?
Thanks in advance for any links to the similar discussion as I haven't really found anything useful.

Not in is intuitive but slow. This construct will generally run quicker.
where id in
(select id from t1
except select t1_id from t2)

The actual performance will likely vary from the estimates, but neither of your queries will out-perform this query, which is the de facto standard approach:
SELECT T1.* FROM T1
LEFT JOIN T2 ON T1.Id = T2.T1_Id
WHERE T2.T1_Id IS NULL
This uses a proper join, which will perform very well (assuming the foreign key column is indexed) and being an left (outer) join the WHERE condition selects only those rows from T1 that don't join (all columns of the right side table are null when the join misses).
Note also that DISTINCT is not required, since there is only ever one row returned from T1 for missed joins.

The SQL Server optimizer needs to understand the size if tables for some of its decisions.
When doing a NOT IN with a subquery, those estimates may not be entirely accurate. When the table is actually materialized, the count would be highly accurate.
I think the first would be faster with an index on
Table2(t1_id)

This is just a guess, but hopefully an educated one...
The DBMS probably concluded that searching a large table small number of times is faster than searching a small table large number of times. That's why you had ~1000 searches on T2, instead of ~1000000 searches on T1.
When you added an index on T2.T1_Id, that turned ~1000 table scans (or full clustered index scans if the table is clustered) into ~1000 index seeks, which made things much faster, as you already noted.
I'm not sure why it didn't attempt a hash join (or a merge join after the index was added) - perhaps it had stale statistics and badly overestimated the number of distinct values?
One more thing: is there a FOREIGN KEY on T2.T1_Id referencing T1.Id? I know Oracle can use FKs to improve the accuracy of cost estimates (in this case, it could infer that the cardinality of T2.T1_Id cannot be greater than T1.Id). If MS SQL Server does something similar, and the FK is missing (or is untrusted), that could contribute to the MS SQL Server thinking there are more distinct values than there really are.
(BTW, it would have helped if you posted the actual query plans and the database structure.)

Related

How to reduce # of columns SQL has to look through while joining 2 tables?

I'm joining two tables together using an inner join, but given that these tables are billions of rows long, I was hoping to speed up my query and find a way to reduce the columns the sql has to comb through. Is there a way to, in a join, only have sql search through certain columns? I'm understand you can do it through SELECT, but I was hoping rather than select columns from the join, that I could reduce the # of columns being searched from.
Ex)
SELECT *
FROM table1 t1
JOIN table2 t2
ON t1.suite = t2.suite
AND t1.region = t2.region
Currently table1 and table2 both have over 20 columns, but I only need the 3 columns from each table.
I'm using presto btw. Thanks and stay safe :)
If you create indexes on each table for both suite and region in the same index, plus an INCLUDES clause for any additional result columns you need, SQL Server can complete the query using only the indexes. This is called a covering index, and it will help performance for the query by increasing the number of "rows" (index entries) which fit in an 8Kb page verses an entire real row, which therefore also reduces the total number of page reads to complete the query.
Be aware, though, that you pay this cost by extra work at INSERT/UPDATE/DELETE time to keep the indexes up to date, extra storage needed for the indexes, and extra cache RAM use if any part of the indexes end up in the cache buffer. With potentially billions of index entries, that cost could be significant, and may outweigh the gains for this one query, or may require updates to your server capacity planning.
SELECT T1.COL_1,T1.COL_2,T1.COL_3,T2.COL_1,T2.COL_2,T2.COL_3
FROM TABLE_1 T1
JOIN TABLE_2 T2 ON t1.suite = t2.suite AND t1.region = t2.region
And in the majority books about SQL-language there is a WARNING "Do not use * in the production code"

Materialized View Performance of Exists vs In

I did some googling and couldn't find a clear answer to an oracle performance question. Maybe we can document it here. I am building an MV that is pretty simple but on fairly large tables. The query like many things can be written more than one way. In my case when written as a select statement two solutions have similar costs / execution plan, but when placed inside of a create materialized view the execution time changes drastically. Any insight into why?
Tab1 is aprox 40M records.
Tab2 is aprox 8M records.
field1 is a primary key on Tab1, it is not a PK or unique on Tab2 but tab 2 does have an index on this field.
field2 is not a key nor is it indexed on either table (boo)
Queries are:
Q1:
SELECT
CR1.Several_Fields
FROM
SCHEMA1.tab1 T1
WHERE T1.field2 like 'EXAMPLE%'
AND T1.field1 not in (
SELECT T2.field1
FROM SCHEMA1.tab2 T2
)
;
Q2:
SELECT
CR1.Several_Fields
FROM
SCHEMA1.tab1 T1
WHERE T1.field2 like 'EXAMPLE%'
AND not exists (
SELECT 1
FROM SCHEMA1.tab2 T2
WHERE T1.field1 = T2.field1
)
;
The two queries as select statements run similarly in time, and explain plan has them both utilizing the index scan rather than full table scans as I would expect. What is unexpected is that Q2 runs vastly faster (47 seconds vs 81 days per v$session_longops) when run in an mv creation like:
CREATE MATERIALIZED VIEW SCHEMA1.mv_blah as
(
Q1 or Q2
);
Does anyone have any insight, is there a rule here to not use IN if possible for mviews only? I know of the tricks between in and exist when indexes do not exist between the tables but this one had me baffled. This is running against an oracle 11g database.
This looks like a known bug. If you have access to My Oracle Support look at Slow Create/Refresh of Materialized View Based on NOT IN Definition Query (Doc ID 1591851.1), or less usefully if you don't, a summary of the problem is available.
The contents of the MOS version can't be reproduced here of course, but suffice to say that the only workaround is what you're already doing with not exists. It's fixed in 12c, which doesn't help you much.

Is this index defined correctly for this join usage? (Postgres)

select
*
from
tbl1 as a
inner join
tbl2 as b on
tbl1.id=b.id
left join
tbl3 as c on
tbl2.id=tb3.parent_id and
tb3.some_col=2 and
tb3.attribute_id=3
In the example above:
If I want optimal performance on the join, should I set the index on tbl3 as so?
parent_id,
some_col,
attribute_id
The answer depends on the chosen join type.
If PostgreSQL chooses a nested loop or a merge outer join, your index is perfect.
If PostgreSQL chooses a hash outer join, the index won't help at all. In that case you need an index on (some_col, attribute_id).
Work with EXPLAIN to make the best choice for your case.
Note: If one of the conditions on some_col and attribute_id is not selective (doesn't filter out a significant number of rows), it is often better to omit that column in the index. In that case, it is better to get the benefit of a smaller index and more HOT updates.
My answer is "Maybe". I am speaking from experience with SQL Server, so someone please correct me if I am wrong and it is different in Postgres.
Your index looks fine for the most part. An issue that may arise is using the SELECT *. If tbl3 has more columns than what is defined in your index and you are querying those fields, they won't be in your index and the engine will have to do additional lookups outside that index.
Another thing would be based on the cardinality of your fields, meaning which are the most selective. If parent_id has a high cardinality, meaning very few duplicates, it could cause more reads against the index. However, if your lowest cardinality field is first and the db can quickly filter out huge chunks of data, that might be more efficient.
I have seen both work very well in SQL Server. SQL Server has even recommended indexes, I apply them, and then it recommends a different one based on field cardinality. Again, I am not familiar with the Postgres engine and I am just assuming these topics apply across both. If all else fails, create 3 indexes with different column order and see which one the engine likes the best.

Inner joins involving three tables

I have a SELECT statement that has three inner joins involving two tables.
Apart from creating indexes on the columns referenced in the ON and WHERE clauses, is there other things I can do to optimize the joins, as in rewriting the query?
SELECT
...
FROM
my_table AS t1
INNER JOIN
my_table AS t2
ON
t2.id = t1.id
INNER JOIN
other_table AS t3
ON
t2.id = t3.id
WHERE
...
You can tune PostgreSQL config, VACUUM ANALIZE and all general optimizations.
If this is not enough and you can spend few days you may write code to create materialized view as described in postgresql wiki.
You likely have an error in your example, because you're selecting the same record from my_table twice, you could really just do:
SELECT
...
FROM
my_table AS t1
INNER JOIN
other_table AS t3
ON
t1.id = t3.id
WHERE
...
Because in your example code t1 Will always be t2.
But lets assume you mean ON t2.idX = t1.id; then to answer your question, you can't get much better performance than what you have, you could index them or you could go further and define them as foreign key relationships (which wouldn't do too much in terms of performance benefits compared to non-index vs indexing them).
You might instead like to look at restricting your where clause and perhaps that is where your indexing would be as (if not more) beneficial.
You could write your query as using WHERE EXISTS (if you don't need to select data from all three tables) rather than INNER JOINS but the performance will be almost identical (except when this is itself inside a nested query) as it still needs find the records.
In PostgreSQL. most of your tuning will not be on the actual query. The goal is to help the optimizer figure out how best to execute your declarative query, not to specify how to do it from your program. That isn't to say that sometimes queries can't be optimized themselves, or that they might not need to be, but this doesn't have any of the problem areas that I am aware of, unless you are retrieving a lot more records than you need to (which I have seen happen occasionally).
The fist thing to do is to run vacuum analyze to make sure you have optimal statistics. Then use explain analyze to compare expected query performance to actual. From that point, we'd look at indexes etc. There isn't anything in this query that needs to be optimized on a query level. However without looking at your actual filters in your where clause and the actual output of explain analyze there isn't much that can be suggested.
Typically you tweak the db to choose a better query plan rather than specifying it in your query. That's usually the PostgreSQL way. The comment is of course qualified by noting there are exceptions.

Slow query with unexpected index scan

I have this query:
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
the biggest table here is RESULT, contains 11.1M records. The left 2 tables about 1M.
this query works slowly (more than 10 minutes) and returns about 800 records. executing plan shows clustered index scan (over it's PRIMARY KEY (result.result_number, which actually doesn't take part in query)) over all 11M records.
RESULT.TEST_NUMBER is a clustered primary key.
if I change 2010-03-17 09:00 to 2010-03-17 10:00 - i get about 40 records. it executes for 300ms. and plan shows index seek (over result.test_number index)
if i replace * in SELECT clause to result.test_number (covered with index) - then all become fast in first case too. this points to hdd IO issues, but doesn't clarifies changing plan.
so, any ideas?
UPDATE:
sampled_date is in table sample and covered by index.
other fields from this query: test.sample_number is covered by index and result.test_number too.
UPDATE 2:
obviously than sql server in any reasons don't want to use index.
i did a small experiment: i remove INNER JOIN with result, select all test.test_number and after that do
SELECT * FROM RESULT WHERE TEST_NUMBER IN (...)
this, of course, works fast. but i cannot get what is the difference and why query optimizer choose such inappropriate way to select data in 1st case.
UPDATE 3:
after backing up database and restoring to database with new name - both requests work fast as expected even on much more ranges...
so - are there any special commands to clean or optimize, whatever, that can be relevant to this? :-(
A couple things to try:
Update statistics
Add hints to the query about what index to use (in SQL Server you might say WITH (INDEX(myindex)) after specifying a table)
EDIT: You noted that copying the database made it work, which tells me that the index statistics were out of date. You can update them with something like UPDATE STATISTICS mytable on a regular basis.
Use EXEC sp_updatestats to update the whole database.
The first thing I would do is specify the exact columns I want, and see if the problems persists. I doubt you would need all the columns from all three tables.
It sounds like it has trouble getting all the rows out of the result table. How big is a row? Look at how big all the data in the table is and divide it by the number of rows. Right click on the table -> properties..., Storage tab.
Try putting where clause into a subquery to force it to do that first?
SELECT *
FROM
(SELECT * FROM sample
WHERE sampled_date
BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00') s
INNER JOIN test ON s.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
OR this might work better if you expect a small number of samples
SELECT *
FROM sample
INNER JOIN test ON sample.sample_number = test.sample_number
INNER JOIN result ON test.test_number = result.test_number
WHERE sample.sample_ID in (
SELECT sample_ID
FROM sample
WHERE sampled_date BETWEEN '2010-03-17 09:00' AND '2010-03-17 12:00'
)
If you do a SELECT *, you want all the data from the table. The data for the table is in the clustered index - the leaf nodes of the clustered index are the data pages.
So if you want all of those data pages anyway, and since you're joining 1 mio. rows to 11 mio. rows (1 out of 11 isn't very selective for SQL Server), using an index to find the rows, and then do bookmark lookups into the actual data pages for each of those rows found, might just not be very efficient, and thus SQL Server uses the clustered index scan instead.
So to make a long story short: only select those rows you really need! You thus give SQL Server a chance to use an index, do a seek there, and find the necessary data.
If you only select three, four columns, then the chances that SQL Server will find and use an index that contains those columns are just so much higher than if you ask for all the data from all the tables involved.
Another option would be to try and find a way to express a subquery, using e.g. a Common Table Expression, that would grab data from the two smaller tables, and reduce that number of rows even more, and join the hopefully quite small result against the main table. If you have a small result set of only 40 or 800 results (rather than two tables with 1 mio. rows each), then SQL Server might be more inclined to use a Clustered Index Seek and do bookmark lookups on 40 or 800 rows, rather than doing a full Clustered Index Scan.