Materialized View Performance of Exists vs In

Materialized View Performance of Exists vs In - sql

I did some googling and couldn't find a clear answer to an oracle performance question. Maybe we can document it here. I am building an MV that is pretty simple but on fairly large tables. The query like many things can be written more than one way. In my case when written as a select statement two solutions have similar costs / execution plan, but when placed inside of a create materialized view the execution time changes drastically. Any insight into why?
Tab1 is aprox 40M records.
Tab2 is aprox 8M records.
field1 is a primary key on Tab1, it is not a PK or unique on Tab2 but tab 2 does have an index on this field.
field2 is not a key nor is it indexed on either table (boo)
Queries are:
Q1:
SELECT
CR1.Several_Fields
FROM
SCHEMA1.tab1 T1
WHERE T1.field2 like 'EXAMPLE%'
AND T1.field1 not in (
SELECT T2.field1
FROM SCHEMA1.tab2 T2
)
;
Q2:
SELECT
CR1.Several_Fields
FROM
SCHEMA1.tab1 T1
WHERE T1.field2 like 'EXAMPLE%'
AND not exists (
SELECT 1
FROM SCHEMA1.tab2 T2
WHERE T1.field1 = T2.field1
)
;
The two queries as select statements run similarly in time, and explain plan has them both utilizing the index scan rather than full table scans as I would expect. What is unexpected is that Q2 runs vastly faster (47 seconds vs 81 days per v$session_longops) when run in an mv creation like:
CREATE MATERIALIZED VIEW SCHEMA1.mv_blah as
(
Q1 or Q2
);
Does anyone have any insight, is there a rule here to not use IN if possible for mviews only? I know of the tricks between in and exist when indexes do not exist between the tables but this one had me baffled. This is running against an oracle 11g database.

This looks like a known bug. If you have access to My Oracle Support look at Slow Create/Refresh of Materialized View Based on NOT IN Definition Query (Doc ID 1591851.1), or less usefully if you don't, a summary of the problem is available.
The contents of the MOS version can't be reproduced here of course, but suffice to say that the only workaround is what you're already doing with not exists. It's fixed in 12c, which doesn't help you much.

Related

SQL Like Operator very slow when using from Another table in AWS Athena

I have SQL query in athena that is very slow when using like operator value from another table
Select * from table1 t1
Where t1.value like (select
concat('%',t2.value,'%') as val
from table2 t2 where t2.id =1
limit 1)
The above query is very slow
When i am using something like below query its working super fast
Select * from table1 t1
Where t1.value like
'%somevalue%'
In my scenario the like value is not fixed it can be changed by the time that's why i need to use this value from another table.
Please suggest fastest way

"Slow" is a relative term, but a query that joins two tables will always be slower than a query that doesn't. A query that compares against a pattern that needs to be looked up in another table at query time will always be slower than a query that uses a static pattern.
Does that mean that the second query is slow? Perhaps, but it you have to base that on what you're actually asking the query engine to do.
Let's dissect what your query is doing:
The outer query looks for all columns of all rows of the first table where one of the columns contains a particular string.
That string is dynamically looked up by scanning every row in the second table looking for a row with a particular value for the id column.
In other words, the first query scans only the first table but the second scans both tables. That's always going to be slower, because it's doing a lot more work. How much more work? That depends on the sizes of the tables. You aren't specifying the running times of any of the queries or the sizes of the tables, so it's hard to know.
You don't provide enough context in your question to answer any more precise than this. We can only respond with generalities like: if it's slow then don't use LIKE, that's a slow operation. Don't do a correlated subquery that reads the whole second table, that's slow.

i have found other method to the same and it's super faster in Athena
Select * from table1 t1
Where POSITION ( (select concat('%',t2.value,'%') as val from table2 t2 where t2.id =1 limit 1) in t1.value )>0

Performance of "NOT IN" in SQL query

I'm quite new to SQL query analysis. Recently I stumbled upon a performance issue with one of the queries and I'm wondering whether my thought process is correct here and why Query Optimizer works the way it works in this case.
I'm om SQL Server 2012.
I've got a SQL query that looks like
SELECT * FROM T1
WHERE Id NOT IN
(SELECT DISTINCT T1_Id from T2);
It takes around 30 seconds to run on my test server.
While trying to understand what is taking so long I rewrote it using a temp table, like this:
SELECT DISTINCT T1_Id
INTO #temp from T2;
SELECT * FROM T1
WHERE Id NOT IN
(SELECT T1_Id from #temp);
It runs a hundred times faster than the first one.
Some info about the tables:
T2 has around 1 million rows, and there are around 1000 distinct values of T1_id there. T1 has around 1000+ rows. Initially I only had a clustered index on T2 on a column other than T1_Id, so T1_id wasn't indexed at all.
Looking at the execution plans, I saw that for the first query there were as many index scans as there are distinct T1_id values, so basically SQL Server performs about 1000 index scans in this case.
That made me realize that adding a non-clustered index on T1_id may be a good idea (the index should've been there from the start, admittedly), and adding an index indeed made the original query run much faster since now it does nonclustered index seeks.
What I'm looking for is to understand the Query optimizer behavior for the original query - does it look reasonable? Are there any ways to make it work in a way similar to the temporary table variant that I posted here rather than doing multiple scans? Am I just misunderstanding something here?
Thanks in advance for any links to the similar discussion as I haven't really found anything useful.

Not in is intuitive but slow. This construct will generally run quicker.
where id in
(select id from t1
except select t1_id from t2)

The actual performance will likely vary from the estimates, but neither of your queries will out-perform this query, which is the de facto standard approach:
SELECT T1.* FROM T1
LEFT JOIN T2 ON T1.Id = T2.T1_Id
WHERE T2.T1_Id IS NULL
This uses a proper join, which will perform very well (assuming the foreign key column is indexed) and being an left (outer) join the WHERE condition selects only those rows from T1 that don't join (all columns of the right side table are null when the join misses).
Note also that DISTINCT is not required, since there is only ever one row returned from T1 for missed joins.

The SQL Server optimizer needs to understand the size if tables for some of its decisions.
When doing a NOT IN with a subquery, those estimates may not be entirely accurate. When the table is actually materialized, the count would be highly accurate.
I think the first would be faster with an index on
Table2(t1_id)

This is just a guess, but hopefully an educated one...
The DBMS probably concluded that searching a large table small number of times is faster than searching a small table large number of times. That's why you had ~1000 searches on T2, instead of ~1000000 searches on T1.
When you added an index on T2.T1_Id, that turned ~1000 table scans (or full clustered index scans if the table is clustered) into ~1000 index seeks, which made things much faster, as you already noted.
I'm not sure why it didn't attempt a hash join (or a merge join after the index was added) - perhaps it had stale statistics and badly overestimated the number of distinct values?
One more thing: is there a FOREIGN KEY on T2.T1_Id referencing T1.Id? I know Oracle can use FKs to improve the accuracy of cost estimates (in this case, it could infer that the cardinality of T2.T1_Id cannot be greater than T1.Id). If MS SQL Server does something similar, and the FK is missing (or is untrusted), that could contribute to the MS SQL Server thinking there are more distinct values than there really are.
(BTW, it would have helped if you posted the actual query plans and the database structure.)

Performance issue with ORDER BY + AND in sqlite

I'm experiencing some weird behavior with SELECT statements in sqlite. There is one table with 3 Million records. E.g.
SELECT * FROM table1 WHERE cond1;
reduces the output to 10000 records and finishes instantly. Same with
SELECT * FROM table1 WHERE cond1 ORDER BY col1;
But
SELECT * FROM table1 WHERE cond1 AND cond2 ORDER BY col1;
seems to take forever. The CPU is working for about 2 seconds and after that there is only I/O. CPU does nothing, memory is free.
What am I doing wrong?
Hope, it's not a newbie question and all i have to do is using an index (but why?).
Thx for help!
More concrete:
the table structure:
0|url|TEXT|0||1
1|date|DATE|0||1
2|md5sum|TEXT|0||0
3|size|INTEGER|0||0
4|archive|TEXT|0||0
5|numScripts|INTEGER|0||0
6|numScriptBytes|INTEGER|0||0
7|numLinesBehaviour|INTEGER|0||0
8|state|TEXT|0||0
the statement:
SELECT * FROM t1 WHERE md5sum LIKE "00%" AND state=="okay" ORDER BY md5sum;
There is no connection between md5sum and state.
I haven't created any indexes.
What i also forgot to mention: The problem occurs only when the statement includes two or more string comparisons AND ordering. So
SELECT * FROM t1 WHERE md5sum LIKE "00%" AND state=="okay";
works also fine.
2 Update:
An obvious workaround:
CREATE TABLE temp (url TEXT, date DATE, ...
INSERT INTO temp SELECT * FROM t1 WHERE state=="okay" AND md5sum LIKE "00%";
SELECT * FROM temp ORDER BY md5sum;
But, damn, there must be an easier way.

I haven't created any indexes.
That implies that the DBMS will have to inspect every row of your table just to make the selection.
ORDER BY md5sum;
That implies that the DBMS has to sort (typically an N log(N) operation) the result set.
Adding indexes may help, either by making the checking of your condition cheaper, or by making the sorting unneeded. (and maybe both)
UPDATE (added):
Since md5sum is part of both the selection condition and the orderby expression, you might try to fool the queryplan generator by adding a bogus term to the sorting expression:
SELECT * from table1
WHERE md5sum LIKE '00%' AND status = 'Ok'
ORDER BY md5sum, status
;
No guarantees, YMMV.

Indexing Null Values in PostgreSQL

I have a query of the form:
select m.id from mytable m
left outer join othertable o on o.m_id = m.id
and o.col1 is not null and o.col2 is not null and o.col3 is not null
where o.id is null
The query returns a few hundred records, although the tables have millions of rows, and it takes forever to run (around an hour).
When I check my index statistics using:
select * from pg_stat_all_indexes
where schemaname <> 'pg_catalog' and (indexrelname like 'othertable_%' or indexrelname like 'mytable_%')
I see that only the index for othertable.m_id is being used, and that the indexes for col1..3 are not being used at all. Why is this?
I've read in a few places that PG has traditionally not been able to index NULL values. However, I've read this has supposedly changed since PG 8.3? I'm currently using PostgreSQL 8.4 on Ubuntu 10.04. Do I need to make a "partial" or "functional" index specifically to speed up IS NOT NULL queries, or is it already indexing NULLs and I'm just misunderstanding the problem?

You could try a partial index:
CREATE INDEX idx_partial ON othertable (m_id)
WHERE (col1 is not null and col2 is not null and col3 is not null);
From the docs: http://www.postgresql.org/docs/current/interactive/indexes-partial.html

Partial indexes aren't going to help you here as they'll only find the records you don't want. You want to create an index that contains the records you do want.
CREATE INDEX findDaNulls ON othertable ((COALESCE(col1,col2,col3,'Empty')))
WHERE col1 IS NULL AND col2 IS NULL AND col3 IS NULL;
SELECT *
FROM mytable m
JOIN othertable o ON m.id = o.m_id
WHERE COALESCE(col1,col2,col3,'Empty') = 'Empty';
BTW searching for null left joins generally isn't as fast as using EXISTS or NOT EXISTS in Postgres.

A single index on m_id, col1, col2 and o.col3 would be my first thought for this query.
And use EXPLAIN on this query to see how it is executed and what takes so much time. You could show us the results to help you out.

A partial index seems the right way here:
If you have a table that contains both
billed and unbilled orders, where the
unbilled orders take up a small
fraction of the total table and yet
those are the most-accessed rows, you
can improve performance by creating an
index on just the unbilled rows.
Perhaps those nullable columns (col1,col2,col3) act in your scenario as some kind of flag to distinguish some subclass of records in your table? (for example, some sort of "logical deletion") ? In that case, besides the partial index solution, you might prefer to rethink your design, and put them in different physical tables (perhaps using inheritance), one for the "live records" other for the "historical records" and access the full set (only when needed) thrugh a view.

Did you try to create a combined index on othertable(m_id, col1, col2, col3)?
You should also check the execution plan (using EXPLAIN) rather than checking the system tables for the index usage.
PostgreSQL 9.0 (currently in beta) will be able to use and index for a IS NULL condition. That feature got postponed

Performance of updating one big table using values from one small table

First, I know that the sql statement to update table_a using values from table_b is in the form of:
Oracle:
UPDATE table_a
SET (col1, col2) = (SELECT cola, colb
FROM table_b
WHERE table_a.key = table_b.key)
WHERE EXISTS (SELECT *
FROM table_b
WHERE table_a.key = table_b.key)
MySQL:
UPDATE table_a
INNER JOIN table_b ON table_a.key = table_b.key
SET table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb
What I understand is the database engine will go through records in table_a and update them with values from matching records in table_b.
So, if I have 10 millions records in table_a and only 10 records in table_b:
Does that mean that the engine will do 10 millions iterations through table_a just to update 10 records? Are Oracle/MySQL/etc smart enough to do only 10 iterations through table_b?
Is there a way to force the engine to actually iterate through records in table_b instead of table_a to do the update? Is there an alternative syntax for the sql statement?
Assume that table_a.key and table_b.key are indexed.

Either engine should be smart enough to optimize the query based on the fact that there are only ten rows in table b. How the engine determines what to do is based factors like indexes and statistics.
If the "key" column is the primary key and/or is indexed, the engine will have to do very little work to run this query. It will basically already sort of "know" where the matching rows are, and look them up very quickly. It won't have to "iterate" at all.
If there is no index on the key column, the engine will have to to a "table scan" (roughly the equivalent of "iterate") to find the right values and match them up. This means it will have to scan through 10 million rows.
Do a little reading on what's called an Execution Plan. This is basically an explanation of what work the engine had to do in order to run your query (some databases show it in text only, some have the option of seeing it graphically). Learning how to interpret an Execution Plan will give you great insight into adding indexes to your tables and optimizing your queries.
Look these up if they don't work (it's been a while), but it's something like:
In MySQL, put the work "EXPLAIN" in front of your SELECT statement
In Oracle, run "SET AUTOTRACE ON" before you run your SELECT statement
I think the first (Oracle) query would be better written with a JOIN instead of a WHERE EXISTS. The engine may be smart enough to optimize it properly either way. Once you get the hang of interpreting an execution plan, you can run it both ways and see for yourself. :)

Okay I know answering own question is usually frowned upon but I already accepted another answer and won't unaccept it so meh here it is ..
I've discovered a much better alternative that I'd like to share it with anyone who encounters the same scenario: MERGE statement.
Apparently, newer Oracle versions introduced this MERGE statement which simply blows! Not only that the performance is so much better in most cases, the syntax is so simple and so make sense that I feel stupid for using the UPDATE statement! Here comes ..
MERGE INTO table_a
USING table_b
ON (table_a.key = table_b.key)
WHEN MATCHED THEN UPDATE SET
table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb;
And what more is that I can also extend the statement to include INSERT action when table_a does not have matching records for some records in table_b:
MERGE INTO table_a
USING table_b
ON (table_a.key = table_b.key)
WHEN MATCHED THEN UPDATE SET
table_a.col1 = table_b.cola,
table_a.col2 = table_b.colb
WHEN NOT MATCHED THEN INSERT
(key, col1, col2)
VALUES (table_b.key, table_b.cola, table_b.colb);
This new statement type made my day the day I discovered it :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Materialized View Performance of Exists vs In - sql

Related

SQL Like Operator very slow when using from Another table in AWS Athena

Performance of "NOT IN" in SQL query

Performance issue with ORDER BY + AND in sqlite

Indexing Null Values in PostgreSQL

Performance of updating one big table using values from one small table

Categories

Resources