Teiid not performing optimal join

Teiid not performing optimal join - sql

For our Teiid Springboot project we use a row filter in a where clause to determine what results a user gets.
Example:
SELECT * FROM very_large_table WHERE id IN ('01', '03')
We want the context in the IN clause to be dynamic like so:
SELECT * FROM very_large_table WHERE id IN (SELECT other_id from very_small_table)
The problem now is that Teiid gets all the data from very_large_table and only then tries to filter with the where clause, this makes the query 10-20 times slower. The data in this very_small_tableis only about 1-10 records and it is based on the user context we get from Java.
The very_large_table is located on a Oracle database and the very_small_table is on the Teiid Pod/Container. Somehow I can't force Teiid to ship the data to Oracle and perform filtering there.
Things that I have tried:
I have specified the the foreign data wrappers as follows
CREATE FOREING DATA WRAPPER "oracle_override" TYPE "oracle" OPTIONS (EnableDependentsJoins 'true');
CREATE SERVER server_name FOREIGN DATA WRAPPER "oracle_override";
I also tried, exists statement or instead of a where clause use a join clause to see if pushdown happened. Also hints for joins don't seem to matter.
Sadly the performance impact at the moment is that high that we can't reach our performance targets.

Are there any cardinalities on very_small_table and very_large_table? If not the planner will assume a default plan.
You can also use a dependent join hint:
SELECT * FROM very_large_table WHERE id IN /*+ dj */ (SELECT other_id from very_small_table)

Often, exists performs better than in:
SELECT vlt.*
FROM very_large_table vlt
WHERE EXISTS (SELECT 1 FROM very_small_table vst WHERE vst.other_id = vlt.id);
However, this might end up scanning the large table.
If id is unique in vlt and there are no duplicates in vst, then a JOIN might optimize better:
select vlt.*
from very_small_table vst join
very_large_table vlt
on vst.other_id = vlt.id;

Related

How can I set a limitation for one column on all SQL selects?

We have a database that holds data for numerous customers. We want to give customers access to the database, but only to the data that belongs to them. Parsing the select to then insert in the where clause "and Company.Name = 'Acme'" strikes me as weak because SQL selects can be very complex and handling 100% of all cases may be difficult.
Is there some way to do the equivalent of (I know this is not valid SQL):
select * from * where Company.Name = 'Acme' and (passed_in_select)
You can nest a full select in as an inner part of a large select. Is there some way to do the above? This way it's a very simple restriction on the select and that is likely to work 100% of the time.

Here is a system solution called "virtual private database" for Oracle database:
https://docs.oracle.com/cd/B28359_01/network.111/b28531/vpd.htm
For other databases look whether there is similar built-in solution.
But there is very simple solution using the WITH clause:
WITH
tab_a__ AS (SELECT * FROM tab_a WHERE comp="xy"),
tab_b__ AS (SELECT * FROM tab_b WHERE comp="xy")
SELECT ... //original select
You just have to find all used tables in the select, add __ behind and add the CTEs to the WITH clause.
Notes: Some databases do not support WITH clause though it is an SQL standard. Some databases can have alias length limitation you could exceed by adding the suffix.

select * from
(
select * from table_a
) outer_table_a
where outer_table_a.col_a = 'test'
I do this sort of thing often especially when I want to perform some aggregation on the data in the inner query (sum, max, etc.) I do this with SQL Server, I do not know if it is valid with other DBMS but I would be surprised if it were not.
I don't know if I would rely on this approach to effectively grant permissions. Perhaps views would allow you lock things down a bit tighter. It sounds like you're planning to tack something on dynamically to a query that you may not have written? In that case whomever writes that query could transform your column of interest which would result in visibility over things you didn't intend, like:
select * from
(
select 'test' as col_a, launch_codes from table_a
) outer_table_a
where outer_table_a.col_a = 'test'

How to eliminate duplicate of subquery in ".. where X in (S) or Y in (S)"?

I have a query where I need to get rows from a table where any of two foreign keys exists in another query. Here is the simplified SQL:
Select MainID From MainTable Where
Key1 In (Select SubID From SubTable Where UserID=#UserID) Or
Key2 In (Select SubID From SubTable Where UserID=#UserID)
As you can see, the sub-query is duplicated. Is the SQL compiler intelligent enough to recognize this and run the sub-query once only or does it run twice?
Is there a better way I can write this SQL?
Update: I should have mentioned this originally - SubID is the primary key on SubTable.

You would replace the IN clause with an EXISTS clause:
Select MainID From MainTable
Where Exists
(
Select *
From SubTable
Where UserID = #UserID
And SubID in (MainTable.Key1, MainTable.Key2)
);

You can use a common table expression:
with subid_data as (
Select SubID
From SubTable
Where UserID=#UserID
)
Select MainID
From MainTable
Where Key1 In (select SubID from subid_data)
Or Key2 In (select SubID from subid_data);

I don't think compiler is intelligent enough to do a table scan or index seek once.
If you have a complicated where clause then you can push the sub-query results into temp table.
Now use the temp table in where clause which will have a better performance.
SELECT SubID
INTO #SubTable
FROM SubTable
WHERE UserID = #UserID
SELECT MainID
FROM MainTable M
WHERE EXISTS (SELECT 1
FROM #SubTable
WHERE M.Key1 = S.SubID)
OR EXISTS (SELECT 1
FROM #SubTable
WHERE M.Key2 = S.SubID)

Please try the following query:
Select MainID
From MainTable m
Where exists
( select 1 from SubTable s Where s.UserID=#UserID and s.sub_id in (m.key1,m.Key2))

tldr; both the original and the following JOIN proposal, with less "looks redundant", should generate equivalent query plans. View the actual query plans if there are any doubts as to how SQL Server is [currently] treating a query. (See IN vs. JOIN vs. EXISTS for a taste of the magic.)
Is the SQL compiler intelligent enough to recognize this and run the sub-query once only or does it run twice?
Yes, SQL Server is smart enough to handle this. It does not need to "run twice" (nit: the subquery does not "run" at all in a procedural sense). That is, there is no mandated explicit materialization stage - much less two. The JOIN transformation below shows why such is not required.
Since these are independent (or non-correlated) sub-queries1, as they do not depend on the outer query, then they can - and I dare say will - be optimized as they can be freely, and easily, moved under Relational Algebra (RA) rules.
As you can see, the sub-query is duplicated .. Is there a better way I can write this SQL?
However it still "looks redundant" visually because it is written so. SQL Server doesn't care - but a human might. Thus the following is how I would write it and what I consider "better".
I am a big fan of using JOINs over subqueries; once a JOIN approach is adopted it often "fits better" with RA. This simple transformation to a JOIN is possible because of the non-correlated nature of the original subqueries - the [SQL Server] query planner is capable of doing such RA rewrites internally; view the actual query plans to see what differences there are, if any.
Rewriting the query would then be:
Select MainID
From MainTable
Join (
Select Distinct SubID -- SubId must be unique from select
From SubTable
Where UserID=#UserID
) t
-- Joining on "A or B" may indicate an ARC relationship
-- but this obtains the original results
On Key1 = t.SubID
Or Key2 = t.SubID
The DISTINCT is added to the derived table query because of the unknown (to me) multiplicity of SubId column - it can be treated as a redundant qualifier by SQL Server if SubId is bound by a Unique Constraint so it's either required or "free". See IN vs. JOIN with large rowsets for why it matters that the joined table keys are unique.
Note: SQL Server does not necessarily rewrite an IN to the join as shown above, as discussed in IN vs. JOIN vs. EXISTS; but the fundamental concept of being able to move the RA operation (and being able to treat the query as a what and not a how) is still used.
1 Some of the answers change the original subquery to a dependent/correlated subquery which is going the wrong way. It may still result in a respectable (or even equivalent) query plan as SQL Server will try to "undo" the changes - but that's going a step away from a clean RA model and JOINs! (And if SQL Server can't "undo" the added correlation then the query will be far inferior.)

Performance of nested select

I know this is a common question and I have read several other posts and papers but I could not find one that takes into account indexed fields and the volume of records that both queries could return.
My question is simple really. Which of the two is recommended here written in an SQL-like syntax (in terms of performance).
First query:
Select *
from someTable s
where s.someTable_id in
(Select someTable_id
from otherTable o
where o.indexedField = 123)
Second query:
Select *
from someTable
where someTable_id in
(Select someTable_id
from otherTable o
where o.someIndexedField = s.someIndexedField
and o.anotherIndexedField = 123)
My understanding is that the second query will query the database for every tuple that the outer query will return where the first query will evaluate the inner select first and then apply the filter to the outer query.
Now the second query may query the database superfast considering that the someIndexedField field is indexed but say that we have thousands or millions of records wouldn't it be faster to use the first query?
Note: In an Oracle database.

In MySQL, if nested selects are over the same table, the execution time of the query can be hell.
A good way to improve the performance in MySQL is create a temporary table for the nested select and apply the main select against this table.
For example:
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from someTable s2
where s2.Field = 123);
Can have a better performance with:
create temporary table 'temp_table' as (
Select someTable_id
from someTable s2
where s2.Field = 123
);
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from tempTable s2);
I'm not sure about performance for a large amount of data.

About first query:
first query will evaluate the inner select first and then apply the
filter to the outer query.
That not so simple.
In SQL is mostly NOT possible to tell what will be executed first and what will be executed later.
Because SQL - declarative language.
Your "nested selects" - are only visually, not technically.
Example 1 - in "someTable" you have 10 rows, in "otherTable" - 10000 rows.
In most cases database optimizer will read "someTable" first and than check otherTable to have match. For that it may, or may not use indexes depending on situation, my filling in that case - it will use "indexedField" index.
Example 2 - in "someTable" you have 10000 rows, in "otherTable" - 10 rows.
In most cases database optimizer will read all rows from "otherTable" in memory, filter them by 123, and than will find a match in someTable PK(someTable_id) index. As result - no indexes will be used from "otherTable".
About second query:
It completely different from first. So, I don't know how compare them:
First query link two tables by one pair: s.someTable_id = o.someTable_id
Second query link two tables by two pairs: s.someTable_id = o.someTable_id AND o.someIndexedField = s.someIndexedField.
Common practice to link two tables - is your first query.
But, o.someTable_id should be indexed.
So common rules are:
all PK - should be indexed (they indexed by default)
all columns for filtering (like used in WHERE part) should be indexed
all columns used to provide match between tables (including IN, JOIN, etc) - is also filtering, so - should be indexed.
DB Engine will self choose the best order operations (or in parallel). In most cases you can not determine this.
Use Oracle EXPLAIN PLAN (similar exists for most DBs) to compare execution plans of different queries on real data.

When i used directly
where not exists (select VAL_ID FROM #newVals = OLDPAR.VAL_ID) it was cost 20sec. When I added the temp table it costs 0sec. I don't understand why. Just imagine as c++ developer that internally there loop by values)
-- Temp table for IDX give me big speedup
declare #newValID table (VAL_ID int INDEX IX1 CLUSTERED);
insert into #newValID select VAL_ID FROM #newVals
insert into #deleteValues
select OLDPAR.VAL_ID
from #oldVal AS OLDPAR
where
not exists (select VAL_ID from #newValID where VAL_ID=OLDPAR.VAL_ID)
or exists (select VAL_ID from #VaIdInternals where VAL_ID=OLDPAR.VAL_ID);

WHERE and JOIN order of operation

My question is similar to this SQL order of operations but with a little twist, so I think it's fair to ask.
I'm using Teradata. And I have 2 tables: table1, table2.
table1 has only an id column.
table2 has the following columns: id, val
I might be wrong but I think these two statements give the same results.
Statement 1.
SELECT table1.id, table2.val
FROM table1
INNER JOIN table2
ON table1.id = table2.id
WHERE table2.val<100
Statement 2.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT *
FROM table2
WHERE val<100
) table3
ON table1.id=table3.id
My questions is, will the query optimizer be smart enough to
- execute the WHERE clause first then JOIN later in Statement 1
- know that table 3 isn't actually needed in Statement 2
I'm pretty new to SQL, so please educate me if I'm misunderstanding anything.

this would depend on many many things (table size, index, key distribution, etc), you should just check the execution plan:
you don't say which database, but here are some ways:
MySql EXPLAIN
SQL Server SET SHOWPLAN_ALL (Transact-SQL)
Oracle EXPLAIN PLAN
what is explain in teradata?
Teradata Capture and compare plans faster with Visual Explain and XML plan logging

Depending on the availability of statistics and indexes for the tables in question the query rewrite mechanism in the optimizer will may or may not opt to scan Table2 for records where val < 100 before scanning Table1.
In certain situations, based on data demographics, joins, indexing and statistics you may find that the optimizer is not eliminating records in the query plan when you feel that it should. Even if you have a derived table such as the one in your example. You can force the optimizer to process a derived table by simply placing a GROUP BY in your derived table. The optimizer is then obligated to resolve the GROUP BY aggregate before it can consider resolving the join between the two tables in your example.
SELECT table1.id, table3.val
FROM table1
INNER JOIN (
SELECT table2.id, tabl2.val
FROM table2
WHERE val<100
GROUP BY 1,2
) table3
ON table1.id=table3.id
This is not to say that your standard approach should be to run with this through out your code. This is typically one of my last resorts when I have a query plan that simply doesn't eliminate extraneous records earlier enough in the plan and results in too much data being scanned and carried around through the various SPOOL files. This is simply a technique you can put in your toolkit to when you encounter such a situation.
The query rewrite mechanism is continually being updated from one release to the next and the details about how it works can be found in the SQL Transaction Processing Manual for Teradata 13.0.

Unless I'm missing something, Why do you even need Table1??
Just query Table2
Select id, val
From table2
WHERE val<100
or are you using the rows in table1 as a filter? i.e., Does table1 only copntain a subset of the Ids in Table2??
If so, then this will work as well ...
Select id, val
From table2
Where val<100
And id In (Select id
From table1)
But to answer your question, Yes the query optimizer should be intelligent enough to figure out the best order in which to execute the steps necessary to translate your logical instructions into a physical result. It uses the strored statistics that the database maintains on each table to determine what to do (what type of join logic to use for example), as wekll as what order to perform the operations in in order to minimize Disk IOs and processing costs.

Q1. execute the WHERE clause first then JOIN later in Statement 1
The thing is, if you switch the order of inner join, i.e. table2 INNER JOIN table1, then I guess WHERE clause can be processed before JOIN operation, during the preparation phase. However, I guess even if you don't change the original query, the optimizer should be able to switch their order, if it thinks the join operation will be too expensive with fetching the whole row, so it will apply WHERE first. Just my guess.
Q2. know that table 3 isn't actually needed in Statement 2
Teradata will interpret your second query in such way that the derived table is necessary, so it will keep processing table 3 involved operation.

SQL - table alias scope

I've just learned ( yesterday ) to use "exists" instead of "in".
BAD
select * from table where nameid in (
select nameid from othertable where otherdesc = 'SomeDesc' )
GOOD
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
And I have some questions about this:
1) The explanation as I understood was: "The reason why this is better is because only the matching values will be returned instead of building a massive list of possible results". Does that mean that while the first subquery might return 900 results the second will return only 1 ( yes or no )?
2) In the past I have had the RDBMS complainin: "only the first 1000 rows might be retrieved", this second approach would solve that problem?
3) What is the scope of the alias in the second subquery?... does the alias only lives in the parenthesis?
for example
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
AND
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeOtherDesc' )
That is, if I use the same alias ( o for table othertable ) In the second "exist" will it present any problem with the first exists? or are they totally independent?
Is this something Oracle only related or it is valid for most RDBMS?
Thanks a lot

It's specific to each DBMS and depends on the query optimizer. Some optimizers detect IN clause and translate it.
In all DBMSes I tested, alias is only valid inside the ( )
BTW, you can rewrite the query as:
select t.*
from table t
join othertable o on t.nameid = o.nameid
and o.otherdesc in ('SomeDesc','SomeOtherDesc');
And, to answer your questions:
Yes
Yes
Yes

You are treading into complicated territory, known as 'correlated sub-queries'. Since we don't have detailed information about your tables and the key structures, some of the answers can only be 'maybe'.
In your initial IN query, the notation would be valid whether or not OtherTable contains a column NameID (and, indeed, whether OtherDesc exists as a column in Table or OtherTable - which is not clear in any of your examples, but presumably is a column of OtherTable). This behaviour is what makes a correlated sub-query into a correlated sub-query. It is also a routine source of angst for people when they first run into it - invariably by accident. Since the SQL standard mandates the behaviour of interpreting a name in the sub-query as referring to a column in the outer query if there is no column with the relevant name in the tables mentioned in the sub-query but there is a column with the relevant name in the tables mentioned in the outer (main) query, no product that wants to claim conformance to (this bit of) the SQL standard will do anything different.
The answer to your Q1 is "it depends", but given plausible assumptions (NameID exists as a column in both tables; OtherDesc only exists in OtherTable), the results should be the same in terms of the data set returned, but may not be equivalent in terms of performance.
The answer to your Q2 is that in the past, you were using an inferior if not defective DBMS. If it supported EXISTS, then the DBMS might still complain about the cardinality of the result.
The answer to your Q3 as applied to the first EXISTS query is "t is available as an alias throughout the statement, but o is only available as an alias inside the parentheses". As applied to your second example box - with AND connecting two sub-selects (the second of which is missing the open parenthesis when I'm looking at it), then "t is available as an alias throughout the statement and refers to the same table, but there are two different aliases both labelled 'o', one for each sub-query". Note that the query might return no data if OtherDesc is unique for a given NameID value in OtherTable; otherwise, it requires two rows in OtherTable with the same NameID and the two OtherDesc values for each row in Table with that NameID value.

Oracle-specific: When you write a query using the IN clause, you're telling the rule-based optimizer that you want the inner query to drive the outer query. When you write EXISTS in a where clause, you're telling the optimizer that you want the outer query to be run first, using each value to fetch a value from the inner query. See "Difference between IN and EXISTS in subqueries".
Probably.
Alias declared inside subquery lives inside subquery. By the way, I don't think your example with 2 ANDed subqueries is valid SQL. Did you mean UNION instead of AND?

Personally I would use a join, rather than a subquery for this.
SELECT t.*
FROM yourTable t
INNER JOIN otherTable ot
ON (t.nameid = ot.nameid AND ot.otherdesc = 'SomeDesc')

It is difficult to generalize that EXISTS is always better than IN. Logically if that is the case, then SQL community would have replaced IN with EXISTS...
Also, please note that IN and EXISTS are not same, the results may be different when you use the two...
With IN, usually its a Full Table Scan of the inner table once without removing NULLs (so if you have NULLs in your inner table, IN will not remove NULLS by default)... While EXISTS removes NULL and in case of correlated subquery, it runs inner query for every row from outer query.
Assuming there are no NULLS and its a simple query (with no correlation), EXIST might perform better if the row you are finding is not the last row. If it happens to be the last row, EXISTS may need to scan till the end like IN.. so similar performance...
But IN and EXISTS are not interchangeable...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas