Oracle semi-join with multiple tables in SQL subquery - sql

This question is how to work around the apparent oracle limitation on semi-joins with multiple tables in the subquery. I have the following 2 UPDATE statements.
Update 1:
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c
WHERE c.id2 = b.id2 AND
c.time BETWEEN start_in AND end_in) AND
EXISTS (SELECT NULL
FROM TABLE(update_in) d
WHERE b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'
The execution plan indicayes that this correctly performs 2 semi-joins, and the update executes in seconds. These need to be semi-joins because c.id2 is not a unique foreign key on b.id2, unlike b.id and a.id. And update_in doesn't have any constraints at all since it's an array.
Update 2:
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c, TABLE(update_in) d
WHERE c.id2 = b.id2 AND
c.time > d.time AND
b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'
This does not do a semi-join; I believe based on the Oracle documentation that's because the EXISTS subquery has 2 tables in it. Due to the sizes of the tables, and partitioning, this update takes hours. However, there is no way to relate d.time to the associated d.start_time and d.end_time other than being on the same row. And the reason we pass in the update_in array and join it here is because running this query in a loop for each time/start_time/end_time combination also proved to give poor performance.
Is there a reason other than the 2 tables that the semi-join could be not working? If not, is there a way around this limitation? Some simple solution I am missing that could make these criteria work without putting 2 tables in the subquery?

As Bob suggests you can use a Global Temporary Table (GTT) with the same structure as your update_in array, but the key difference is that you can create indexes on the GTT, and if you populate the GTT with representative sample data, you can also collect statistics on the table so the SQL query analyzer is better able to predict an optimal query plan.
That said there are also some other notable differences in your two queries:
In the first exists clause of your first query you refer to two columns start_in and end_in that don't have table references. My guess is that they are either columns in table a or b, or they are variables within the current scope of your sql statement. It's not clear which.
In your second query you refer to column d.time, however, you don't use that column in the first query.
Does updating your second query to the following improve it's performance?
UPDATE
(SELECT a.flag update_column
FROM a, b
WHERE a.id = b.id AND
EXISTS (SELECT NULL
FROM c, TABLE(update_in) d
WHERE c.id2 = b.id2 AND
c.time BETWEEN start_in AND end_in AND
c.time > d.time AND
b.time BETWEEN d.start_time AND d.end_time))
SET update_column = 'F'

Related

Optimize Join GreenPlum

I have table A with 20 million records. There is table B with 200,000 records.
I want to do a join like:
select *
from tableA a
left join tableB b
on ((a.name1 = b.name1 OR a.name1 = b.name2) OR a.id = b.id)
and a.time > b.time
;
This is very time consuming.
I am using GreenPlum so I cannot make use of indexes.
How can I optimize this?
The number of rows in table B are incremental and will increase.
Greenplum does support indexes. However, this query is a tricky since immaterial of what your distribution column is there is no way to co-locate the join for the following reason.
a.time or b.time is a bad candidate for distribution since it is a ">" operator
You could distribute tableA by (name1, id) and tableB by (name1, name2, id). But to see if the a.time > b.time is satisfied you needs to still see all tuples.
Not sure the query is a very MPP friendly one I am afraid.

In SQL is there a way to use select * on a join?

Using Snowflake,have 2 tables, one with many columns and the other with a few, trying to select * on their join, get the following error:
SQL compilation error:duplicate column name
which makes sense because my joining columns are in both tables, could probably use select with columns names instead of *, but is there a way I could avoid that? or at least have the query infer the columns names dynamically from any table it gets?
I am quite sure snowflake will let you choose all from both halves of two+ tables via
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
what you will not be able to do is refer to the named of the columns in GROUP BY indirectly, thus this will not work
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY x
even though some databases know because you have JOIN ON a.x = b.x there is only one x, snowflake will not allow it (well it didn't last time I tried this)
but you can with the above use the alias name or the output column position thus both the following will work.
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY a.x
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY 1 -- assuming x is the first column
in general the * and a.* forms are super convenient, but are actually bad for performance.
when selecting you are now are risk of getting the columns back in a different order if the table has been recreated, thus making reading code unstable. Which also impacts VIEWs.
It also means all meta data for the table need to be loaded to know what the complete form of the data will be in. Where if you want x,y,z only and later a w was added to the table, the whole query plan can be compiled faster.
Lastly if you are selecting SELECT * FROM table in a sub-select and only a sub-set of those columns are needed the execution compiler doesn't need to prune these. And if all variables are attached to a correctly aliased table, if later a second table adds the same named column, naked columns are not later ambiguous. Which will only occur when that SQL is run, which might be an "annual report" which doesn't happen that often. wow, what a long use alias rant.
You can prefix the name of the column with the name of the table:
select table_a.id, table_b.name from table_a join table_b using (id)
The same works in combination with *:
select table_a.id, table_b.* from table_a join table_b using (id)
It works in "join" and "where" parts of the statement as well
select table_a.id, table_b.* from table_a join table_b
on table_a.id = table_b.id where table_b.name LIKE 'b%'
You can use table aliases to make the statement sorter:
select a.id, b.* from table_a a join table_b b
on a.id = b.id
Aliases could be applies on fields to use in subqueries, client software and (depending on the SQL server) in the other parts of the statements, for example 'order by':
select a.id as a_id, b.* from table_a a join table_b b
on a.id = b.id order by a_id
If you're after a result that includes all the distinct non-join columns from each table in the join with the join columns included in the output only once (given they will be identical for an inner-join) you can use NATURAL JOIN.
e.g.
select * from d1 natural inner join d2 order by id;
See examples: https://docs.snowflake.com/en/sql-reference/constructs/join.html#examples

Database UPDATE SET when joining tables and the value to update is not unique due to cartesian product

How does PostgreSQL (perhaps it is contemplated by standard SQL) behave when the SET UPDATE
is not a unique record but a cartesian product?
Imagine for some reason the b table contains:
(1,1),(1,1),(1,2)
What should be the value to update (or does the database make a cartesian product or creates records or something)?
UPDATE table_a a
SET b_value = b.value
FROM (SELECT id, value FROM mdm.table_b) AS b
WHERE b.a_id = a.id;
Your query is not stable. Bad things might happen.
The documentation is clear about that:
When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn't join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.
Then:
Because of this indeterminacy, referencing other tables only within sub-selects is safer, though often harder to read and slower than using a join.
If you were to follow the documentation's advice, you could phrase the query as:
update table_a a
set b_value = (select max(b.value) from table_b b where b.a_id = a.id)
where exists (select 1 from table_b b where b.a_id = a.id)
The aggregate function in the subquery ensures that a single row will be returned (you could as well use min()). You can also express this with from:
update table_a a
set b_value = b.value
from (select a_id, max(value) as value from table_b group by a_id) as b
where b.a_id = a.id;

Deleting rows from a join of two tables and rownum condition in Oracle DB

Learning PL/SQL with Oracle DB and trying to accomplish the following:
I have two tables a and b. I am joining them on id, add several conditions and then try removing resulting rows only from table a in a batch size of 1000. Base query looks like this:
DELETE (SELECT *
FROM SCHEMA.TABLEA a
INNER JOIN SCHEMA.TABLEB b ON a.b_id = b.id
WHERE par=0 AND ROWNUM <= 1000);
This obviously doesn’t work as I am trying to manipulate a view: “data manipulation operation not legal on this view”
How can I rewrite this?
you can only remote from a table, there now Need to do a join. you can handle it in a where clause if you Need
you delete Statement could be e.g.
DELETE from SCHEMA.TABLEA a
where a.id in (select b.id from SCHEMA.TABLEB b)
and par=0 AND ROWNUM <= 1000
You can write simple query which checks if the rows in TABLEA that are required to be deleted exists in TABLEB.
DELETE
FROM schema.tablea a
WHERE par = 0
AND EXISTS (SELECT 1 FROM schema.tableb b WHERE a.b_id = b.id)
AND rownum <= 1000;

SQL query on large tables fast at first then slow

Below query returns the initial result fast and then becomes extremely slow.
SELECT A.Id
, B.Date1
FROM A
LEFT OUTER JOIN B
ON A.Id = B.Id AND A.Flag = 'Y'
AND (B.Date1 IS NOT NULL AND A.Date >= B.Date2 AND A.Date < B.Date1)
Table A has 24 million records and Table B has 500 thousand records.
Index for Table A is on columns: Id and Date
Index for Table B is on columns: Id, Date2, Date1 - Date1 is nullable - index is unique
Frist 11m records are returned quite fast and it then suddenly becomes extremely slow. Execution Plan shows the indexes are used.
However, when I remove condition A.Date < B.Date1, query becomes fast again.
Do you know what should be done to improve the performance? Thanks
UPDATE:
I updated the query to show that I need fields of Table B in the result. You might think why I used left join when I have condition "B.Date1 is not null". That's because I have posted the simplified query. My performance issue is even with this simplified version.
You can maybe try using EXISTS. It should be faster as it stops looking for further rows once a match is found unlike JOIN where all the rows will have to be fetched and joined.
select id
from a
where flag = 'Y'
and exists (
select 1
from b
where a.id = b.id
and a.date >= b.date2
and a.date < b.date1
and date1 is not null
);
Generally what I've noticed with queries, and SQL performance is the DATA you are joining, for instance ONE to ONE relationships are much faster than ONE to MANY relationships.
I've noticed ONE to MANY relationship on table 3000 items, joining to a table with 30,000 items can easily take up to 11-15 seconds, with LIMIT. But that same query, redesigned with all ONE TO ONE relationships would take less than 1 second.
So my suggestion to speed up your query.
According to Left Outer Join (desc) "LEFT JOIN and LEFT OUTER JOIN are the same" so it doesn't matter which one, you use.
But ideally, should use INNER because in your question you stated B.Date1 IS NOT NULL
Based on this parent columns in join selection (desc), you can use parent column in SELECT within JOIN.
SELECT a.Id FROM A a
INNER JOIN (SELECT b.Id AS 'Id', COUNT(1) as `TotalLinks` FROM B b WHERE ((b.Date1 IS NOT NULL) AND ((a.Date >= b.Date2) AND (a.Date < b.Date1)) GROUP BY b.Id) AS `ab` ON (a.Id = ab.Id) AND (a.Flag = 'Y')
WHERE a.Flag = 'Y' AND b.totalLinks > 0
LIMIT 0, 500
Try and also, LIMIT the DATA you want; this will reduce the filtering necessary by SQL.