Bigquery job suddenly started failing from today due to coorelated subquery - google-bigquery

My Bigquery job which was executing fine until yesterday started failing due to the below error
Error:- Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
Query:-
with result as (
select
*
from
(
select * from `project.dataset_stage.=non_split_daily_temp`
union all
select * from `project.dataset_stage.split_daily_temp`
)
)
select
*
from
result final
where
not (
exists
(
select
1
from
`project.dataset.bqt_sls_cust_xref` target
where
final.sls_dte = target.sls_dte and
final.rgs_id = target.rgs_id and
) and
unlinked = 'Y' and
cardmatched = 'Y'
}
Can someone please assist me on this, i would like to know reason for sudden break of this and how to fix this issue permanently.

Thank you for the suggestion.
We figured out the reason for the cause of the issue ,below is the reason
When one writes correlated subquery like
select T2.col, (select count(*) from T1 where T1.col = T2.col) from T2
Technically SQL text implies that subquery needs to be re-executed for every row from T2.
If T2 has billion rows then we would need to scan T1 billion times. That would take forever and query would never finish.
The cost of executing query dropped from O(size T1 * size T2) to O(size T1 + size T2) if implemented as below
select any_value(t.col), count(*) from
t left join T1 on T1.col = t.col
group by t.primary_key````
BigQuery errors out if it can't find a way to optimize correlated subquery into linear cost O(size T1 + size T2).
We have plenty of patterns that we recognize and we rewrite for correlated subqueries but apparently new view definition made subquery too complex and query optimizer was unable find a way to run it in linear complexity algorithm.
Probably google will fix the issue by identifying the better algorithm.

I don't know why it suddenly broke, but seemingly, your query can be rewritten with OUTER JOIN:
with result as (
select
*
from
(
select * from `project.dataset_stage.=non_split_daily_temp`
union all
select * from `project.dataset_stage.split_daily_temp`
)
)
select
*
from
result final LEFT OUTER JOIN `project.dataset.bqt_sls_cust_xref` target
ON final.sls_dte = target.sls_dte and
final.str_id = target.str_id and
final.rgs_id = target.rgs_id
where
target.<id_column> IS NULL AND -- No join found, equivalent to NOT (HAVING (<correlected sub-query>))
was_unlinked_run = 'Y' and
is_card_matched = 'Y'
)

Related

SQL takes very long to execute

This SQL statement left joins two tables, both with approx. 10.000 rows (table1 = 20 columns, table2 = 50+ columns), and it takes 60+ seconds to execute. Is there a way to make it faster?
SELECT
t.*, k.*
FROM
table1 AS t
LEFT JOIN
table2 AS k ON t.key_Table1 = k.Key_Table2
WHERE
((t.Time) = (SELECT MAX(t2.Time) FROM table1 AS t2
WHERE t2.key2_Table1 = t.key2_Table1))
ORDER BY
t.Time;
The ideal execution time would be < 5 seconds, since Excel query does it in 8 secs, and it is very surprising that Excel query would work faster than a SQL Server Express query.
Execution plan:
also you can rewrite your query better :
select *
from table2 as k
join (
select *, row_number() over (partition by Key_Table2 order by time desc) rn
from table1
) t
on t.rn = 1
and t.key_Table1 = k.Key_Table2
but you need index on Key_Table2, time and key_Table1 columns if you already don't have.
also another improvement would be to select only columns you want instead of select *
The optimizer is determining that a merge join is best, but if both tables have 10,000 rows and they aren't joining on indexed columns then forcing the optimizer to get out of the way and telling it to hash join may improve performance
The syntax would be to change LEFT JOIN to LEFT HASH JOIN
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/ms191426(v=sql.100)
https://learn.microsoft.com/en-us/sql/relational-databases/performance/joins?view=sql-server-ver15
https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-join?view=sql-server-ver15
I would recommend rewriting the query using outer apply:
SELECT t.*, k.*
FROM table1 t OUTER APPLY
(SELECT TOP (1) k.*
FROM table2 k
WHERE t.key_Table1 = k.Key_Table2
ORDER BY Time DESC
) k
ORDER BY t.Time;
And for this query, you want an index on table2(Key_Table2, time desc).

Get the latest entry time using SQL if your result returns two different time, should I use cross or outer apply?

So I want to use datediff for two tables that I'm doing a join on. The problem is if I filter by a unique value, it returns two rows of result. For example:
select *
from [internalaudit]..ReprocessTracker with (nolock)
where packageID = '1983446'
It returns two rows, because it was repackaged twice, by two different workers.
User RepackageTime
KimVilder 2021-06-10
DanielaS 2021-06-05
I want to use the latest repackagetime of that unique packageID and then do a datediff with another time record when I do a join with a different table.
Is there way to filer so I can get the latest time entry of Repackagetime?
There are numerous ways you can accomplish this, if I understand your goal - proper example data and tables would be a help here.
One way is using apply and selecting the max date for each packageId
select DateDiff(datepart, t.datecolumn, r.RepackageTime)...
from othertable t
cross apply (
select Max(RepackageTime)RepackageTime
from internalaudit.dbo.ReprocessTracker r
where r.packageId=t.packageId
)r
select *
from Othertable t1
join (
select *
from [internalaudit]..ReprocessTracker t2
where packageID = '1983446'
limit 1
) t2
on t1.id = t2.id
if you are using sql server instead of limit 1 you should use top 1
also otherwise you solid reason to use nolock hint, avoid using it.
also to generalize the query above:
select *
from Othertable t1
cross join (
select *
from [internalaudit]..ReprocessTracker t2
where t1.packageID = t2.packageID
limit 1
) t2

Performance of JOIN then UNION vs. UNION then JOIN

I have a SQL query along the following lines:
WITH a AS (
SELECT *
FROM table1
INNER JOIN table3 ON table1.id = table3.id
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
INNER JOIN table3 ON table2.id = table3.id
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM a
UNION
SELECT *
FROM b
)
SELECT *
FROM combined
I rewrote this as:
WITH a AS (
SELECT *
FROM table1
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM (
SELECT *
FROM a
UNION
SELECT *
FROM b
) union
INNER JOIN table3 ON union.id = table3.id
)
SELECT *
FROM combined
I expected that this might be more performant, since it's only doing the JOIN once, or at the very least that it would have no effect on execution time. I was surprised to find that the query now takes almost twice as long to run.
This is no problem since it worked perfectly well before, I only really rewrote it out of my own personal style preference anyway so I'll stick with the original. But I'm no expert when it comes to databases/SQL, so I was interested to know if anyone can share any insights as to why this second approach is so much less performant?
If it makes a difference, it's a Redshift database, table1 and table2 are both around ~250 million rows, table3 is ~1 million rows, and combined has less than 1000 rows.
The SQL optimizer has more information on "bare" tables than on "computed" tables. So, it is easier to optimize the two CTEs.
In a database that uses indexes, this might affect index usage. In Redshift, this might incur additional data movement.
In this particular case, though, I suspect the issue might have to do with filtering via the JOIN operation. The UNION is incurring overhead to remove duplicates. By filtering before the UNION, duplicate removal is faster than filtering afterwards.
In addition, the UNION may affect where the data is located, so the second version might require additional data movement.

SQL SELECT compare values from two tables (without UNION ALL)

I have table T1:
ID IMPACT
1 3
I have table T2
PRIORITY URGENCY
1 2
I need to do the SELECT from T1 table.
I would like to get all the rows from T1 where IMPACT is greater than PRIORITY from T2.
I am working in some IBM application where it is only possible to start with SQL statement after the WHERE clause from the first table T1.
So query (unfortunately) must always start with "SELECT * FROM T1 WHERE..."
This cannot be changed (please have that in mind).
This means that I cannot use some JOIN or UNION ALL statement after the "FROM T1" part because I can start to write SQL query only after the WHERE clause.
SELECT * FROM T1
WHERE
IMPACT> SELECT PRIORITY FROM T2 WHERE URGENCY=2
But I am getting an error for this statement.
Please is it possible to write SQL query starting with:
SELECT * FROM T1
WHERE
You want a subquery, so all you need are parentheses:
SELECT *
FROM T1
WHERE IMPACT > (SELECT T2.PRIORITY FROM T2 WHERE T2.URGENCY = 2)
This assumes that the subquery returns one row (or zero rows, in which case nothing is returned). If the subquery can return more than one row, you should ask another question and be very explicit about what you want done.
One reasonable interpretation (for more than one row) is:
SELECT *
FROM T1
WHERE IMPACT > (SELECT MAX(T2.PRIORITY) FROM T2 WHERE T2.URGENCY = 2)
I would use exists:
select t1.*
from t1
where exists (select 1 from t2 where t1.IMPACT > t2.PRIORITY);

Should I avoid IN() because slower than EXISTS() [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server IN vs. EXISTS Performance
Should I avoid IN() because slower than EXISTS()?
SELECT * FROM TABLE1 t1 WHERE EXISTS (SELECT 1 FROM TABLE2 t2 WHERE t1.ID = t2.ID)
VS
SELECT * FROM TABLE1 t1 WHERE t1.ID IN(SELECT t2.ID FROM TABLE2 t2)
From my investigation, I set SHOWPLAN_ALL. I get the same execution plan and estimation cost. The index(pk) is used, seek on both query. No difference.
What are other scenarios or other cases to make big difference result from both query? Is optimizer so optimization for me to get same execution plan?
Do neither. Do this:
SELECT DISTINCT T1.*
FROM TABLE1 t1
JOIN TABLE2 t2 ON t1.ID = t2.ID;
This will out perform anything else by orders of magnitude.
Both queries will produce the same execution plan (assuming no indexes were created): two table scans and one nested loop (join).
The join, suggested by Bohemian, will do a Hash Match instead of the loop, which I've always heard (and here is a proof: Link) is the worst kind of join.
Among IN and EXIST (your actuall question), EXISTS returs better performance (take a lok at: Link)
If your table T2 has a lot of records, EXISTS is the better approach hands down, because when your database find a record that match your requirement, the condition will be evaluated to true and it stopped the scan from T2. However, in the IN clause, you're scanning your Table2 for every row in table1.
IN is better than Exists when you have a bunch of values, or few values in the subquery.
Expandad a little my answer, based on Ask Tom answer:
In a Select with in, for example:
Select * from T1 where x in ( select y from T2 )
is usually processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table (typically).
In an exist query like:
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
When is where exists appropriate and in appropriate?
Use EXISTS when... Subquery T2 is huge and takes a long time and T1 is relatively small and executing (select null from t2 where y = x.x ) is very very fast
Use IN when... The result of the subquery is small -- then IN is typicaly more appropriate.
If both the subquery and the outer table are huge -- either might work as well as the other -- depends on the indexes and other factors.