Performance of JOIN then UNION vs. UNION then JOIN - sql

I have a SQL query along the following lines:
WITH a AS (
SELECT *
FROM table1
INNER JOIN table3 ON table1.id = table3.id
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
INNER JOIN table3 ON table2.id = table3.id
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM a
UNION
SELECT *
FROM b
)
SELECT *
FROM combined
I rewrote this as:
WITH a AS (
SELECT *
FROM table1
WHERE table1.condition = 'something'
),
b AS (
SELECT *
FROM table2
WHERE table2.condition = 'something else'
),
combined AS (
SELECT *
FROM (
SELECT *
FROM a
UNION
SELECT *
FROM b
) union
INNER JOIN table3 ON union.id = table3.id
)
SELECT *
FROM combined
I expected that this might be more performant, since it's only doing the JOIN once, or at the very least that it would have no effect on execution time. I was surprised to find that the query now takes almost twice as long to run.
This is no problem since it worked perfectly well before, I only really rewrote it out of my own personal style preference anyway so I'll stick with the original. But I'm no expert when it comes to databases/SQL, so I was interested to know if anyone can share any insights as to why this second approach is so much less performant?
If it makes a difference, it's a Redshift database, table1 and table2 are both around ~250 million rows, table3 is ~1 million rows, and combined has less than 1000 rows.

The SQL optimizer has more information on "bare" tables than on "computed" tables. So, it is easier to optimize the two CTEs.
In a database that uses indexes, this might affect index usage. In Redshift, this might incur additional data movement.
In this particular case, though, I suspect the issue might have to do with filtering via the JOIN operation. The UNION is incurring overhead to remove duplicates. By filtering before the UNION, duplicate removal is faster than filtering afterwards.
In addition, the UNION may affect where the data is located, so the second version might require additional data movement.

Related

Is it better to call table once in a CTE, then call the CTE multiple times or call the 1 table multiple times in different CTEs

Sample Query 1
WITH sample_1 AS (
SELECT * FROM table_1
),
transform_1 AS (
SELECT * FROM sample_1 JOIN table_2 on ..
),
transform_2 AS (
SELECT * FROM sample_1 JOIN table_3 on ..
)
SELECT * FROM transform_1 JOIN transform_2
Sample Query 2
WITH
transform_1 AS (
SELECT * FROM table_1 JOIN table_2 on ..
),
transform_2 AS (
SELECT * FROM table_1 JOIN table_3 on ..
)
SELECT * FROM transform_1 JOIN transform_2
Im trying to make my code more efficient and easy to read
Some general guidelines I would follow: Use a CTE if at least one of the following is true:
it encapsulates some meaningful application logic, such as
predicates, joins and aggregations, especially if this logic
is used more than once in the query (for example: Sales from last year),
it contains SQL operations with the potential to be shared between
multiple parts of the query, such as doing a
scan, join or aggregation just once for multiple consumers.
Without any predicates, it is unlikely the above CTEs will get any benefit, but let's assume (using sample query 1) transform_1 has a predicate
WHERE sample_1.a in (10, 20)
and transform_2 has a predicate
WHERE sample_1.a in (20, 30)
That would allow us to make a CTE for sample_1 with a predicate
WHERE table_1.a in (10, 20, 30)
That could result in a better performing plan. So, this would be an example of my second bullet above. I've seen this kind of optimization with the ORCA optimizer in Greenplum (based on Postgres), but am not sure whether the Postgres planner will also choose the same optimization.

SQL join performance operation order

I am trying to come up with how to order a join query to improve its performance.
Lets say we have two tables to join, to which some filters must be applied.
Is it the same to do:
table1_result = select * from table1 where field1 = 'A';
table2_result = select * from table2 where field1 = 'A';
result = select * from table1 as one inner join table2 as two on one.field1 = two.field1;
to doing this:
result = select * from table1 as one inner join table2 as two on one.field1 = two.field1
where one.field1 = 'A' and two.field1 = 'A';
or even doing this:
result = select * from table1 as one inner join table2 as two on one.field1 = two.field1 and one.field1 = 'A';
Thank you so much!!
Some common optimization techniques to improve your queries are here:
Index the columns used in joining. If they are foreign keys, normally databases like MySql already index them.
Index the columns used in conditions or WHERE clause.
Avoid * and explicitly select the columns that you really need.
The order of joining in most of the cases won't matter, because DB-Engines are inteligent enough to decide that.
So its better to analyze your structure of both the joining tables, have indexes in place.
And if anyone is further intrested, how changing conditions order can help getting the better performance. I've a detailed answer over here mysql Slow query issue.

Bigquery job suddenly started failing from today due to coorelated subquery

My Bigquery job which was executing fine until yesterday started failing due to the below error
Error:- Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
Query:-
with result as (
select
*
from
(
select * from `project.dataset_stage.=non_split_daily_temp`
union all
select * from `project.dataset_stage.split_daily_temp`
)
)
select
*
from
result final
where
not (
exists
(
select
1
from
`project.dataset.bqt_sls_cust_xref` target
where
final.sls_dte = target.sls_dte and
final.rgs_id = target.rgs_id and
) and
unlinked = 'Y' and
cardmatched = 'Y'
}
Can someone please assist me on this, i would like to know reason for sudden break of this and how to fix this issue permanently.
Thank you for the suggestion.
We figured out the reason for the cause of the issue ,below is the reason
When one writes correlated subquery like
select T2.col, (select count(*) from T1 where T1.col = T2.col) from T2
Technically SQL text implies that subquery needs to be re-executed for every row from T2.
If T2 has billion rows then we would need to scan T1 billion times. That would take forever and query would never finish.
The cost of executing query dropped from O(size T1 * size T2) to O(size T1 + size T2) if implemented as below
select any_value(t.col), count(*) from
t left join T1 on T1.col = t.col
group by t.primary_key````
BigQuery errors out if it can't find a way to optimize correlated subquery into linear cost O(size T1 + size T2).
We have plenty of patterns that we recognize and we rewrite for correlated subqueries but apparently new view definition made subquery too complex and query optimizer was unable find a way to run it in linear complexity algorithm.
Probably google will fix the issue by identifying the better algorithm.
I don't know why it suddenly broke, but seemingly, your query can be rewritten with OUTER JOIN:
with result as (
select
*
from
(
select * from `project.dataset_stage.=non_split_daily_temp`
union all
select * from `project.dataset_stage.split_daily_temp`
)
)
select
*
from
result final LEFT OUTER JOIN `project.dataset.bqt_sls_cust_xref` target
ON final.sls_dte = target.sls_dte and
final.str_id = target.str_id and
final.rgs_id = target.rgs_id
where
target.<id_column> IS NULL AND -- No join found, equivalent to NOT (HAVING (<correlected sub-query>))
was_unlinked_run = 'Y' and
is_card_matched = 'Y'
)

Can I select several tables in the same WITH query?

I have a long query with a with structure. At the end of it, I'd like to output two tables. Is this possible?
(The tables and queries are in snowflake SQL by the way.)
The code looks like this:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
..... many more alias tables and subqueries here .....
)
select * from table_g where z = 3 ;
But for the very last row, I'd like to query table_g twice, once with z = 3 and once with another condition, so I get two tables as the result. Is there a way of doing that (ending with two queries rather than just one) or do I have to re-run the whole code for each table I want as output?
One query = One result set. That's just the way that RDBMS's work.
A CTE (WITH statement) is just syntactic sugar for a subquery.
For instance, a query similar to yours:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
select id,
product_c
from x.z ),
select *
from table_a
inner join table_b on table_a.id = table_b.id
inner join table_c on table_b.id = table_c.id;
Is 100% identical to:
select *
from
(select id, product_a from x.x) table_a
inner join (select id, product_b from x.y) table_b
on table_a.id = table_b.id
inner join (select id, product_c from x.z) table_c
on table_b.id = table_c.id
The CTE version doesn't give you any extra features that aren't available in the non-cte version (with the exception of a recursive cte) and the execution path will be 100% the same (EDIT: Please see Simon's answer and comment below where he notes that Snowflake may materialize the derived table defined by the CTE so that it only has to perform that step once should the CTE be referenced multiple times in the main query). As such there is still no way to get a second result set from the single query.
While they are the same syntactically, they don't have the same performance plan.
The first case can be when one of the stages in the CTE is expensive, and is reused via other CTE's or join to many times, under Snowflake, use them as a CTE I have witness it running the "expensive" part only a single time, which can be good so for example like this.
WITH expensive_select AS (
SELECT a.a, b.b, c.c
FROM table_a AS a
JOIN table_b AS b
JOIN table_c AS c
WHERE complex_filters
), do_some_thing_with_results AS (
SELECT stuff
FROM expensive_select
WHERE filters_1
), do_some_agregation AS (
SELECT a, SUM(b) as sum_b
FROM expensive_select
WHERE filters_2
)
SELECT a.a
,a.b
,b.stuff
,c.sum_b
FROM expensive_select AS a
LEFT JOIN do_some_thing_with_results AS b ON a.a = b.a
LEFT JOIN do_some_agregation AS c ON a.a = b.a;
This was originally unrolled, and the expensive part was some VIEWS that the date range filter that was applied at the top level were not getting pushed down (due to window functions) so resulted in full table scans, multiple times. Where pushing them into the CTE the cost was paid once. (In our case putting date range filters in the CTE made Snowflake notice the filters and push them down into the view, and things can change, a few weeks later the original code ran as good as the modified, so they "fixed" something)
In other cases, like this the different paths that used the CTE use smaller sub-sets of the results, so using the CTE reduced the remote IO so improved performance, there then was more stalls in the execution plan.
I also use CTEs like this to make the code easier to read, but giving the CTE a meaningful name, but the aliasing it to something short, for use. Really love that.

SQL query: how to translate IN() into a JOIN?

I have a lot of SQL queries like this:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o
WHERE o.Id IN (
SELECT DISTINCT Id
FROM table1
, table2
, table3
WHERE ...
)
These queries have to run on different database engines (MySql, Oracle, DB2, MS-Sql, Hypersonic), so I can only use common SQL syntax.
Here I read, that with MySql the IN statement isn't optimized and it's really slow, so I want to switch this into a JOIN.
I tried:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o, table2, table3
WHERE ...
But this does not take into account the DISTINCT keyword.
Question: How do I get rid of the duplicate rows using the JOIN approach?
To write this with a JOIN you can use an inner select and join with that:
SELECT o.Id, o.attrib1, o.attrib2 FROM table1 o
JOIN (
SELECT DISTINCT Id FROM table1, table2, table3 WHERE ...
) T1
ON o.id = T1.Id
I'm not sure this will be much faster, but maybe... you can try it for yourself.
In general restricting yourself only to SQL that will work on multiple databases is not going to result in the best performance.
But this does not take into account
the DISTINCT keyword.
You do not need the distinct in the sub-query. The in will return one row in the outer query regardless of whether it matches one row or one hundred rows in the sub-query. So, if you want to improve the performance of the query, junking that distinct would be a good start.
One way of tuning in clauses is to rewrite them using exists instead. Depending on the distribution of data this may be a lot more efficient, or it may be slower. With tuning, the benchmark is king.
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o
WHERE EXISTS (
SELECT Id FROM table1 t1, table2 t2, table3 t3 WHERE ...
AND ( t1.id = o.id
or t2.id = o.id
or t3.id = o.id
)
Not knowing your business logic the precise formulation of that additional filter may be wrong.
Incidentally I notice that you have table1 in both the outer query and the sub-query. If that is not a mistake in transcribing your actual SQL to here you may want to consider whether that makes sense. It would be better to avoid querying that table twice; using exists make make it easier to avoid the double hit.
SELECT DISTINCT o.Id, o.attrib1, o.attrib2
FROM table1 o, table2, table3
WHERE ...
Though if you need to support a number of different database back ends you probably want to give each its own set of repository classes in your data layer, so you can optimize your queries for each. This also gives you the power to persist in other types of databases, or xml, or web services, or whatever should the need arise down the road.
I'm not sure to really understand what is your problem. Why don't you try this :
SELECT distinct o.Id, o.attrib1, o.attrib2
FROM
table1 o
, table o1
, table o2
...
where
o1.id1 = o.id
or o2.id = o.id