Converting Nested Subqueries into Mini Queries - sql

I have a lot of trouble reading nested subqueries - I personally prefer to write several mini queries and work from there. I understand that more advanced SQL users find it more efficient to write nested subqueries.
For instance, in the following query:
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join ( select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1) c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1
Problem: I am trying to turn this into several different queries:
#PART 1 :
create table_1 as select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
#PART 2:
create table_2 as select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1
#PART 3 (final result: final_table)
create final_table as select a.*, b.*
from table_1 a
inner join table_2 b
on a.id_1 = b.id_1 and a.var_1 = b.max_var_1
My Question: Can someone please tell me if this is correct? Is this how the above nested subquery can be converted into 3 mini queries?

Subqueries are only inserted into separate tables when you use them multiple times. And yet, if the result of the subquery returns many records, then it is not recommended to insert them separately into the table. Because when you are using only select DB will only read data from the disc, but when using insert command, DB will write to disc. Inserting many records may be long process than selecting.
P.S. Mostly used "create temporary table" when inserting subquery process.
Another good way is to use "CTE (Common Table Expression)". When using "CTE", the database stores the results of "SELECT" queries in RAM, executing the subquery only once. If then subqueries are then used multiple times, the database only uses the results from RAM (not executing).

For the performance of your query, you can use only #PART2, and others should not be used. They are unnecessary. But for better performance, I recommended you write your query without inserting, using CTE. For example:
with sub_query as (
select
id_1,
max(var_1) as max_var_1
from
table_a
group by id_1
)
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join sub_query c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1;

Related

Improving performance of a full outer join in Redshift

I need to complete a full outer join using two columns: date_local and hour_local but am concerned about query performance when using an outer join as opposed to another type of join.
Below is the query using outer join:
SELECT *
FROM TABLE_A
FULL OUTER JOIN TABLE_B
USING (DATE_LOCAL, HOUR_LOCAL)
Would the following query perform better than the query above:
WITH JOIN_VALS AS
(SELECT DATE_LOCAL
, HOUR_LOCAL
FRΩM TABLE_A
UNION
SELECT
DATE_LOCAL
, HOUR_LOCAL
FROM TABLE_B
)
SELECT
JV.DATE
, JV.HOUR_LOCAL
, TA.PLANNED
, TB.ACTUAL
FROM JOIN_VALS JV
LEFT JOIN TABLE_A TA
ON JV.DATE = TA.DATE
AND JV.HOUR_LOCAL = TA.HOUR_LOCAL
LEFT JOIN TABLE_B TB
ON JV.DATE = TB.DATE
AND JV.HOUR_LOCAL = TB.HOUR_LOCAL;
Wondering if I get any performance improvements but isolating the unique join values first, rather than finding them during the outer join.
UNION can be expensive and I don’t think you will seen any benefit from this construct in Redshift. Likely performance loss. Redshift is a columnar database and will see no benefit from peeling off these columns.
The big cost will be if the matches between the two tables on these two columns will be many-to-many. This would lead to additional row creation which could lead to slow performance.

Comparing efficiency of Hive queries with different join orders

Consider the following two queries in Hive:
SELECT
*
FROM
A
INNER JOIN
B
INNER JOIN
C
ON
A.COL = B.COL
AND A.COL = C.COL
and
SELECT
*
FROM
A
INNER JOIN
B
ON
A.COL = B.COL
INNER JOIN
C
ON
A.COL = C.COL
Question: Are the two queries computationally same or different? In other words, to get the fastest results should I prefer to write one versus the other, or it doesn't matter? Thanks.
On Hive 1.2, also tested on Hive 2.3, both on Tez, the optimizer is intelligent enough to derive ON condition for join with table B and performs two INNER JOINs each with correct it's own ON condition.
Checked on simple query
with A as (
select stack(3,1,2,3) as id
),
B as (
select stack(3,1,2,3) as id
),
C as (
select stack(3,1,2,3) as id
)
select * from A
inner join B
inner join C
ON A.id = B.id AND A.id = C.id
Explain command shows that both joins are executed as map-join on single mapper and each join has it's own join condition. This is explain output:
Map 1
File Output Operator [FS_17]
Map Join Operator [MAPJOIN_27] (rows=1 width=12)
Conds:FIL_24.col0=RS_12.col0(Inner),FIL_24.col0=RS_14.col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]
First I thought that it will be CROSS join with table B in first query, then join with C will reduce the dataset, but both queries work the same(the same plan, the same execution), thanks to the optimizer.
Also I tested the same with map-join switched off (set hive.auto.convert.join=false;) and also got exactly the same plan for both queries. I did not test it for really big tables, you better double-check.
So, computationally both are the same on Hive 1.2 and Hive 2.3 for map-join and merge join on reducer

Can I select several tables in the same WITH query?

I have a long query with a with structure. At the end of it, I'd like to output two tables. Is this possible?
(The tables and queries are in snowflake SQL by the way.)
The code looks like this:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
..... many more alias tables and subqueries here .....
)
select * from table_g where z = 3 ;
But for the very last row, I'd like to query table_g twice, once with z = 3 and once with another condition, so I get two tables as the result. Is there a way of doing that (ending with two queries rather than just one) or do I have to re-run the whole code for each table I want as output?
One query = One result set. That's just the way that RDBMS's work.
A CTE (WITH statement) is just syntactic sugar for a subquery.
For instance, a query similar to yours:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
select id,
product_c
from x.z ),
select *
from table_a
inner join table_b on table_a.id = table_b.id
inner join table_c on table_b.id = table_c.id;
Is 100% identical to:
select *
from
(select id, product_a from x.x) table_a
inner join (select id, product_b from x.y) table_b
on table_a.id = table_b.id
inner join (select id, product_c from x.z) table_c
on table_b.id = table_c.id
The CTE version doesn't give you any extra features that aren't available in the non-cte version (with the exception of a recursive cte) and the execution path will be 100% the same (EDIT: Please see Simon's answer and comment below where he notes that Snowflake may materialize the derived table defined by the CTE so that it only has to perform that step once should the CTE be referenced multiple times in the main query). As such there is still no way to get a second result set from the single query.
While they are the same syntactically, they don't have the same performance plan.
The first case can be when one of the stages in the CTE is expensive, and is reused via other CTE's or join to many times, under Snowflake, use them as a CTE I have witness it running the "expensive" part only a single time, which can be good so for example like this.
WITH expensive_select AS (
SELECT a.a, b.b, c.c
FROM table_a AS a
JOIN table_b AS b
JOIN table_c AS c
WHERE complex_filters
), do_some_thing_with_results AS (
SELECT stuff
FROM expensive_select
WHERE filters_1
), do_some_agregation AS (
SELECT a, SUM(b) as sum_b
FROM expensive_select
WHERE filters_2
)
SELECT a.a
,a.b
,b.stuff
,c.sum_b
FROM expensive_select AS a
LEFT JOIN do_some_thing_with_results AS b ON a.a = b.a
LEFT JOIN do_some_agregation AS c ON a.a = b.a;
This was originally unrolled, and the expensive part was some VIEWS that the date range filter that was applied at the top level were not getting pushed down (due to window functions) so resulted in full table scans, multiple times. Where pushing them into the CTE the cost was paid once. (In our case putting date range filters in the CTE made Snowflake notice the filters and push them down into the view, and things can change, a few weeks later the original code ran as good as the modified, so they "fixed" something)
In other cases, like this the different paths that used the CTE use smaller sub-sets of the results, so using the CTE reduced the remote IO so improved performance, there then was more stalls in the execution plan.
I also use CTEs like this to make the code easier to read, but giving the CTE a meaningful name, but the aliasing it to something short, for use. Really love that.

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows
When running a left join query:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
I get a total of 10,896 rows or 10,896 ids shared across both tables.
However, when I run a FULL OUTER JOIN on the 2 tables like this:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.
I am wondering if there is an issue with my query syntax.
As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end
Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
and
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id
I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.
SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))
It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id).
Use a count(*) and it should works.
You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly
SELECT COUNT(DISTINCT coalesce(a.id,b.id))
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
This query will now take full effect of full outer join :)

Refactor SQL statement with lots of common INNER JOINS and columns

I need suggestions on how to re-factor the following SQL expression. As you can see all the selected columns except col_N are same. Plus all the inner joins except the last one in the 2 sub-queries are the same. This is just a snippet of my code so I am not including the WHERE clause I have in my query. FYI-This is part of a stored procedure which is used by a SSRS report and performance is BIG for me due to thousands of records:
SELECT col_A
, col_B, col_C,...
, '' As[col_N]
FROM table_A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D1
UNION
SELECT col_A
, col_B, col_C,...
, (select E.field_2 from table_E AS E where D2.field_1 = E.field_1 AND A.field_1 = E.field_2) AS [col_N]
FROM table_A as A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D2 as D2
Jean's first suggestion of creating a view by joining A, B and C worked. I created a temp table by joining ABC and then used it to achieve significant performance improvement (query time reduces to half for couple thousand records)!