Is Common Table Expression evil in this case? - apache-spark-sql

I have seen many SQL statements like the following in our data warehouse daily pipelines.
WITH a AS (
SELECT
...
FROM
t1 JOIN t2
WHERE
...
), b AS (
SELECT
...
FROM
t3 JOIN t4 ON ...
JOIN a ...
WHERE
...
), c AS (
SELECT
...
FROM
t5 JOIN b ON ...
), d AS (
SELECT
...
FROM
t6 JOIN b ON ...
)
INSERT INTO TABLE ...
SELECT
...
FROM
c JOIN d ON ...
If I remember correctly, Common Table Expressions are just syntactic sugar, it doesn't really do any real optimization. If that's the case, I'm wondering how many times the t1 and t2 actually joined in that query?
Is there a way to force materialization of CTE and reuse? Thanks.

Related

Any way to simplify my code? Left joining multiple select statements

I figured out a way to combine 6 different select statements into one very long row which I can use my select statement to filter data out of. It works just like I need it to, however I feel like I have a ton of redundant code. Is there any way to simplify my code without changing the functionality at all?
SELECT * FROM
(
SELECT row1 FROM db1
JOIN db2 ON ...
JOIN db3 ON ...
WHERE ...) t1
LEFT JOIN
(
SELECT value_to_join FROM db4 v1, db1
JOIN db2 ON ...
JOIN db3 ON ...
WHERE ...) t2
ON t1.other_value = t2.other_value
LEFT JOIN
(
SELECT value_to_join FROM db4 v2, db1
JOIN db2 ON ...
JOIN db3 ON ...
WHERE ...) t3
ON t1.other_value = t3.other_value
My output is a row from the first select statement joined with 5 different values from db4. These 5 values can only be joined with db1 when I join the other tables (db2, db3) because there is no common column to join on.
Some more information: This format of left join is used up to t6, with the ON being t1.value = tn.value with n increasing respectively. The join statements in each subquery are the same in all 6, so I'm assuming there has to be a way to simplify that. The '...' is just a mess of code that comes after each clause.
If your RDBMS supports not-so-old SQL versions, (SQL 3 / SQL:1999), you may use CTE to achieve this:
WITH myquery (value,…) AS (
SELECT * FROM db4, db1
JOIN db2 ON ...
JOIN db3 ON ...
WHERE ...)
SELECT * FROM
(
SELECT * FROM db1
JOIN db2 ON ...
JOIN db3 ON ...
WHERE ...) t1
LEFT JOIN myquery t2
ON t1.value = t2.value
LEFT JOIN myquery t3
ON t1.value = t3.value
…
However, you will need to replace "value,…" and "SELECT *" in the first query by the exact list of wished columns.

SQL merge results from multiple joined tables into a single view field (without union)

Is there a way to merge/combine the results from multiple joined tables into a single field?
Two key details:
Must not use a Union.
Must be able to generate a View with the code.
We have a nested table structure and there's quite a bit of conditional logic, so I'd like to avoid writing the same large query multiple times (hence why I'm trying to avoid a Union).
select [Combine Id results from all 4 tables]
from Table1 tbl1
inner join Table2 tbl2 on tbl1.Id = tbl2.ParentId
inner join Table3 tbl3 on tbl2.Id = tbl3.ParentId
inner join Table4 tbl4 on tbl3.Id = tbl4.ParentId
So if the tables contain the following data:
Table 1
Id, ParentId
1, 1
Table 2
Id, ParentId
2, 1
Table 3
Id, ParentId
3, 2
Table 4
Id, ParentId
4, 3
Is it possible to produce the following single-field output (list of integers) using the join structure from my original query?:
Id
1
2
3
4
I think apply does what you want:
select v.id
from Table1 tbl1 inner join
Table2 tbl2
on tbl1.Id = tbl2.ParentId inner join
Table3 tbl3
on tbl2.Id = tbl3.ParentId inner join
Table4 tbl4
on tbl3.Id = tbl4.ParentId cross apply
(values (tbl1.id), (tbl2.id), (tbl3.id), (tbl4.id)) v(id);

Which performs first WHERE clause or JOIN clause

Which clause performs first in a SELECT statement?
I have a doubt in select query on this basis.
consider the below example
SELECT *
FROM #temp A
INNER JOIN #temp B ON A.id = B.id
INNER JOIN #temp C ON B.id = C.id
WHERE A.Name = 'Acb' AND B.Name = C.Name
Whether, First it checks WHERE clause and then performs INNER JOIN
First JOIN and then checks condition?
If it first performs JOIN and then WHERE condition; how can it perform more where conditions for different JOINs?
The conceptual order of query processing is:
1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
But this is just a conceptual order. In fact the engine may decide to rearrange clauses. Here is proof. Let's make 2 tables with 1000000 rows each:
CREATE TABLE test1 (id INT IDENTITY(1, 1), name VARCHAR(10))
CREATE TABLE test2 (id INT IDENTITY(1, 1), name VARCHAR(10))
;WITH cte AS(SELECT -1 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) d FROM
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t1(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t2(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t3(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t4(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t5(n) CROSS JOIN
(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t6(n))
INSERT INTO test1(name) SELECT 'a' FROM cte
Now run 2 queries:
SELECT * FROM dbo.test1 t1
JOIN dbo.test2 t2 ON t2.id = t1.id AND t2.id = 100
WHERE t1.id > 1
SELECT * FROM dbo.test1 t1
JOIN dbo.test2 t2 ON t2.id = t1.id
WHERE t1.id = 1
Notice that the first query will filter most rows out in the join condition, but the second query filters in the where condition. Look at the produced plans:
1 TableScan - Predicate:[Test].[dbo].[test2].[id] as [t2].[id]=(100)
2 TableScan - Predicate:[Test].[dbo].[test2].[id] as [t2].[id]=(1)
This means that in the first query optimized, the engine decided first to evaluate the join condition to filter out rows. In the second query, it evaluated the where clause first.
Logical order of query processing phases is:
FROM - Including JOINs
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
You can have as many as conditions even on your JOINs or WHERE clauses. Like:
Select * from #temp A
INNER JOIN #temp B ON A.id = B.id AND .... AND ...
INNER JOIN #temp C ON B.id = C.id AND .... AND ...
Where A.Name = 'Acb'
AND B.Name = C.Name
AND ....
you can refer to this join optimization
SELECT * FROM T1 INNER JOIN T2 ON P1(T1,T2)
INNER JOIN T3 ON P2(T2,T3)
WHERE P(T1,T2,T3)
The nested-loop join algorithm would execute this query in the following manner:
FOR each row t1 in T1 {
FOR each row t2 in T2 such that P1(t1,t2) {
FOR each row t3 in T3 such that P2(t2,t3) {
IF P(t1,t2,t3) {
t:=t1||t2||t3; OUTPUT t;
}
}
}
}
You can refer MSDN
The rows selected by a query are filtered first by the FROM clause
join conditions, then the WHERE clause search conditions, and then the
HAVING clause search conditions. Inner joins can be specified in
either the FROM or WHERE clause without affecting the final result.
You can also use the SET SHOWPLAN_ALL ON before executing your query to show the execution plan of your query so that you can measure the performance difference in the two.
If you come to this site for the question about logical query processing, you really need to read this article on ITProToday by Itzik Ben-Gan.
Figure 3: Logical query processing order of query clauses
1 FROM
2 WHERE
3 GROUP BY
4 HAVING
5 SELECT
5.1 SELECT list
5.2 DISTINCT
6 ORDER BY
7 TOP / OFFSET-FETCH

SQL alternative to sub-query in SELECT Item list

I have RDBMS table and Queries which are working perfectly. I have offloaded data from RDBMS to HIVE table.To run the existing queries on HIVE, we need first to make them compatible to HIVE.
Let's take below example with sub-query in select item list. It is syntactically valid and working fine on RDBMS system. But It Will not work on HIVE As per HIVE manual , Hive supports subqueries only in the FROM and WHERE clause.
Example 1 :
SELECT t1.one
,(SELECT t2.two
FROM TEST2 t2
WHERE t1.one=t2.two) t21
,(SELECT t3.three
FROM TEST3 t3
WHERE t1.one=t3.three) t31
FROM TEST1 t1 ;
Example 2:
SELECT a.*
, CASE
WHEN EXISTS
(SELECT 1
FROM tblOrder O
INNER JOIN tblProduct P
ON O.Product_id = P.Product_id
WHERE O.customer_id = C.customer_id
AND P.Product_Type IN (2, 5, 6, 9)
)
THEN 1
ELSE 0
END AS My_Custom_Indicator
FROM tblCustomer C
INNER JOIN tblOtherStuff S
ON C.CustomerID = S.CustomerID ;
Example 3 :
Select component_location_id, component_type_code,
( select clv.LOCATION_VALUE
from stg_dev.component_location_values clv
where identifier_code = 'AXLE'
and component_location_id = cl.component_location_id ) as AXLE,
( select clv.LOCATION_VALUE
from stg_dev.component_location_values clv
where identifier_code = 'SIDE'
and component_location_id = cl.component_location_id ) as SIDE
from stg_dev.component_locations cl ;
I want to know the possible alternative of sub-queries in select item list to make it compatible to hive. Apparently I will be able to transform existing queries in HIVE format.
Any help and guidance is highly appreciated !
The query you provided could be transformed to a simple query with LEFT JOINs.
SELECT
t1.one, t2.two AS t21, t3.three AS t31
FROM
TEST1 t1
LEFT JOIN TEST2 t2
ON t1.one = t2.two
LEFT JOIN TEST3 t3
ON t1.one = t3.three
Since there is no limitation in the subqueries, the joins will return the same data. (The subqueries should return only one or no row for each row in TEST1.)
Please note, that your original query could not handle 1..n connections. In most DBMS, subqueries in the SELECT list should return only with a resultset with one columns and one or no row.
Based on HIVE manual:
SELECT t1.one, t2.two, t3.three
FROM TEST1 t1,TEST2 t2, TEST3 t3
WHERE t1.one=t2.two AND t1.one=t3.three;
SELECT t1.one,t2.two,t3.three FROM TEST1 t1 INNER
JOIN TEST2 t2 ON t1.one=t2.two INNER JOIN TEST3 t3
ON t1.one=t3.three WHERE t1.one=t2.two AND t1.one=t3.three;
SELECT t1.one,t2.two as t21,t3.three as t31 FROM TEST1 t1
INNER JOIN TEST2 t2 ON t1.one=t2.two
INNER JOIN TEST3 t3 ON t1.one=t3.three

LEFT JOIN on 3 tables to get a value

I'm trying to create an new interface for a database but I don't know how to do what I want.
I have 3 tables :
- table1(id1, time, ...)
id11 ..
id12 ..
id13 ..
- table2(id2, price, ...)
id21 ..
id22 ..
id23 ..
- table1_table2(#id1, #id2, value)
id11, id22, 6
id11, id23, 10
id13, id22, 5
So I want to have something like this :
id11, id21, 0
id11, id22, 6
id11, id23, 10
id12, id21, 0
id12, id22, 0
id12, id23, 0
id13, id21, 0
id13, id22, 5
id13, id23, 0
I've tried lots of requests but nothing efficient..
Please, help me ^^
EDIT : I'm using Access ( :'( ) 2007, and apparently, it doesn't support CROSS JOIN...
I tried to use this : http://blog.jooq.org/2014/02/12/no-cross-join-in-ms-access/
but still have a syntax error on the JOIN or the FROM..
EDIT 2 : Here is my query (I'm french, so don't take care of names please ^^)
SELECT Chantier.id_chantier, Indicateur.id_indicateur, Indicateur_chantier.valeur
FROM ((Chantier INNER JOIN Indicateur ON (Chantier.id_chantier*0 = Indicateur.id_indicateur*0))
LEFT JOIN Indicateur_chantier ON ( (Chantier.id_chantier = Indicateur_chantier.id_chantier)
AND (Indicateur.id_indicateur = Indicateur_chantier.id_indicateur) ) )
You should first cross join table1 and table2 to produce their Cartesian product and the left join to get the values where exist :
SELECT t1.id1,t2.id2,ISNULL(t12.value,0)
FROM table1 t1
CROSS JOIN table2 t2
LEFT JOIN table1_table2 t12 on t12.id1=t.id1 and t12.id2=t2.id2
Finally use ISNULL to replace null values with zeros.
Answer may vary by database, this works in SQL Server, you need a CROSS JOIN to get every combination of table1 and table2, then a LEFT JOIN to return pairs with values:
SELECT a.id1, b.id2, COALESCE(c.value,0)
FROM table1 a
CROSS JOIN table2 b
LEFT JOIN table3 c
ON a.id1 = c.id1
AND b.id2 = c.id2
Pairs without values would return NULL, so you can use COALESCE() to return 0 instead.
Demo: SQL Fiddle
In your question you say that Access "doesn't support CROSS JOIN". While it is true that Access SQL does not support
... FROM tableX CROSS JOIN tableY ...
you can perform a cross join in Access by simply using
... FROM tableX, tableY ...
In your case,
SELECT
crossjoin.id1,
crossjoin.id2,
Nz(table1_table2.value, 0) AS [value]
FROM
(
SELECT table1.id1, table2.id2
FROM table1, table2
) AS crossjoin
LEFT JOIN
table1_table2
ON table1_table2.id1 = crossjoin.id1
AND table1_table2.id2 = crossjoin.id2
ORDER BY crossjoin.id1, crossjoin.id2