Query design for nested statements and CTEs - sql

I have a query that sequentially joins 6 tables from their original data sources. Nested, it's a mess:
SELECT
FROM
(
SELECT
FROM
(
SELECT
FROM
(. . .)
INNER JOIN
)
INNER JOIN
)
I switched to CTE definitions, and each definition is one join on a previous definition, with the final query at the end providing the result:
WITH
Table1 (field1, field2) AS
(
SELECT
FROM
INNER JOIN
),
Table2 (field2, field3) AS
(
SELECT
FROM Table1
INNER JOIN
), . . .
SELECT
FROM Table 6
This is a lot more readable, and dependencies flow downward in logical order. However, this doesn't seem like the intended use of CTEs (and also why I'm not using Views), since each definition is really only referenced once in order.
Is there any guidance out there on how to construct sequentially nested joins like this that is both readable and logical in structure?

I don't think there is anything wrong in utilizing CTE to create temporary views.
In a larger shop, there are roles defined that separates the
responsibility of DBAs versus developers. The CREATE statement, in general, will be the victim of this bureaucracy. Hence, no view. CTE is a very good
compromise.
If the views are not really reusable anyway, keeping it with the SQL makes it more readable.
CTE is a lot more readable and intuitive than sub-queries (even with
just one level). If your subqueries are not correlated, I would just suggesting
converting all of your sub-queries to CTE.
Recursion is the "killer" app for CTE, but it doesn't mean that you shouldn't use CTE, otherwise.
The only con that I can think of is that (depending on your Database Engine) it might confuse or prevent the optimizer from doing what it's suppose to do. Optimizers are smart enough to rewrite subqueries for you.
Now, let us discuss abuse of CTE, be careful that you don't substitute your application developer knowledge for Database Engine optimization. There are a lot of smart developers (smarter that us) that designed this software, just write the query without cte or subqueries as much as possible and let the DB do the work. For example, I often see developers DISTINCT/WHERE every key in a subquery before doing their join. You may think your doing the right thing, but you're not.
With regards to your question, most people intend to solve problems and not discuss something theoretical. Hence, you get people scratching their heads on what you are after. I wouldn't say you didn't imply that in your text, but perhaps be more forceful.

May be I didn't understand the question but what's wrong with:
select * from table1 t1, table2 t2, table3 t3, table4 t4, table5 t5, table6 t6
where t1.id = t2.id and t2.id = t3.id and t3.id = t4.id
and t4.id = t5.id and t5.id = t6.id
or same using table 1 t1 INNER JOIN table2 t2 ON t1.id = t2.id ....

why didn't you just join your tables like this
select *
from Table1 as t1
inner join Table2 as t2 on t2.<column> = t1.<column>
inner join Table3 as t3 on t3.<column> = t2.<column>
inner join Table4 as t4 on t4.<column> = t3.<column>
inner join Table5 as t5 on t5.<column> = t4.<column>
inner join Table6 as t6 on t6.<column> = t5.<column>

Related

SQL Query Performance Join with condition

calling all sql experts. I have the following select statement:
SELECT 1
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id
WHERE t1.field = xyz
I'm a little bit worried about the performance here. Is the where clause evaluated before or after the join? If its evaluated after, is there way to first evaluate the where clause?
The whole table could easily contain more than a million entries but after the where clause it may be only 1-10 entries left so in my opinion it really is a big performance difference depending on when the where clause is evaluated.
Thanks in advance.
Dimi
You could rewrite your query like this:
SELECT 1
FROM (SELECT * FROM table1 WHERE field = xyz) t1
JOIN table2 t2 ON t1.id = t2.id
But depending on the database product the optimiser might still decide that the best way to do this is to JOIN table1 to table2 and then apply the constraint.
For this query:
SELECT 1
FROM table1 t1 JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.field = xyz;
The optimal indexes are table1(field, id), table2(id).
How the query is executed depends on the optimizer. It is tasked with choosing the based execution plan, given the table statistics and environment.
Each DBMS has its own query optimizer. So by logic of things in case like yours WHERE will be executed first and then JOINpart of the query
As mentioned in the comments and other answers with performance the answer is always "it depends" depending on your dbms and the indexing of the base tables the query may be fine as is and the optimizer may evaluate the where first. Or the join may be efficient anyway if the indexes cover the join requirements.
Alternatively you can force the behavior you require by reducing the dataset of t1 before you do the join using a nested select as Richard suggested or adding the t1.field = xyz to the join for example
ON t1.field = xyz AND t1.id = t2.id
personally if i needed to reduce the dataset before the join I would use a cte
With T1 AS
(
SELECT * FROM table1
WHERE T1.Field = 'xyz'
)
SELECT 1
FROM T1
JOIN Table2 T2
ON T1.Id = T2.Id

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

when i run SELECT stataement on my view it takes around 1.5 hours to run, what can i do to optimize it.
Below is the sample structure of how my view looks like
CREATE VIEW SCHEMANAME.VIEWNAME
{
COL, COL1, COL2, COL3 }
AS SELECT
COST.ETA,
CASE
WHEN VOL.CURR IS NOT NULL
THEN COALESCE {VOL.COMM,0}
END CASE,
CASE
WHEN...
END CASE
FROM TABLE1 t1 inner join TABLE2 t2 ON t1.ETA=t2.ETA
INNER JOIN TABLE3 t3 on t2.ETA=t3.ETA
LEFT OUTER JOIN TABLE4 t4 on t2.ETA=t4.ETA
This is your query:
SELECT COST.ETA,
(CASE WHEN VOL.CURR IS NOT NULL THEN COALESCE {VOL.COMM,0}
END) as ??,
. . .
FROM TABLE1 t1 inner join
TABLE2 t2
ON t1.ETA = t2.ETA INNER JOIN
TABLE3 t3
on t2.ETA = t3.ETA LEFT OUTER JOIN
TABLE4 t4
on t2.ETA = t4.ETA;
First, I will the fact that the select clause references tables that are not in the from clause. I assume this is a typo.
Second, you should be able to use indexes to improve this query: table1(eta), table2(eta),table3(eta), andtable4(eta).
Third, I am highly suspicious on seeing the same column used for joining so many tables. I suspect that you might have cartesian products occurring, because there are multiple values of any given eta in several tables. If that is the case, you need to fix the query to better reflect what you really need. If so, ask another question with sample data and desired results, because your query is probably not correct.

Postgres/netezza multiple join from multiple tables

Hi I have a problem when migrating from ORACLE to Netezza, netezza seems to have problem if multiple tables is declared before using JOIN`s. How could I write this join differently ?
INSERT INTO...
SELECT...
FROM table1 t1, table2 t2 //here seems to be the problem as postgres dont allow to put two tables in FROM clause if there are JOIN`s involved
JOIN talbe3 t3 ON t2.column = t3.column
JOIN table4 t4 ON t2.column = t4.column
LEFT OUTER JOIN table5 t5 ON (t4.column=t5.column AND t4.column=t2.column AND t4.column=t3.column)
WHERE....;
You simply should not mix old-style (implicit) and new-style (explicit) joins. In fact, a simple rule is simply to avoid commas in the from clause.
I imagine the problem that you have is a scoping problem for the table aliases. I know this happens in MySQL. But, because I never use commas in from clauses, I am not aware of how this affects other databases. I think the part of the from clause after the comma is parsed as a unit, and the aliases defined before are not known during this parsing stage.
In any case, whatever the problem, the simple solution is to replace the comma with CROSS JOIN:
INSERT INTO...
SELECT...
FROM table1 t1 CROSS JOIN table2 t2 //here seems to be the problem as postgres dont allow to put two tables in FROM clause if there are JOIN`s involved
JOIN table3 t3 ON t2.column = t3.column
JOIN table4 t4 ON t2.column = t4.column
LEFT OUTER JOIN table5 t5 ON (t4.column=t5.column AND t4.column=t2.column AND t4.column=t3.column)
WHERE....;
This should work in all the databases you mention -- and more.

Hive - grab from two tables without join

In MySQL, I can select from two tables without a join, like so:
SELECT t1.value, t2.value FROM t1, t2 WHERE (t1.value = t2.value);
Hive, on the other hand, will accept "FROM t1 join t2" but not "FROM t1, t2".)
Does anyone have any ideas about how to optimize a query like
SELECT t1.value, t2.value FROM t1 join t2 WHERE (t1.value = t2.value);
in any other way?
(Also, why does switching from "select from t1 join t2" to "select from t1, t2" in MySQL optimize queries anyway?)
Why don't you want to use a join? Selecting from two tables and requiring some equalities between them results in an inner join.
Also, with the join you are using, you are creating the cartesian product of both tables and eliminate those records, where t1.value=t2.value. Directly using an inner join would be more efficient:
SELECT t1.value, t2.value FROM t1 JOIN t2 ON t1.value=t2.value;
If one of your tables is remarkable small, you could do a map-side join. The small table would be cached in the memory while the larger one can be streamed through and no reduce step would be necessary. To activate the map-side join you have to execute set hive.auto.convert.join=true; before executing the query. The threshold for the maximum table size in bytes for map-side joins is set in the property hive.mapjoin.smalltable.filesize.
(Source: Edward Capriolo, Dean Wampler, and Jason Rutherglen. Programming Hive.
O’Reilly, 2012.)

How can I speed up MySQL query with multiple joins

Here is my issue, I am selecting and doing multiple joins to get the correct items...it pulls in a fair amount of rows, above 100,000. This query takes more than 5mins when the date range is set to 1 year.
I don't know if it's possible but I am afraid that the user might extend the date range to like ten years and crash it.
Anyone know how I can speed this up? Here is the query.
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND t5.store =2
I am not the greatest with mysql so any help would be appreciated!
Thanks in advance!
UPDATE
Here is the explain you asked for
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t5 ref PRIMARY,C_store_type,C_id,C_store_type_2 C_store_type_2 1 const 101 Using temporary
1 SIMPLE t4 ref PRIMARY,P_cat P_cat 5 alphacom.t5.C_id 326 Using where
1 SIMPLE t3 ref I_pid,I_oref I_pid 4 alphacom.t4.P_id 31
1 SIMPLE t2 eq_ref O_ref,O_cid O_ref 28 alphacom.t3.I_oref 1
1 SIMPLE t1 eq_ref PRIMARY PRIMARY 4 alphacom.t2.O_cid 1 Using where
Also I added an index to table5 rows and table4 rows because they don't really change, however the other tables get around 500-1000 entries a month... I heard you should add an index to a table that has that many new entries....is this true?
I'd try the following:
First, ensure there are indexes on the following tables and columns (each set of columns in parentheses should be a separate index):
table1 : (subscribe, CDate)
(CU_id)
table2 : (O_cid)
(O_ref)
table3 : (I_oref)
(I_pid)
table4 : (P_id)
(P_cat)
table5 : (C_id, store)
Second, if adding the above indexes didn't improve things as much as you'd like, try rewriting the query as
SELECT DISTINCT t1.first_name, t1.last_name, t1.email FROM
(SELECT CU_id, t1.first_name, t1.last_name, t1.email
FROM table1
WHERE subscribe = 1 AND
CDate >= $startDate AND
CDate <= $endDate) AS t1
INNER JOIN table2 AS t2
ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3
ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4
ON t3.I_pid = t4.P_id
INNER JOIN (SELECT C_id FROM table5 WHERE store = 2) AS t5
ON t4.P_cat = t5.C_id
I'm hoping here that the first sub-select would cut down significantly on the number of rows to be considered for joining, hopefully making the subsequent joins do less work. Ditto the reasoning behind the second sub-select on table5.
In any case, mess with it. I mean, ultimately it's just a SELECT - you can't really hurt anything with it. Examine the plans that are generated by each different permutation and try to figure out what's good or bad about each.
Share and enjoy.
Make sure your date columns and all the columns you are joining on are indexed.
Doing an unequivalence operator on your dates means it checks every row, which is inherently slower than an equivalence.
Also, using DISTINCT adds an extra comparison to the logic that your optimizer is running behind the scenes. Eliminate that if possible.
Well, first, make a subquery to decimate table1 down to just the records you actually want to go to all the trouble of joining...
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM (
SELECT first_name, last_name, email, CU_id FROM table1 WHERE
table1.subscribe = 1
AND table1.Cdate >= $startDate
AND table1.Cdate <= $endDate
) AS t1
INNER JOIN table2 AS t2 ON t1.CU_id = t2.O_cid
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id
WHERE t5.store = 2
Then start looking at modifying the directionality of the joins.
Additionally, if t5.store is only very rarely 2, then flip this idea around: construct the t5 subquery, then join it back and back and back.
At present, your query is returning all matching rows on table2-table5, just to establish whether t5.store = 2. If any of table2-table5 have a significantly higher row count than table1, this may be greatly increasing the number of rows processed - consequently, the following query may perform significantly better:
SELECT DISTINCT t1.first_name, t1.last_name, t1.email
FROM table1 AS t1
WHERE t1.subscribe =1
AND t1.Cdate >= $startDate
AND t1.Cdate <= $endDate
AND EXISTS
(SELECT NULL FROM table2 AS t2
INNER JOIN table3 AS t3 ON t2.O_ref = t3.I_oref
INNER JOIN table4 AS t4 ON t3.I_pid = t4.P_id
INNER JOIN table5 AS t5 ON t4.P_cat = t5.C_id AND t5.store =2
WHERE t1.CU_id = t2.O_cid);
Try adding indexes on the fields that you join. It may or may not improve the performance.
Moreover it also depends on the engine that you are using. If you are using InnoDB check your configuration params. I had faced a similar problem, as the default configuration of innodb wont scale much as myisam's default configuration.
As everyone says, make sure you have indexes.
You can also check if your server is set up properly so it can contain more of, of maybe the entire, dataset in memory.
Without an EXPLAIN, there's not much to work by. Also keep in mind that MySQL will look at your JOIN, and iterate through all possible solutions before executing the query, which can take time. Once you have the optimal JOIN order from the EXPLAIN, you could try and force this order in your query, eliminating this step from the optimizer.
It sounds like you should think about delivering subsets (paging) or limit the results some other way unless there is a reason that the users need every row possible all at once. Typically 100K rows is more than the average person can digest.