Hive - grab from two tables without join - optimization

In MySQL, I can select from two tables without a join, like so:
SELECT t1.value, t2.value FROM t1, t2 WHERE (t1.value = t2.value);
Hive, on the other hand, will accept "FROM t1 join t2" but not "FROM t1, t2".)
Does anyone have any ideas about how to optimize a query like
SELECT t1.value, t2.value FROM t1 join t2 WHERE (t1.value = t2.value);
in any other way?
(Also, why does switching from "select from t1 join t2" to "select from t1, t2" in MySQL optimize queries anyway?)

Why don't you want to use a join? Selecting from two tables and requiring some equalities between them results in an inner join.
Also, with the join you are using, you are creating the cartesian product of both tables and eliminate those records, where t1.value=t2.value. Directly using an inner join would be more efficient:
SELECT t1.value, t2.value FROM t1 JOIN t2 ON t1.value=t2.value;
If one of your tables is remarkable small, you could do a map-side join. The small table would be cached in the memory while the larger one can be streamed through and no reduce step would be necessary. To activate the map-side join you have to execute set hive.auto.convert.join=true; before executing the query. The threshold for the maximum table size in bytes for map-side joins is set in the property hive.mapjoin.smalltable.filesize.
(Source: Edward Capriolo, Dean Wampler, and Jason Rutherglen. Programming Hive.
O’Reilly, 2012.)

Related

what operation does "select from table1, table2 " imply? [duplicate]

This question already has answers here:
Select from Table1, Table2
(3 answers)
Closed 4 years ago.
I know different joins, but I wanted to know which of them is being used when we run queries like this:
select * from table1 t1, table2 t2
is it full outer join or natural join for example?
Also does it have a unique meaning among different databases or all do the same?
UPDATE: what if we add where clause ? will it be always inner join?
The comma in the from clause -- by itself -- is equivalent to cross join in almost all databases. So:
from table1 t1, table2 t2
is functionally equivalent to:
from table1 t1 cross join table2 t2
They are not exactly equivalent, because the scoping rules within the from clause are slightly different. So:
from table1 t1, table2 t2 join
table3 t3
on t1.x = t3.x
generates an error, whereas the equivalent query with cross join works.
In general, conditions in the WHERE clause will always result in the INNER JOIN. However, some databases have extended the syntax to support outer joins in the WHERE clause.
I can think of one exception where the comma does not mean CROSS JOIN. Google's BigQuery originally used the comma for UNION ALL. However, that is only in Legacy SQL and they have removed that in Standard SQL.
Commas in the FROM clause have been out of fashion since the 1900s. They are the "original" form of joining tables in SQL, but explicit JOIN syntax is much better.
To me, they also mean someone who learned SQL decades ago and refused to learn about outer joins, or someone who has learned SQL from ancient materials -- and doesn't know a lot of other things that SQL does.
demo: db<>fiddle
This is a CROSS JOIN (cartesian product). So both of the following queries are equal
SELECT * FROM table1, table2 -- implicit CROSS JOIN
SELECT * FROM table1 CROSS JOIN table1 -- explicit CROSS JOIN
concerning UPDATE
A WHERE clause makes the general CROSS JOIN to an INNER JOIN. An INNER JOIN can be got by three ways:
SELECT * FROM table1, table2 WHERE table1.id = table2.id -- implicit CROSS JOIN notation
SELECT * FROM table1 CROSS JOIN table2 WHERE table1.id = table2.id -- really unusual!: explicit CROSS JOIN notation
SELECT * FROM table1 INNER JOIN table2 ON (table1.id = table2.id) -- explicit INNER JOIN NOTATION
Further reading (wikipedia)

SQL Query Performance Join with condition

calling all sql experts. I have the following select statement:
SELECT 1
FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id
WHERE t1.field = xyz
I'm a little bit worried about the performance here. Is the where clause evaluated before or after the join? If its evaluated after, is there way to first evaluate the where clause?
The whole table could easily contain more than a million entries but after the where clause it may be only 1-10 entries left so in my opinion it really is a big performance difference depending on when the where clause is evaluated.
Thanks in advance.
Dimi
You could rewrite your query like this:
SELECT 1
FROM (SELECT * FROM table1 WHERE field = xyz) t1
JOIN table2 t2 ON t1.id = t2.id
But depending on the database product the optimiser might still decide that the best way to do this is to JOIN table1 to table2 and then apply the constraint.
For this query:
SELECT 1
FROM table1 t1 JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.field = xyz;
The optimal indexes are table1(field, id), table2(id).
How the query is executed depends on the optimizer. It is tasked with choosing the based execution plan, given the table statistics and environment.
Each DBMS has its own query optimizer. So by logic of things in case like yours WHERE will be executed first and then JOINpart of the query
As mentioned in the comments and other answers with performance the answer is always "it depends" depending on your dbms and the indexing of the base tables the query may be fine as is and the optimizer may evaluate the where first. Or the join may be efficient anyway if the indexes cover the join requirements.
Alternatively you can force the behavior you require by reducing the dataset of t1 before you do the join using a nested select as Richard suggested or adding the t1.field = xyz to the join for example
ON t1.field = xyz AND t1.id = t2.id
personally if i needed to reduce the dataset before the join I would use a cte
With T1 AS
(
SELECT * FROM table1
WHERE T1.Field = 'xyz'
)
SELECT 1
FROM T1
JOIN Table2 T2
ON T1.Id = T2.Id

In joining 2 tables, do the WHERE clauses reduce the table sizes before or after the join occurs?

For example, does the first query get processed different than the second query?
Query 1
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN table2 t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
Query 2
SELECT t1.var1, t2.var2 FROM table1 t1
INNER JOIN (
SELECT var2, key from table2
WHERE ID = 'ABCD'
) t2
ON t1.key = t2.key
WHERE t2.ID = 'ABCD'
At a glance, it seems as if the second query would be more efficient - table2 is reduced before the join begins, whereas the first query appears to join the tables first, then reduce later. I'm using teradata, if it matters.
Depends on vendor, version and configuration.
Teradata older version/legacy configuration might spool the sub-query as a first stage for Query 2 leading to reduced performance in comparison to Query 1 in depends with the table's' primary indexes and join algorithm.
I would suggest to avoid this kind of "optimization".
P.s.
Check if you get the same execution plan for both plans or different execution plans.
Check the query log for AMPCPUTime (for start)

Optimization of DB2 query which uses joins and takes 1.5 hours to execute

when i run SELECT stataement on my view it takes around 1.5 hours to run, what can i do to optimize it.
Below is the sample structure of how my view looks like
CREATE VIEW SCHEMANAME.VIEWNAME
{
COL, COL1, COL2, COL3 }
AS SELECT
COST.ETA,
CASE
WHEN VOL.CURR IS NOT NULL
THEN COALESCE {VOL.COMM,0}
END CASE,
CASE
WHEN...
END CASE
FROM TABLE1 t1 inner join TABLE2 t2 ON t1.ETA=t2.ETA
INNER JOIN TABLE3 t3 on t2.ETA=t3.ETA
LEFT OUTER JOIN TABLE4 t4 on t2.ETA=t4.ETA
This is your query:
SELECT COST.ETA,
(CASE WHEN VOL.CURR IS NOT NULL THEN COALESCE {VOL.COMM,0}
END) as ??,
. . .
FROM TABLE1 t1 inner join
TABLE2 t2
ON t1.ETA = t2.ETA INNER JOIN
TABLE3 t3
on t2.ETA = t3.ETA LEFT OUTER JOIN
TABLE4 t4
on t2.ETA = t4.ETA;
First, I will the fact that the select clause references tables that are not in the from clause. I assume this is a typo.
Second, you should be able to use indexes to improve this query: table1(eta), table2(eta),table3(eta), andtable4(eta).
Third, I am highly suspicious on seeing the same column used for joining so many tables. I suspect that you might have cartesian products occurring, because there are multiple values of any given eta in several tables. If that is the case, you need to fix the query to better reflect what you really need. If so, ask another question with sample data and desired results, because your query is probably not correct.

Query design for nested statements and CTEs

I have a query that sequentially joins 6 tables from their original data sources. Nested, it's a mess:
SELECT
FROM
(
SELECT
FROM
(
SELECT
FROM
(. . .)
INNER JOIN
)
INNER JOIN
)
I switched to CTE definitions, and each definition is one join on a previous definition, with the final query at the end providing the result:
WITH
Table1 (field1, field2) AS
(
SELECT
FROM
INNER JOIN
),
Table2 (field2, field3) AS
(
SELECT
FROM Table1
INNER JOIN
), . . .
SELECT
FROM Table 6
This is a lot more readable, and dependencies flow downward in logical order. However, this doesn't seem like the intended use of CTEs (and also why I'm not using Views), since each definition is really only referenced once in order.
Is there any guidance out there on how to construct sequentially nested joins like this that is both readable and logical in structure?
I don't think there is anything wrong in utilizing CTE to create temporary views.
In a larger shop, there are roles defined that separates the
responsibility of DBAs versus developers. The CREATE statement, in general, will be the victim of this bureaucracy. Hence, no view. CTE is a very good
compromise.
If the views are not really reusable anyway, keeping it with the SQL makes it more readable.
CTE is a lot more readable and intuitive than sub-queries (even with
just one level). If your subqueries are not correlated, I would just suggesting
converting all of your sub-queries to CTE.
Recursion is the "killer" app for CTE, but it doesn't mean that you shouldn't use CTE, otherwise.
The only con that I can think of is that (depending on your Database Engine) it might confuse or prevent the optimizer from doing what it's suppose to do. Optimizers are smart enough to rewrite subqueries for you.
Now, let us discuss abuse of CTE, be careful that you don't substitute your application developer knowledge for Database Engine optimization. There are a lot of smart developers (smarter that us) that designed this software, just write the query without cte or subqueries as much as possible and let the DB do the work. For example, I often see developers DISTINCT/WHERE every key in a subquery before doing their join. You may think your doing the right thing, but you're not.
With regards to your question, most people intend to solve problems and not discuss something theoretical. Hence, you get people scratching their heads on what you are after. I wouldn't say you didn't imply that in your text, but perhaps be more forceful.
May be I didn't understand the question but what's wrong with:
select * from table1 t1, table2 t2, table3 t3, table4 t4, table5 t5, table6 t6
where t1.id = t2.id and t2.id = t3.id and t3.id = t4.id
and t4.id = t5.id and t5.id = t6.id
or same using table 1 t1 INNER JOIN table2 t2 ON t1.id = t2.id ....
why didn't you just join your tables like this
select *
from Table1 as t1
inner join Table2 as t2 on t2.<column> = t1.<column>
inner join Table3 as t3 on t3.<column> = t2.<column>
inner join Table4 as t4 on t4.<column> = t3.<column>
inner join Table5 as t5 on t5.<column> = t4.<column>
inner join Table6 as t6 on t6.<column> = t5.<column>