Hive star schema query with 'SELECT *' in sub-selects - hive

I came across a previously implement query in Hive that I am trying to grok, and was wondering if someone could explain what the advantages (or lack there of) of the query pattern used. The query structure is a star schema that sub-selects the join tables in this manner:
SELECT
a.key
a.field1
b.field2
c.field3
d.field4
FROM first a
JOIN ( SELECT * FROM second ) b ON a.key = b.key
JOIN ( SELECT * from third ) c ON a.key = c.key
JOIN ( SELECT * from fourth ) d ON a.kay = d.key
SORT BY a.key DESC;
The thing that is perplexing me is why would you sub-select the join tables (note the SELECT * with no WHERE) rather than join directly to them. Before I go changing legacy code queries (for other reasons), I wanted to understand what might be the goals of this approach. The query was written in the time of Hive 0.10, but we are up to Hive 0.13 now. Could this be a legacy work around for something?

There is no any advantage in subqueries with select all and with no WHERE clause. But it maybe useful if applicable to limit columns and rows in subqueries so that their datasets will fit in memory and map-only join will work. Three subqueries in your example can be pre-calculated in parallel if hive.exec.parallel=true.
Also SORT BY a.key DESC seems useless in this query.

my best guess would be that it remained from the testing phase. to keep the where clauses in their own spaces. Performance wise there is no improvement.

Related

Oracle: Use only few tables in WHERE clause but mentioned more tables in 'FROM' in a jon SQL

What will happen in an Oracle SQL join if I don't use all the tables in the WHERE clause that were mentioned in the FROM clause?
Example:
SELECT A.*
FROM A, B, C, D
WHERE A.col1 = B.col1;
Here I didn't use the C and D tables in the WHERE clause, even though I mentioned them in FROM. Is this OK? Are there any adverse performance issues?
It is poor practice to use that syntax at all. The FROM A,B,C,D syntax has been obsolete since 1992... more than 30 YEARS now. There's no excuse anymore. Instead, every join should always use the JOIN keyword, and specify any join conditions in the ON clause. The better way to write the query looks like this:
SELECT A.*
FROM A
INNER JOIN B ON A.col1 = B.col1
CROSS JOIN C
CROSS JOIN D;
Now we can also see what happens in the question. The query will still run if you fail to specify any conditions for certain tables, but it has the effect of using a CROSS JOIN: the results will include every possible combination of rows from every included relation (where the "A,B" part counts as one relation). If each of the three parts of those joins (A&B, C, D) have just 100 rows, the result set will have 1,000,000 rows (100 * 100 * 100). This is rarely going to give the results you expect or intend, and it's especially suspect when the SELECT clause isn't looking at any of the fields from the uncorrelated tables.
Any table lacking join definition will result in a Cartesian product - every row in the intermediate rowset before the join will match every row in the target table. So if you have 10,000 rows and it joins without any join predicate to a table of 10,000 rows, you will get 100,000,000 rows as a result. There are only a few rare circumstances where this is what you want. At very large volumes it can cause havoc for the database, and DBAs are likely to lock your account.
If you don't want to use a table, exclude it entirely from your SQL. If you can't for reason due to some constraint we don't know about, then include the proper join predicates to every table in your WHERE clause and simply don't list any of their columns in your SELECT clause. If there's a cost to the join and you don't need anything from it and again for some very strange reason can't leave the table out completely from your SQL (this does occasionally happen in reusable code), then you can disable the joins by making the predicates always false. Remember to use outer joins if you do this.
Native Oracle method:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10) -- test data
SELECT A.*
FROM data a,
data b,
data c,
data d
WHERE a.col = b.col
AND DECODE('Y','Y',NULL,a.col) = c.col(+)
AND DECODE('Y','Y',NULL,a.col) = d.col(+)
ANSI style:
WITH data AS (SELECT ROWNUM col FROM dual CONNECT BY LEVEL < 10)
SELECT A.*
FROM data a
INNER JOIN data b ON a.col = b.col
LEFT OUTER JOIN data c ON DECODE('Y','Y',NULL,a.col) = b.col
LEFT OUTER JOIN data d ON DECODE('Y','Y',NULL,a.col) = d.col
You can plug in a variable for the first Y that you set to Y or N (e.g. var_disable_join). This will bypass the join and avoid both the associated performance penalty and the Cartesian product effect. But again, I want to reiterate, this is an advanced hack and is probably NOT what you need. Simply leaving out the unwanted tables it the right approach 95% of the time.

Oracle SQL Query Filter in JOIN ON vs WHERE

For inner joins, is there any difference in performance to apply a filter in the JOIN ON clause or the WHERE clause? Which is going to be more efficient, or will the optimizer render them equal?
JOIN ON
SELECT u.name
FROM users u
JOIN departments d
ON u.department_id = d.id
AND d.name = 'IT'
VS
WHERE
SELECT u.name
FROM users u
JOIN departments d
ON u.department_id = d.id
WHERE d.name = 'IT'
Oracle 11gR2
There should be no difference. The optimizer should generate the same plan in both cases and should be able to apply the predicate before, after, or during the join in either case based on what is the most efficient approach for that particular query.
Of course, the fact that the optimizer can do something, in general, is no guarantee that the optimizer will actually do something in a particular query. As queries get more complicated, it becomes impossible to exhaustively consider every possible query plan which means that even with perfect information and perfect code, the optimizer may not have time to do everything that you'd like it to do. You'd need to take a look at the actual plans generated for the two queries to see if they are actually identical.
I prefer putting the filter criteria in the where clause.
With data warehouse queries, putting the filter criteria in the join seems to cause the query to last significantly longer.
For example, I have Table1 indexed by field Date, and Table2 partitioned by field Partition. Table2 is the biggest table in the query and is in another database server. I use driving_site hint to tell the optimizer to use Table2 partitions.
select /*+driving_site(b)*/ a.key, sum(b.money) money
from schema.table1 a
join schema2.table2#dblink b
on a.key = b.key
where b.partition = to_number(to_char(:i,'yyyymm'))
and a.date = :i
group by a.key`
If I do the query this way, it takes about 30 - 40 seconds to return the results.
If I don't do the query this way, it takes about 10 minutes until I cancel the execution with no results.

SQL Server : does order of full outer join matter?

I have 4 full-outer joins in my query and its really slow, So does the order of FULL OUTER JOIN make a difference in performance / result ?
FULL OUTER JOIN = ⋈
Then,
I have a situation : A ⋈ B ⋈ C ⋈ D
All joins occur on a key common to all k contained in all A,B,C,D
Then:
Will changing the order of ⋈ joins make a difference to performance ?
Will changing the order of ⋈ change the result ?
I feel that it should not affect the result, but will it affect the performance or not I am not sure !
Update:
Will SQL Server automatically rearrange the joins for better performance assuming the result set will be independent of the order ?
No, rearranging the JOIN orders should not affect the performance. MSSQL (as with other DBMS) has a query optimizer whose job it is to find the most efficient query plan for any given query. Generally, these do a pretty good job - so you're unlikely to beat the optimizer easily.
That said, they do get it wrong occasionally. That's where reading an execution plan comes into play. You can add JOIN hints to tell MSSQL how to join your tables (at which point, ordering does matter). You'd generally order from smallest to largest table (though, with a FULL JOIN, it's not likely to matter very much) and follow the rules of thumb for join types.
Since you're doing FULL JOINS, you're basically reading the entirety of 4 tables off disk. That's likely to be very expensive. You may want to re-examine the problem, and see if it can be accomplished in a different way.
Will changing the order of ⋈ change the result ?
No, the order of the FULL JOIN does not matter, the result will be the same. Notice however, that you can't use something like this (the following may give different results depending on the order of joins):
SELECT
COALESCE(a.id, b.id, c.id, d.id) AS id, --- Key columns used in FULL JOIN
a.*, b.*, c.*, d.* --- other columns
FROM a
FULL JOIN b
ON b.id = a.id
FULL JOIN c
ON c.id = a.id
FULL JOIN d
ON d.id = a.id ;
You have to use something like this (no difference in results whatever the order of joins):
SELECT
COALESCE(a.id, b.id, c.id, d.id) AS id,
a.*, b.*, c.*, d.*
FROM a
FULL JOIN b
ON b.id = a.id
FULL JOIN c
ON c.id = COALESCE(a.id, b.id)
FULL JOIN d
ON d.id = COALESCE(a.id, b.id, c.id) ;
Will changing the order of ⋈ joins make a difference to performance?
Taking into consideration that the second and third joins have to be done on the COALESCE() of the columns and not the columns themselves, I think only testing with large enough tables will show if the indexes can be used effectively.
Changing the order of a Full outer join shouldn't affect performance or results. The only thing that will be affected based on order of a Full Outer Join is the default order of the columns produced if using a SELECT *. You may be having performance issues simply from trying to do multiple joins with large tables. If there is no where clause to limit the tables, you could be going through hundreds of thousands of results.

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help

Is there something wrong with joins that don't use the JOIN keyword in SQL or MySQL?

When I started writing database queries I didn't know the JOIN keyword yet and naturally I just extended what I already knew and wrote queries like this:
SELECT a.someRow, b.someRow
FROM tableA AS a, tableB AS b
WHERE a.ID=b.ID AND b.ID= $someVar
Now that I know that this is the same as an INNER JOIN I find all these queries in my code and ask myself if I should rewrite them. Is there something smelly about them or are they just fine?
My answer summary: There is nothing wrong with this query BUT using the keywords will most probably make the code more readable/maintainable.
My conclusion: I will not change my old queries but I will correct my writing style and use the keywords in the future.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c
WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.
The more verbose INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN are from the ANSI SQL/92 syntax for joining. For me, this verbosity makes the join more clear to the developer/DBA of what the intent is with the join.
In SQL Server there are always query plans to check, a text output can be made as follows:
SET SHOWPLAN_ALL ON
GO
DECLARE #TABLE_A TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_A
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
DECLARE #TABLE_B TABLE
(
ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
Data VARCHAR(10) NOT NULL
)
INSERT INTO #TABLE_B
SELECT 'ABC' UNION
SELECT 'DEF' UNION
SELECT 'GHI' UNION
SELECT 'JKL'
SELECT A.Data, B.Data
FROM
#TABLE_A AS A, #TABLE_B AS B
WHERE
A.ID = B.ID
SELECT A.Data, B.Data
FROM
#TABLE_A AS A
INNER JOIN #TABLE_B AS B ON A.ID = B.ID
Now I'll omit the plan for the table variable creates, the plan for both queries is identical though:
SELECT A.Data, B.Data FROM #TABLE_A AS A, #TABLE_B AS B WHERE A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
SELECT A.Data, B.Data FROM #TABLE_A AS A INNER JOIN #TABLE_B AS B ON A.ID = B.ID
|--Nested Loops(Inner Join, OUTER REFERENCES:([A].[ID]))
|--Clustered Index Scan(OBJECT:(#TABLE_A AS [A]))
|--Clustered Index Seek(OBJECT:(#TABLE_B AS [B]), SEEK:([B].[ID]=#TABLE_A.[ID] as [A].[ID]) ORDERED FORWARD)
So, short answer - No need to rewrite, unless you spend a long time trying to read them each time you maintain them?
It's more of a syntax choice. I prefer grouping my join conditions with my joins, hence I use the INNER JOIN syntax
SELECT a.someRow, b.someRow
FROM tableA AS a
INNER JOIN tableB AS b
ON a.ID = b.ID
WHERE b.ID = ?
(? being a placeholder)
Nothing is wrong with the syntax in your example. The 'INNER JOIN' syntax is generally termed 'ANSI' syntax, and came after the style illustrated in your example. It exists to clarify the type/direction/constituents of the join, but is not generally functionally different than what you have.
Support for 'ANSI' joins is per-database platform, but it's more or less universal these days.
As a side note, one addition with the 'ANSI' syntax was the 'FULL OUTER JOIN' or 'FULL JOIN'.
Hope this helps.
In general:
Use the JOIN keyword to link (ie. "join") primary keys and foreign keys.
Use the WHERE clause to limit your result set to only the records you are interested in.
The one problem that can arise is when you try to mix the old "comma-style" join with SQL-92 joins in the same query, for example if you need one inner join and another outer join.
SELECT *
FROM table1 AS a, table2 AS b
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1
WHERE a.column2 = b.column2;
The problem is that recent SQL standards say that the JOIN is evaluated before the comma-join. So the reference to "a" in the ON clause gives an error, because the correlation name hasn't been defined yet as that ON clause is being evaluated. This is a very confusing error to get.
The solution is to not mix the two styles of joins. You can continue to use comma-style in your old code, but if you write a new query, convert all the joins to SQL-92 style.
SELECT *
FROM table1 AS a
INNER JOIN table2 AS b ON a.column2 = b.column2
LEFT OUTER JOIN table3 AS c ON a.column1 = c.column1;
Another thing to consider in the old join syntax is that is is very easy to get a cartesion join by accident since there is no on clause. If the Distinct keyword is in the query and it uses the old style joins, convert it to an ANSI standard join and see if you still need the distinct. If you are fixing accidental cartesion joins this way, you can improve performance tremendously by rewriting to specify the join and the join fields.
I avoid implicit joins; when the query is really large, they make the code hard to decipher
With explicit joins, and good formatting, the code is more readable and understandable without need for comments.
It also depends on whether you are just doing inner joins this way or outer joins as well. For instance, the MS SQL Server syntax for outer joins in the WHERE clause (=* and *=) can give different results than the OUTER JOIN syntax and is no longer supported (http://msdn.microsoft.com/en-us/library/ms178653(SQL.90).aspx) in SQL Server 2005.
And what about performances ???
As a matter of fact, performances is a very important problem in RDBMS.
So the question is what is the most performant... Using JOIN or having joined table in the WHERE clause ?
Because optimizer (or planer as they said in PG...) ordinary does a good job, the two execution plans are the same, so the performances while excuting the query will be the same...
But devil are hidden in some details....
All optimizers have a limited time or a limited amount of work to find the best plan... And when the limit is reached, the result is the best plan among all computed plans, and not the better of all possible plans !
Now the question is do I loose time when I use WHERE clause instead of JOINs for joining tables ?
And the answer is YES !
YES, because the relational engine use relational algebrae that knows only JOIN operator, not pseudo joins made in the WHERE clause. So the first thing that the optimizer do (in fact the parser or the algrebriser) is to rewrite the query... and this loose some chances to have the best of all plans !
I have seen this problem twice, in my long RDBMS career (40 years...)