Is the GROUP BY clause applied after the WHERE clause in Hive? - hive

Suppose I have the following SQL:
select user_group, count(*)
from table
where user_group is not null
group by user_group
Suppose further that 99% of the data has null user_group.
Will this discard the rows with null before the GROUP BY, or will one poor reducer end up with 99% of the rows that are later discarded?
I hope it is the former. That would make more sense.
Bonus points if you say what will happen by Hive version. We are using 0.11 and migrating to 0.13.
Bonus points if you can point to any documentation that confirms.

Sequence
FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
SELECT
ORDER BY arranges the remaining rows/groups
The first step is always the FROM clause. In your case, this is pretty straight-forward, because there's only one table, and there aren't any complicated joins to worry about. In a query with joins, these are evaluated in this first step. The joins are assembled to decide which rows to retrieve, with the ON clause conditions being the criteria for deciding which rows to join from each table. The result of the FROM clause is an intermediate result. You could think of this as a temporary table, consisting of combined rows which satisfy all the join conditions. (In your case the temporary table isn't actually built, because the optimizer knows it can just access your table directly without joining to any others.)
The next step is the WHERE clause. In a query with a WHERE clause, each row in the intermediate result is evaluated according to the WHERE conditions, and either discarded or retained. So null will be discarded before going to Group by clause
Next comes the GROUP BY. If there's a GROUP BY clause, the intermediate result is now partitioned into groups, one group for every combination of values in the columns in the GROUP BY clause.
Now comes the HAVING clause. The HAVING clause operates once on each group, and all rows from groups which do not satisfy the HAVING clause are eliminated.
Next comes the SELECT. From the rows of the new intermediate result produced by the GROUP BY and HAVING clauses, the SELECT now assembles the columns it needs.
Finally, the last step is the ORDER BY clause.

This query discards the rows with NULL before the GROUP BY operation.
Hope this link will be useful:-
http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.2.0/bk_dataintegration/content/hive-013-feature-subqueries-in-where-clauses.html

Related

Row Order in SQL

I wanted to know if the row order returned by a query mattered?
I'm not using a SQL service yet, just working with plain tables and Excel.
For example if I do a left join on two tables, my take is that all the rows from the left or first table to be mentioned will be the first ones in my resulting table, whether there are coincidences on the right one or not. But a classmate ordered the results so he placed the rows with coincidences first and the ones without, with null values, at the end.
SQL tables represent unordered sets. SQL results sets are unordered unless you explicitly have an ORDER BY for the outermost SELECT.
This is always true and is a fundamental part of the language. Your class should have covered this on day 1.
The results from a query without an ORDER BY may look like they are in a particular order. However, you should not depend on that -- or, you depend on that at your peril. The rule is simple: without an ORDER BY, you do not know the ordering of the result set.

Are big-query results always ordered, that is: using OFFSET makes sense to skip rows?

In other words does a select query order results every time, so these 2 will always produce unique values:
select *
from bigquery-public-data.crypto_ethereum.balances
limit 10 OFFSET 100
select *
from bigquery-public-data.crypto_ethereum.balances
limit 10 OFFSET 2000
Assuming of course the table has unique values...I am just curious if without using "order" clause the table is always deterministic/consequetive or can the results duplicate if they're returned indeed at random? 10x!
I am just curious if without using "order" clause the table is always deterministic/consequetive or can the results duplicate if they're returned indeed at random.
No. SQL tables represent unordered set of rows. There is no inherent ordering of the rows. Unless an order by clause is specified, there is no guarantee that two consequent executive of the same query would yield an indentical result. The database is free to return the rows in whatever order it likes.
As a consequence, the results of a query with a row-limiting clause but no order by clause are not deterministic. Do add an order by clause the these queries, or you will sooner or later run into suprising and hard-to-debug behaviors.

exclude a column from group by statement

I would like to exclude a column from group by statement, because it results in some redundant records. Are there any recommendations?
I use Oracle, and have a complex query which join 6 tables together, and want to use sql aggregate function (count), without duplicate result.
You can't.
When using aggregate functions every column/column expression which is not an aggregate must be in the GROUP BY.
This is completely logical. If you're not aggregating the column then excluding it from the GROUP BY would force Oracle to chose a random value, which is not very useful.
If you don't want this column in your GROUP BY then you must decide what aggregation to apply to this column in order to return the appropriate data for your situation. You can't hand this responsibility off to the database engine.

In SQL, does the COUNT(*) happen after the JOIN?

I have a query like follows :
SELECT LOCATION_CODE AS "Location",
COUNT(prha.authorization_status) AS "Reqn Lines Count Approved" ,
FROM tabl t1
JOIN table t1 ... etc
JOIN
My question is - suppose that I want to tally up both the counts of something, and then the "opposite" counts (i.e counting the nulls and zero's ) ; all within one query.
So I was wondering if this is possible? or does the COUNT(*) function only occur after we use the JOIN's ? thanks
I'm not sure that I completely understand what you're asking, but recent versions of Oracle do not technically have to perform joins at all if they would not affect the required result.
If you were counting records from a table, and joined to a table against which there was a foreign key constraint, then the optimiser can infer that the join is not required and can omit it.
Furthermore, I seem to recall that the optimiser can also perform aggregations prior to joins as well in some circumstances, if it would be more efficient to do so (for example, if joining between a DW fact table and dimension table, grouping at the atomic level of the dimension and selecting many dimension columns -- the aggregation can be performed on the fact table prior to the join to the dimension, in order to reduce the size of the sort needed on the aggregation).
So while under normal circumstances the join is going to be executed first, in some cases it will not.

SQL - Using MAX in a WHERE clause

Assume value is an int and the following query is valid:
SELECT blah
FROM table
WHERE attribute = value
Though MAX(expression) returns int, the following is not valid:
SELECT blah
FROM table
WHERE attribute = MAX(expression)
OF course the desired effect can be achieved using a subquery, but my question is why was SQL designed this way - is there some reason why this sort of thing is not allowed? Students coming from programming languages where you can always replace a data-type by a function call that returns that type find this issue confusing. Is there an explanation one can give them rather than just saying "that's the way it is"?
It's just because of the order of operations of a query.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
WHERE just filters the rows returned by FROM. An aggregate function like MAX() can't have a result returned because it hasn't even been applied to anything.
That's also the reason, why you can't use aliases defined in the SELECT clause in a WHERE clause, but you can use aliases defined in FROM clause.
A where clause checks every row to see if it matches the conditions specified.
A max computes a single value from a row set. If you put a max, or any other aggregate function into a where clause, how can SQL server figure out what rows the max function can use until the where clause has finished it filter?
This deals with the order that SQL Server processes commands in. It runs the WHERE clause before a GROUP BY or any aggregate. Since a where clause runs first, SQL Server can't tell if a row will be included in an aggregate until it processes the where. That is what the HAVING clause is for. HAVING runs after the GROUP BY and the WHERE and can include MAX since you have already filtered out the rows you don't want to use. See http://www.bennadel.com/blog/70-SQL-Query-Order-of-Operations.htm for a good explanation of the order in which SQL commands run.
Maybe this work
SELECT blah
FROM table
WHERE attribute = (SELECT MAX(expresion) FROM table1)
The WHERE clause is specifically designed to test conditions against raw data (individual rows of the table). However, MAX is an aggregate function over multiple rows of data. Basically, without a sub-select, the WHERE clause knows nothing about any rows in the table except for the current row. So how can you determine the maximum value over a whole bunch of rows when you don't even know what those rows are?
Yes, it's a little bit of a simplification, especially when dealing with joins, but the same principle applies. WHERE is always row-by-row, so that's all it really knows about.
Even if you have a GROUP BY clause, the WHERE clause still only processes one row at a time in the raw data before grouping. It doesn't know the value of a column in any other rows, so it has no way of knowing which row has the maximum value.
Assuming this is MS SQL Server, the following would work.
SELECT TOP 1 blah
FROM table
ORDER BY expression DESC