HiveQL - first_value of multiple columns over window - sql

I am looking to retrieve the first row and last row over a window in HiveQL.
I know there are a couple ways to do this:
Use FIRST_VALUE and LAST_VALUE on the columns I am interested in.
SELECT customer,
FIRST_VALUE(product) over (W),
FIRST_VALUE(time) over (W),
LAST_VALUE(product) over (W),
LAST_VALUE(time) over (W)
FROM table
WINDOW W AS (PARTITION BY customer ORDER BY COST)
Calculate ROW_NUMBER() of each row and use a where clause for row_number=1.
WITH table_wRN AS
(
SELECT *,
row_number() over (partition by customer order by cost ASC) rn_B,
row_number() over (partition by customer order by cost DESC) rn_E
FROM table
),
table_first_last AS
(
SELECT *
FROM table_wRN
WHERE (rn_E=1 OR rn_B=1)
)
SELECT table_first.customer,
table_first.product, table_first.time,
table_last.product, table_last.time
FROM table_first_last as table_first WHERE table_first_last.rn_B=1
JOIN table_first_last as table_last WHERE table_first_last.rn_E=1
ON table_first.customer = table_last.customer
My questions:
Does anyone know which of these two is more efficient?
Intuitively, I think the first one should be faster because there is no need for a sub-query or a CTE.
Experimentally, I feel the second is faster but this could be because I am running first_value on a number of columns.
Is there a way to apply first_value and retrieve multiple columns in one shot.
I am looking to reduce the number of times the windowing is done / evaluated (something like cache the window)
Example of pseudo-code:
FIRST_VALUE(product,time) OVER (W) AS product_first, time_first
Thank you!

I am almost certain that the first would be more efficient. I mean two window functions versus two window functions, filtering and two joins?
Once you multiply the number of columns, then there might be an issue of which is faster. That said, look at the execution plan. I would expect that all window functions using the same window frame specification would use the same "windows" processing, with just tweaks for each value.
Hive does not have very good support for complex data types such as strings and arrays. In databases that do, it is easy enough to provide a complex type.

Related

Why isn't FIRST_VALUE and LAST_VALUE an aggregation function in SQL?

Is there any special reason that SQL only implements FIRST_VALUE and LAST_VALUE as a windowing function instead of an aggregation function? I find it quite common to encounter problems such as "find the item with highest price in each category". While other languages (such as python) provide MIN/MAX functions with keywords such that
MAX(item_name, key=lambda x: revenue[x])
is possible, In SQL the only way to tackle this problem seems to be:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Is there a special reason that there is no "aggregation version" of FIRST_VALUE such that
SELECT category, FIRST_VALUE(item_name, revenue)
FROM catalog
GROUP BY
category
or is it just the way it is?
That’s just the way it is, as far as I’m concerned. I suspect the only real answer would be “because it’s not in the SQL spec” and the only people who could really answer as to why it’s not in the spec are the people who write it. Questions of the form “what was (name of relevant external authority) thinking when they mandated that (name of product) should operate like this” are actually typically off topic here because very few people can reliably and factually answer.. I don’t even like my own answer here, as it feels like an extended comment on a question that cannot realistically be answered
Aggregate functions work on sets of data and while some of them might require some implied ordering operation such as median, the functions are always about the column they’re operating on, not a “give me the value of this column based on the ordering of that column”.
There are plenty of window/analytic functions that don’t have a corollary aggregation version, and window functions have a different end use intent than aggregation. You could conceive that some of them perform aggregation and then join the aggregation result back to the main data in order to relate the agg result to the particular row, but I wouldn’t assume the two facilities (agg vs window) are related at all
As far as I understand the python (not a python dev), it is not doing any aggregation, it's searching a list of item_name strings and looking each up in a dictionary that returns the revenue for that item, and returning the item_name that has the largest revenue. There wasn't any grouping there, it's much more like a SELECT TOP 1 item_name ORDER BY revenue and is only really good for returning a single item, rather than a load of items that are all maxes within their group, unless it's used within a loop that is processing a different list of item name each time
I know your question wasn't exactly about this particular SQL query but it may be helpful for you if I mention a couple of things on it. I'm not really sure what:
WITH temp as(
SELECT *, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
)
SELECT category, MAX(fv) -- MIN(fv) also OK
FROM temp
GROUP BY category;
Gives you over something like:
SELECT DISTINCT category, FIRST_VALUE(item_name) OVER(PARTITION BY category ORDER BY revenue) as fv
FROM catalog
The analytic/window will produce the same value for every category (the partition) so it seems that really all the extra group by is doing is reducing the repeated values - which could be more simply answered by just getting the values you want and using distinct to quash the duplicates (one of the few cases where I would advocate such)
In the more general sense of "I want the entire most X row as determined by highest/lowest Y" we typically use row number for that:
WITH temp as(
SELECT *, ROW_NUMBER(item_name) OVER(PARTITION BY category ORDER BY revenue) as rn
FROM catalog)
SELECT *
FROM temp
WHERE rn = 1;
Though I find it more compact/readable to dispense with the CTE and just use a sub query but YMMV

Window functions in view called with "where" gives bad execution plan

I have a view that uses LAG.
CREATE VIEW V_ImportedReadingDay2
AS
SELECT
ID,
PlacementID,
LAG(Reading, 1) OVER (PARTITION BY MeterNumber ORDER BY Date) AS Val
FROM dbo.ImportedReadingDay
If I call it using "WHERE" it gets an execution plan much worse than if just calling the query.
SELECT
ID,
PlacementID,
LAG(Reading, 1) OVER (PARTITION BY MeterNumber ORDER BY Date) AS Val
FROM dbo.ImportedReadingDay
WHERE (PlacementID = 12404)
SELECT *
FROM V_ImportedReadingDay2
WHERE (PlacementID = 12404)
This is a known problem. You can google the problem.
I have found two solutions. Either use a table valued function or move the LAG outside of the view.
BUT I'd like to know if there are any other solutions since none of these work for me since I have to use the view in a client software.
Your two queries aren't logically the same. So, of course, they don't get the same execution plan.
Consider these queries:
select name,LAG(column_id) OVER (ORDER BY system_type_id) as cid
from sys.columns
where name='name'
select * from (
select name,LAG(column_id) OVER (ORDER BY system_type_id) as cid
from sys.columns
) t
where name='name'
Because of the logical processing order of queries, the WHERE clause is processed before the SELECT clause. So, for the first query, we first filter the sys.columns table to only retrieve rows with a particular name, and then we apply the LAG() function just on this filtered set (so, the lagged value will definitely come from another row which matches the filter).
For the second query, we first (logically) process the subquery. We're performing the LAG() function across the whole set of rows (because the subquery doesn't have any filters/WHERE clause) and then (in the outer query) we're filtering the set of rows. Importantly, that means that the lagged value may have been pulled from a row which doesn't match the final filter.
Well, when you use a view, it's similar to my second query. The value of Val retrieved when you use your view is not guaranteed to be from a row with a PlacementID equal to 12404.
This was a simplified view just for this example.
In the real one I partition the LAG.
I found out that partitioning the LAG with the same as used in the "WHERE" (in this case PlacementID) solved the performance issue.

Does ROW_NUMBER queried data require to fetch the entire result set?

I'm trying to write query that's not using offset (because as I just have learnt offset fetches all data which causes performance overhead). with ROW_NUMBER window function. For instance:
SELECT id
FROM(
SELECT id, ROW_NUMBER() over (order by id) rn
FROM users) sq
WHERE rn > 1000
Does it require all rows to be fetched as it would be with offset 1000? I mean, does it make a sense to use such query instead of
SELECT if
FROM users
OFFSET 1000
? Do I get performance improvement on large amount of data?
Check out the window function docs. Window functions operate on the result set, after the fetch:
Window functions are permitted only in the SELECT list and the ORDER
BY clause of the query. They are forbidden elsewhere, such as in GROUP
BY, HAVING and WHERE clauses. This is because they logically execute
after the processing of those clauses. Also, window functions execute
after regular aggregate functions. This means it is valid to include
an aggregate function call in the arguments of a window function, but
not vice versa.
Does it make sense to use the row_number() query? Well, it produces the same result set. However, the query basically has to assign row_number() to all the rows in order to find the ones that meet the requirement.
The second query, however, is lacking an order by. When using offset, you should have an order by:
SELECT id
FROM users u
ORDER BY id
OFFSET 1000
I would imagine that this is more efficient than using row_number(), but actual timings would demonstrate that.

Calculating SQL Server ROW_NUMBER() OVER() for a derived table

In some other databases (e.g. DB2, or Oracle with ROWNUM), I can omit the ORDER BY clause in a ranking function's OVER() clause. For instance:
ROW_NUMBER() OVER()
This is particularly useful when used with ordered derived tables, such as:
SELECT t.*, ROW_NUMBER() OVER()
FROM (
SELECT ...
ORDER BY
) t
How can this be emulated in SQL Server? I've found people using this trick, but that's wrong, as it will behave non-deterministically with respect to the order from the derived table:
-- This order here ---------------------vvvvvvvv
SELECT t.*, ROW_NUMBER() OVER(ORDER BY (SELECT 1))
FROM (
SELECT TOP 100 PERCENT ...
-- vvvvv ----redefines this order here
ORDER BY
) t
A concrete example (as can be seen on SQLFiddle):
SELECT v, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM (
SELECT TOP 100 PERCENT 1 UNION ALL
SELECT TOP 100 PERCENT 2 UNION ALL
SELECT TOP 100 PERCENT 3 UNION ALL
SELECT TOP 100 PERCENT 4
-- This descending order is not maintained in the outer query
ORDER BY 1 DESC
) t(v)
Also, I cannot reuse any expression from the derived table to reproduce the ORDER BY clause in my case, as the derived table might not be available as it may be provided by some external logic.
So how can I do it? Can I do it at all?
The Row_Number() OVER (ORDER BY (SELECT 1)) trick should NOT be seen as a way to avoid changing the order of underlying data. It is only a means to avoid causing the server to perform an additional and unneeded sort (it may still perform the sort but it's going to cost the minimum amount possible when compared to sorting by a column).
All queries in SQL server ABSOLUTELY MUST have an ORDER BY clause in the outermost query for the results to be reliably ordered in a guaranteed way.
The concept of "retaining original order" does not exist in relational databases. Tables and queries must always be considered unordered until and unless an ORDER BY clause is specified in the outermost query.
You could try the same unordered query 100,000 times and always receive it with the same ordering, and thus come to believe you can rely on said ordering. But that would be a mistake, because one day, something will change and it will not have the order you expect. One example is when a database is upgraded to a new version of SQL Server--this has caused many a query to change its ordering. But it doesn't have to be that big a change. Something as little as adding or removing an index can cause differences. And more: Installing a service pack. Partitioning a table. Creating an indexed view that includes the table in question. Reaching some tipping point where a scan is chosen instead of a seek. And so on.
Do not rely on results to be ordered unless you have said "Server, ORDER BY".

What is the result set ordering when using window functions that have `order by` components?

I'm working on a query on the SEDE:
select top 20
row_number() over(order by "percentage approved" desc, approved desc),
row_number() over(order by "total edits" asc),
*
from editors
where "total edits" > 30
What is the ordering of the result set, taking into account the two window functions?
I suspect it's undefined but couldn't find a definitive answer. OTOH, results from queries with one such window function were ordered according to the over(order by ...) clause.
The results can be returned in any order.
Now, they will often be returned in the same order as specified in the OVER clause, but this is just because SQL Server is likely to pick a query plan that sorts the rows to calculate the aggregate. This is by no means guaranteed, as it could pick a different query plan at any time, especially as you make your query more complex which extends the space of possible query plans.
The result set of ANY SQL Server query that doesn't have an explicit ORDER BY is undefined.
This includes when you have window functions within the query, or an ORDER BY in a subquery. The result order will depend on a lot of factors, none of which are guaranteed unless you specify an ORDER BY.