Impala max() over a window clause - sql

I have a query that looks like this:
SELECT name,
time,
MAX(number) OVER (PARTITION BY name
ORDER BY time
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW)
FROM some_table
For some reason, aggregating over a fixed window isn't implemented for MAX(), as I get the following error:
'max(number)' is only supported with an UNBOUNDED PRECEDING start bound
(Replacing MAX with SUM works as one would expect.)
Is there a workaround for this? I would also appreciate a rough explanation for why this works for SUM or COUNT but not MAX or MIN.
I'm currently using Impala 2.7.0.

I ran into the same problem.
Try Hive instead of Impala. It doesn't have the issue.
Vincent

Related

ERROR: Aggregate window functions with an ORDER BY clause require a frame clause

I am getting an 'ERROR: Aggregate window functions with an ORDER BY clause require a frame clause' message when enterring the following query on Redshift. Please help - I am trying to view the growth of members from day 1 til today. Thanks.
select date(timestampregistered), count(distinct(memberid)),
(SUM(count(distinct(memberid))) OVER (ORDER BY date(timestampregistered)))
AS total_users
from table
order by date(timestampregistered);
You have a couple of things going on. First you seem to be missing a GROUP BY clause for the proper operation of COUNT() by date.
Next you need to specify the range of "counts" for which you want to SUM(). Specifically you want to sum counts for previous dates up to and including the current row's date but not later dates.
select date(timestampregistered), count(distinct(memberid)),
(SUM(count(distinct(memberid))) OVER (ORDER BY date(timestampregistered) ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
AS total_users
from table
group by date(timestampregistered)
order by date(timestampregistered);

When can aggregate functions be nested in standard SQL?

I know it wasn't allowed in SQL-92. But since then it may have changed, particularly when there's a window applied. Can you explain the changes and give the version (or versions if there were more) in which they were introduced?
Examples
Is SUM(COUNT(votes.option_id)) OVER() valid syntax per standard SQL:2016 (or earlier)?
This is my comment (unanswered, an probably unlikely to in such an old question) in Why can you nest aggregate functions when using a window function in PostgreSQL?.
The Calculating Running Total (SQL) kata at Codewars has as its most upvoted solution (using PostgreSQL 13.0, a highly standard compliant engine, so the code is likely to be standard) this one:
SELECT
CREATED_AT::DATE AS DATE,
COUNT(CREATED_AT) AS COUNT,
SUM(COUNT(CREATED_AT)) OVER (ORDER BY CREATED_AT::DATE ROWS UNBOUNDED PRECEDING)::INT AS TOTAL
FROM
POSTS
GROUP BY
CREATED_AT::DATE
(Which could be simplified to:
SELECT
created_at::DATE date,
COUNT(*) COUNT,
SUM(COUNT(*)) OVER (ORDER BY created_at::DATE)::INT total
FROM posts
GROUP BY created_at::DATE
I assume the ::s are a new syntax for casting I didn't know of. And that casting from TIMESTAMP to DATE is now allowed (in SQL-92 it wasn't).)
As this SO answer explains, Oracle Database allows it even without a window, pulling in the GROUP BY from context. I don't know if the standard allows it.
You already noticed the difference yourself: It's all about the window. COUNT(*) without an OVER clause for instance is an aggregation function. COUNT(*) with an OVER clause is a window function.
By using aggregation functions you condense the original rows you get after the FROM clause and WHERE clause are applied to either the specified group in GROUP BY or to one row in the absence of a GROUP BY clause.
Window functions, aka analytic functions, are applied afterwards. They don't change the number of result rows, but merely add information by looking at all or some rows (the window) of the selected data.
In
SELECT
options.id,
options.option_text,
COUNT(votes.option_id) as vote_count,
COUNT(votes.option_id) / SUM(COUNT(votes.option_id)) OVER() * 100.0 as vote_percentage
FROM options
LEFT JOIN votes on options.id = votes.option_id
GROUP BY options.id;
we first join votes to options and then count the votes per option by aggregating the joined rows down to one result row per option (GROUP BY options.id). We count on a non-nullable column in the votes table (COUNT(votes.option_id), so we get a zero count in case there are no votes, because in an outer joined row this column is set to null.
After aggregating all rows and getting thus one row per option we apply a window function (SUM() OVER) on this result set. We apply the analytic SUM on the vote count (SUM(COUNT(votes.option_id)) by looking at the whole result set (empty OVER clause), thus getting the same total vote count in every row. We use this value for a calculation: option's vote count diveded by total vote count times 100, which is the option's percentage of total votes.
The PostgreSQL query is very similar. We select the number of posts per date (COUNT(created_at) is nothing else than a mere COUNT(*)) along with a running total of these counts (by using a window that looks at all rows up to the current row).
So, while this looks like we are nesting two aggregate functions, this is not really the case, because SUM OVER is not considered an agregation function but an analytic/window function.
Oracle does allow applying an aggregate function directly on another, thus invoking a final aggregation on a previous grouped by aggregation. This allows us to get one result row of, say, the average of sums without having to write a subquery for this. This is not compliant with the SQL standard, however, and very unpopular even among Oracle developers at that.

Adding a Simple Subtraction Formula to a Case When Statement

I'm writing a sql script where I'd like to add a subtraction formula to the script. My problem is that when I add this CASE STATEMENT my script will not run. I read that you need to add paranetheses around the formula, which I did, and when I do this without the CASE WHEN it will work great. Can you just not use formulas within a case statement?
In my statement below, I have a column, TotalWeightLoss, where its a cumulative total weight lost by a person. So what I am trying to do is see the monthly weight lost instead of a cumulative total.
SELECT *
,case when rownmbr=1 then TotalWeightLoss else (TotalWeightLoss - LAG(TotalWeightLoss) OVER (PARTITION BY AccountNumber ORDER BY ProcessDate, ProcessDate)) AS AmountLost
from cte;"))
Thanks!
Besides the missing END: Assuming that rownmbr is based on a ROW_NUMBER calculation this can be simplified to
TotalWeightLoss
- LAG(TotalWeightLoss,1,0) -- LAG supports a default for a missing value
OVER (PARTITION BY AccountNumber
ORDER BY ProcessDate, ProcessDate) AS AmountLost

Redshift SQL - Running Sum using Unbounded Proceding and Following

When we use the window function to calculate the running sum like SUM(sales) over (partition by dept order by date), if we don't specify the range/window, is the default setting as between unbounded proceding and current row, basically from the first row until the current row?
According to this doc it seems to be the case, but I wanted to double check.
Thanks!
The problem you are running into is 'what does the database engine assume in ambiguous circumstances?' I've run into this exact case before when porting from SQLserver to Redshift - SQL server assumes that is you order but don't specify a frame that you want unbounded preceding to current row. Other DBs do not make the same assumption - if you don't specify a frame it will be unbounded preceding to unbounded following and yet other will throw an error if you specify and "order by" but don't specify a frame. Bottom line - don't let the DB engine guess what you want, be specific.
Gordon is correct that this is based on rows, not ranges. If you want a running sum by date (not row), you can group by date and run the window function - windows execute after group by in a single query.

Cumulative count for calculating daily frequency using SQL query (in Amazon Redshift)

I have a dataset contains 'UI' (unique id), time, frequency (frequency for give value in UI column), as it is shown here:
What I would like to add a new column named 'daily_frequency' which simply counts each unique value in UI column for a given day sequentially as I show in the image below.
For example if UI=114737 and it is repeated 2 times in one day, we should have 1, and 2 in the daily_frequency column.
I could do that with Python and Panda package using group by and cumcount methods as follows ...
df['daily_frequency'] = df.groupby(['UI','day']).cumcount()+1
However, for some reason, I must do this via SQL queries (Amazon Redshift).
I think you want a running count, which could be calculated as:
COUNT(*) OVER (PARTITION BY ui, TRUNC(time) ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS daily_frequency
Although Salman's answer seems to be correct, I think ROW_NUMBER() is simpler:
COUNT(*) OVER (PARTITION BY ui, time::date
ORDER BY time
) AS daily_frequency