Redshift SQL - Running Sum using Unbounded Proceding and Following - sql

When we use the window function to calculate the running sum like SUM(sales) over (partition by dept order by date), if we don't specify the range/window, is the default setting as between unbounded proceding and current row, basically from the first row until the current row?
According to this doc it seems to be the case, but I wanted to double check.
Thanks!

The problem you are running into is 'what does the database engine assume in ambiguous circumstances?' I've run into this exact case before when porting from SQLserver to Redshift - SQL server assumes that is you order but don't specify a frame that you want unbounded preceding to current row. Other DBs do not make the same assumption - if you don't specify a frame it will be unbounded preceding to unbounded following and yet other will throw an error if you specify and "order by" but don't specify a frame. Bottom line - don't let the DB engine guess what you want, be specific.
Gordon is correct that this is based on rows, not ranges. If you want a running sum by date (not row), you can group by date and run the window function - windows execute after group by in a single query.

Related

Window function - N preceding and unbounded question

Say I create a window function and specify:
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
How does the window function treat the first 9 rows? Does it only calculate up to however many rows above it are available?
I couldn't find this documented in SQL Server's documentation but I could find it in Postgres, and I believe it is standardised1:
In any case, the distance to the end of the frame is limited by the distance to the end of the partition, so that for rows near the partition ends the frame might contain fewer rows than elsewhere.
(My emphasis)
1Have also search MySQL documentation to no avail; This Q is just tagged sql so should be based on the standard but I can't find any downloadable drafts of those at the moment either.
It does the computation ,considering the 10 rows prior to the current row and the current row ,for the given partition window .For example if you want to sum up a number based on the last 3 years and current year ,you can do sum(amount) over (order by year asc) rows between 3 PRECEDING and CURRENT ROW.
To answer your question "Does it only calculate up to however many rows above it are available?" - Yes it considers only those rows which are available

get Last value of a column in sql server

I want to get the last value of a column(it is not an identity column) and increment it to the value of corresponding row number generated.
Select isnull(LAST_VALUE(ColumnA) over(order by ColumnA), 0) +
ROW_NUMBER() OVER (ORDER BY ColumnA)
from myTable
I am calling my sp recursively hence why I thought of this logic.
But it is not working.
I basically wanted, for first time 1-9 for second run (if recursively being called 2 times) 10-19 and so on.
Total stab in the dark, but I suspect "not working" means "returning the current row's value." Don't forget that an OVER clause defaults to the window RANGE BETWEEN PRECEDING AND CURRENT ROW when one isn't explicitly specified and there is an ORDER BY (see SELECT - OVER Clause (Transact-SQL) - ORDER BY).
ORDER BY
ORDER BY *order_by_expression* [COLLATE *collation_name*] [ASC|DESC]
Defines the logical order of the rows within each partition of the result set. That is, it specifies the logical order in which the window function calculation is performed.
If it is not specified, the default order is ASC and window function will use all rows in partition.
If it is specified, and a ROWS/RANGE is not specified, then default RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for window frame by the functions that can accept optional ROWS/RANGE specification (for example min or max).
As you haven't defined the window, that's what your LAST_VALUE function is using. Define that you want the whole lot for the partition:
SELECT ISNULL(LAST_VALUE(ColumnA) OVER (ORDER BY ColumnA ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) +
ROW_NUMBER() OVER (ORDER BY ColumnA)
FROM dbo.myTable;
Though what Gordon says in their comment is the real solution:
You should be using an identity column or sequence.
This type of solution can (and will) end up suffering race conditions, as well as end up reusing "identities" when it shouldn't be.

Cumulative count for calculating daily frequency using SQL query (in Amazon Redshift)

I have a dataset contains 'UI' (unique id), time, frequency (frequency for give value in UI column), as it is shown here:
What I would like to add a new column named 'daily_frequency' which simply counts each unique value in UI column for a given day sequentially as I show in the image below.
For example if UI=114737 and it is repeated 2 times in one day, we should have 1, and 2 in the daily_frequency column.
I could do that with Python and Panda package using group by and cumcount methods as follows ...
df['daily_frequency'] = df.groupby(['UI','day']).cumcount()+1
However, for some reason, I must do this via SQL queries (Amazon Redshift).
I think you want a running count, which could be calculated as:
COUNT(*) OVER (PARTITION BY ui, TRUNC(time) ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS daily_frequency
Although Salman's answer seems to be correct, I think ROW_NUMBER() is simpler:
COUNT(*) OVER (PARTITION BY ui, time::date
ORDER BY time
) AS daily_frequency

RANGE BETWEEN function with Timestamp values in SQL Impala

I am trying to calculate a moving number of events (impressions) that happen per minute. How can I use the range between function with timestamp values to define the 1-minute interval?
I have something like this:
count(impression) over (partition by user
ORDER BY trunc(cast(entrytime as TIMESTAMP), "MI")
RANGE BETWEEN interval 1 minutes Preceding
and interval 1 minutes Following) as densityperminute
but this doesn't seem to work. Any ideas on how to fix this?
I believe that's not supported, unfortunately. From the documentation for 6.1:
Currently, Impala supports only some combinations of arguments to the
RANGE clause:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (the default when
ORDER BY is specified and the window clause is omitted)
RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Source
(Forgive me answering an old question, but I'm currently looking into this for a school project and this came up in my search!)

Impala max() over a window clause

I have a query that looks like this:
SELECT name,
time,
MAX(number) OVER (PARTITION BY name
ORDER BY time
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW)
FROM some_table
For some reason, aggregating over a fixed window isn't implemented for MAX(), as I get the following error:
'max(number)' is only supported with an UNBOUNDED PRECEDING start bound
(Replacing MAX with SUM works as one would expect.)
Is there a workaround for this? I would also appreciate a rough explanation for why this works for SUM or COUNT but not MAX or MIN.
I'm currently using Impala 2.7.0.
I ran into the same problem.
Try Hive instead of Impala. It doesn't have the issue.
Vincent