Window function - N preceding and unbounded question - sql

Say I create a window function and specify:
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
How does the window function treat the first 9 rows? Does it only calculate up to however many rows above it are available?

I couldn't find this documented in SQL Server's documentation but I could find it in Postgres, and I believe it is standardised1:
In any case, the distance to the end of the frame is limited by the distance to the end of the partition, so that for rows near the partition ends the frame might contain fewer rows than elsewhere.
(My emphasis)
1Have also search MySQL documentation to no avail; This Q is just tagged sql so should be based on the standard but I can't find any downloadable drafts of those at the moment either.

It does the computation ,considering the 10 rows prior to the current row and the current row ,for the given partition window .For example if you want to sum up a number based on the last 3 years and current year ,you can do sum(amount) over (order by year asc) rows between 3 PRECEDING and CURRENT ROW.
To answer your question "Does it only calculate up to however many rows above it are available?" - Yes it considers only those rows which are available

Related

SQL LAG function

I tried using the LAG function to calculate the value of previous weeks, but there are gaps in the data due to the fact that certain weeks are missing.
This is the table:
The problem is that the LAG functions takes the previous found week in the table. But I would like it to be zero if the previous week is not consecutive previous week.
This is what I would like it to be:
I'm open to any solutions.
Thank you in advance
Your example data is baffling. You have multiple rows per time frame. The first column looks like a string, which doesn't really make sense for the comparison.
So, let me answer based on a simpler data mode. The answer is to use range. If you had an integer column that specified the time frame:
ordering sales
1 10
2 20
3 30
5 50
Then you would phrase this as:
select max(sales) over (order by ordering range between 1 preceding and 1 preceding)
This would return the value from the "previous" row as defined by the first column. The value would be in a separate column, not a separate row.

Redshift - How to SUM number over last 4 weeks as a window function per row?

is it possible to SUM a number over a special time period in Amazon Redshift with a WINDOW-Function?
As an example I'm counting login numbers for different companies per day.
What I now want per row is, that it sums up the logins over the last 4 weeks (referenced by the date of the row): The field which I'm serarching for is marked yellow in the screenshot.
Thanks in advance for your help.
If you have data for each day, then you can use rows:
select t.*,
sum(logs) over (partition by company
order by date
rows between 27 preceding and current row
) as logins_4_weeks
from t;
Redshift does not yet support range for the window frame, so this is your best bet.

Redshift SQL - Running Sum using Unbounded Proceding and Following

When we use the window function to calculate the running sum like SUM(sales) over (partition by dept order by date), if we don't specify the range/window, is the default setting as between unbounded proceding and current row, basically from the first row until the current row?
According to this doc it seems to be the case, but I wanted to double check.
Thanks!
The problem you are running into is 'what does the database engine assume in ambiguous circumstances?' I've run into this exact case before when porting from SQLserver to Redshift - SQL server assumes that is you order but don't specify a frame that you want unbounded preceding to current row. Other DBs do not make the same assumption - if you don't specify a frame it will be unbounded preceding to unbounded following and yet other will throw an error if you specify and "order by" but don't specify a frame. Bottom line - don't let the DB engine guess what you want, be specific.
Gordon is correct that this is based on rows, not ranges. If you want a running sum by date (not row), you can group by date and run the window function - windows execute after group by in a single query.

Calculate averages of previous 7 rows SQL

Consider the following result set returned from a stored procedure:
The goal with the IHD column, is to do a calculation of the previous 6 rows (days) to determine a IHD value from within the stored procedure.
In this case, only from row 7 and onwards will there be an IHD value, since the calculation needs to take into consideration the previous 6 days' closing balance including current day (day 7) and calculate an average. Basically, it needs to use row 1 to 7 for row's 7 IHD value. And then, to calculate row 8's IHD value, it needs to use row 2 to 8.
I have had a look at SQL LAG function, but this only allows me to skip to 1 previous row, and I am not quite sure if I would be able to successfully use the LAG function in a self referencing CTE where averages of more than one previous row is required.
How should I approach this scenario?
Use ROWS BETWEEN. Without Consumable sample data and expected results I can only give Pseudo SQL, but this'll put you on the right path:
AVG({Your Column}) OVER ([PARTITION BY {Other Column}] ORDER BY {Column To Order BY}
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
Obviously replace the parts in braces ({}) and remove the parts in brackets ([]) if not required.

How can I calculate moving sum / average on Google BigQuery?

Analyzing trends in data series with too much volatility is hard. In many cases it is useful to use smoothing techniques such as moving averages or moving sums. There are a lot of tool to do this type of operation but when we are talking about millions of rows it is useful to do it directly in a cloud environment such as Google Big Query.
My question is: How can I calculate moving sum/avg on Google Big Query?
Bellow it follows a figure of the moving average average I want to achieve:
Below is for BigQuery Standard SQL
#standardSQL
SELECT
pickup_date,
number_of_trip,
AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d,
AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 27 PRECEDING AND CURRENT ROW) AS mov_avg_28d
FROM (
SELECT
DATE(pickup_datetime) AS pickup_date,
UNIX_DATE(DATE(pickup_datetime)) AS day,
COUNT(*) AS number_of_trip
FROM `nyc-tlc.yellow.trips`
GROUP BY 1, 2
)
WHERE pickup_date>'2013-01-01'
From first glance - this answer looks very similar to OP's answer so just few comments about how this answer is different :
First (and least important) - it is for BigQuery Standard SQL which is highly recommended by BigQuery Team to use - unless one has really good reason to use Legacy SQL - for example because of range snapshot or something very specific to legacy sql
Secondly, and most important - using OVER with ROWS in such context is not the best option because it counts rows and not the days, so if - by chance - any given day is missed - calculation will use last 8 and 29 days respectively (instead of 7 and 28)
In such cases one should use OVER with RANGE
I spent a lot of time researching this answer without success so I thought it would be worth it to share it with more people.
Solution: To arrive at the answer I used Big Query's Analytic Functions OVER with ROWS (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-function-syntax). Bellow there is an example of 7 day moving average and 28 day moving average of taxi trips using public data available in BigQuery:
SELECT
pickup_date,
number_of_trip,
avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 6 PRECEDING and CURRENT ROW) AS mov_avg_7d,
avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 27 PRECEDING and CURRENT ROW) AS mov_avg_28d
FROM
(SELECT
date(pickup_datetime) as pickup_date,
count(*) as number_of_trip,
FROM [nyc-tlc:yellow.trips]
group each by 1
order by 1)
where pickup_date>'2013-01-01'
Be careful with anti-patterns! there are many posts online that suggest solutions using JOIN or even CROSS JOIN to achieve the same result. However these methods are anti-patterns according to Big Query documentation (https://cloud.google.com/bigquery/docs/best-practices-performance-patterns). That means that for large amounts of data performance will be an issue if you solve the problem using brute force.