Analyzing trends in data series with too much volatility is hard. In many cases it is useful to use smoothing techniques such as moving averages or moving sums. There are a lot of tool to do this type of operation but when we are talking about millions of rows it is useful to do it directly in a cloud environment such as Google Big Query.
My question is: How can I calculate moving sum/avg on Google Big Query?
Bellow it follows a figure of the moving average average I want to achieve:
Below is for BigQuery Standard SQL
#standardSQL
SELECT
pickup_date,
number_of_trip,
AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS mov_avg_7d,
AVG(number_of_trip) OVER (ORDER BY day RANGE BETWEEN 27 PRECEDING AND CURRENT ROW) AS mov_avg_28d
FROM (
SELECT
DATE(pickup_datetime) AS pickup_date,
UNIX_DATE(DATE(pickup_datetime)) AS day,
COUNT(*) AS number_of_trip
FROM `nyc-tlc.yellow.trips`
GROUP BY 1, 2
)
WHERE pickup_date>'2013-01-01'
From first glance - this answer looks very similar to OP's answer so just few comments about how this answer is different :
First (and least important) - it is for BigQuery Standard SQL which is highly recommended by BigQuery Team to use - unless one has really good reason to use Legacy SQL - for example because of range snapshot or something very specific to legacy sql
Secondly, and most important - using OVER with ROWS in such context is not the best option because it counts rows and not the days, so if - by chance - any given day is missed - calculation will use last 8 and 29 days respectively (instead of 7 and 28)
In such cases one should use OVER with RANGE
I spent a lot of time researching this answer without success so I thought it would be worth it to share it with more people.
Solution: To arrive at the answer I used Big Query's Analytic Functions OVER with ROWS (https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-function-syntax). Bellow there is an example of 7 day moving average and 28 day moving average of taxi trips using public data available in BigQuery:
SELECT
pickup_date,
number_of_trip,
avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 6 PRECEDING and CURRENT ROW) AS mov_avg_7d,
avg(number_of_trip) OVER (ORDER BY pickup_date ROWS BETWEEN 27 PRECEDING and CURRENT ROW) AS mov_avg_28d
FROM
(SELECT
date(pickup_datetime) as pickup_date,
count(*) as number_of_trip,
FROM [nyc-tlc:yellow.trips]
group each by 1
order by 1)
where pickup_date>'2013-01-01'
Be careful with anti-patterns! there are many posts online that suggest solutions using JOIN or even CROSS JOIN to achieve the same result. However these methods are anti-patterns according to Big Query documentation (https://cloud.google.com/bigquery/docs/best-practices-performance-patterns). That means that for large amounts of data performance will be an issue if you solve the problem using brute force.
Related
Say I create a window function and specify:
ROWS BETWEEN 10 PRECEDING AND CURRENT ROW
How does the window function treat the first 9 rows? Does it only calculate up to however many rows above it are available?
I couldn't find this documented in SQL Server's documentation but I could find it in Postgres, and I believe it is standardised1:
In any case, the distance to the end of the frame is limited by the distance to the end of the partition, so that for rows near the partition ends the frame might contain fewer rows than elsewhere.
(My emphasis)
1Have also search MySQL documentation to no avail; This Q is just tagged sql so should be based on the standard but I can't find any downloadable drafts of those at the moment either.
It does the computation ,considering the 10 rows prior to the current row and the current row ,for the given partition window .For example if you want to sum up a number based on the last 3 years and current year ,you can do sum(amount) over (order by year asc) rows between 3 PRECEDING and CURRENT ROW.
To answer your question "Does it only calculate up to however many rows above it are available?" - Yes it considers only those rows which are available
I have two years worth of data that I'm summing up for instance
Date | Ingredient_cost_Amount| Cost_Share_amount |
I'm looking at two years worth of data for 2012 and 2013,
I want to roll up all the totals so I have only two rows, one row for 2012 and one row for 2013. How do I write a SQL statement that will look at the dates but display only the 4 digit year vs 8 digit daily date. I suspect the sum piece of it will be taken care of by summing those columns withe calculations, so I'm really looking for help in how to tranpose a daily date to a 4 digit year.
Help is greatly appreciated.
select DATEPART(year,[Date]) [Year]
, sum(Ingredient_cost_Amount) Total
from #table
group by DATEPART(year,[Date])
Define a range/grouping table.
Something similar to the following should work in most RDBMSs:
SELECT Grouping.id, SUM(Ingredient.ingredient_cost_amount) AS Ingredient_Cost_Amount,
SUM(Ingredient.cost_share_amount) AS Cost_Share_Amount
FROM (VALUES (2013, DATE('2013-01-01'), DATE('2014-01-01')),
(2012, DATE('2012-01-01'), DATE('2013-01-01'))) Grouping(id, gStart, gEnd)
JOIN Ingredient
ON Ingredient.date >= Grouping.gStart
AND Ingredient.date < Grouping.gEnd
GROUP BY Grouping.id
(DATE() and related conversion functions are heavily DB dependent. Some RDBMSs don't support using VALUES this way, although there are other ways to create the virtual grouping table)
See this blog post for why I used an exclusive upper bound for the range.
Using a range table this way will potentially allow the db to use indices to help with the aggregation. How much this helps depends on a bunch of other factors, like the specific RDBMS used.
Let's say I have a table UserActivity in SQL Server 2012 with two columns:
ActivityDateTime
UserID
I want to calculate number of distinct users with any activity in a 30-day period (my monthly active users) on a daily basis. (So I have a 30-day window that increments a day at a time. How do I do this efficiently using window functions in SQL Server?
The output would look like this:
Date,NumberActiveUsersInPrevious30Days
01-01-2010,13567
01-02-2010,14780
01-03-2010,13490
01-04-2010,15231
01-05-2010,15321
01-06-2010,14513
...
SQL Server doesn't support COUNT(DISTINCT ... ) OVER () or a numeric value (30 PRECEDING) in conjunction with RANGE
I wouldn't bother trying to coerce window functions into doing this. Because of the COUNT(DISTINCT UserID) requirement it is always going to have to re-examine the entire 30 day window for each date.
You can create a calendar table with a row for each date and use
SELECT C.Date,
NumberActiveUsersInPrevious30Days
FROM Calendar C
CROSS APPLY (SELECT COUNT(DISTINCT UserID)
FROM UserActivity
WHERE ActivityDateTime >= DATEADD(DAY, -30, C.[Date])
AND ActivityDateTime < C.[Date]) CA(NumberActiveUsersInPrevious30Days)
WHERE C.Date BETWEEN '2010-01-01' AND '2010-01-06'
Option 1: For (while) loop though each day and select 30 days backward for each (obviously quite slow).
Option 2: A separate table with a row for each day and join on the original table (again quite slow).
Option 3: Recursive CTEs or stored procs (still not doing much better).
Option 4: For (while) loop in combination with cursors (efficient, but requires some advanced SQL knowledge). With this solution you will step through each day and each row in order and keep track of the average (you'll need some sort of wrap-around array to know what value to subtract when a day moves out of range).
Option 5: Option 3 in a general-purpose / scripting programming language (C++ / Java / PHP) (easy to do with basic knowledge of one of those languages, efficient).
Some related questions.
Finding the start and end time for adjacent records that have the same value?
I have a table that contains heart rate readings (in beats per minute) and datetime field. (Actually the fields are heartrate_id, heartrate, and datetime.) The data are generated by a device that records the heart rate and time every 6 seconds. Sometimes the heart rate monitor will give false readings and the recorded beats per minute will "stick" for an period of time. By sticks, I mean the beats per minute value will be identical in adjacent times.
Basically I need to find all the records where the heart rate is the same (e.g. 5 beats per minute, 100 beats per minute, etc.) in but only on adjacent records. If the device records 25 beats per minute for 3 consecutive reading (or 100 consecutive readings) I need to locate these events. The results need to have the heartrate, time the heartrate started, and the time the heart rate ended and ideally the results would look more of less like this:
heartrate starttime endtime
--------- --------- --------
1.00 21:12:00 21:12:24
35.00 07:00:12 07:00:36
I've tried several different approaches but so far I'm striking out. Any help would be greatly appreciated!
EDIT:
Upon review, none of my original work on this answer was very good. This actually belongs to the class of problems known as gaps-and-islands, and this revised answer will use information I've gleaned from similar questions/learned since first answering this question.
It turns out this query can be done a lot more simply than I originally thought:
WITH Grouped_Run AS (SELECT heartRate, dateTime,
ROW_NUMBER() OVER(ORDER BY dateTime) -
ROW_NUMBER() OVER(PARTITION BY heartRate ORDER BY dateTime) AS groupingId
FROM HeartRate)
SELECT heartRate, MIN(dateTime), MAX(dateTime)
FROM Grouped_Run
GROUP BY heartRate, groupingId
HAVING COUNT(*) > 2
SQL Fiddle Demo
So what's happening here? One of the definitions of gaps-and-islands problems is the need for "groups" of consecutive values (or lack thereof). Often sequences are generated to solve this, exploiting an often overlooked/too-intuitive fact: subtracting sequences yields a constant value.
For example, imagine the following sequences, and the subtraction (the values in the rows are unimportant):
position positionInGroup subtraction
=========================================
1 1 0
2 2 0
3 3 0
4 1 3
5 2 3
6 1 5
7 4 3
8 5 3
position is a simple sequence generated over all records.
positionInGroup is a simple sequence generated for each set of different records. In this case, there's actually 3 different sets of records (starting at position = 1, 4, 6).
subtraction is the result of the difference between the other two columns. Note that values may repeat for different groups!
One of the key properties the sequences must share is they must be generated over the rows of data in the same order, or this breaks.
So how is SQL doing this? Through the use of ROW_NUMBER() this function will generate a sequence of numbers over a "window" of records:
ROW_NUMBER() OVER(ORDER BY dateTime)
will generate the position sequence.
ROW_NUMBER() OVER(PARTITION BY heartRate ORDER BY dateTime)
will generate the positionInGroup sequence, with each heartRate being a different group.
In the case of most queries of this type, the values of the two sequences is unimportant, it's the subtraction (to get the sequence group) that matters, so we just need the result of the subtraction.
We'll also need the heartRate and the times in which they occurred to provide the answer.
The original answer asked for the start and end times of each of the "runs" of stuck heartbeats. That's a standard MIN(...)/MAX(...), which means a GROUP BY. We need to use both the original heartRate column (because that's a non-aggregate column) and our generated groupingId (which identifies the current "run" per stuck value).
Part of the question asked for only runs that repeated three or more times. The HAVING COUNT(*) > 2 is an instruction to ignore runs of length 2 or less; it counts rows per-group.
I recommend Ben-Gan's article on interval packing, which applies to your adjacency problem.
tsql-challenge-packing-date-and-time-intervals
solutions-to-packing-date-and-time-intervals-puzzle
Suppose ,I have a table which has all the billing records. Now I want to see the sales trend for a user given time duration group by each 3 days ...what should be the sql query regarding this?
please help,Otherwise I am gone ...
I can only give a vague suggestion as per the question, however you may want to have a derived column with a standardised date (as per MS date format, just a number per day) that you could then use a modulus (3) on so that days are equal per 3 day period. You can then group and aggregate over this column to get the values for a 3 day period. Obviously to display the date nicely you would have to multiply back and convert your column as well.
Again I'm not sure of the specifics, but I think this general idea could be achieved to get a result (may well not be the best way so it would help to add more to the question...)