Related
I need some help with this problem.
Assuming I have following table:
contract_id
tariff_id
product_category
date (DD.MM.YYYY)
month (YYYYMM)
123456
ABC
small
01.01.2021
202101
123456
ABC
medium
01.02.2021
202102
123456
DEF
small
01.03.2021
202103
123456
DEF
small
01.04.2021
202104
123456
ABC
big
01.05.2021
202105
123456
DEF
small
01.06.2021
202106
123456
DEF
medium
02.06.2021
202106
123456
DEF
medium
01.07.2021
202107
The table is partitioned by month.
This is a part of my table containing multiple contract_ids.
I'm trying to figure out for every contract_id, since when it has its most recent tariff_id and since when it has the product_category_id='small' (if it doesn't have small as product category, the value should then be Null).
The results will be written into a table which gets updated every month.
So for the table above my latest results should look like this:
contract_id
same_tariff_id_since
product_category_small_since
123456
01.06.2021
NULL
I'm using Hive.
So far, I could only come up with this solution for same_tariff_id_since:
The problem is that it gives me absolute min(date) for the tariff_id and not the min(date) since the most recent tariff_id.
I think the code for product_category_small_since will have mostly the same logic.
My current code is:
SELECT q2.contract_id
, q3.tariff_id
, q2.date
FROM (
SELECT contract_id
, max(date_2) AS date
FROM (
SELECT contract_id
, date
, min(date) OVER (PARTITION BY tariff_id ORDER BY date) AS date_2
FROM given_table
)q1
WHERE date=date_2
GROUP BY contract_id
)q2
JOIN given_table AS q3
ON q2.contract_id=q3.contract_id
AND q2.date=q3.date
Thanks in advance.
One approach for solving this type of query is to do a grouping of the sequences you want to track. For the tariff_id sequence grouping, you want a new "sequence grouping id" for each time that the tariff id changes for a given contract id. Since the product_category can change independently, you need to do a sequence grouping id for that change as well.
Here's code to accomplish the task. This only returns the latest version of each contract and the specific columns you described in your latest results table. This was done against PostgreSQL 9.6, but the syntax and data types can probably be modified to be compatible with Hive.
https://www.db-fiddle.com/f/qSk3Mb9Xfp1NDo5VeA1qHh/8
select q2.contract_id
, to_char(min(q2."date (DD.MM.YYYY)")
over (partition by q2.contract_id, q2.contract_tariff_sequence_id), 'DD.MM.YYYY') as same_tariff_id_since
, to_char(min(case when q2.product_category = 'small' then q2."date (DD.MM.YYYY)" else null end)
over (partition by q2.contract_id, q2.contract_product_category_sequence_id), 'DD.MM.YYYY') as product_category_small_since
from(
select q1.*
, sum(case when q1.tariff_id = q1.prior_tariff_id then 0 else 1 end)
over (partition by q1.contract_id order by q1."date (DD.MM.YYYY)" rows unbounded preceding) as contract_tariff_sequence_id
, sum(case when q1.product_category = q1.prior_product_category then 0 else 1 end)
over (partition by q1.contract_id order by q1."date (DD.MM.YYYY)" rows unbounded preceding) as contract_product_category_sequence_id
from (
select *
, lag(tariff_id) over (partition by contract_id order by "date (DD.MM.YYYY)") as prior_tariff_id
, lag(product_category) over (partition by contract_id order by "date (DD.MM.YYYY)") as prior_product_category
, row_number() over (partition by contract_id order by "date (DD.MM.YYYY)" desc) latest_record_per_contract
from contract_tariffs
) q1
) q2
where latest_record_per_contract = 1
If you want to see all the rows and columns so you can examine how this works with the sequence grouping ids etc., you can modify the outer query slightly:
https://www.db-fiddle.com/f/qSk3Mb9Xfp1NDo5VeA1qHh/10
If this works for you, please mark as correct answer.
I have the table structure as below
product_id
Period
Sales
Profit
x1
L13
$100
$10
x1
L26
$200
$20
x1
L52
$300
$30
x2
L13
$500
$110
x2
L26
$600
$120
x2
L52
$700
$130
I want to pivot the period column over and have the sales value and profit in those columns. I need a table like below.
product_id
SALES_L13
SALES_L26
SALES_L52
PROFIT_L13
PROFIT_L26
PROFIT_L52
x1
$100
$200
$300
$10
$20
$30
x2
$500
$600
$700
$110
$120
$130
I am using the snowflake to write the queries. I tried using the pivot function of snowflake but there I can only specify one aggregation function.
Can anyone help as how I can achieve this solution ?
Any help is appreciated.
Thanks
How about we stack sales and profit before we pivot? I'll leave it up to you to fix the column names that I messed up.
with cte (product_id, period, amount) as
(select product_id, period||'_profit', profit from t
union all
select product_id, period||'_sales', sales from t)
select *
from cte
pivot(max(amount) for period in ('L13_sales','L26_sales','L52_sales','L13_profit','L26_profit','L52_profit'))
as p (product_id,L13_sales,L26_sales,L52_sales,L13_profit,L26_profit,L52_profit);
If you wish to pivot period twice for sales and profit, you'll need to duplicate the column so you have one for each instance of pivot. Obviously, this will create nulls due to duplicate column still being present after the first pivot. To handle that, we can use max in the final select. Here's what the implementation looks like
select product_id,
max(L13_sales) as L13_sales,
max(L26_sales) as L26_sales,
max(L52_sales) as L52_sales,
max(L13_profit) as L13_profit,
max(L26_profit) as L26_profit,
max(L52_profit) as L52_profit
from (select *, period as period2 from t) t
pivot(max(sales) for period in ('L13','L26','L52'))
pivot(max(profit) for period2 in ('L13','L26','L52'))
as p (product_id, L13_sales,L26_sales,L52_sales,L13_profit,L26_profit,L52_profit)
group by product_id;
At this point, it's an eye soar. You might as well use conditional aggregation or better yet, handle pivoting inside the reporting application. A more compact alternative of conditional aggregation uses decode
select product_id,
max(decode(period,'L13',sales)) as L13_sales,
max(decode(period,'L26',sales)) as L26_sales,
max(decode(period,'L52',sales)) as L52_sales,
max(decode(period,'L13',profit)) as L13_profit,
max(decode(period,'L26',profit)) as L26_profit,
max(decode(period,'L52',profit)) as L52_profit
from t
group by product_id;
Using conditional aggregation:
SELECT product_id
,SUM(CASE WHEN Period = 'L13' THEN Sales END) AS SALES_L13
,SUM(CASE WHEN Period = 'L26' THEN Sales END) AS SALES_L26
,SUM(CASE WHEN Period = 'L52' THEN Sales END) AS SALES_L52
,SUM(CASE WHEN Period = 'L13' THEN Profit END) AS PROFIT_L52
,SUM(CASE WHEN Period = 'L26' THEN Profit END) AS PROFIT_L52
,SUM(CASE WHEN Period = 'L52' THEN Profit END) AS PROFIT_L52
FROM tab
GROUP BY product_id
I'm not 100% happy with this answer ... pretty sure someone can improve on this approach.
Basically PIVOTING an ARRAY ... the list of aggregation functions available to an ARRAY is not huge ... there's just one ARRAY_AGG. And PIVOT only supposed to support AVG, COUNT, MAX, MIN, and SUM. So this shouldn't work ... it does as I think PIVOT just requires an aggregation of some sorts.
I'd recommend aggregating your metrics PRIOR to constructing the ARRAY ... but does let you pivot multiple Metrics at once - which from reading Stack Overflow shouldn't be possible!
Copy|Paste|Run| .. and IMPROVE please :-)
WITH CTE AS( SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT)
SELECT
PRODUCT_ID
,"'L13'"[0][0] SALES_L13
,"'L13'"[0][1] PROFIT_L13
,"'L26'"[0][0] SALES_L26
,"'L26'"[0][1] PROFIT_L26
,"'L52'"[0][0] SALES_L52
,"'L52'"[0][1] PROFIT_L52
FROM
(SELECT * FROM
(
SELECT PRODUCT_ID, PERIOD,ARRAY_CONSTRUCT(SALES,PROFIT) S FROM CTE)
PIVOT (ARRAY_AGG(S) FOR PERIOD IN ('L13','L26','L52')
)
)
Example with aggregations (added 1700,1130 to L52 X2)
WITH CTE AS(
SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,1700 SALES,1130 PROFIT)
SELECT
PRODUCT_ID
,"'L13'"[0][0] SALES_L13
,"'L13'"[0][1] PROFIT_L13
,"'L26'"[0][0] SALES_L26
,"'L26'"[0][1] PROFIT_L26
,"'L52'"[0][0] SALES_L52
,"'L52'"[0][1] PROFIT_L52
FROM
(SELECT * FROM
(
SELECT PRODUCT_ID, PERIOD,ARRAY_CONSTRUCT(SUM(SALES),SUM(PROFIT)) S FROM CTE GROUP BY 1,2)
PIVOT (ARRAY_AGG(S) FOR PERIOD IN ('L13','L26','L52')
)
)
Heres an alternative form using OBJECT_AGG with LATERAL FLATTEN that avoids the potential support issue of PIVOT with ARRAY_AGG proposed by Adrian White.
This should work for any aggregates on multiple input columns included within the initial ARRAY_CONSTRUCT in the OBJ_TALL CTE. I expect that the conditional aggregation option with CASE statements would be faster but you'd need to test at scale to see.
-- OBJECT FORM USING LATERAL FLATTEN
WITH CTE AS(
SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,1700 SALES,1130 PROFIT)
,OBJ_TALL AS ( SELECT PRODUCT_ID,
OBJECT_CONSTRUCT(PERIOD,
ARRAY_CONSTRUCT( SUM(SALES)
,SUM(PROFIT)
)
) S
FROM CTE
GROUP BY PRODUCT_ID, PERIOD)
SELECT * FROM OBJ_TALL;
,OBJ_WIDE AS ( SELECT PRODUCT_ID, OBJECT_AGG(KEY,VALUE) OA
FROM OBJ_TALL, LATERAL FLATTEN(INPUT => S)
GROUP BY PRODUCT_ID)
-- SELECT * FROM OBJ_WIDE;
SELECT
PRODUCT_ID
,OA:L13[0] SALES_L13
,OA:L13[1] PROFIT_L13
,OA:L26[0] SALES_L26
,OA:L26[1] PROFIT_L26
,OA:L52[0] SALES_L52
,OA:L52[1] PROFIT_L52
FROM OBJ_WIDE
ORDER BY 1;
For easy comparison to the above, heres Adrians ARRAY_AGG and PIVOT version reformatted using CTE's.
-- ARRAY FORM - RE-WRITTEN WITH CTES FOR CLARITY AND COMPARISON TO OBJECT FORM
WITH CTE AS(
SELECT 'X1' PRODUCT_ID,'L13' PERIOD,100 SALES,10 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L26' PERIOD,200 SALES,20 PROFIT
UNION SELECT 'X1' PRODUCT_ID,'L52' PERIOD,300 SALES,30 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L13' PERIOD,500 SALES,110 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L26' PERIOD,600 SALES,120 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,700 SALES,130 PROFIT
UNION SELECT 'X2' PRODUCT_ID,'L52' PERIOD,1700 SALES,1130 PROFIT)
,ARR_TALL AS (SELECT PRODUCT_ID,
PERIOD,
ARRAY_CONSTRUCT( SUM(SALES)
,SUM(PROFIT)
) S
FROM CTE GROUP BY 1,2)
,ARR_WIDE AS (SELECT *
FROM ARR_TALL PIVOT (ARRAY_AGG(S) FOR PERIOD IN ('L13','L26','L52') ) )
SELECT
PRODUCT_ID
,"'L13'"[0][0] SALES_L13
,"'L13'"[0][1] PROFIT_L13
,"'L26'"[0][0] SALES_L26
,"'L26'"[0][1] PROFIT_L26
,"'L52'"[0][0] SALES_L52
,"'L52'"[0][1] PROFIT_L52
FROM ARR_WIDE
ORDER BY 1;
I believe you can only have one pivot at one time but you can check by running the first code below. Then you can run separately only with one pivot to see if it is working fine. Unfortunately, if multiple pivots are not allowed i.e first code then you can use the third code i.e case when method OR use union first to combine them i.e (Phil Culson method from above).
select *
from [table name]
pivot(sum(amount) for PERIOD in (L13, L26, L52)),
pivot(sum(profit) for PERIOD in (L13, L26, L52))
order by product_id;
if the above one doesn't work try with one for example:
https://count.co/sql-resources/snowflake/pivot-tables
select *
from [table name]
pivot(sum(amount) for PERIOD in (L13, L26, L52))
order by product_id;
Otherwise you will have to apply the manual case when logic:
select
product_id,
sum(case when Period = 'L13' then Sales end) as sales_l13,
sum(case when Period = 'L26' then Sales end) as sales_l26,
sum(case when Period = 'L52' then Sales end) as sales_l52,
sum(case when Period = 'L13' then Profit end) as profi_l13,
sum(case when Period = 'L26' then Profit end) as profit_l26,
sum(case when Period = 'L52' then Profit end) as profit_l52
from [table name]
group by 1
I have a table that stores the VIN numbers and delivery dates of vehicles based on a code. I want to be able to get one row with three columns of data.
I have tried the following
SELECT DISTINCT VIN, MAX(TRANSACTION_DATE) AS DELIVERY_DATE
FROM "TABLE"
WHERE DELIVERY_TYPE ='025'
AND VIN IN ('XYZ')
GROUP BY VIN
UNION ALL
SELECT VIN, MAX(TRANSACTION_DATE) AS OTHER_DELIVERY_DATE
FROM "TABLE"
WHERE DELIVERY_TYPE !='025'
AND VIN IN ('XYZ')
GROUP BY VIN;
When I run this I get
VIN DELIVERY_DATE
XYZ 26-dec-18
XYZ 01-MAY-19
current data format in table:
VIN TRANSACTION_DATE
XYZ 26-DEC-18
XYZ 01-MAY-19
Required format:
VIN DELIVERY_DATE OTHER_DELIVERY DATE
XYZ 26-DEC-18 01-MAY-19
use conditional aggregation
SELECT VIN,
MAX (CASE WHEN DELIVERY_TYPE ='025' AND
VIN IN ('XYZ') then TRANSACTION_DATE end) AS DELIVERY_DATE
MAX(CASE WHEN DELIVERY_TYPE !='025' AND
VIN IN ('XYZ') then TRANSACTION_DATE end) AS OTHER_DELIVERY
FROM "TABLE"
GROUP BY VIN
Just use conditional aggregation:
SELECT VIN,
MAX(CASE WHEN DELIVERY_TYPE = 25 THEN TRANSACTION_DATE END) AS DELIVERY_DATE,
MAX(CASE WHEN DELIVERY_TYPE <> 25 THEN TRANSACTION_DATE END) AS TRANSACTION_DATE
FROM TABLE
WHERE VIN IN ('XYZ')
GROUP BY VIN;
Note that SELECT DISTINCT is almost never used with GROUP BY.
You can use CROSS APPLY
DECLARE #Cars TABLE (VIN VARCHAR(100), DELIVERY_TYPE VARCHAR(3), TRANSACTION_DATE DATE)
INSERT INTO #Cars
(VIN, DELIVERY_TYPE , TRANSACTION_DATE)
VALUES
('XYZ', '025', '20181226'), ('XYZ', '030', '20190319')
I needed above code to be able to run without a table and data, all you need is this:
SELECT DISTINCT C.VIN, DD.DELIVERY_DATE, TD.TRANSACTION_DATE
FROM #Cars C
CROSS APPLY (SELECT MAX(TRANSACTION_DATE) DELIVERY_DATE FROM #Cars D WHERE D.DELIVERY_TYPE = '025' AND D.VIN = C.VIN) DD
CROSS APPLY (SELECT MAX(TRANSACTION_DATE) TRANSACTION_DATE FROM #Cars D WHERE D.DELIVERY_TYPE = '025' AND D.VIN = C.VIN) TD
If you need to transpond not two but a lot more columns, I'd suggest using PIVOT TABLE as more appropriate, but for two columns either CROSS APPLY or conditional aggregation will do the trick.
I am trying to get an aggregate result using a subquery from a table against each row of another table in hive. I understand that hive does not support subquery in SELECT clause so I'm trying to use the subquery in FROM clause, but it seems that hive does not support correlated subqueries as well.
Here's the example: table A contains data of accounts transactions with columns of dates(d1 and d2) and a currency column along with other columns, what I want to do is get the sum of exchange rate values in table B(which contains currency rates for each day of the year) between dates d1 and d2 for each account. I'm trying something like this:
SELECT
account_no, currn, balance,
trans_date as d2, last_trans_date as d1, exchng_rt
FROM
acc AS A,
(SELECT sum(rate) exchng_rt
FROM currency
WHERE curr_type = A.currn
AND banking_date BETWEEN A.d1 AND A.d2) AS B
Here is sample, the table A has account transactions and dates like:
account balance trans_date last_trans_date currency
abc 100 20-12-2016 20-11-2016 USD
abc 200 25-12-2016 20-12-2016 USD
def 500 15-11-2015 10-11-2015 AUD
def 600 20-11-2015 15-11-2015 AUD
and the table B is something like:
curr_type rate banking_date
USD 50.9 01-01-2016
USD 50.2 02-01-2016
USD 50.5 03-01-2016
AUD 50.9 01-01-2016
AUD 50.2 02-01-2016
AUD 50.5 03-01-2016 and so on...
so table contains daily rates of currencies for each type of currency
I think you can do what you want using JOIN and GROUP BY:
SELECT a.account_no, a.currn, a.balance, a.trans_date as d2, a.last_trans_date as d1,
SUM(rate) as exchng_rt
FROM acc a LEFT JOIN
currency c
ON c.curr_type = a.currn and banking_date between A.d1 and A.d2
GROUP BY a.account_no, a.currn, a.balance, a.trans_date, a.last_trans_date;
You should specify the filter after joining the two tables, something like the following:
SELECT A.account_no,
A.currn,
A.balance,
A.trans_date as d2,
A.last_trans_date as d1,
B.exchng_rt
FROM acc as A
JOIN (SELECT sum(rate) as exchng_rt,
curr_type,
banking_date
FROM currency group by curr_type,
banking_date ) as B
ON A.currn = curr_type
WHERE B.banking_date between A.d1 and A.d2</code>
How do you create a moving average in SQL?
Current table:
Date Clicks
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520
2012-05-04 1,330
2012-05-05 2,260
2012-05-06 3,540
2012-05-07 2,330
Desired table or output:
Date Clicks 3 day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010
This is an Evergreen Joe Celko question.
I ignore which DBMS platform is used. But in any case Joe was able to answer more than 10 years ago with standard SQL.
Joe Celko SQL Puzzles and Answers citation:
"That last update attempt suggests that we could use the predicate to
construct a query that would give us a moving average:"
SELECT S1.sample_time, AVG(S2.load) AS avg_prev_hour_load
FROM Samples AS S1, Samples AS S2
WHERE S2.sample_time
BETWEEN (S1.sample_time - INTERVAL 1 HOUR)
AND S1.sample_time
GROUP BY S1.sample_time;
Is the extra column or the query approach better? The query is
technically better because the UPDATE approach will denormalize the
database. However, if the historical data being recorded is not going
to change and computing the moving average is expensive, you might
consider using the column approach.
MS SQL Example:
CREATE TABLE #TestDW
( Date1 datetime,
LoadValue Numeric(13,6)
);
INSERT INTO #TestDW VALUES('2012-06-09' , '3.540' );
INSERT INTO #TestDW VALUES('2012-06-08' , '2.260' );
INSERT INTO #TestDW VALUES('2012-06-07' , '1.330' );
INSERT INTO #TestDW VALUES('2012-06-06' , '5.520' );
INSERT INTO #TestDW VALUES('2012-06-05' , '3.150' );
INSERT INTO #TestDW VALUES('2012-06-04' , '2.230' );
SQL Puzzle query:
SELECT S1.date1, AVG(S2.LoadValue) AS avg_prev_3_days
FROM #TestDW AS S1, #TestDW AS S2
WHERE S2.date1
BETWEEN DATEADD(d, -2, S1.date1 )
AND S1.date1
GROUP BY S1.date1
order by 1;
One way to do this is to join on the same table a few times.
select
(Current.Clicks
+ isnull(P1.Clicks, 0)
+ isnull(P2.Clicks, 0)
+ isnull(P3.Clicks, 0)) / 4 as MovingAvg3
from
MyTable as Current
left join MyTable as P1 on P1.Date = DateAdd(day, -1, Current.Date)
left join MyTable as P2 on P2.Date = DateAdd(day, -2, Current.Date)
left join MyTable as P3 on P3.Date = DateAdd(day, -3, Current.Date)
Adjust the DateAdd component of the ON-Clauses to match whether you want your moving average to be strictly from the past-through-now or days-ago through days-ahead.
This works nicely for situations where you need a moving average over only a few data points.
This is not an optimal solution for moving averages with more than a few data points.
select t2.date, round(sum(ct.clicks)/3) as avg_clicks
from
(select date from clickstable) as t2,
(select date, clicks from clickstable) as ct
where datediff(t2.date, ct.date) between 0 and 2
group by t2.date
Example here.
Obviously you can change the interval to whatever you need. You could also use count() instead of a magic number to make it easier to change, but that will also slow it down.
General template for rolling averages that scales well for large data sets
WITH moving_avg AS (
SELECT 0 AS [lag] UNION ALL
SELECT 1 AS [lag] UNION ALL
SELECT 2 AS [lag] UNION ALL
SELECT 3 AS [lag] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1]) AS [avg_value1],
AVG([value2]) AS [avg_value2]
FROM [data_table]
CROSS JOIN moving_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
And for weighted rolling averages:
WITH weighted_avg AS (
SELECT 0 AS [lag], 1.0 AS [weight] UNION ALL
SELECT 1 AS [lag], 0.6 AS [weight] UNION ALL
SELECT 2 AS [lag], 0.3 AS [weight] UNION ALL
SELECT 3 AS [lag], 0.1 AS [weight] --ETC
)
SELECT
DATEADD(day,[lag],[date]) AS [reference_date],
[otherkey1],[otherkey2],[otherkey3],
AVG([value1] * [weight]) / AVG([weight]) AS [wavg_value1],
AVG([value2] * [weight]) / AVG([weight]) AS [wavg_value2]
FROM [data_table]
CROSS JOIN weighted_avg
GROUP BY [otherkey1],[otherkey2],[otherkey3],DATEADD(day,[lag],[date])
ORDER BY [otherkey1],[otherkey2],[otherkey3],[reference_date];
select *
, (select avg(c2.clicks) from #clicks_table c2
where c2.date between dateadd(dd, -2, c1.date) and c1.date) mov_avg
from #clicks_table c1
Use a different join predicate:
SELECT current.date
,avg(periods.clicks)
FROM current left outer join current as periods
ON current.date BETWEEN dateadd(d,-2, periods.date) AND periods.date
GROUP BY current.date HAVING COUNT(*) >= 3
The having statement will prevent any dates without at least N values from being returned.
assume x is the value to be averaged and xDate is the date value:
SELECT avg(x) from myTable WHERE xDate BETWEEN dateadd(d, -2, xDate) and xDate
In hive, maybe you could try
select date, clicks, avg(clicks) over (order by date rows between 2 preceding and current row) as moving_avg from clicktable;
For the purpose, I'd like to create an auxiliary/dimensional date table like
create table date_dim(date date, date_1 date, dates_2 date, dates_3 dates ...)
while date is the key, date_1 for this day, date_2 contains this day and the day before; date_3...
Then you can do the equal join in hive.
Using a view like:
select date, date from date_dim
union all
select date, date_add(date, -1) from date_dim
union all
select date, date_add(date, -2) from date_dim
union all
select date, date_add(date, -3) from date_dim
NOTE: THIS IS NOT AN ANSWER but an enhanced code sample of Diego Scaravaggi's answer. I am posting it as answer as the comment section is insufficient. Note that I have parameter-ized the period for Moving aveage.
declare #p int = 3
declare #t table(d int, bal float)
insert into #t values
(1,94),
(2,99),
(3,76),
(4,74),
(5,48),
(6,55),
(7,90),
(8,77),
(9,16),
(10,19),
(11,66),
(12,47)
select a.d, avg(b.bal)
from
#t a
left join #t b on b.d between a.d-(#p-1) and a.d
group by a.d
--#p1 is period of moving average, #01 is offset
declare #p1 as int
declare #o1 as int
set #p1 = 5;
set #o1 = 3;
with np as(
select *, rank() over(partition by cmdty, tenor order by markdt) as r
from p_prices p1
where
1=1
)
, x1 as (
select s1.*, avg(s2.val) as avgval from np s1
inner join np s2
on s1.cmdty = s2.cmdty and s1.tenor = s2.tenor
and s2.r between s1.r - (#p1 - 1) - (#o1) and s1.r - (#o1)
group by s1.cmdty, s1.tenor, s1.markdt, s1.val, s1.r
)
I'm not sure that your expected result (output) shows classic "simple moving (rolling) average" for 3 days. Because, for example, the first triple of numbers by definition gives:
ThreeDaysMovingAverage = (2.230 + 3.150 + 5.520) / 3 = 3.6333333
but you expect 4.360 and it's confusing.
Nevertheless, I suggest the following solution, which uses window-function AVG. This approach is much more efficient (clear and less resource-intensive) than SELF-JOIN introduced in other answers (and I'm surprised that no one has given a better solution).
-- Oracle-SQL dialect
with
data_table as (
select date '2012-05-01' AS dt, 2.230 AS clicks from dual union all
select date '2012-05-02' AS dt, 3.150 AS clicks from dual union all
select date '2012-05-03' AS dt, 5.520 AS clicks from dual union all
select date '2012-05-04' AS dt, 1.330 AS clicks from dual union all
select date '2012-05-05' AS dt, 2.260 AS clicks from dual union all
select date '2012-05-06' AS dt, 3.540 AS clicks from dual union all
select date '2012-05-07' AS dt, 2.330 AS clicks from dual
),
param as (select 3 days from dual)
select
dt AS "Date",
clicks AS "Clicks",
case when rownum >= p.days then
avg(clicks) over (order by dt
rows between p.days - 1 preceding and current row)
end
AS "3 day Moving Average"
from data_table t, param p;
You see that AVG is wrapped with case when rownum >= p.days then to force NULLs in first rows, where "3 day Moving Average" is meaningless.
We can apply Joe Celko's "dirty" left outer join method (as cited above by Diego Scaravaggi) to answer the question as it was asked.
declare #ClicksTable table ([Date] date, Clicks int)
insert into #ClicksTable
select '2012-05-01', 2230 union all
select '2012-05-02', 3150 union all
select '2012-05-03', 5520 union all
select '2012-05-04', 1330 union all
select '2012-05-05', 2260 union all
select '2012-05-06', 3540 union all
select '2012-05-07', 2330
This query:
SELECT
T1.[Date],
T1.Clicks,
-- AVG ignores NULL values so we have to explicitly NULLify
-- the days when we don't have a full 3-day sample
CASE WHEN count(T2.[Date]) < 3 THEN NULL
ELSE AVG(T2.Clicks)
END AS [3-Day Moving Average]
FROM #ClicksTable T1
LEFT OUTER JOIN #ClicksTable T2
ON T2.[Date] BETWEEN DATEADD(d, -2, T1.[Date]) AND T1.[Date]
GROUP BY T1.[Date]
Generates the requested output:
Date Clicks 3-Day Moving Average
2012-05-01 2,230
2012-05-02 3,150
2012-05-03 5,520 4,360
2012-05-04 1,330 3,330
2012-05-05 2,260 3,120
2012-05-06 3,540 3,320
2012-05-07 2,330 3,010