Forward Rolling Average in Hive Query

Forward Rolling Average in Hive Query - sql

I want to calculate the rolling average looking forward on "4 day window" basis. Please find the details below
Create table stock(day int, time String, cost float);
Insert into stock values(1,"8 AM",3.1);
Insert into stock values(1,"9 AM",3.2);
Insert into stock values(1,"10 AM",4.5);
Insert into stock values(1,"11 AM",5.5);
Insert into stock values(2,"8 AM",5.1);
Insert into stock values(2,"9 AM",2.2);
Insert into stock values(2,"10 AM",1.5);
Insert into stock values(2,"11 AM",6.5);
Insert into stock values(3,"8 AM",8.1);
Insert into stock values(3,"9 AM",3.2);
Insert into stock values(3,"10 AM",2.5);
Insert into stock values(3,"11 AM",4.5);
Insert into stock values(4,"8 AM",3.1);
Insert into stock values(4,"9 AM",1.2);
Insert into stock values(4,"10 AM",0.5);
Insert into stock values(4,"11 AM",1.5);
I wrote the below query
select day, cost,sum(cost) over (order by day range between current row and 4 Following), avg(cost) over (order by day range between current row and 4 Following)
from stock
As you can see, I get 4 records for each day and I need to calculate rolling average on 4 day window. For this, I wrote the above window query, as I have the data only for 4 days each day containing 4 records, the sum for the first day will be the total of all the 16 records. Based on this the first record will have the sum of 56.20 which is correct and the average should be 56.20/4 (as there are 4 days), but it is doing 56.20/16 as there 16 records in total. How do I fix the average part of this?
Thanks
Raj

Is this what you want?
select t.*,
avg(cost) over (order by day range between current row and 4 following)
from t;
EDIT:
What you seem to want is:
select t.*,
(sum(cost) over (order by day range between current row and 3 following) /
count(distinct day) over (order by day range between current row and 3 following)
)
from t;
But, Hive does not support this. You can use a subquery for this purpose:
select t.*,
(sum(cost) over (order by day range between current row and 3 following) /
sum(case when seqnum = 1 then 1 else 0 end) over (order by day range between current row and 3 following)
)
from (select t.*
row_number() over (partition by day order by time) as seqnum
from t
)t

Related

Finding sales growth from cumulative totals over period in SQL

SELECT CUST_ID,CONTACTS
Sum("CONTACTS") Over (PARTITION by "CUST_ID" Order By "end_Period" ROWS UNBOUNDED PRECEDING) as RunningContacts,
"SALES",
Sum("SALES") Over (PARTITION by "CUST_ID" Order By "end_Period" ROWS UNBOUNDED PRECEDING) as RunningSales,
end_Period
FROM Table2
I have currently created the Running growth column in excel formula is (New Runningsales - Previous Running sales) / Previous RunningSales.
Any help here is appreciated.

Are you looking for this?
select t.*,
RunningSales / (Running - Sales) - 1
from (< your query here > ) x

The SQL derived table can hold a query to aggregate sales by period, and you an join such to itself to compare each period to the prior period.

How do I sum values between a date and that same date plus 12 weeks?

I have data at the Cust_ID and Week_Date level with sales_dollars. Each Week_Date represents the sales_dollars for that week. I want to sum the dollars for each Cust_ID for the next 8 weeks. I want the final data at the Cust_ID and Week_Date level. Is that possible using sql?

You seem to want a window functions:
select t.*,
sum(sales) over (partition by cust_id
order by week_date
rows between current row and 7 following
) as sales_8weeks
from t;
This assumes that all weeks have data. If you want to put a range in by date, you can do that but the exact syntax varies by database.

Excluding rows from an AGG() OVER(ROWS BETWEEN x PRECEDING) if they're related to the current row?

I'm calculating a moving average of the last 100 sales of a particular item. I'd like to know if user X has spent more than 5 times everyone else combined, on that item in the last 100 sales window.
--how much has the current row user spent on this item over the last 100 sales?
SUM(saleprice) OVER(PARTITION BY item, user ORDER BY saledate ROWS BETWEEN 100 PRECEDING AND CURRENT ROW)
--pseudocode: how much has everyone else, excluding this user, spent on that item over the last 100 sales?
SUM(saleprice) OVER(PARTITION BY item ORDER BY saledate ROWS BETWEEN 100 PRECEDING AND CURRENT ROW WHERE preceding_row.user <> current_row.ruser)
Ultimately, I don't want the purchases made by my big spender to be factored into the total spend by the little spenders. Is there a technique that can exclude rows from a window, if they don't meet some comparison criteria versus the current row? (in my case, don't sum the saleprice from the preceding row if it bears the same user as the current row)

This first one looks fine to me, except you're counting 101 sales. (100 preceding AND the current row)
--how much has the current row user spent on this item over the last 100 sales?
SUM(saleprice)
OVER (
PARTITION BY item, user
ORDER BY saledate
ROWS BETWEEN 100 PRECEDING AND 1 PRECEDING -- 100 excluding this sale
ROWS BETWEEN 99 PRECEDING AND CURRENT ROW -- 100 including this sale
)
(Just use one of the two suggested ROWS BETWEEN clauses)
In the second expression, you can't add a WHERE clause. You can change the aggregation, the partition and the sorting, but I can't see how that would help you. I think you need a correlated sub-query and/or use of OUTER APPLY...
SELECT
*,
SUM(saleprice)
OVER (
PARTITION BY item, user
ORDER BY saledate
ROWS BETWEEN 99 PRECEDING AND CURRENT ROW -- 100 including this sale
)
AS user_total_100_purchases_to_date,
others_sales_top_100_total.sale_price
FROM
sales_data
OUTER APPLY
(
SELECT
SUM(saleprice) AS saleprice
FROM
(
SELECT TOP(100) saleprice
FROM sales_data others_sales
WHERE others_sales.user <> sales_data.user
AND others_sales.item = sales_data.item
AND others_sales.saledate <= sales_data.saledate
ORDER BY others_sales.saledate DESC
)
AS others_sales_top_100
)
AS others_sales_top_100_total
EDIT: Another way to look at it, to make things come consistent
SELECT
*,
usr_last100_saletotal,
all_last100_saletotal,
CASE WHEN usr_last100_saletotal > all_last100_saletotal * 0.8
THEN 'user spent 80%, or more, of last 100 sales'
ELSE 'user spent under 80% of last 100 sales'
END
AS
FROM
sales_data
OUTER APPLY
(
SELECT
SUM(CASE top100.user WHEN sales_data.user THEN top100.saleprice END) AS usr_last100_saletotal,
SUM( top100.saleprice ) AS all_last100_saletotal
FROM
(
SELECT TOP(100) user, saleprice
FROM sales_data AS others_sales
WHERE others_sales.item = sales_data.item
AND others_sales.saledate <= sales_data.saledate
ORDER BY others_sales.saledate DESC
)
AS top100
)
AS top100_summary

Calculate MAX for value over a relative date range

I am trying to calculate the max of a value over a relative date range. Suppose I have these columns: Date, Week, Category, Value. Note: The Week column is the Monday of the week of the corresponding Date.
I want to produce a table which gives the MAX value within the last two weeks for each Date, Week, Category combination so that the output produces the following: Date, Week, Category, Value, 2WeeksPriorMAX.
How would I go about writing that query? I don't think the following would work:
SELECT Date, Week, Value,
MAX(Value) OVER (PARTITION BY Category
ORDER BY Week
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as 2WeeksPriorMAX
The above query doesn't account for cases where there are missing values for a given Category, Week combination within the last 2 weeks, and therefore it would span further than 2 weeks when it analyzes the 2 preceding rows.

Left joining or using a lateral join/subquery might be expensive. You can do this with window functions, but you need to have a bit more logic:
select t.*,
(case when lag(date, 1) over (partition by category order by date) < date - interval '2 week'
then value
when lag(date, 2) over (partition by category order by date) < date - interval '2 week'
then max(value) over (partition by category order by date rows between 1 preceding and current row)
else max(value) over (partition by category order by date rows between 2 preceding and current row)
end) as TwoWeekMax
from t;

Moving trailing 13-week average in Postgres

I am trying to build a view that generates a movable 13-week average over the past year.
My source data includes a date, customer ID, and integer, and basically I want to average the 13 prior values (including the current one), over the previous 52 weeks. When I'm finished, I'd like to have a table with a date, each customer ID, and trailing 13-week average for that customer.

After upgrading Postgres to 9.1, the window functions worked great for this:
SELECT vs.weekending,
cs.slinkcust AS customer,
cs.slinkid AS id,
round(avg(vs.maxattached) OVER (PARTITION BY cs.slinkid ORDER BY vs.weekending DESC ROWS BETWEEN 0 PRECEDING AND 12 FOLLOWING), 2) AS rolling_conc_avg,
round(avg(vs.totsessions) OVER (PARTITION BY cs.slinkid ORDER BY vs.weekending DESC ROWS BETWEEN 0 PRECEDING AND 12 FOLLOWING), 2) AS rolling_sess_avg,
dense_rank() OVER (ORDER BY vs.weekending) AS week_number
FROM cfg_slink cs
JOIN view_statslink vs ON cs.slinkid = vs.id
WHERE vs.weekending >= (now() - '364 days'::interval) AND cs.disabled = 0
GROUP BY vs.weekending, cs.slinkid, vs.maxattached, vs.totsessions
ORDER BY vs.weekending DESC, cs.slinkcust;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Forward Rolling Average in Hive Query - sql

Related

Finding sales growth from cumulative totals over period in SQL

How do I sum values between a date and that same date plus 12 weeks?

Excluding rows from an AGG() OVER(ROWS BETWEEN x PRECEDING) if they're related to the current row?

Calculate MAX for value over a relative date range

Moving trailing 13-week average in Postgres

Categories

Resources