Cumulative product using input from multiple columns - sql

I have a combination of some daily return estimates and month-to-day (MTD) returns, which are issued weekly. I want to combine these two data series to get a daily estimated MTD value.
I have tried to summarize what I would like to archieve below
I got all the columns except MTD_estimate, which I would like to derive from DailyReturnEstimate and MTD. In case that a MTD value exist, then it should use that value. Otherwise, it should do a cumulative product of the returns. My code looks as follows
select *, exp(sum(log(1+DailyReturnEstimate)) OVER (ORDER BY dates) )-1 as Cumu_DailyReturn from TestTbl
My problem is that I am not sure how to do use the MTD value when present when doing the cumulative product.
I am using Microsot SQL 2012. I have made a small data example below:
CREATE TABLE TestTbl (
id integer PRIMARY KEY,
dates date,
DailyReturnEstimate float,
MTD integer,
);
INSERT INTO TestTbl
(id, Dates, DailyReturnEstimate, MTD) VALUES
(1, '2020-01-01', -0.01, NULL ),
(2, '2020-01-02', 0.005 , NULL ),
(3, '2020-01-03', 0.012 , NULL ),
(4, '2020-01-04', -0.007 , NULL ),
(5, '2020-01-05', 0.021 , 0.016 ),
(6, '2020-01-06', 0.001 , NULL );

This is a bit tricky, but the idea is to set up separate groups based on where mtd is already defined. Then do the calculation only within those groups:
select t.*,
exp(sum(log(1+coalesce(mtd, DailyReturnEstimate))) OVER (partition by grp ORDER BY dates) )-1 as Cumu_DailyReturn
from (select t.*,
count(mtd) over (order by dates) as grp
from testtbl t
) t;
Here is a db<>fiddle.

Related

Divide monthly spend into daily spend in BigQuery

I have monthly data in BigQuery in the following form:
CREATE TABLE if not EXISTS spend (
id int,
created_at DATE,
value float
);
INSERT INTO spend VALUES
(1, '2020-01-01', 100),
(2, '2020-02-01', 200),
(3, '2020-03-01', 100),
(4, '2020-04-01', 100),
(5, '2020-05-01', 50);
I would like a query to translate it into daily data in the following day:
One row per day.
The value of each day should be the monthly value divided by the number of days of the month.
What's the simplest way of doing this in BigQuery?
You can make use of GENERATE_DATE_ARRAY() in order to get an array between the desired dates (in your case, between 2020-01-01 and 2020-05-31) and create a calendar table, and then divide the value of a given month among the days in the month :)
Try this and let me know if it worked:
with calendar_table as (
select
calendar_date
from
unnest(generate_date_array('2020-01-01', '2020-05-31', interval 1 day)) as calendar_date
),
final as (
select
ct.calendar_date,
s.value,
s.value / extract(day from last_day(ct.calendar_date)) as daily_value
from
spend as s
cross join
calendar_table as ct
where
format_date('%Y-%m', date(ct.calendar_date)) = format_date('%Y-%m', date(s.created_at))
)
select * from final
My recommendation is to do this "locally". That is, run generate_date_array() for each row in the original table. This is much faster than a join across rows. BigQuery also makes this easy with the last_day() function:
select t.id, u.date,
t.value / extract(day from last_day(t.created_at))
from `table` t cross join
unnest(generate_date_array(t.created_at,
last_day(t.created_at, month)
)
) u(date);

Get first value outside where window with lag function

When using the lag function on time series in SQL Server, I always struggle with the first value in a time series.
Assume this trivial example
CREATE TABLE demo
([id] int, [time] date, [content] int)
;
INSERT INTO demo (id, time, content) VALUES
(1, '2021-05-31', cast(rand()*1000 as int)),
(2, '2021-06-01', cast(rand()*1000 as int)),
(3, '2021-06-02',cast(rand()*1000 as int)),
(4, '2021-06-03', cast(rand()*1000 as int)),
(5, '2021-06-04', cast(rand()*1000 as int)),
(6, '2021-06-05', cast(rand()*1000 as int)),
(7, '2021-06-06', cast(rand()*1000 as int)),
(8, '2021-06-07', cast(rand()*1000 as int)),
(9, '2021-06-08', cast(rand()*1000 as int));
I want to get all values and their previous value in June, so something like this
select content, lag(content, 1, null) over (order by time)
from demo
where time >= '2021-06-01'
so far so good, however, the first entry will result in null for the previous value.
Of course there are many solutions on how to fill the null value, e.g. subselecting a larger range etc. but for very large tables I somehow think there should be an elegant solution to this.
Sometimes I do stuff like this
select content, lag(content, 1,
(select content from demo d1 join
(select max(time) maxtime from demo where time < '2021-06-01') d2 on d1.time = d2.maxtime
)) over (order by time)
from demo
where time >= '2021-06-01'
Is there something more efficient? (note: of course for this trivial example I doesn't make a difference, but for tables with partition and 500'000'000 entries, one should find the most efficient solution)
Check out the fiddle
The key idea is to use a subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
) d
where time >= '2021-06-01';
This is probably going to scan the entire table. However, you can create an index demo(time, content) to help the lag().
Next, you can optimize this if you have a reasonable lookback period. For instance, if there are records every month, just go back one month in the subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
where time >= '2021-05-01'
) d
where time >= '2021-06-01';
This can also be very important if your data is partitioned -- as large tables are wont to be.
For this particular case, going by your comments, you may first compute the lag over the entire unfiltered table, then subquery that based on date:
WITH cte AS (
SELECT time, content, LAG(content) OVER (ORDER BY time) lag_content
FROM demo
)
SELECT content, lag_content
FROM cte
WHERE time >= '2021-06-01';
What would you like the null values to be? I've put them as 0 in the below example.
SELECT
content,
coalesce(LAG(content, 1, NULL) OVER(
ORDER BY
time
), content-1) lag_content
FROM
demo
WHERE
time >= '2021-06-01'
Output:
content lag_content
-------------------
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
Try it out here: dbfiddle

Sum and Running Sum, Distinct and Running Distinct

I want to calculate sum, running sum, distinct, running distinct - preferably all in one query.
http://sqlfiddle.com/#!18/65eff/1
create table test (store int, day varchar(10), food varchar(10), quantity int)
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 1
insert into test select 101, '2021-01-01', 'rice', 2
insert into test select 101, '2021-01-01', 'fruit', 2
insert into test select 101, '2021-01-01', 'water', 3
insert into test select 101, '2021-01-01', 'fruit', 1
insert into test select 101, '2021-01-01', 'salt', 2
insert into test select 101, '2021-01-02', 'rice', 1
insert into test select 101, '2021-01-02', 'rice', 2
insert into test select 101, '2021-01-02', 'fruit', 1
insert into test select 101, '2021-01-02', 'pepper', 4
Uniques (distinct) & Total (sum) are simple:
select store, day, count(distinct food) as uniques, sum(quantity) as total
from test
group by store, day
But I want output to be :
store
day
uniques
run_uniques
total
run_total
101
2021-01-01
4
4
12
12
101
2021-01-02
3
5
10
22
I tried a self-join with t.day >= prev.day to get cumulative/running data, but it's causing double-counting.
First off: always store data in the correct data type, day should be a date column.
Calculating a running sum of sum(quantity) aggregate is quite simple, you just nest it inside a window function: SUM(SUM(...)) OVER (...).
Calculating the running number of unique food per store is more complicated because you want the rolling number of unique items before grouping, and there is no COUNT(DISTINCT window function in SQL Server (which is what I'm using).
So I've gone with calculating a row_number() for each store and food across all days, then we just sum up the number of times we get 1 i.e. this is the first time we've seen this food.
SELECT
t.store,
t.day,
uniques = COUNT(DISTINCT t.food),
run_uniques = SUM(SUM(CASE WHEN t.rn = 1 THEN 1 ELSE 0 END))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING),
total = SUM(t.quantity),
run_total = SUM(SUM(t.quantity))
OVER (PARTITION BY t.store ORDER BY t.day ROWS UNBOUNDED PRECEDING)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY store, food ORDER BY day) rn
FROM test
) t
GROUP BY t.store, t.day;

Find time average time difference between stages

In SQL Server 2012, I have a database table called Stages that has 3 columns:
AccountID
StageNum
StartTime
I am trying to find out how long it usually takes between each stage. IE stage 2 usually takes 3 days to complete. Is this possible? Is it possible to skip weekends too?
Any SQL would be helpful!
Thank you
The simplest method is:
select ( datediff(minute, min(StartTime), max(StartTime)) /
nullif(60.0 * (count(*) - 1), 0)
) as avg_hours
from t;
The nullif() prevents division by zero. The idea is simple . . . take the total amount of time and divide by one less than the number of stages.
I would try the use of LEAD and AVG as follows:
CREATE TABLE Tab1(
AccountID INT, StageNum INT, StartTime DATETIME
)
INSERT INTO Tab1 VALUES(1, 1, '2018-01-01 07:00:00.000'), (1, 2, '2018-01-03 12:54:00.000'), (1, 3, '2018-02-01 12:00:00.000')
INSERT INTO Tab1 VALUES(2, 1, '2018-03-01 00:00:00.000'), (2, 2, '2018-04-03 12:54:00.000'), (2, 3, '2018-08-01 12:00:00.000')
WITH cte AS(
SELECT *
,LEAD(StartTime) OVER (PARTITION BY t.AccountID ORDER BY t.StageNum) NextStart
,DATEDIFF(MINUTE, StartTime, LEAD(StartTime) OVER (PARTITION BY t.AccountID ORDER BY t.StageNum))/60.0 TimeSpanHours
FROM Tab1 t
)
SELECT AccountID, AVG(TimeSpanHours) AvgTimeSpanHours
FROM cte
GROUP BY AccountID
ORDER BY AccountID

Get average of last 7 days

I'm attacking a problem, where I have a value for a a range of dates. I would like to consolidate the rows in my table by averaging them and reassigning the date column to be relative to the last 7 days. My SQL experience is lacking and could use some help. Thanks for giving this a look!!
E.g.
7 rows with dates and values.
UniqueId Date Value
........ .... .....
a 2014-03-20 2
a 2014-03-21 2
a 2014-03-22 3
a 2014-03-23 5
a 2014-03-24 1
a 2014-03-25 0
a 2014-03-26 1
Resulting row
UniqueId Date AvgValue
........ .... ........
a 2014-03-26 2
First off I am not even sure this is possible. I'm am trying to attack a problem with this data at hand. I thought maybe using a framing window with a partition to roll the dates into one date with the averaged result, but am not exactly sure how to say that in SQL.
Am taking following as sample
CREATE TABLE some_data1 (unique_id text, date date, value integer);
INSERT INTO some_data1 (unique_id, date, value) VALUES
( 'a', '2014-03-20', 2),
( 'a', '2014-03-21', 2),
( 'a', '2014-03-22', 3),
( 'a', '2014-03-23', 5),
( 'a', '2014-03-24', 1),
( 'a', '2014-03-25', 0),
( 'a', '2014-03-26', 1),
( 'b', '2014-03-01', 1),
( 'b', '2014-03-02', 1),
( 'b', '2014-03-03', 1),
( 'b', '2014-03-04', 1),
( 'b', '2014-03-05', 1),
( 'b', '2014-03-06', 1),
( 'b', '2014-03-07', 1)
OPTION A : - Using PostgreSQL Specific Function WITH
with cte as (
select unique_id
,max(date) date
from some_data1
group by unique_id
)
select max(sd.unique_id),max(sd.date),avg(sd.value)
from some_data1 sd inner join cte using(unique_id)
where sd.date <=cte.date
group by cte.unique_id
limit 7
> SQLFIDDLE DEMO
OPTION B : - To work in PostgreSQL and MySQL
select max(sd.unique_id)
,max(sd.date)
,avg(sd.value)
from (
select unique_id
,max(date) date
from some_data1
group by unique_id
) cte inner join some_data1 sd using(unique_id)
where sd.date <=cte.date
group by cte.unique_id
limit 7
> SQLFDDLE DEMO
Maybe something along the lines of SELECT AVG(Value) AS 'AvgValue' FROM tableName WHERE Date BETWEEN dateStart AND dateEnd That will get you the average between those dates and you have dateEnd already so you could use that result to create the row you're looking for.
For PostgreSQL a window function might be what you want:
DROP TABLE IF EXISTS some_data;
CREATE TABLE some_data (unique_id text, date date, value integer);
INSERT INTO some_data (unique_id, date, value) VALUES
( 'a', '2014-03-20', 2),
( 'a', '2014-03-21', 2),
( 'a', '2014-03-22', 3),
( 'a', '2014-03-23', 5),
( 'a', '2014-03-24', 1),
( 'a', '2014-03-25', 0),
( 'a', '2014-03-26', 1),
( 'a', '2014-03-27', 3);
WITH avgs AS (
SELECT unique_id, date,
avg(value) OVER w AS week_avg,
count(value) OVER w AS num_days
FROM some_data
WINDOW w AS (
PARTITION BY unique_id
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW))
SELECT unique_id, date, week_avg
FROM avgs
WHERE num_days=7
Result:
unique_id | date | week_avg
-----------+------------+--------------------
a | 2014-03-26 | 2.0000000000000000
a | 2014-03-27 | 2.1428571428571429
Questions include:
What happens if a day from the preceding six days is missing? Do we want to add it and count it as zero?
What happens if you add a day? Is the result of the code above what you want (a rolling 7-day average)?
For SQL Server, you can follow the below approach. Try this
1. For weekly value's average
SET DATEFIRST 4
;WITH CTE AS
(
SELECT *,
DATEPART(WEEK,[DATE])WK,
--Find last day in that week
ROW_NUMBER() OVER(PARTITION BY UNIQUEID,DATEPART(WEEK,[DATE]) ORDER BY [DATE] DESC) RNO,
-- Find average value of that week
AVG(VALUE) OVER(PARTITION BY UNIQUEID,DATEPART(WEEK,[DATE])) AVGVALUE
FROM DATETAB
)
SELECT UNIQUEID,[DATE],AVGVALUE
FROM CTE
WHERE RNO=1
Click here to view result
2. For last 7 days value's average
DECLARE #DATE DATE = '2014-03-26'
;WITH CTE AS
(
SELECT UNIQUEID,[DATE],VALUE,#DATE CURRENTDATE
FROM DATETAB
WHERE [DATE] BETWEEN DATEADD(DAY,-7,#DATE) AND #DATE
)
SELECT UNIQUEID,CURRENTDATE [DATE],AVG(VALUE) AVGVALUE
FROM CTE
GROUP BY UNIQUEID,CURRENTDATE
Click here to view result