Divide monthly spend into daily spend in BigQuery - sql

I have monthly data in BigQuery in the following form:
CREATE TABLE if not EXISTS spend (
id int,
created_at DATE,
value float
);
INSERT INTO spend VALUES
(1, '2020-01-01', 100),
(2, '2020-02-01', 200),
(3, '2020-03-01', 100),
(4, '2020-04-01', 100),
(5, '2020-05-01', 50);
I would like a query to translate it into daily data in the following day:
One row per day.
The value of each day should be the monthly value divided by the number of days of the month.
What's the simplest way of doing this in BigQuery?

You can make use of GENERATE_DATE_ARRAY() in order to get an array between the desired dates (in your case, between 2020-01-01 and 2020-05-31) and create a calendar table, and then divide the value of a given month among the days in the month :)
Try this and let me know if it worked:
with calendar_table as (
select
calendar_date
from
unnest(generate_date_array('2020-01-01', '2020-05-31', interval 1 day)) as calendar_date
),
final as (
select
ct.calendar_date,
s.value,
s.value / extract(day from last_day(ct.calendar_date)) as daily_value
from
spend as s
cross join
calendar_table as ct
where
format_date('%Y-%m', date(ct.calendar_date)) = format_date('%Y-%m', date(s.created_at))
)
select * from final

My recommendation is to do this "locally". That is, run generate_date_array() for each row in the original table. This is much faster than a join across rows. BigQuery also makes this easy with the last_day() function:
select t.id, u.date,
t.value / extract(day from last_day(t.created_at))
from `table` t cross join
unnest(generate_date_array(t.created_at,
last_day(t.created_at, month)
)
) u(date);

Related

Get first value outside where window with lag function

When using the lag function on time series in SQL Server, I always struggle with the first value in a time series.
Assume this trivial example
CREATE TABLE demo
([id] int, [time] date, [content] int)
;
INSERT INTO demo (id, time, content) VALUES
(1, '2021-05-31', cast(rand()*1000 as int)),
(2, '2021-06-01', cast(rand()*1000 as int)),
(3, '2021-06-02',cast(rand()*1000 as int)),
(4, '2021-06-03', cast(rand()*1000 as int)),
(5, '2021-06-04', cast(rand()*1000 as int)),
(6, '2021-06-05', cast(rand()*1000 as int)),
(7, '2021-06-06', cast(rand()*1000 as int)),
(8, '2021-06-07', cast(rand()*1000 as int)),
(9, '2021-06-08', cast(rand()*1000 as int));
I want to get all values and their previous value in June, so something like this
select content, lag(content, 1, null) over (order by time)
from demo
where time >= '2021-06-01'
so far so good, however, the first entry will result in null for the previous value.
Of course there are many solutions on how to fill the null value, e.g. subselecting a larger range etc. but for very large tables I somehow think there should be an elegant solution to this.
Sometimes I do stuff like this
select content, lag(content, 1,
(select content from demo d1 join
(select max(time) maxtime from demo where time < '2021-06-01') d2 on d1.time = d2.maxtime
)) over (order by time)
from demo
where time >= '2021-06-01'
Is there something more efficient? (note: of course for this trivial example I doesn't make a difference, but for tables with partition and 500'000'000 entries, one should find the most efficient solution)
Check out the fiddle
The key idea is to use a subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
) d
where time >= '2021-06-01';
This is probably going to scan the entire table. However, you can create an index demo(time, content) to help the lag().
Next, you can optimize this if you have a reasonable lookback period. For instance, if there are records every month, just go back one month in the subquery:
select t.*
from (select content, lag(content) over (order by time)
from demo d
where time >= '2021-05-01'
) d
where time >= '2021-06-01';
This can also be very important if your data is partitioned -- as large tables are wont to be.
For this particular case, going by your comments, you may first compute the lag over the entire unfiltered table, then subquery that based on date:
WITH cte AS (
SELECT time, content, LAG(content) OVER (ORDER BY time) lag_content
FROM demo
)
SELECT content, lag_content
FROM cte
WHERE time >= '2021-06-01';
What would you like the null values to be? I've put them as 0 in the below example.
SELECT
content,
coalesce(LAG(content, 1, NULL) OVER(
ORDER BY
time
), content-1) lag_content
FROM
demo
WHERE
time >= '2021-06-01'
Output:
content lag_content
-------------------
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
Try it out here: dbfiddle

Count the number of minutes with datediff and substring

I have data like this
availabilities
[{"starts_at":"09:00","ends_at":"17:00"}]
I have query below and it works
select COALESCE(availabilities,'Total') as availabilities,
SUM(DATEDIFF(minute,start_at,end_at)) as 'Total Available Hours in Minutes'
from (
select cast(availabilities as NVARCHAR) as availabilities,
cast(SUBSTRING(availabilities,16,5) as time) as start_at,
cast(SUBSTRING(availabilities,34,5) as time) as end_at
from alfy.dbo.daily_availabilities
)x
GROUP by ROLLUP(availabilities);
Result
availabilities Total Available Hours in Minutes
[{"starts_at":"09:00","ends_at":"17:00"}] 480
How if the data like below
availabilities
[{"starts_at":"10:00","ends_at":"13:30"},{"starts_at":"14:00","ends_at":"18:00"}]
[{"starts_at":"09:00","ends_at":"12:30"},{"starts_at":"13:00","ends_at":"15:30"},{"starts_at":"16:00","ends_at":"18:00"}]
How to count the number of minutes over two or more time ranges?
Since you have JSON data use OPENJSON (Transact-SQL) to parse it, e.g.:
create table dbo.daily_availabilities (
id int,
availabilities nvarchar(max) --JSON
);
insert dbo.daily_availabilities (id, availabilities) values
(1, N'[{"starts_at":"09:00","ends_at":"17:00"}]'),
(2, N'[{"starts_at":"10:00","ends_at":"13:30"},{"starts_at":"14:00","ends_at":"18:00"}]'),
(3, N'[{"starts_at":"09:00","ends_at":"12:30"},{"starts_at":"13:00","ends_at":"15:30"},{"starts_at":"16:00","ends_at":"18:00"}]');
select id, sum(datediff(mi, starts_at, ends_at)) as total_minutes
from dbo.daily_availabilities
cross apply openjson(availabilities) with (
starts_at time,
ends_at time
) av
group by id
id
total_minutes
1
480
2
450
3
480

Cumulative product using input from multiple columns

I have a combination of some daily return estimates and month-to-day (MTD) returns, which are issued weekly. I want to combine these two data series to get a daily estimated MTD value.
I have tried to summarize what I would like to archieve below
I got all the columns except MTD_estimate, which I would like to derive from DailyReturnEstimate and MTD. In case that a MTD value exist, then it should use that value. Otherwise, it should do a cumulative product of the returns. My code looks as follows
select *, exp(sum(log(1+DailyReturnEstimate)) OVER (ORDER BY dates) )-1 as Cumu_DailyReturn from TestTbl
My problem is that I am not sure how to do use the MTD value when present when doing the cumulative product.
I am using Microsot SQL 2012. I have made a small data example below:
CREATE TABLE TestTbl (
id integer PRIMARY KEY,
dates date,
DailyReturnEstimate float,
MTD integer,
);
INSERT INTO TestTbl
(id, Dates, DailyReturnEstimate, MTD) VALUES
(1, '2020-01-01', -0.01, NULL ),
(2, '2020-01-02', 0.005 , NULL ),
(3, '2020-01-03', 0.012 , NULL ),
(4, '2020-01-04', -0.007 , NULL ),
(5, '2020-01-05', 0.021 , 0.016 ),
(6, '2020-01-06', 0.001 , NULL );
This is a bit tricky, but the idea is to set up separate groups based on where mtd is already defined. Then do the calculation only within those groups:
select t.*,
exp(sum(log(1+coalesce(mtd, DailyReturnEstimate))) OVER (partition by grp ORDER BY dates) )-1 as Cumu_DailyReturn
from (select t.*,
count(mtd) over (order by dates) as grp
from testtbl t
) t;
Here is a db<>fiddle.

PostgreSQL difference between values with different time stamps in same table

I'm relatively new to working with PostgreSQL and I could use some help with this.
Suppose I have a table of forecasted values (let's say temperature) are stored, which are indicated by a dump_date_time . This dump_date_time is the date_time when the values were stored in the table. The temperature forecasts are also indicated by the date_time to which the forecast corresponds. Lets say that every 6 hours a forecast is published.
Example:
At 06:00 today the temperature for tomorrow at 16:00 is published and stored in the table. Then at 12:00 today the temperature for tomorrow at 16:00 is published and also stored in the table. I now have two forecasts for the same date_time (16:00 tomorrow) which are published at two different times (06:00 and 12:00 today), indicated by the dump_date_time.
All these values are stored in the same table, with three columns: dump_date_time, date_time and value. My goal is to SELECT from this table the difference between the temperatures of the two forecasts. How do I do this?
One option uses a join:
select date_time, t1.value - t2.value value_diff
from mytable t1
inner join mytable t2 using (date_time)
where t1.dump_date = '2020-01-01 06:00:00'::timestamp
and t2.dump_date = '2020-01-01 16:00:00'::timestamp
Something like:
create table forecast(dump_date_time timestamptz, date_time timestamptz, value numeric)
insert into forecast values ('09/24/2020 06:00', '09/25/2020 16:00', 50), ('09/24/2020 12:00', '09/25/2020 16:00', 52);
select max(value) - min(value) from forecast where date_time = '09/25/2020 16:00';
?column?
----------
2
--Specifying dump_date_time range
select
max(value) - min(value)
from
forecast
where
date_time = '09/25/2020 16:00'
and
dump_date_time <#
tstzrange(current_date + interval '6 hours',
current_date + interval '12 hours', '[]');
?column?
----------
2
This is a very simple case. If you need something else you will need to provide more information.
UPDATE
Add example that uses timestamptz range to select dump_date_time in range.

list of all users who watched at least a movie every week in this month

create table active_users(
user_id numeric,
movie_streamed date
)
insert into active_users values (1,'2020-01-2'::date);
insert into active_users values (1,'2020-01-9'::date);
insert into active_users values (1,'2020-01-16'::date);
insert into active_users values (1,'2020-01-23'::date);
insert into active_users values (1,'2020-01-30'::date);
insert into active_users values (2,'2020-01-14'::date);
insert into active_users values (2,'2020-01-16'::date);
Hi all,
I am looking for a query which will help me to get the users who watched at least a movie every week in this month(being the test data). Given the data where every record has the user_id and when that particular person has watched the movie given the date. I want a generic answer. Not like every month has 4 weeks. Because there could be some scenarios where there are 5 weeks in some cases too.
You can use generate_series(1,5) by counting from 1 upto 5, since there should 5 different weeks might exist even uncompleted as you already mentioned.
The trick is to compare the distinct count of the beginning dates for each week within the current month :
SELECT u.user_id
FROM active_users u
JOIN generate_series( 1, 5 ) g
ON date_trunc('week', movie_streamed)
= date_trunc('week', current_date) + interval '7' day * (g-1)
GROUP BY u.user_id
HAVING COUNT(DISTINCT date_trunc('week', movie_streamed)) =
(
SELECT COUNT(*)
FROM generate_series( 1, 5 ) g
WHERE to_char(current_date,'yyyymm')
= to_char(date_trunc('week', current_date)
+ interval '7' day * (g-1),'yyyymm')
);
Demo