Adding set lists of future dates to rows in a SQL query - sql

So I am doing a cohort analysis for customers, where a cohort is a group of people who started using the product in the same month. I then keep track of each cohort's total use for every subsequent month up till present time.
For example, the first "cohort month" is January 2012, then I have "use months" January 12, Feb 12, March 12, ..., March 17(current month). One column is "cohort month", and another is "use month". This process repeats for every subsequent cohort month. The table looks like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
...
Feb 17 | Feb 17
Feb 17 | Mar 17
Mar 17 | Mar 17
The problem arises because I want to do forecasting for one year out for both existing and future cohorts.
That means for the Jan 12 cohort, I want to do prediction for April 17 to Mar 18.
I also want to do predictions for the April 17 cohort (which doesn't exist yet) from April 17 to Mar 18. And so on till predictions for the Mar 18 cohort in Mar 18.
I can handle the predictions, don't worry about that.
My issue is that I cannot figure out how to add in this list of (April 17 .. Mar 17) in the "use month" column before every cohort switches.
I also need to add in cohorts April 17 to Mar 18, and have the applicable parts of this list of (April 17 ... Mar 17) for each of these future cohorts.
So I want the table to look like:
Jan 12 | Jan 12
Jan 12 | Feb 12
...
Jan 12 | Mar 17
Jan 12 | Apr 17
..
Jan 12 | Mar 18
Feb 12 | Feb 12
Feb 12 | Mar 12
...
Feb 12 | Mar 17
Feb 12 | Apr 17
...
Feb 12 | Mar 18
...
...
Feb 17 | Feb 17
Feb 17 | Mar 17
...
Feb 17 | Mar 18
Mar 17 | Mar 17
...
Mar 17 | Mar 18
I know the first solution to come to mind is to do a create a list of all dates Jan 12 to Mar 18, cross join it to itself, and then left outer join to the current table I have (where cohort / use months range from Jan 12 to Mar 17). However, this is not scalable.
Is there a way I can just iteratively add in this list of the months of the next year?
I am using HP Vertica, could use Presto or Hive if absolutely necessary

I think you should use the query here below to create a temporary table out of nothing, and join it with the rest of your query. You can't do anything in a procedural manner in SQL, I'm afraid. You won't be able to get away without a CROSS JOIN. But here, you limit the CROSS JOIN to the generation of the first-of-month pairs that you need.
Here goes:
WITH
-- create a list of integers from 0 to 100 using the TIMESERIES clause
i(i) AS (
SELECT dt::DATE - '2000-01-01'::DATE
FROM (
SELECT '2000-01-01'::DATE + 0
UNION ALL SELECT '2000-01-01'::DATE + 100
) d(d)
TIMESERIES dt AS '1 day' OVER(ORDER BY d::TIMESTAMP)
)
,
-- limits are Jan-2012 to the first of the current month plus one year
month_limits(month_limit) AS (
SELECT '2012-01-01'::DATE
UNION ALL SELECT ADD_MONTHS(TRUNC(CURRENT_DATE,'MONTH'),12)
)
-- create the list of possible months as a CROSS JOIN of the i table
-- containing the integers and the month_limits table, using ADD_MONTHS()
-- and the smallest and greatest month of the month limits
,month_list AS (
SELECT
ADD_MONTHS(MIN(month_limit),i) AS month_first
FROM month_limits CROSS JOIN i
GROUP BY i
HAVING ADD_MONTHS(MIN(month_limit),i) <= (
SELECT MAX(month_limit) FROM month_limits
)
)
-- finally, CROSS JOIN the obtained month list with itself with the
-- filters needed.
SELECT
cohort.month_first AS cohort_month
, use.month_first AS use_month
FROM month_list AS cohort
CROSS JOIN month_list AS use
WHERE use.month_first >= cohort.month_first
ORDER BY 1,2
;

Related

SQL select multiple columns according to current month

I want to select previous months column from a table. So for example: if the current month is April I want to select columns Jan, Feb and March. I tried using CASE but the problem with a one condition you can only select one column, and once the condition is matched then it will ignore subsequent when statements.
EDIT
Sample Data
var_d
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
A
1
1
1
1
1
1
1
1
1
1
1
1
O/P for April
var_d
Jan
Feb
Mar
A
1
1
1
Likewise in May we need April's data too

Calculate running sum of previous 3 months from monthly aggregated data

I have a dataset that I have aggregated at monthly level. The next part needs me to take, for every block of 3 months, the sum of the data at monthly level.
So essentially my input data (after aggregated to monthly level) looks like:
month
year
status
count_id
08
2021
stat_1
1
09
2021
stat_1
3
10
2021
stat_1
5
11
2021
stat_1
10
12
2021
stat_1
10
01
2022
stat_1
5
02
2022
stat_1
20
and then my output data to look like:
month
year
status
count_id
3m_sum
08
2021
stat_1
1
1
09
2021
stat_1
3
4
10
2021
stat_1
5
8
11
2021
stat_1
10
18
12
2021
stat_1
10
25
01
2022
stat_1
5
25
02
2022
stat_1
20
35
i.e 3m_sum for Feb = Feb + Jan + Dec. I tried to do this using a self join and wrote a query along the lines of
WITH CTE AS(
SELECT date_part('month',date_col) as month
,date_part('year',date_col) as year
,status
,count(distinct id) as count_id
FROM (date_col, status, transaction_id) as a
)
SELECT a.month, a.year, a.status, sum(b.count_id) as 3m_sum
from cte as a
left join cte as b on a.status = b.status
and b.month >= a.month - 2 and b.month <= a.month
group by 1,2,3
This query NEARLY works. Where it falls apart is in Jan and Feb. My data is from August 2021 to Apr 2022. The means, the value for Jan should be Nov + Dec + Jan. Similarly for Feb it should be Dec + Jan + Feb.
As I am doing a join on the MONTH, all the months of Aug - Nov are treated as being values > month of jan/feb and so the query isn't doing the correct sum.
How can I adjust this bit to give the correct sum?
I did think of using a LAG function, but (even though I'm 99% sure a month won't ever be missed), I can't guarantee we will never have a month with 0 values, and therefore my LAG function will be summing the wrong rows.
I also tried doing the same join, but at individual date level (and not aggregating in my nested query) but this gave vastly different numbers, as I want the sum of the aggregation and I think the sum from the individual row was duplicated a lot of stuff I do a COUNT DISTINCT on to remove.
You can use a SUM with a window frame of 2 PRECEDING. To ensure you don't miss rows, use a calendar table and left-join all the results to it.
SELECT *,
SUM(a.count_id) OVER (ORDER BY c.year, c.month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Calendar c
LEFT JOIN a ON a.year = c.year AND a.month = c.month
WHERE c.year >= 2021 AND c.year <= 2022;
db<>fiddle
You could also use LAG but you would need it twice.
It should be #Charlieface's answer - only that I get one different result than you put in your expected result table:
WITH
-- your input - and I avoid keywords like "MONTH" or "YEAR"
-- and also identifiers starting with digits are forbidden -
indata(mm,yy,status,count_id,sum_3m) AS (
SELECT 08,2021,'stat_1',1,1
UNION ALL SELECT 09,2021,'stat_1',3,4
UNION ALL SELECT 10,2021,'stat_1',5,8
UNION ALL SELECT 11,2021,'stat_1',10,18
UNION ALL SELECT 12,2021,'stat_1',10,25
UNION ALL SELECT 01,2022,'stat_1',5,25
UNION ALL SELECT 02,2022,'stat_1',20,35
)
SELECT
*
, SUM(count_id) OVER(
ORDER BY yy,mm
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS sum_3m_calc
FROM indata;
-- out mm | yy | status | count_id | sum_3m | sum_3m_calc
-- out ----+------+--------+----------+--------+-------------
-- out 8 | 2021 | stat_1 | 1 | 1 | 1
-- out 9 | 2021 | stat_1 | 3 | 4 | 4
-- out 10 | 2021 | stat_1 | 5 | 8 | 9
-- out 11 | 2021 | stat_1 | 10 | 18 | 18
-- out 12 | 2021 | stat_1 | 10 | 25 | 25
-- out 1 | 2022 | stat_1 | 5 | 25 | 25
-- out 2 | 2022 | stat_1 | 20 | 35 | 35

Select row with most recent date per location and increment recent date by 1 for each row by location using MariaDB

I have a table of location which has 'Date column'. I have to find recent date by each group of locationID for e.g. locationID 1 has most recent date '31 May 2022'. After finding recent date from the group of locationID I have to add 14 days in that recent date and store it in NewDate column. and add + 1 in that new date for other row for that group of locationID.
My table is:
id locationID Date NewDate
1 1 31 May 2022
2 1 16 May 2022
3 1 28 Apr 2021
4 2 29 Mar 2022
5 2 22 Feb 2022
6 3 14 Jun 2022
7 3 27 Oct 2021
8 4 01 Feb 2022
9 4 04 May 2022
10 4 14 Jun 2021
11 5 01 Jun 2022
12 5 29 May 2022
13 5 20 Sep 2022
14 5 11 Aug 2022
15 5 03 Aug 2022
Answer should be as below:
For e.g. for locationID = 1
id locationID Date NewDate
1 1 31 May 2022 14 Jun 2022 // Recent Date + 14 Days - 31 May + 14 Days
2 1 16 May 2022 15 Jun 2022 // Recent Date + 15 Days - 31 May + 15 Days
3 1 28 Apr 2021 16 Jun 2022 // Recent Date + 16 Days - 31 May + 16 Days
I have come across few similar post and found recent date like this:
SELECT L.*
FROM Locations L
INNER JOIN
(SELECT locationID, MAX(Date) AS MAXdate
FROM Locations
GROUP BY locationID) groupedL
ON L.locationID = groupedL.locationID
AND L.Date = groupedL.MAXdate
using above code I am able to find recent date per location but how do I add and increment required days and store it to NewDate column ? I am new to MariaDB, please suggest similar post link, any reference documents or blogs. Should I make some function to perform this logic and call the function to store required dates in NewDate column? I am not sure please suggest. Thank you.
RESULT SHOULD LOOK LIKE BELOW:
id locationID Date NewDate
1 1 31 May 2022 14 Jun 2022 // Recent Date for locationid 1 + 14 Days - 31 May + 14 Days
2 1 16 May 2022 15 Jun 2022 // Recent Date for locationid 1 + 15 Days - 31 May + 15 Days
3 1 28 Apr 2021 16 Jun 2022 // Recent Date for locationid 1 + 16 Days - 31 May + 16 Days
4 2 29 Mar 2022 12 APR 2022 // Recent Date for locationid 2 + 14 Days
5 2 22 Feb 2022 13 APR 2022 // Recent Date for locationid 2 + 15 Days
6 3 14 Jun 2022 28 JUN 2022 // Recent Date for locationid 3 + 14 Days
7 3 27 Oct 2021 29 JUN 2022 // Recent Date for locationid 3 + 15 Days
8 4 01 Feb 2022 18 MAY 2022 // Recent Date for locationid 4 + 14 Days
9 4 04 May 2022 19 MAY 2022 // Recent Date for locationid 4 + 15 Days
10 4 14 Jun 2021 20 MAY 2022 // Recent Date for locationid 4 + 16 Days
11 5 01 Jun 2022 04 OCT 2022 // Recent Date for locationid 5 + 14 Days
12 5 29 May 2022 05 OCT 2022 // Recent Date for locationid 5 + 15 Days
13 5 20 Sep 2022 06 OCT 2022 // Recent Date for locationid 5 + 16 Days
14 5 11 Aug 2022 07 OCT 2022 // Recent Date for locationid 5 + 17 Days
15 5 03 Aug 2022 08 OCT 2022 // Recent Date for locationid 5 + 18 Days
You can use a cte:
with cte as (
select l1.*, l2.m, (select sum(l4.id < l1.id and l4.locationid = l1.locationid) from locations l4) inc from locations l1
join (select l3.locationid, max(l3.dt) m from locations l3 group by l3.locationid) l2 on l1.locationid = l2.locationid
)
select c.id, c.locationid, c.dt, c.m + interval 14 + c.inc day from cte c
You could use analytic window functions and update the original table by joining to a sub-query (works for MariaDB):
update t
join (
select Id,
Date_Add(First_Value(date) over(partition by locationId order by date desc),
interval (13 + row_number() over(partition by locationId order by date desc)) day
) NewDate
from t
)nd on t.id = nd.id
set t.Newdate = nd.NewDate;
See DB<>Fiddle example

big query SQL - repeatedly/recursively change a row's column in the select statement based on the values in previous row

I have table like below
customer
date
end date
1
jan 1 2021
jan 30 2021
1
jan 2 2021
jan 31 2021
1
jan 3 2021
feb 1 2021
1
jan 27 2021
feb 26 2021
1
feb 3 2021
mar 5 2021
2
jan 2 2021
jan 31 2021
2
jan 10 2021
feb 9 2021
2
feb 10 2021
mar 12 2021
Now, I wanted to update the value in the 'end date' column of a row based on the values in the previous row 'end date' and the current row 'date'.
Say if the date in current row < end date of the previous row, I wanted to update the end date of the current row = (end date of the previous row).
I Wanted to do this repeated for all the rows (grouped by customer).
I want the output as below. Just need it in the select statement instead of a updating/inserting in a table.
Note - in below as the second row(end date) is updated with the value in the first row (jan 30 2021), now the third row value (jan 3 2021) is evaluated against the updated value in the second row (which is jan 30 2021) but not with the second row value before update (jan 31 2021).
customer
date
end date
1
jan 1 2021
jan 30 2021
1
jan 2 2021
jan 30 2021 [updated because current date < previous end date]
1
jan 3 2021
jan 30 2021[updated because current date < previous end date]
1
jan 27 2021
jan 30 2021 [updated because current date < previous end date]
1
feb 3 2021
mar 5 2021
2
jan 2 2021
jan 31 2021
2
jan 10 2021
jan 31 2021[updated because current date < previous end date]
2
feb 10 2021
mar 12 2021
I think I should go this way. I use the datasource twice just to get the way its needed to perform the operation without updating or inserting into the table.
input table:
1|2021-01-01|2021-01-30
1|2021-01-02|2021-01-31
1|2021-01-03|2021-02-01
1|2021-01-27|2021-02-26
1|2021-02-03|2021-03-05
2|2021-01-02|2021-01-31
2|2021-01-10|2021-02-09
2|2021-02-10|2021-03-12
code:
with num_raw_data as (
SELECT row_number() over(partition by customer)as num, customer,date_init,date_end
FROM `project-id.data-set.table`
), analyzed_data as(
select r.num,
r.customer,
r.date_init,
r.date_end,
case when date_init<(select date_end from num_raw_data where num=r.num-1 and customer=r.customer and EXTRACT(month FROM r.date_init)=EXTRACT(month FROM date_init)) then 1 else 0 end validation
from num_raw_data r
)
select customer,
date_init,
case when validation !=0 then (select MIN(date_end) from analyzed_data where validation=0 and customer=ad.customer and date_init<ad.date_end) else date_end end as date_end
from analyzed_data ad
order by customer,num
output:
1|2021-01-01|2021-01-30
1|2021-01-02|2021-01-30
1|2021-01-03|2021-01-30
1|2021-01-27|2021-01-30
1|2021-02-03|2021-03-05
2|2021-01-02|2021-01-31
2|2021-01-10|2021-01-31
2|2021-02-10|2021-03-12
Using column validation from analyzed_data to get to know where I should be looking for changes. I'm not sure if its fast (probably not) but it works for the scenario you bring in your question.

Subtraction of inventory from Demand in BigQuery everday and adding new inventory

Here's how my data looks like:
date
sku
inventory_added
demand
22nd Nov 2021
XYZ
70
18
23rd Nov 2021
XYZ
0
18
24th Nov 2021
XYZ
0
50
25th Nov 2021
XYZ
0
15
26th Nov 2021
XYZ
80
30
27th Nov 2021
XYZ
0
20
28th Nov 2021
XYZ
0
15
29th Nov 2021
XYZ
0
20
30th Nov 2021
XYZ
0
10
1st Dec 2021
XYZ
100
40
2nd Dec 2021
XYZ
0
10
I want to create a new column named solution using BigQuery SQL where in the 1st row, i.e. 22nd Nov 2021, I want formula as - inventory_added - demand.
This will give me 1st row's value for solution will be 52.
Now what I am not able to do is from 2nd row:
So, next now, will be 52 (remaining inventory from previous day) + 0 (inventory_added on 23rd Nov 2021) - 18 (demand on 23 Nov 2021). This is equal to 34.
Similarly going to next row, i.e. 24th November:
value in solution will be 34 + 0 - 50 = -16. Now since it is negative, it should be put as 0.
I tried this - MAX(solutions, 0).
The result will look like this:
date
sku
inventory_added
demand
solution
22nd Nov 2021
XYZ
70
18
52
23rd Nov 2021
XYZ
0
18
34
24th Nov 2021
XYZ
0
50
0
25th Nov 2021
XYZ
0
15
0
26th Nov 2021
XYZ
80
30
50
27th Nov 2021
XYZ
0
20
30
28th Nov 2021
XYZ
0
15
15
29th Nov 2021
XYZ
0
20
0
30th Nov 2021
XYZ
0
10
0
1st Dec 2021
XYZ
100
40
60
2nd Dec 2021
XYZ
0
10
50
I am not sure if this can be accomplished by BigQuery, but all suggestions are welcome.
Thanks!
Without the condition "it is negative, it should be put as 0" you may use window (in BigQuery terms - analytic) variant of SUM() function:
SELECT *,
SUM(inventory_added - demand) OVER (PARTITION BY sku ORDER BY date) AS solution
FROM source_table
With this condition the output become iterative, and you must use recursive CTE (if available in BigQuery) or iterative stored procedure.
I see that recursive CTE is not available in BigQuery ... Can you provide a pseudo code may as a starting point for stored procedures? – Shantanu Jain
CREATE PROCEDURE procname()
BEGIN
CREATE temptable;
OPEN CURSOR FOR SELECT * FROM datatable ORDER BY date;
SET #solution = 0;
FETCH CURSOR INTO #date, #sku, #inventory_added, #demand;
LOOP ​
​ SET #solution = GREATEST(#solution + #inventory_added - #demand, 0);
​ INSERT INTO temptable VALUES (#date, #sku, #inventory_added, #demand, #solution);
FETCH CURSOR INTO #date, #sku, #inventory_added, #demand;
UNTIL NO_ROWS_IN_CURSOR END LOOP;
SELECT * FROM temptable;
DROP temptable;
END
AS an option - consider use of recently introduced FOR...IN Loop
declare result int64;
declare prev_sku string;
create temp table results as (select *, 0 as solution from your_table where false);
set (result, prev_sku) = (0, '');
for record in (select *, parse_date('%d %B %Y', regexp_replace(date, r'(\d*)(\w*)( \w{3} \d{4})', r'\1 \3')) dt from your_table order by sku, dt) do
if record.sku != prev_sku then set result = 0; end if;
set result = result + record.inventory_added - record.demand;
if result < 0 then set result = 0; end if;
insert into results values(record.date, record.sku, record.inventory_added, record.demand, result);
set prev_sku = record.sku;
end for;
select * from results
order by sku, parse_date('%d %B %Y', regexp_replace(date, r'(\d*)(\w*)( \w{3} \d{4})', r'\1 \3'));
If applied to sample data in your question - output is
Note: While delivering expected result - obviously this is going to be extremely slow (as any cursor based solution) - so while applicable for learning - I don't think appropriate for real production use