Using a window function in BigQuery to create running sum of active quarters - sql

I am working to enhance a dataset by creating a column that would allow me to track how many active quarters a given company has had for a given row. A company is "active" if they recognize revenue within that quarter.
Each row of my dataset represents one month's performance for a single company.
I have been able to use a WINDOW function to create a running sum for active months successfully:
COUNTIF(Revenue IS NOT NULL) OVER
(partition by Company_Name ORDER BY month_end ASC ROWS BETWEEN unbounded preceding and current row) AS cumulative_active_months
I am now struggling to convert my logic to count the quarters rather than the months.
This is a rough idea of what my table currently looks like.
Row Month Month_end Fiscal_Quarter Company_Name Revenue Active month count
----- ------- ------------ ---------------- -------------- --------- --------------------
1 Jul 2016-07-31 FY17-Q2 Foo x,xxx 1
2 Jul 2016-07-31 FY17-Q2 Bar xxx,xxx 1
3 Aug 2016-08-31 FY17-Q2 Foo xx,xxx 2
4 Aug 2016-08-31 FY17-Q2 Bar xxx 2
5 Sep 2016-09-30 FY17-Q2 Foo xx 3
6 Sep 2016-09-30 FY17-Q2 Bar x,xxx 3
7 Oct 2016-10-31 FY17-Q3 Foo xx 4
8 Oct 2016-10-31 FY17-Q3 Bar Null 3
This what ideally I'd like for my table to look like.
Row Month Month_end Fiscal_Quarter Company_Name Revenue Active month count Active quarter count
----- ------- ------------ ---------------- -------------- --------- -------------------- ----------------------
1 Jul 2016-07-31 FY17-Q2 Foo x,xxx 1 1
2 Jul 2016-07-31 FY17-Q2 Bar xxx,xxx 1 1
3 Aug 2016-08-31 FY17-Q2 Foo xx,xxx 2 1
4 Aug 2016-08-31 FY17-Q2 Bar xxx 2 1
5 Sep 2016-09-30 FY17-Q2 Foo xx 3 1
6 Sep 2016-09-30 FY17-Q2 Bar x,xxx 3 1
7 Oct 2016-10-31 FY17-Q3 Foo xx 4 2
8 Oct 2016-10-31 FY17-Q3 Bar Null 3 1

If this is counting active months:
COUNTIF(Revenue IS NOT NULL) OVER (PARTITION BY Company_Name ORDER BY month_end ASC) AS cumulative_active_months
Then this is the corresponding count for quarters would use COUNT(DISTINCT):
COUNT(DISTINCT CASE WHEN Revenue IS NOT NULL THEN Fiscal_Quarter END) OVER (PARTITION BY Company_Name ORDER BY month_end ASC) AS cumulative_active_quarters
Unfortunately, BigQuery does not support this, so you can use a subquery and cumulative sum:
select t.* except (seqnum),
countif(seqnum = 1) over (partition by company_name order by month_end) as cnt
from (select t.*,
(case when revenue is not null
then row_number() over (partition by Company_Name, Fiscal_Quarter order by month_end)
else 0
end) as seqnum
from t
) t;
Note: This does not count the current quarter until there is revenue, which I think makes sense.

Related

SQL: Find number of active "events" each month

Background
I have an SQL table that contains all events, with each event containing a unique identifier.
As you can see for some IDs the "event" stretches across multiple months. What I'm trying to find is the number of "active events" per month.
For example event ID:342, is active in both the month of Jan and Feb. So it should count towards both Jan and Feb's final count.
Example dataset
ID
Start Date
End Date
342
01 Jan 2022
12 Feb 2022
231
12 Feb 2022
26 Feb 2022
123
20 Jan 2022
10 Apr 2022
Desired output:
Month
Start Date
Jan
2
Feb
3
Mar
1
Apr
1
btw: I'm using Alibaba's ODPS SQL and not MySQL or Postgres. So i appreciate if the solution provided could be SQL system agnostic. Thanks!
Here is an example is MySQL 8, using a recursive CTE to construct the list of months. It would be more efficient to use a Calendar Table.
If you are not using MySQL you will need to modify the syntax of the query.
create table dataset(
ID int, Start_date Date,End_date Date);
insert into dataset values
(342,'2022-01-01','2022-02-12'),
(231,'2022-01-12','2022-02-26'),
(123,'2022-01-20','2022-04-10');
/*
Desired output:
Month Start Date
Jan 2
Feb 3
Mar 1
Apr 1
*/
✓
✓
✓
select
min(month(Start_date)),
max(month(End_date))
from dataset;
min(month(Start_date)) | max(month(End_date))
---------------------: | -------------------:
1 | 4
with recursive m as
(select min(month(Start_date)) mon from dataset
union all
select mon + 1 from m
where mon < (select max(month(End_date)) from dataset)
)
select
mon "month",
count(id) "Count"
from m
left join dataset
on month(Start_date)<= mon
and month(End_date) >= mon
group by mon
order by mon;
month | Count
----: | ----:
1 | 3
2 | 3
3 | 1
4 | 1
db<>fiddle here

SQL: Select only users who are new in 2021

If we have a table as follows:
User_ID
Order_date
Order_ID
1
2020-02-02
23
2
2021-03-03
45
1
2021-02-02
13
3
2019-05-23
34
3
2021-01-31
56
How to select only the user whose first order is in the year 2021 (in this case, only User 2)?
You can use aggregation:
select user_id
from t
group by user_id
having min(order_date) >= '2021-01-01';
This checks that the earliest order date is after the first of the year.

How to get last quarterly and last half yearly average of balance for each month in hive?

I have a table with column cust_id, year_, month_, monthly_txn, monthly_bal. I need
to calculate the previous three month and previous six month avg(monthly_txn) and variance(monthly_bal) for each month. I have a query which returns avg and variance for last three and six month only for last month not for each month. I am not good in analytical function in Hive.
SELECT cust_id, avg(monthly_txn)y,variance(monthly_bal)x, FROM (
SELECT cust_id, monthly_txn,monthly_bal,
row_number() over (partition by cust_id order by year_,month_ desc) r
from mytable) b WHERE r <= 3 GROUP BY cust_id
But I want something like below.
input:
cust_id year_ month_ monthly_txn monthly_bal
1 2018 1 456 8979289
1 2018 2 675 4567
1 2018 3 645 4890
1 2017 1 342 44522
1 2017 2 378 9898900
1 2017 2 456 234492358
1 2017 4 3535 789
1 2017 5 456 345
1 2017 6 598 334
expecting output:
suppose for txn the quaterly and half yearly txn will be like this same for variance also
cust_id year_ month_ monthly_txn monthly_bal q_avg_txn h_avg_txn
1 2018 1 456 8979289 avg(456,598,4561) avg(456,598,4561,3535,4536,378)
1 2018 2 675 4567 avg(675,456,598) avg(675,456,3535,4561,598,4536)
1 2018 3 645 4890 avg(645,675,645) avg(645,675,645,3535,4561,598)
1 2017 1 342 44522 avg(342) avg(342)
1 2017 2 378 9898900 avg(378,342) avg(378,342)
1 2017 3 4536 234492358 avg(4536,372,342) avg(4536,378,342)
1 2017 4 3535 789 avg(3535,4536,378) avg(3535,4536,378,342)
1 2017 5 4561 345 avg(4561,3535,4536) avg(4561,3535,4536,342,378)
1 2017 6 598 334 avg(598,4561,3535) avg(598,4561,3535,4536,342,378)
use unbounded preceding analytic functions (/* to get the quarterly and half years values) and then use the subquery to get results.
What is ROWS UNBOUNDED PRECEDING used for in Teradata?
If you have data for every month of interest (i.e., no gaps), then this should work:
select t.*,
avg(monthly_bal) over (partition by cust_id
order by year_, month_
rows between 2 preceding and current row
) as avg_3,
avg(monthly_bal) over (partition by cust_id
order by year_, month_
rows between 5 preceding and current row
) as avg_6,
variance(monthly_bal) over (partition by cust_id
order by year_, month_
rows between 2 preceding and current row
) as variance_3,
variance(monthly_bal) over (partition by cust_id
order by year_, month_
rows between 5 preceding and current row
) as variance_6
from mytable t;

Extracting financial years overlapping with tenancy periods

Given these tenancy contracts:
2012 2013 2014 2015 2016
YR | | | | |
FIN_YR | 2012-2013 | 2013-2014 | 2014-2015 | 2015-2016 |
____________________________________________________
1 ----------------++++--------------------------------
2 -----+++++++++++++++++++++++++++++++++++++++--------
4 -----------------------------++++++++++++++++++-----
which lasted over these dates:
TENANCY_ID FROM TO
---------- ---------- ----------
1 2013-05-02 2013-08-12
2 2012-06-22 2015-09-01
4 2014-06-03 2015-11-15
I want to produce a long table like:
TENANCY_ID Financial_Year
---------- --------------
1 2013-2014
2 2012-2013
2 2013-2014
2 2014-2015
2 2015-2016
4 2014-2015
4 2015-2016
where Financial_Year shows the financial years (1 Apr - 31 Mar) over which each tenancy, at least partly, lasted.
If relevant, db2, otherwise a generic solution would be fine.
Sorry, haven't got db2 at hand, here's example at Oracle:
with financial_years as (
select to_char(r) || '-' || to_char(r + 1) as year,
to_date('01.04.' || to_char(r),'dd.mm.yyyy') as date_begin,
to_date('31.03.' || to_char(r + 1) || '23:59:59','dd.mm.yyyy hh24:mi:ss') as date_End
from t_fin_year -- here's a table (year INT)
)
select y.year,
t.id
from t_tenancy t
join financial_years y
on y.date_begin between t.from and t.to
OR y.date_end between t.from and t.to
order by t.id, y.year;
The main idea is to join financial years with tenancy wia dates: if year start or end is between tenancy start\end, then tenancy belongs to this year.

How to replace all values in grouped column except first row

I have table like this:
ID Region CreatedDate Value
--------------------------------
1 USA 2016-01-01 5
2 USA 2016-02-02 10
3 Canada 2016-02-02 2
4 USA 2016-02-03 7
5 Canada 2016-03-03 3
6 Canada 2016-03-04 10
7 USA 2016-03-04 1
8 Cuba 2016-01-01 4
I need to sum column Value grouped by Region and CreatedDate by year and month. The result will be
Region Year Month SumOfValue
--------------------------------
USA 2016 1 5
USA 2016 2 17
USA 2016 3 1
Canada 2016 2 2
Canada 2016 3 13
Cuba 2016 1 4
BUT I want to replace all repeated values in column Region with empty string except first met row. The finish result must be:
Region Year Month SumOfValue
--------------------------------
USA 2016 1 5
2016 2 17
2016 3 1
Canada 2016 2 2
2016 3 13
Cuba 2016 1 4
Thank you for a solution. It will be advantage if solution will replace also in column Year
You need to use SUM and GROUP BY to get the SumOfValue. For the formatting, you can use ROW_NUMBER:
WITH Cte AS(
SELECT
Region,
[Year] = YEAR(CreatedDate),
[Month] = MONTH(CreatedDate),
SumOfValue = SUM(Value),
Rn = ROW_NUMBER() OVER(PARTITION BY Region ORDER BY YEAR(CreatedDate), MONTH(CreatedDate))
FROM #tbl
GROUP BY
Region, YEAR(CreatedDate), MONTH(CreatedDate)
)
SELECT
Region = CASE WHEN Rn = 1 THEN c.Region ELSE '' END,
[Year],
[Month],
SumOfValue
FROM Cte c
ORDER BY
c.Region, Rn
ONLINE DEMO
Although this can be done in TSQL, I suggest you do the formatting on the application side.
Query that follows the same order as the OP.