SQL count distinct over partition by cumulatively - sql

I am using AWS Athena (Presto based) and I have this table named base:
id
category
year
month
1
a
2021
6
1
b
2022
8
1
a
2022
11
2
a
2022
1
2
a
2022
4
2
b
2022
6
I would like to craft a query that counts the distinct values of the categories per id, cumulatively per month and year, but retaining the original columns:
id
category
year
month
sumC
1
a
2021
6
1
1
b
2022
8
2
1
a
2022
11
2
2
a
2022
1
1
2
a
2022
4
1
2
b
2022
6
2
I've tried doing the following query with no success:
SELECT id,
category,
year,
month,
COUNT(category) OVER (PARTITION BY id, ORDER BY year, month) AS sumC FROM base;
This results in 1, 2, 3, 1, 2, 3 which is not what I'm looking for. I'd rather need something like a COUNT(DISTINCT) inside a window function, though it's not supported as a construct.
I also tried the DENSE_RANK trick:
DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
+ DENSE_RANK() OVER (PARTITION BY id ORDER BY category)
- 1 as sumC
Though, because there is no ordering between year and month, it just results in 2, 2, 2, 2, 2, 2.
Any help is appreciated!

One option is
creating a new column that will contain when each "category" is seen for the first time (partitioning on "id", "category" and ordering on "year", "month")
computing a running sum over this column, with the same partition
WITH cte AS (
SELECT *,
CASE WHEN ROW_NUMBER() OVER(
PARTITION BY id, category
ORDER BY year, month) = 1
THEN 1
ELSE 0
END AS rn1
FROM base
ORDER BY id,
year_,
month_
)
SELECT id,
category,
year_,
month_,
SUM(rn1) OVER(
PARTITION BY id
ORDER BY year, month
) AS sumC
FROM cte

Related

FInd records 1 year older/newer than dates given in the same column, for each ID#

I need to find if a customer has a subscription the previous year and the following year, and how many subscriptions were new or were canceled the following year.
Sample data:
ID
Subscription year
1
2010
1
2011
1
2019
2
2011
2
2012
3
2010
Thinking of this approach: subtracting and adding 1 to the subscription year and seeing if the customer ID has another row that corresponds (ex. if no rows for year+1, customer had canceled the next year). Hoping for something like this:
ID
Subscription year
SubscribedPreviousYear
SubscribedNextYear
1
2010
F
T
1
2011
T
F
1
2019
F
F
2
2011
F
T
2
2012
T
F
3
2010
F
F
Then counting the F's in SubscribedPreviousYear as new subscriptions (they are counted as new if customer did not have one the immediate previous year, even for existing customers) and F's in SubscribedNextYear as canceled subscriptions, to get something like this:
Year
New (# F's in SubscribedPreviousYear)
Canceled (# F's in SubscribedPreviousYear)
2010
2
1
2011
1
1
2012
0
1
2019
1
1
I had tried this code, modified from a similar MySQL question, but got 'F' for all rows.
select
t1.Id, cast(t1.year as date),
IIF((select count(*) from table t2
where t1.Id=t2.Id and
datediff(y, t2.year, t1.year)=1) <1, 'T','F')
as SubscribedPreviousYear
from table t;
I would use LEAD() and LAG() here:
SELECT id, year,
CASE WHEN LAG(year) OVER (PARTITION BY id ORDER BY year) = year - 1
THEN 'T' ELSE 'F' END AS SubscribedPreviousYear,
CASE WHEN LEAD(year) OVER (PARTITION BY id ORDER BY year) = year + 1
THEN 'T' ELSE 'F' END AS SubscribedNextYear
FROM yourTable
ORDER BY id, year;
To get the final result, we can aggregate by year:
WITH cte AS (
SELECT *,
LAG(year) OVER (PARTITION BY id ORDER BY year) AS year_lag,
LEAD(year) OVER (PARTITION BY id ORDER BY year) AS year_lead
FROM yourTable
)
SELECT year,
COUNT(CASE WHEN year != year_lag + 1 THEN 1 END) AS [New],
COUNT(CASE WHEN year != year_lead - 1 THEN 1 END) AS Cancelled
FROM cte
GROUP BY year
ORDER BY year;

Selecting records that have low numbers consecutively

I have a table as following (using bigquery):
id
year
month
day
rating
111
2020
11
30
4
111
2020
12
01
4
112
2020
11
30
5
113
2020
11
30
5
Is there a way in which I can select ids that have ratings that are consecutively (two or more consecutive records) low (low as in both records' ratings less than 4.5)?
For example, my desired output is:
id
year
month
day
rating
111
2020
11
30
4
111
2020
12
01
4
If you want all rows, then you need to look at both the previous rating and the next rating:
SELECT t.*
FROM (SELECT t.*,
LAG(rating) OVER (PARTITION BY id ORDER BY year, month, day ASC) AS prev_rating,
LEAD(rating) OVER (PARTITION BY id ORDER BY year, month, day ASC) AS next_rating,
FROM dataset.table t
) t
WHERE (rating < 4.5 and prev_rating < 4.5) OR
(rating < 4.5 and next_rating < 4.5)
Below is for BigQuery Standard SQL
select * except(grp, seq_len)
from (
select *, sum(1) over(partition by grp) seq_len
from (
select *,
countif(rating >= 4.5) over(partition by id order by year, month, day) grp
from `project.dataset.table`
)
where rating < 4.5
)
where seq_len > 1

Bigquery - How to Calculate the sum of two continuous rows

How can I get the sum of two rows clubbed together for instance If I have 5 rows in total, I should get 3 rows a result.
Below is my table:
2020-08-01 1
2020-08-02 3
2020-08-03 4
2020-08-04 2
2020-08-05 4
I want to achive this:
4
6
4
August 1 and 2 = 4
August 3 and 4 = 6
August 5 = 4
You could use ROW_NUMBER here:
WITH cte AS (
SELECT dt, val, ROW_NUMBER() OVER (ORDER BY dt) rn
FROM yourTable
)
SELECT SUM(val)
FROM cte
GROUP BY FLOOR((rn - 1) / 2)
GROUP BY MIN(dt);
Here is a demo link, shown in SQL Server, but whose logic should also be working for BigQuery:
Demo
Below is for Bigquery Standard SQL
#standardSQL
SELECT SUM(value) AS value,
STRING_AGG(FORMAT_DATE('%B %d', day), ' and ') || ' = ' || CAST(SUM(value) AS STRING) AS calc
FROM (
SELECT day, value, DIV(ROW_NUMBER() OVER(ORDER BY day) - 1, 2) grp
FROM `project.dataset.table` t
)
GROUP BY grp
ORDER BY grp
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT DATE '2020-08-01' day, 1 value UNION ALL
SELECT '2020-08-02', 3 UNION ALL
SELECT '2020-08-03', 4 UNION ALL
SELECT '2020-08-04', 2 UNION ALL
SELECT '2020-08-05', 4
)
SELECT SUM(value) AS value,
STRING_AGG(FORMAT_DATE('%B %d', day), ' and ') || ' = ' || CAST(SUM(value) AS STRING) AS calc
FROM (
SELECT day, value, DIV(ROW_NUMBER() OVER(ORDER BY day) - 1, 2) grp
FROM `project.dataset.table` t
)
GROUP BY grp
ORDER BY grp
with output
Row value calc
1 4 August 01 and August 02 = 4
2 6 August 03 and August 04 = 6
3 4 August 05 = 4

Grouping data on SQL Server

I have this table in SQL Server:
Year Month Quantity
----------------------------
2015 January 10
2015 February 20
2015 March 30
2014 November 40
2014 August 50
How can I identify the different years and months adding two more columns that group the same years with a number and then different months in sequential way like the example
Year Month Quantity Group Subgroup
------------------------------------------------
2015 January 10 1 1
2015 February 20 1 2
2015 March 30 1 3
2014 November 40 2 1
2014 August 50 2 2
You can use DENSE_RANK to calculate the groups for you:
SELECT t1.*, DENSE_RANK() OVER (ORDER BY Year DESC) AS [Group],
DENSE_RANK() OVER (PARTITION BY Year ORDER BY DATEPART(month, Month + ' 01 2010')) AS [SubGroup]
FROM t1
ORDER BY 4, 5
See this fiddle.
To associate group and subgroup with a number you can do this:
WITH RankedTable AS (
SELECT year, month, quantity,
ROW_NUMBER() OVER (partition by year order by Month) AS rn
FROM yourtable)
SELECT year, month, quantity,
SUM (CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (ORDER BY YEAR) as year_group,
rn AS subgroup
FROM RankedTable
Here ROW_NUMBER() OVER clause calculates rank of a month within a year.
And SUM() ... OVER calculates running SUM for the months with rank 1.
SQL Fiddle

Moving average of 2 columns

Hello I have a problem. I know how to calculate moving average last 3 months using oracle analytic functions... but my situatiion is a little different
Month-----ProductType-----Sales----------Average(HAVE TO FIND THIS)
1---------A---------------10
1---------B---------------12
1---------C---------------17
2---------A---------------21
3---------C---------------2
3---------B---------------21
4---------B---------------23
5
6
7
8
9
So we have sales for each month and each product type... I need to calculate the moving average of the last 3 months and the particular product.
example:
For month 4 and Produt B it would be (21+0+12)/3
Any ideas ?
Another option is to use the windowing clause of analytic functions
with my_data as (
select 1 as month, 'A' as product, 10 as sales from dual union all
select 1 as month, 'B' as product, 12 as sales from dual union all
select 1 as month, 'C' as product, 17 as sales from dual union all
select 2 as month, 'A' as product, 21 as sales from dual union all
select 3 as month, 'C' as product, 2 as sales from dual union all
select 3 as month, 'B' as product, 21 as sales from dual union all
select 4 as month, 'B' as product, 23 as sales from dual
)
select
month,
product,
sales,
nvl(sum(sales)
over (partition by product order by month
range between 3 preceding and 1 preceding),0)/3 as average_sales
from my_data
order by month, product
SELECT month,
productType,
sales,
(lag(sales, 3) over (partition by produtType order by month) +
lag(sales, 2) over (partition by productType order by month) +
lag(sales, 1) over (partition by productType order by month)/3 moving_avg
FROM your_table_name