How to change rank according to data changes - sql

I want to have a window function to rank the month according to their values. So in this example, 2018-12 be rank 1, 2019-01 be 2, etc.
And I also want the rank counter to reset after it goes to a new cohort, in this case, cohort 2, the rank should start from 1 again, and the pattern will be similar to cohort 1
SELECT *,
rank() over (partition by cohort, month order by month asc)
FROM (
SELECT 1 as cohort, id, date_trunc('month',start_date) as month
FROM _analysis.terms
WHERE holiday=FALSE and id >= 125
UNION SELECT 2, id, date_trunc('month', start_date) FROM _analysis.terms
WHERE holiday=FALSE and id >= 126
ORDER BY cohort, id, month
)
ORDER BY cohort, id, month

This may help
USE AdventureWorks2014
GO
SELECT SalesOrderID,OrderDate
,DENSE_RANK() OVER(ORDER BY MONTH(OrderDate) ASC) [Rank]
FROM [Sales].[SalesOrderHeader]

Related

How to differentiate iteration using date filed in bigquery

I have a process that occur every 30 days but can take few days.
How can I differentiate between each iteration in order to sum the output of the process?
for Example
the output I except is
Name
Date
amount
iteration (optional)
Sophia Liu
2016-01-01
4
1
Sophia Liu
2016-02-01
5
2
Nikki Leith
2016-01-02
5
1
Nikki Leith
2016-02-01
10
2
I tried using lag function on the date filed and using the difference between that column and the date column.
WITH base AS
(SELECT 'Sophia Liu' as name, DATE '2016-01-01' as date, 3 as amount
UNION ALL SELECT 'Sophia Liu', DATE '2016-01-02', 1
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-01', 3
UNION ALL SELECT 'Sophia Liu', DATE '2016-02-02', 2
UNION ALL SELECT 'Nikki Leith', DATE '2016-01-02', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-01', 5
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-02', 3
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-03', 1
UNION ALL SELECT 'Nikki Leith', DATE '2016-02-04', 1)
select
name
,date
,lag(date) over (partition by name order by date) as lag_func
,date_diff(date,lag(date) over (partition by name order by date),day) date_differacne
,case when date_diff(date,lag(date) over (partition by name order by date),day) >= 10
or date_diff(date,lag(date) over (partition by name order by date),day) is null then true else false end as new_iteration
,amount
from base
Edited answer
After your clarification and looking at what's actually in your SQL code. I'm guessing you are looking for a solution to what's called a gaps and islands problem. That is, you want to identify the "islands" of activity and sum the amount for each iteration or island. Taking your example you can first identify the start of a new session (or "gap") and then use that to create a unique iteration ("island") identifier for each user. You can then use that identifier to perform a SUM().
gaps as (
select
name,
date,
amount,
if(date_diff(date, lag(date,1) over(partition by name order by date), DAY) >= 10, 1, 0) new_iteration
from base
),
islands as (
select
*,
1 + sum(new_iteration) over(partition by name order by date) iteration_id
from gaps
)
select
*,
sum(amount) over(partition by name, iteration_id) iteration_amount
from islands
Previous answer
Sounds like you just need a RANK() to count the iterations in your window functions. Depending on your need you can then sum cumulative or total amounts in a similar window function. Something like this:
select
name
,date
,rank() over (partition by name order by date) as iteration
,sum(amount) over (partition by name order by date) as cumulative_amount
,sum(amount) over (partition by name) as total_amount
,amount
from base

RANK() over (PARTITION BY) To show only TOP 3 rows for each month

I have a question about ranking . (My using Pgadmin for my SQL codes)
Mange to get my sum of sales in DESC order and rank 1 to 3 for the month of APR
But how can I achieve my result by showing only rank 1 to 3 for the month of Apr , May and June.
I need to reflect only 9 rows in my table .
SELECT restaurant_id,
EXTRACT(year FROM submitted_on) AS year,
EXTRACT(month FROM submitted_on) AS month,
SUM(total_amount),
RANK() OVER (PARTITION BY(extract(month from submitted_on))
ORDER BY SUM(total_amount) DESC) rank
FROM orders
WHERE submitted_on::date BETWEEN '2021-04-01' AND '2021-06-30'
GROUP BY restaurant_id, year, month
If you just want 3 records you should use row_number instead of rank. for your requirement you can do it in this way:
select t.* from (
SELECT restaurant_id,
EXTRACT(year FROM submitted_on) AS year,
EXTRACT(month FROM submitted_on) AS month,
SUM(total_amount),
RANK() OVER (PARTITION BY(extract(month from submitted_on))
ORDER BY SUM(total_amount) DESC) rank
FROM orders
WHERE submitted_on::date BETWEEN '2021-04-01' AND '2021-06-30'
GROUP BY restaurant_id, year, month
) t
where rank <=3;

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

How can I count users in a month that were not present in the month before?

I am trying to count unique users on a monthly basis that were not present in the previous month. So if a user has a record for January and then another one for February, then I would only count January for that user.
user_id time
a1 1/2/17
a1 2/10/17
a2 2/18/17
a4 2/5/17
a5 3/25/17
My results should look like this
Month User Count
January 1
February 2
March 1
I'm not really familiar with BigQuery, but here's how I would solve the problem using TSQL. I imagine that you'd be able to use similar logic in BigQuery.
1). Order the data by user_id first, and then time. In TSQL, you can accomplish this with the following and store it in a common table expression, which you will query in the step after this.
;WITH cte AS
(
select ROW_NUMBER() OVER (PARTITION BY [user_id] ORDER BY [time]) AS rn,*
from dbo.employees
)
2). Next query for only the rows with rn = 1 (the first occurrence for a particular user) and group by the month.
select DATENAME(month, [time]) AS [Month], count(*) AS user_count
from cte
where rn = 1
group by DATENAME(month, [time])
This is assuming that 2017 is the only year you're dealing with. If you're dealing with more than one year, you probably want step #2 to look something like this:
select year([time]) as [year], DATENAME(month, [time]) AS [month],
count(*) AS user_count
from cte
where rn = 1
group by year([time]), DATENAME(month, [time])
First aggregate by the user id and the month. Then use lag() to see if the user was present in the previous month:
with du as (
select date_trunc(time, month) as yyyymm, user_id
from t
group by date_trunc(time, month)
)
select yyyymm, count(*)
from (select du.*,
lag(yyyymm) over (partition by user_id order by yyyymm) as prev_yyyymm
from du
) du
where prev_yyyymm is not null or
prev_yyyymm < date_add(yyyymm, interval 1 month)
group by yyyymm;
Note: This uses the date functions, but similar functions exist for timestamp.
The way I understood question is - to exclude user to be counted in given month only if same user presented in previous month. But if same user present in few months before given, but not in previous - user should be counted.
If this is correct - Try below for BigQuery Standard SQL
#standardSQL
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
you can test / play with above using below example with dummy data from your question
#standardSQL
WITH yourTable AS (
SELECT 'a1' AS user_id, '1/2/17' AS time UNION ALL
SELECT 'a1', '2/10/17' UNION ALL
SELECT 'a2', '2/18/17' UNION ALL
SELECT 'a4', '2/5/17' UNION ALL
SELECT 'a5', '3/25/17'
)
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
The output is
Year Month User_Count
2017 January 1
2017 February 2
2017 March 1
Try this query:
SELECT
t1.d,
count(DISTINCT t1.user_id)
FROM
(
SELECT
EXTRACT(MONTH FROM time) AS d,
--EXTRACT(MONTH FROM time)-1 AS d2,
user_id
FROM nbitra.tmp
) t1
LEFT JOIN
(
SELECT
EXTRACT(MONTH FROM time) AS d,
user_id
FROM nbitra.tmp
) t2
ON t1.d = t2.d+1
WHERE
(
t1.user_id <> t2.user_id --User is in previous month
OR t2.user_id IS NULL --To handle january, since there is no previous month to compare to
)
GROUP BY t1.d;

Retrieve recent 5 days forecast for each cities with latest issue date

I need to retrieve the recent 5 days forecast info for each cities.
My table looks like below
The real problem is with the issue date.
the city may contain several forecast info for the same date with distinct issue date.
I need to retrieve recent 5 records for each cities with latest issue date and group by forecast date
I have tried something like below but not giving the expected result
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID ORDER BY FORECAST_DATE DESC, ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
GROUP BY FORECAST_DATE
) WHERE rn <= 5
Any suggestion or advice will be helpful
This will get the latest issued forecast per day over the most recent 5 days for each city:
SELECT *
FROM (
SELECT f.*,
DENSE_RANK() OVER ( PARTITION BY city_id ORDER BY forecast_date DESC )
AS forecast_rank,
ROW_NUMBER() OVER ( PARTITION BY city_id, forecast_date ORDER BY issue_date DESC )
AS issue_rn
FROM Forecast f
)
WHERE forecast_rank <= 5
AND issue_rn = 1;
Partition by works like group by but for the function only.
Try
with CTE as
(
select t1.*,
row_number() over (partition by city_id, forecast_date order by issue_date desc) as r_ord
from Forecast
)
select CTE.*
from CTE
where r_ord <= 5
Try this
SELECT * FROM(
SELECT
ROW_NUMBER () OVER (PARTITION BY CITY_ID, FORECAST_DATE order by ISSUE_DATE DESC) AS rn,
CITY_ID, FORECAST_DATE, ISSUE_DATE
FROM
FORECAST
) WHERE rn <= 5