How can I count users in a month that were not present in the month before? - sql

I am trying to count unique users on a monthly basis that were not present in the previous month. So if a user has a record for January and then another one for February, then I would only count January for that user.
user_id time
a1 1/2/17
a1 2/10/17
a2 2/18/17
a4 2/5/17
a5 3/25/17
My results should look like this
Month User Count
January 1
February 2
March 1

I'm not really familiar with BigQuery, but here's how I would solve the problem using TSQL. I imagine that you'd be able to use similar logic in BigQuery.
1). Order the data by user_id first, and then time. In TSQL, you can accomplish this with the following and store it in a common table expression, which you will query in the step after this.
;WITH cte AS
(
select ROW_NUMBER() OVER (PARTITION BY [user_id] ORDER BY [time]) AS rn,*
from dbo.employees
)
2). Next query for only the rows with rn = 1 (the first occurrence for a particular user) and group by the month.
select DATENAME(month, [time]) AS [Month], count(*) AS user_count
from cte
where rn = 1
group by DATENAME(month, [time])
This is assuming that 2017 is the only year you're dealing with. If you're dealing with more than one year, you probably want step #2 to look something like this:
select year([time]) as [year], DATENAME(month, [time]) AS [month],
count(*) AS user_count
from cte
where rn = 1
group by year([time]), DATENAME(month, [time])

First aggregate by the user id and the month. Then use lag() to see if the user was present in the previous month:
with du as (
select date_trunc(time, month) as yyyymm, user_id
from t
group by date_trunc(time, month)
)
select yyyymm, count(*)
from (select du.*,
lag(yyyymm) over (partition by user_id order by yyyymm) as prev_yyyymm
from du
) du
where prev_yyyymm is not null or
prev_yyyymm < date_add(yyyymm, interval 1 month)
group by yyyymm;
Note: This uses the date functions, but similar functions exist for timestamp.

The way I understood question is - to exclude user to be counted in given month only if same user presented in previous month. But if same user present in few months before given, but not in previous - user should be counted.
If this is correct - Try below for BigQuery Standard SQL
#standardSQL
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
you can test / play with above using below example with dummy data from your question
#standardSQL
WITH yourTable AS (
SELECT 'a1' AS user_id, '1/2/17' AS time UNION ALL
SELECT 'a1', '2/10/17' UNION ALL
SELECT 'a2', '2/18/17' UNION ALL
SELECT 'a4', '2/5/17' UNION ALL
SELECT 'a5', '3/25/17'
)
SELECT Year, Month, COUNT(DISTINCT user_id) AS User_Count
FROM (
SELECT *,
DATE_DIFF(time, LAG(time) OVER(PARTITION BY user_id ORDER BY time), MONTH) AS flag
FROM (
SELECT
user_id,
DATE_TRUNC(PARSE_DATE('%x', time), MONTH) AS time,
EXTRACT(YEAR FROM PARSE_DATE('%x', time)) AS Year,
FORMAT_DATE('%B', PARSE_DATE('%x', time)) AS Month
FROM yourTable
GROUP BY 1, 2, 3, 4
)
)
WHERE IFNULL(flag, 0) <> 1
GROUP BY Year, Month, time
ORDER BY time
The output is
Year Month User_Count
2017 January 1
2017 February 2
2017 March 1

Try this query:
SELECT
t1.d,
count(DISTINCT t1.user_id)
FROM
(
SELECT
EXTRACT(MONTH FROM time) AS d,
--EXTRACT(MONTH FROM time)-1 AS d2,
user_id
FROM nbitra.tmp
) t1
LEFT JOIN
(
SELECT
EXTRACT(MONTH FROM time) AS d,
user_id
FROM nbitra.tmp
) t2
ON t1.d = t2.d+1
WHERE
(
t1.user_id <> t2.user_id --User is in previous month
OR t2.user_id IS NULL --To handle january, since there is no previous month to compare to
)
GROUP BY t1.d;

Related

How to Count Entries on Same Day and Sum Amount based on the Count?

I am attempting to produce Table2 below - which essentially counts the rows that have the same day and adds up the "amount" column for the rows that are on the same day.
I found a solution online that can count entries from the same day, which works:
SELECT
DATE_TRUNC('day', datetime) AS date,
COUNT(datetime) AS date1
FROM Table1
GROUP BY DATE_TRUNC('day', datetime);
It is partially what I am looking for, but I am having difficulty trying to display all the column names.
In my attempt, I have all the columns I want but the Accumulated Count is not accurate since it counts the rows with unique IDs (because I put "id" in GROUP BY):
SELECT *, count(id) OVER(ORDER BY DateTime) as accumulated_count,
SUM(Amount) OVER(ORDER BY DateTime) AS Accumulated_Amount
FROM Table1
GROUP BY date(datetime), id
I've been working on this for days and seemingly have come across every possible outcome that is not what I am looking for. Does anyone have an idea as to what I'm missing here?
Cumulative sum and count should be calculated for each day
with Table1 (id,datetime,client,product,amount) as(values
(1 ,to_timestamp('2020-07-08 07:30:10','YYYY-MM-DD HH24:MI:SS'),'Tom','Bill Payment',24),
(2 ,to_timestamp('2020-07-08 07:50:30','YYYY-MM-DD HH24:MI:SS'),'Tom','Bill Payment',27),
(3 ,to_timestamp('2020-07-09 08:20:10','YYYY-MM-DD HH24:MI:SS'),'Tom','Bill Payment',37)
)
SELECT
Table1.*,
count(*) over (partition by DATE_TRUNC('day', datetime)
order by datetime asc ) accumulated_count,
sum(amount) over (partition by DATE_TRUNC('day', datetime) order by datetime asc) accumulated_sum
FROM Table1;
Not to familiar with postgresql but this does what you ask fror.
with data (id,date_time,client,product,amount) as(
select 1 ,to_timestamp('Jul 08 2020, 07:30:10','Mon DD YYYY, HH24:MI:SS'),'Tom','Bill',24 Union all
select 2 ,to_timestamp('Jul 08 2020, 07:50:30','Mon DD YYYY, HH24:MI:SS'),'Tom','Bill',27 Union all
select 3 ,to_timestamp('Jul 09 2020, 08:20:10','Mon DD YYYY, HH24:MI:SS'),'Tom','Bill',37
)
select d.id,d.date_time,d.client,d.product,d.amount,
(select count(*) from data d1
where d1.date_time <= d.date_time and date(d1.date_time) = date(d.date_time) ) acc_count,
(select sum(amount) from data d1
where d1.date_time <= d.date_time and date(d1.date_time) = date(d.date_time) ) acc_amount
from data d

Sum of unique customers in rolling trailing 30d window displayed by week

I'm working in SQL Workbench.
I'd like to track every time a unique customer clicks the new feature in trailing 30 days, displayed week over week. An example of the data output would be as follows:
Week 51: Reflects usage through the end of week 51 (Dec 20th) - 30 days. aka Nov 20-Dec 20th
Week 52: Reflects usage through the end of week 52 (Dec 31st) - 30 days. aka Dec 1 - Dec 31st.
Say there are 22MM unique customer clicks that occurred from Nov 20-Dec 20th. Week 51 data = 22MM.
Say there are 25MM unique customer clicks that occurred from Dec 1-Dec 31st. Week 52 data = 25MM. The customer uniqueness is only relevant to that particular week. Aka, if a customer clicks twice in Week 51 they're only counted once. If they click once in Week 51 and once in Week 52, they are counted once in each week.
Here is what I have so far:
select
min_e_date
,sum(count(*)) over (order by min_e_date rows between unbounded preceding and current row) as running_distinct_customers
from (select customer_id, min(DATE_TRUNC('week', event_date)) as min_e_date
from final
group by 1
) c
group by
min_e_date
I don't think a rolling count is the right way to go. As I add in additional parameters (country, subscription), the rolling count doesn't distinguish between them - the figures just get added to the prior row.
Any suggestions are appreciated!
edit Additional data below. Data collection begins on 11/23. No data precedes that date.
You can get the count of distinct customers per week like so:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt
from final
group by 1
Now if you want a rolling sum of that count(say, the current week and the three preceding weeks), you can use window functions:
select date_trunc('week', event_date) as week_start,
count(distinct customer_id) cnt,
sum(count(distinct customer_id)) over(
order by date_trunc('week', event_date)
range between 3 week preceding and current row
) as rolling_cnt
from final
group by 1
Rolling distinct counts are quite difficult in RedShift. One method is a self-join and aggregation:
select t.date,
count(distinct case when tprev.date >= t.date - interval '6 day' then customer_id end) as trailing_7,
count(distinct customer_id) as trailing_30
from t join
t tprev
on tprev.date >= t.date - interval '29 day' and
tprev.date <= t.date
group by t.date;
If you can get this to work, you can just select every 7th row to get the weekly values.
EDIT:
An entirely different approach is to use aggregation and keep track of when customers enter and end time periods of being counted. This is a pain with two different time frames. Here is what it looks like for one.
The idea is to
Create an enter/exit record for each record being counted. The "exit" is n days after the enter.
Summarize these into periods of activity for each customer. So, there is one record with an enter and exit date. This is a type of gaps-and-islands problem.
Unpivot this result to count +1 for a customer being counted and -1 for a customer not being counted.
Do a cumulative sum of this count.
The code looks something like this:
with cd as (
select customer_id, date,
lead(date) over (partition by customer_id order by date) as next_date,
sum(sum(inc)) over (partition by customer_id order by date) as cnt
from ((select t.customer_id, t.date, 1 as inc
from t
) union all
(select t.customer_id, t.date + interval '7 day', -1
from t
)
) tt
),
cd2 as (
select customer_id, min(date) as enter_date, max(date) as exit_date
from (select cd.*,
sum(case when cnt = 0 then 1 else 0 end) over (partition by customer_id order by date) as grp
from (select cd.*,
lag(cnt) over (partition by customer_id order by date) as prev_cnt
from cd
) cd
) cd
group by customer_id, grp
having max(cnt) > 0
)
select dte, sum(sum(inc)) over (order by dte)
from ((select customer_id, enter_date as dte, 1 as inc
from cd2
) union all
(select customer_id, exit_date as dte, -1 as inc
from cd2
)
) cd2
group by dte;

Get last data recorded of the date and group it by month

tbl_totalMonth has id,time, date and kwh column.
I want to get the last recorded data of the months and group it per month so the result would be the name of the month and kwh.
the result should be something like this:
month | kwh
------------
January | 150
February | 400
the query I tried: (but it returns the max kwh not the last kwh recorded)
SELECT DATENAME(MONTH, a.date) as monthly, max(a.kwh) as kwh
from tbl_totalMonth a
WHERE date > = DATEADD(yy,DATEDIFF(yy,0, GETDATE() -1 ),0)
group by DATENAME(MONTH, a.date)
I suspect you need something quite different:
select *
from (
select *
, row_number() over(partition by month(a.date), year(a.date) order by a.date DESC) as rn
from tbl_totalMonth a
WHERE date > = DATEADD(yy,DATEDIFF(yy,0, GETDATE() -1 ),0)
) d
where rn = 1
To get "the last kwh recorded (per month)" you need to use row_number() which - per month - will order the rows (descending) and give each one a row number. When that number is 1 you have "the most recent" row for that month, and you won't need group by at all.
You could use group by and month
select datename(month, date), sum(kwh)
from tbl_totalMonth
where date = (select max(date) from tbl_totalMonth )
group by datename(month, date)
if you need only the last row for each month then youn should use
select datename(month, date), khw
from tbl_totalMonth a
inner join (
select max(date) as max_date
from tbl_totalMonth
group by month(date)) t on t.max_date = a.date

Filter rows in PostgreSQL based on values of consecutive rows in one column

So I'm working with the following postgresql table:
10 rows from PostGreSQL table
For each business_id, I want to filter out those businesses where the review_count isn't above a specific review_count threshold for 2 consecutive months (or rows). Depending on the city the business_id is in, the threshold will be different (so for example, in the screenshot above, we can assume rows with city = Charlotte has a review_count threshold of >= 2, and those with city = Las Vegas has a review_count threshold of >= 3. If a business_id does not have at least one instance of consecutive months with review_counts above the specified threshold, I want to filter it out.
I want this query to return only the business_ids that meet this condition (as well as all the other columns in the table that go along with that business_id). The composite primary key on this table is (business_id, year, month).
Some months, as you may notice, are missing from the data (month 9 of the second business_id). If that is the case, I do NOT want to count 2 rows as 'consecutive months'. For example, for the business in Las Vegas, I do NOT want to consider month 8 to 10 as 'consecutive months', even though they appear in consecutive rows.
I've tried something like this, but have kind of run into a wall and don't think its getting me far:
SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
FROM us_business_monthly_review_growth
GROUP BY business_id, year, month
HAVING (city = 'Las Vegas'
AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))
I'm new to Postgre and StackOverflow, so if you have any feedback on the way I asked this question please don't hesitate to let me know! =)
UPDATE:
Thanks to some help from #Gordon Linoff, I found the following solution:
SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
FROM us_businesses_monthly_growth_and_avg
) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);
You can do this with lag():
select distinct business_id
from (select t.*,
lag(year) over (partition by business_id order by year, month) as prev_year,
lag(month) over (partition by business_id order by year, month) as prev_month,
lag(rating) over (partition by business_id order by year, month) as prev_rating
from us_business_monthly_review_growth t
) t
where rating >= $threshhold and prev_rating >= $threshhold and
(year * 12 + month) = (prev_year * 12 + prev_month) + 1;
The only trick is setting the threshold value. I have no idea how you plan on doing that.
Please try...
SELECT business_id
FROM
(
SELECT business_id AS business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
city,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
review_count AS review_count
FROM us_business_monthly_review_growth
order BY business_id,
year,
month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
AND diffInDates = 1
AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;
In formulating this answer I first used the following code to test that LAG() performed as hoped...
SELECT business_id,
LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
Here I was trying to get LAG() to refer to values on the next row, but the output showed that it was referring to the previous row in that comparison. Unfortunately I wanted to compare current values with the next one to see if the next record had the same business_id, etc. So I changed the 1 in LAG() to `-1', giving me...
SELECT business_id,
LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
year,
month,
LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
year,
month;
As this gave me the desired results I added city, to allow a JOIN between the results and an assumed table holding the details of each city and its corresponding threshold. I chose the name tblCityThresholds as a suggestion since I am not sure what you have / would call it. This completed the inner SELECT statement.
I then joined the results of the inner SELECT statement to tblCityThresholds and refined the output as per your criteria. Note : It is assumed that the city field will always have a corresponding entry in tblCityThresholds;
I then used GROUP BY to ensure no repetition of a business_id.
If you have any questions or comments, then please feel free to post a Comment accordingly.
Further Reading
https://www.postgresql.org/docs/8.4/static/functions-window.html (in regards LAG())

Google BigQuery: Rolling Count Distinct

I have a table with is simply a list of dates and user IDs (not aggregated).
We define a metric called active users for a given date by counting the distinct number of IDs that appear in the previous 45 days.
I am trying to run a query in BigQuery that, for each day, returns the day plus the number of active users for that day (count distinct user from 45 days ago until today).
I have experimented with window functions, but can't figure out how to define a range based on the date values in a column. Instead, I believe the following query would work in a database like MySQL, but does not in BigQuery.
SELECT
day,
(SELECT
COUNT(DISTINCT visid)
FROM daily_users
WHERE day BETWEEN DATE_ADD(t.day, -45, "DAY") AND t.day
) AS active_users
FROM daily_users AS t
GROUP BY 1
This doesn't work in BigQuery: "Subselect not allowed in SELECT clause."
How to do this in BigQuery?
BigQuery documentation claims that count(distinct) works as a window function. However, that doesn't help you, because you are not looking for a traditional window frame.
One method would adds a record for each date after a visit:
select theday, count(distinct visid)
from (select date_add(u.day, n.n, "day") as theday, u.visid
from daily_users u cross join
(select 1 as n union all select 2 union all . . .
select 45
) n
) u
group by theday;
Note: there may be simpler ways to generate a series of 45 integers in BigQuery.
Below should work with BigQuery
#legacySQL
SELECT day, active_users FROM (
SELECT
day,
COUNT(DISTINCT id)
OVER (ORDER BY ts RANGE BETWEEN 45*24*3600 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, TIMESTAMP_TO_SEC(TIMESTAMP(day)) AS ts
FROM daily_users
)
) GROUP BY 1, 2 ORDER BY 1
Above assumes that day field is represented as '2016-01-10' format.
If it is not a case , you should adjust TIMESTAMP_TO_SEC(TIMESTAMP(day)) in most inner select
Also please take a look at COUNT(DISTINC) specifics in BigQuery
Update for BigQuery Standard SQL
#standardSQL
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 3888000 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1
You can test / play with it using below dummy sample
#standardSQL
WITH daily_users AS (
SELECT 1 AS id, '2016-01-10' AS day UNION ALL
SELECT 2 AS id, '2016-01-10' AS day UNION ALL
SELECT 1 AS id, '2016-01-11' AS day UNION ALL
SELECT 3 AS id, '2016-01-11' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-12' AS day UNION ALL
SELECT 1 AS id, '2016-01-13' AS day
)
SELECT
day,
(SELECT COUNT(DISTINCT id) FROM UNNEST(active_users) id) AS active_users
FROM (
SELECT
day,
ARRAY_AGG(id)
OVER (ORDER BY ts RANGE BETWEEN 86400 PRECEDING AND CURRENT ROW) AS active_users
FROM (
SELECT day, id, UNIX_DATE(PARSE_DATE('%Y-%m-%d', day)) * 24 * 3600 AS ts
FROM daily_users
)
)
GROUP BY 1, 2
ORDER BY 1