SQL counting distinct users over a growing timeframe - sql

I don't think I properly titled this, but in essence I'm wanting to be able to count distinct users but have those previous distinct users be considered as time goes on. As an example, say we have a dataset of user purchases over time:
Date | User
-----------------
2/3/22 | A
2/4/22 | B
2/22/22 | C
3/2/22 | A
3/4/22 | D
3/15/22 | A
4/30/22 | B
Generally, if I were to count distincts grouped by months as would be normal we would get:
Date | Count
-----------------
2/1/22 | 3
3/1/22 | 2
4/1/22 | 1
But what I'm really wanting to see would be how the total number of distinct users increases over the time period.
Date | Count
-----------------
2/1/22 | 3
3/1/22 | 4
4/1/22 | 4
As such it would be 3 distinct users for the first month. Then 4 for the second month considering the total number of distinct users grew by one with the addition of "D" while "A" isn't counted because it was already recognized as a distinct user in the previous month. The third month would then still be 4 because no new distinct user performed an action that month.
Any help would be greatly appreciated (even if it is just a better title so that it reaches more people more appropriately haha)

here's a solution based on running sum in Postgres that should translate well to Vertica.
select date_trunc('month', "Date") as "Date"
,sum(count(case rn when 1 then 1 end)) over (order by date_trunc('month', "Date")) as "Count"
from (
select "Date"
,"User"
,row_number() over(partition by "User" order by "Date") as rn
from t
) t
group by date_trunc('month', "Date")
order by "Date"
Date
Count
2022-02-01 00:00:00
3
2022-03-01 00:00:00
4
2022-04-01 00:00:00
4
Fiddle

Related

Running sum of unique users in redshift

I have a table with as follows with user visits by day -
| date | user_id |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
I am looking to get a running total of unique user_id for the last 30 days . Here is the expected output -
| date | distinct_users|
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
Here is the query I tried -
SELECT date
, SUM(COUNT(DISTINCT user_id)) over (order by date rows between 30 preceding and current row) AS unique_users
FROM mytable
GROUP BY date
ORDER BY date DESC
The problem I am running into is that this query not counting the unique user_id - for instance the result I am getting for 01/31/23 is 9 instead of 5 as it is counting user_id 'a' every time it occurs.
Thank you, appreciate your help!
Not the most performant approach, but you could use a correlated subquery to find the distinct count of users over a window of the past 30 days:
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
There are a few things going on here. First window functions run after group by and aggregation. So COUNT(DISTINCT user_id) gives the count of user_ids for each date then the window function runs. Also, window function set up like this work over the past 30 rows, not 30 days so you will need to fill in missing dates to use them.
As to how to do this - I can only think of the "expand to the data so each date and id has a row" method. This will require a CTE to generate the last 2 years of dates plus 30 days so that the look-back window works for the first dates. Then window over the past 30 days for each user_id and date to see which rows have an example of this user_id within the past 30 days, setting the value to NULL if no uses of the user_id are present within the window. Then Count the user_ids counts (non NULL) grouping by just date to get the number of unique user_ids for that date.
This means expanding the data significantly but I see no other way to get truly unique user_ids over the past 30 days. I can help code this up if you need but will look something like:
WITH RECURSIVE CTE to generate the needed dates,
CTE to cross join these dates with a distinct set of all the user_ids in user for the past 2 years,
CTE to join the date/user_id data set with the table of real data for past 2 years and 30 days and window back counting non-NULL user_ids, partition by date and user_id, order by date, and setting any zero counts to NULL with a DECODE() or CASE statement,
SELECT, grouping by just date count the user_ids by date;

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.
Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

Query and return user requests if dates are consecutive

I am attempting to group records together by consecutive dates in the request_date column and user field but only return if the count is equal or above a certain number, say 3.
At the moment the Columns I have would be
user_id | request_date |
--------|--------------|
3 | 2019-01-01 |
5 | 2019-05-08 |
3 | 2019-01-02 |
4 | 2019-08-09 |
3 | 2019-01-03 |
the query would ideally return something along the lines of:
user_id: 3
num_of_reqs: 3
first_date: 2019-01-01
last_date: 2019-01-03
any insight would be appreciated.
You can use window functions. In particular, subtracting an increasing sequence from the date column will be constant when the dates are consecutive.
Something like this:
select user_id, count(*) as num_requests,
min(request_date), max(request_date)
from (select t.*,
row_number() over (partition by user_id order by request_date) as seqnm
from t
) t
group by user_id, (request_date - seqnum)
If you want to limit to a particular number, then add a having clause:
having count(*) >= 3
for instance.

How to get sum of one day and sum of last three days in single query?

Suppose I have a statistical table like this:
date | stats
-------------
10/1 | 2
10/1 | 3
10/1 | 2
10/2 | 1
10/3 | 3
10/3 | 2
10/4 | 1
10/4 | 1
What I want is three columns:
Date
sum(stats) of Date
sum(stats) of last three days before Date
I know I can use window function to handle the 2nd column, but I cannot handle 2nd and 3rd at the same time.
What should I do to archive this?
Thanks!
You can use aggregation and window functions:
select date, sum(stats) as day_stats,
sum(sum(stats)) over (order by date rows between 3 preceding and 1 preceding) as day_stats_3
from t
group by date
order by date;
You can use a correlated query:
SELECT s.date,sum(s.stats) as today_sum,
(SELECT sum(t.stats) FROM YourTable t
where t.date between s.date - 2 and s.date) as sum_3days
FROM YourTable s
GROUP BY s.date

How do I get a trailing week count for every day over a given period (on Postgres)?

Say I’ve got an events table with just the columns id and occurred (which is just a datetime).
I want to get, for every day in a given period, the number of events in the previous week. So, let’s say the period was Jan 1 through April 1. I’d want the results of this query to look like:
_______________
|count | date |
|------|------|
| 3 | 1/1 |
| 2 | 1/2 |
| 0 | 1/3 |
| 4 | 1/4 |
---------------
Where count is, for that date, the number of events that happened in the week prior. So, the 3 count for 1/1 is how many events happened between Dec 25th and Jan 1.
I could do this easily enough in code:
for (date in 1/1 to 4/1) {
start_date = date - 7 days
db.query(’SELECT COUNT(1) FROm events WHERE occurred > start_date AND occurred < date`)
}
Unfortunately, this would result in over a hundred separate queries. I’d like to figure out how to do this in one query.
Hmm, you can generate all the dates in the period using generate_series(). Then then join in the data and do a cumulative sum:
select dd.dte,
sum(cnt) over (order by dd.dte rows between 6 preceding and current date) as avg_7daymoving
from generate_series('2015-01-01'::timestamp, '2015-04-01'::timestamp, '1 day'::interval) dd(dte) left join
(select date_trunc('day', occurred) as dte, count(*) as cnt
from events e
group by date_trunc('day', occurred)
) e
on e.dte = dd.dte