Count first occurring record per time period - sql

In my table trips , I have two columns: created_at and user_id
Unique users take many different trips. My goal is to count the very first trip made unique per each user_ids per year-month. I understand that in this case the min() function should be applied.
In a previous query, all unique users per year-month were aggregated:
SELECT to_char(created_at, 'YYYY-MM') as yyyymm, COUNT(DISTINCT user_id)
FROM trips
GROUP BY yyyymm
ORDER BY yyyymm;
Where in the above query should min() be integrated? In other words, instead of counting all unique user id's per month, I only need to count the first occurrence of unique user id per month.
The sample input would look like:
> routes
user_id created_at
1 1 2015-08-07 07:18:21
2 2 2015-05-06 20:43:52
3 3 2015-05-06 20:53:54
4 1 2015-03-30 20:09:07
5 2 2015-10-01 18:28:32
6 3 2015-08-07 07:29:29
7 1 2015-08-28 13:45:44
8 2 2015-08-07 07:37:31
9 3 2015-03-30 20:14:04
10 1 2015-08-07 07:08:50
And the output would be:
count Y-m
1 0 2015-01
2 0 2015-02
3 2 2015-03
4 0 2015-04
5 1 2015-05
Because the first occurrences of user_id 1 and 3 were in March and the first occurrence of user_id 2 was in May

You can do this with 2 levels of aggregation. Get the min time per user_id and then count.
SELECT to_char(first_time, 'YYYY-MM'),count(*)
from (
SELECT user_id,MIN(created_at) as first_time
FROM trips
GROUP BY user_id
) t
GROUP BY to_char(first_time, 'YYYY-MM')

Related

Find most visited Hotel by month in PostgreSQL

I have a table with couple of customers resided in a hotel for a month or months. I need to find 3 most visited hotels by month. In case one customer lived in a hotel for three months, then it refers for three month. To be more precise below table hotel I have:
id
usr_id
srch_ci
srch_co
hotel_id
1
13
2021-10-01
2021-11-22
200
2
12
2021-10-11
2021-10-22
300
3
11
2021-10-28
2021-11-05
200
4
10
2021-10-28
2021-12-03
100
Result should look like below:
mnth
hotel_id
rnk
visits
2021-10
200
1
2
2021-10
100
2
1
2021-10
300
2
1
2021-11
200
1
2
2021-11
100
2
1
2021-12
100
1
1
As we can see above, user_id = 10 stayed in a hotel = 100 for 3 different months. That means it is counted for 3 different month for a hotel as 1 count. And for 2021-12 month only user = 10 stayed, for this reason in 2021-12 month hotel = 100 is ranked as 1st.
I solved problem using generate_series function in Postgres. That is what I was looking for. This link helped me. Splitting single row into multiple rows based on date
SELECT hotel_id,mnth,visits,
ROW_NUMBER() OVER (PARTITION BY mnth ORDER BY visits DESC) AS rnk FROM (
SELECT hotel_id,to_char(live_mnth,'YYYY-MM') AS mnth,count(*) AS visits FROM (
SELECT id,usr_id,hotel_id,date_in,date_out,
generate_series(date_in, date_out, '1 MONTH')::DATE AS live_mnth
FROM (
SELECT *,TO_CHAR(srch_ci, 'yyyy-mm-01')::date AS date_in,
TO_CHAR(srch_co, 'yyyy-mm-01')::date AS date_out
FROM hotels
) s
) s GROUP BY hotel_id,to_char(live_mnth,'YYYY-MM')
) t

SQL: How to group rows with the condition that sum of fields is limited to a certain value?

This is my table:
id user_id date balance
1 1 2016-05-10 10
2 1 2016-05-10 30
3 2 2017-04-24 5
4 2 2017-04-27 10
5 3 2017-11-10 40
I want to group the rows by user_id and sum the balance, but so that the sum is equal or less than 30. Moreover, I need to display the minimum date in the group. It should look like this:
id balance date_start
1-1 10 2016-05-10
1-2 30 2016-05-10
2-1 15 2017-04-24
Excuse for my language. Thanks.
You should be able to do so by using group by & having, here is an example of what may solve your case :
SELECT id, user_id, SUM(balance) as balance, data_start
FROM your_table
GROUP BY user_id
HAVING SUM(balance) >= 30
AND MIN(date_start)
This is a good way to do it with one query, but it is a complex query and you should be careful if using it on a very large tables.

Count over relative dates in Amazon Redshift

I have 2 tables on Amazon Redshift and they look as follows:
Tablename: Groups
GroupID Created
1 2016-08-04
2 2017-05-24
3 2017-06-12
Tablename: GroupActivities
GroupID CreationTime ActivityType
1 2016-08-13 Assign
1 2016-09-13 Assign
2 2017-05-25 Create
2 2017-05-27 Assign
3 2017-06-24 Create
3 2017-06-28 Assign
I would like to count the number of activities within each 30 day period from group creation. For example, I would like the output to be the following:
GroupID Period ActivityCount
1 Period1 1
1 Period2 1
2 Period1 2
3 Period1 2
I could do this if the dates were not relative, but I'm not sure how to achieve this when the dates are relative. Any help would be much appreciated.
TIA.
join the tables by group id, use integer division of date difference to identify the period and aggregate:
select
group_id
,'Period'||((a.creationtime::date-g.created::date)/30+1)::varchar as period
,count(1) as activity_count
from groups g
join activities a
on g.groupid=a.groupid
group by 1,2

Find the monthly count, for every month from the start date till member is cancelled

Problem: Monthly distinct count of members from the first date of the gene reading, till the member is cancelled.
Members can have more than one reading per month. They can continue to have as many readings as they want.
Example:
member_id date gene_a_measurement_done gene_b_measurement_done
5557153 1/1/2010 y
5557153 2/1/2010 y
222458 2/1/2010 y y
222458 1/1/2011 y
707222 1/1/2011 y
Another table has members cancellation date:
member_id status date
5557153 Cancelled 5/1/2011
222458 Cancelled 12/1/9999
707222 Cancelled 12/1/9999
Expected result :
month distinct_count_of_member_with_gene_a_measurement distinct_count_of_member_with_gene_b_measurement
1/1/10 1 0
2/1/10 2 2
3/1/10 2 2
4/1/10 2 2
5/1/10 1 1
6/1/10 1 1
7/1/10 1 1
8/1/10 1 1
9/1/10 1 1
10/1/10 1 1
11/1/10 1 1
12/1/10 1 1
1/1/11 2 1
Query tried:
SELECT
sub.last_day,
sum(sub.distinct_count_of_member_with_gene_a_measurement) as distinct_count_of_member_with_gene_a_measurement,
sum(sub.distinct_count_of_member_with_gene_b_measurement) as distinct_count_of_member_with_gene_b_measurement,
FROM
(SELECT last_day(date),
COUNT(DISTINCT member_id) as distinct_count_of_member_with_gene_a_measurement,
null as distinct_count_of_member_with_gene_b_measurement,
FROM measurement
WHERE gene_a_measurement_done is not null
GROUP BY last_day(date)
UNION ALL
SELECT last_day(date),
null as distinct_count_of_member_with_gene_a_measurement,
COUNT(DISTINCT member_id) as distinct_count_of_member_with_gene_b_measurement,
FROM measurement
WHERE gene_b_measurement_done is not null
GROUP BY last_day(date)) as sub
GROUP BY sub.last_day(date)
Above query only gives distinct count of member for the month for which measurement was done and I am not sure how to best consider cancellation date? (inner join with member_status table on member_id and have condition to filter out cancelled member?)

How do I create a frequency distribution?

I'm trying to create a frequency distribution to show how many customers have transacted 1x, 2x, 3x, etc.
I have a database transactions and column user_id. Each row indicates a transaction, and if a user_id shows up in multiple rows, that user has done multiple transactions.
Now I'd like to get a list that looks something like this:
Tra. | Freq.
0 | 345
1 | 543
2 | 45
3 | 20
4 | 0
5 | 3
etc
Currently I have this, but it just shows a list of users and how many transactions they have had.
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
ORDER BY number_of_transactions DESC;
I did some digging and was suggested that generate_series might help, but I'm stuck and don't know how to move forward.
Use the first result as input to an outer query where you apply the count again, but this time grouping on number_of_transactions:
SELECT number_of_transactions, COUNT(*) AS freq
FROM (
SELECT user_id, COUNT(user_id) as number_of_transactions
FROM transactions
GROUP BY user_id
) A
GROUP BY number_of_transactions;
This would transform a result like:
user_id number_of_transactions
----------- ----------------------
1 2
2 1
3 2
4 4
to this:
number_of_transactions freq
---------------------- -----------
1 1
2 2
4 1