Find max per id after counting and grouping - sql

In my table I have bus rides taken in different networks - each record represents one ride.
My goal is to find the max number of rides taken in a day in each network and the day that the max number of rides occurred - which requires first counting the number of rides per day in each network and then taking the max count per network - in the end I will have three columns -
YMD - max_count- network_id
I have tried to use the query below but I am not sure where or how to include the max() function. Any suggestions?
SELECT DISTINCT ON (network_id)
network_id, count(*), to_char(start_time, 'YYYY-MM-DD') as YMD
FROM routes
ORDER BY network_id, count DESC, YMD;

I'd use an aggregate query to count the number of rides a day, and then a windowing rank call to find the date with the most rides:
SELECT network_id, cnt, ymd
FROM (SELECT network_id,
ymd,
cnt,
RANK() OVER (PARTITION BY network_id ORDER BY cnt DESC) AS rk
FROM (SELECT network_id,
TO_CHAR(start_time, 'YYYY-MM-DD') AS ymd,
COUNT(*) AS cnt
FROM routes
GROUP BY network_id, TO_CHAR(start_time, 'YYYY-MM-DD')
) t
) s
WHERE rk = 1

Related

Rolling NEW active users in SQL (BigQuery)

I have already computed rolling active users (on a weekly basis) as follow:
SELECT
DATE_TRUNC(EXTRACT(DATE FROM tracks.timestamp), WEEK),
COUNT(DISTINCT tracks.user_id)
FROM `company.dataset.tracks` AS tracks
WHERE tracks.timestamp > TIMESTAMP('2020-01-01')
AND tracks.event = 'activation_event'
GROUP BY 1
ORDER BY 1
I am interested in knowing the number of distinct users who performed the activation event for the 1st time on a rolling weekly basis.
If I follow you correctly, you can use two levels of aggrgation:
select
date_trunc(date(activation_timestamp), week) activation_week,
count(*) cnt_active_users
from (
select min(timestamp) activation_timestamp
from `company.dataset.tracks` t
where event = 'activation_event'
group by user_id
) t
where activation_timestamp > timestamp('2020-01-01
The subquery comptes the date of the first activation event per user, then the outer query counts the number of such events per week.
If you want both the actives and starts in the same query:
SELECT week, COUNT(*) as users_in_week,
COUNTIF(seqnum = 1) as new_users
FROM (SELECT DATE_TRUNC(EXTRACT(DATE FROM t.timestamp), WEEK) as week,
t.user_id, COUNT(*) as cnt,
ROW_NUMBER() OVER (PARTITION BY t.user_id ORDER BY MIN(t.timestamp)) as seqnum
FROM `company.dataset.tracks` t
WHERE t.event = 'activation_event'
GROUP BY 1, 2
) t
WHERE tracks.timestamp > TIMESTAMP('2020-01-01')
GROUP BY 1
ORDER BY 1;

Month over Month percent change in user registrations

I am trying to write a query to find month over month percent change in user registration. \
Users table has the logs for user registrations
user_id - pk, integer
created_at - account created date, varchar
activated_at - account activated date, varchar
state - active or pending, varchar
I found the number of users for each year and month. How do I find month over month percent change in user registration? I think I need a window function?
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count(distinct user_id) as number_of_registration
FROM users
GROUP BY 1,2
ORDER BY 1,2
This is the output of above query:
Then I wrote this to find the difference in user registration in the previous year.
SELECT
*
,number_of_registration - lag(number_of_registration) over (partition by created_month) as difference_in_previous_year
FROM (
SELECT
EXTRACT(month from created_at::timestamp) as created_month
,EXTRACT(year from created_at::timestamp) as created_year
,count( user_id) as number_of_registration
FROM users as u
GROUP BY 1,2
ORDER BY 1,2) as temp
The output is this:
You want an order by clause that contains created_year.
number_of_registration
- lag(number_of_registration) over (partition by created_month order by created_year) as difference_in_previous_year
Note that you don't actually need a subquery for this. You can do:
select
extract(year from created_at) as created_year,
extract(month from created_at) as created_year
count(*) as number_of_registration,
count(*) - lag(count(*)) over(partition by extract(month from created_at) order by extract(year from created_at))
from users as u
group by created_year, created_month
order by created_year, created_month
I used count(*) instead of count(user_id), because I assume that user_id is not nullable (in which case count(*) is equivalent, and more efficient). Casting to a timestamp is also probably superfluous.
These queries work as long as you have data for every month. If you have gaps, then the problem should be addressed differently - but this is not the question you asked here.
I can get the registrations from each year as two tables and join them. But it is not that effective
SELECT
t1.created_year as year_2013
,t2.created_year as year_2014
,t1.created_month as month_of_year
,t1.number_of_registration_2013
,t2.number_of_registration_2014
,(t2.number_of_registration_2014 - t1.number_of_registration_2013) / t1.number_of_registration_2013 * 100 as percent_change_in_previous_year_month
FROM
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2013
from users
where extract(year from created_at) = '2013'
group by 1,2) t1
inner join
(select
extract(year from created_at) as created_year
,extract(month from created_at) as created_month
,count(*) as number_of_registration_2014
from users
where extract(year from created_at) = '2014'
group by 1,2) t2
on t1.created_month = t2.created_month
First off, Why are you using strings to hold date/time values? Your 1st step should to define created_at, activated_at as a proper timestamps. In the resulting query I assume this correction. If this is faulty (you do not correct it) then cast the string to timestamp in the CTE generating the date range. But keep in mind that if you leave it as text you will at some point get a conversion exception.
To calculate month-over-month use the formula "100*(Nt - Nl)/Nl" where Nt is the number of users this month and Nl is the number of users last month. There are 2 potential issues:
There are gaps in the data.
Nl is 0 (would incur divide by 0 exception)
The following handles this by first generating the months between the earliest date to the latest date then outer joining monthly counts to the generated dates. When Nl = 0 the query returns NULL indication the percent change could not be calculated.
with full_range(the_month) as
(select generate_series(low_month, high_month, interval '1 month')
from (select min(date_trunc('month',created_at)) low_month
, max(date_trunc('month',created_at)) high_month
from users
) m
)
select to_char(the_month,'yyyy-mm')
, users_this_month
, case when users_last_month = 0
then null::float
else round((100.00*(users_this_month-users_last_month)/users_last_month),2)
end percent_change
from (
select the_month, users_this_month , lag(users_this_month) over(order by the_month) users_last_month
from ( select f.the_month, count(u.created_at) users_this_month
from full_range f
left join users u on date_trunc('month',u.created_at) = f.the_month
group by f.the_month
) mc
) pc
order by the_month;
NOTE: There are several places there the above can be shortened. But the longer form is intentional to show how the final vales are derived.

Count occurences in a row using aggregate functions

Consider the following relation
column measured_at holds thousands of different timestamps and column cell_id holds the number of the cell tower used at each timestamp. I want to query for each day saved in measured_at, which cell tower has the most occurences (used the most at that day, here is time irrelevant, only the date is to query). This probably can be done using window functions, but I want to do it using only aggregate functions and simple queries.
an output should look like for example:
cell_id measured_at
27997442 2015-12-22
for the above example because on 22-12-2015 tower number 27997442 has been used the most.
You can use aggregation and distinct on. To get the counts:
select date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt
from t
group by dte, cell_id
And then extend this for only one value:
select distinct on (date_trunc(date, measured_at)) date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt
from t
group by dte, cell_id
order by date_trunc(date, measured_at), count(*) desc;
Of course, you can use window functions as well -- and that is a better approach if you want to get ties as well:
select dte, cell_id, cnt
from (select date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt,
rank() over (partition by date_trunc(date, measured_at) order by count(*) desc) as seqnum
from t
group by dte, cell_id
) dc
where seqnum = 1;

Unique values per time period

In my table trips , I have two columns: created_at and user_id
My goal is to count unique user_ids per month with a query in postgres. So far, I have written this - but it returns an error
SELECT user_id,
to_char(created_at, 'YYYY-MM') as t COUNT(*)
FROM (SELECT DISTINCT user_id
FROM trips) group by t;
How should I change this query?
The query is much simpler than that:
SELECT to_char(created_at, 'YYYY-MM') as yyyymm, COUNT(DISTINCT user_id)
FROM trips
GROUP BY yyyymm
ORDER BY yyyymm;

Selecting Max from subquery Oracle

I’m using Oracle and trying to find the maximum transaction count (and associated date) for each station.
This is the code I have but it returns each transaction count and date for each station rather than just the maximum. If I take the date part out of the outer query it returns just the maximum transaction count for each station, but I need to know the date of when it happened. Does anyone know how to get it to work?
Thanks!
SELECT STATION_ID, STATION_NAME, MAX(COUNTTRAN), TESTDATE
FROM
(
SELECT COUNT(TRANSACTION_ID) AS COUNTTRAN, STATION_ID,
STATION_NAME, TO_CHAR(TRANSACTION_DATE, 'HH24') AS TESTDATE
FROM STATION_TRANSACTIONS
WHERE COUNTRY = 'GB'
GROUP BY STATION_ID, STATION_NAME, TO_CHAR(TRANSACTION_DATE, 'HH24')
)
GROUP BY STATION_ID, STATION_NAME, TESTDATE
ORDER BY MAX(COUNTTRAN) DESC
This image shows the results I currently get vs the ones I want:
What your query does is this:
Subquery: Get one record per station_id, station_name and date. Count the transactions for each such combination.
Main query: Get one record per station_id, station_name and date. (We already did that, so it doesn't change anything.)
Order the records by transaction count.
This is not what you want. What you want is one result row per station_id, station_name, so in your main query you should have grouped by these only, excluding the date:
select
station_id,
station_name,
max(counttran) as maxcount,
max(testdate) keep (dense_rank last over order by counttran) as maxcountdate
from
(
select
count(transaction_id) as counttran,
station_id,
station_name,
to_char(transaction_date, 'hh24') as testdate
from station_transactions
where country = 'GB'
group by station_id, station_name, to_char(transaction_date, 'hh24')
)
group by station_id, station_name;
An alternative would be not to group by in the main query again, for actually you already have the desired records already and only want to remove the others. You can do this by ranking the records in the subquery, i.e. give them row numbers, with #1 for the best record per station (this is the one with the highest count). Then dismiss all others and you are done:
select station_id, station_name, counttran, testdate
from
(
select
count(transaction_id) as counttran,
row_number() over(partition by station_id order by count(transaction_id) desc) as rn
station_id,
station_name,
to_char(transaction_date, 'hh24') as testdate
from station_transactions
where country = 'GB'
group by station_id, station_name, to_char(transaction_date, 'hh24')
)
where rn = 1;