I am trying to figure out how to combine a group by with a window function
I have field as
timestamp track_name
where timestamp represents the time the track was played.
If I want to get the track that was played most each day how can I do this?
I would need a count of track_name grouped by date(timestamp), and track_name
but not sure how to do that grouped by count and then get the most for that day?
thnx!
You can certainly mix analytic functions with a GROUP BY query. In the query below, I aggregate by date and track name, but also generate a rank based on the aggregated count. For each date, the track having the highest rank is retained. Note that I use RANK rather than ROW_NUMBER, since the former can handle the possibility of two or more tracks being tied for the most number of plays on a single day.
WITH cte AS (
SELECT DATE(timestamp) AS dt, track_name,
RANK() OVER (PARTITION BY DATE(timestamp) ORDER BY COUNT(*) DESC) rnk
FROM yourTable
GROUP BY DATE(timestamp), track_name
)
SELECT dt, track_name
FROM cte
WHERE rnk = 1;
This works because analytic functions are evaluated after GROUP BY has already finished. So, the COUNT(*) is available to be used in the call to RANK().
Related
This post is similar to this thread in that I have multiple observations per group. However, I want to randomly select only one of them. I am also working on Oracle 10g.
There are multiple rows per person_id in table df. I want to order each group of person_ids by dbms_random.value() and select the first observation from each group. To do so, I tried:
select
person_id, purchase_date
from
df
where
row_number() over (partition by person_id order by dbms_random.value()) = 1
The query returns:
ORA-30483: window functions are not allowed here
30483. 00000 - "window functions are not allowed here"
*Cause: Window functions are allowed only in the SELECT list of a query. And, window function cannot be an argument to another window or group function.
Use a subquery:
select person_id, purchase_date
from (select df.*,
row_number() over (partition by person_id order by dbms_random.value()) as seqnum
from df
) df
where seqnum = 1;
One option would be using WITH..AS Clause :
WITH t AS
(
SELECT df.*,
ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY dbms_random.value()) AS rn
FROM df
)
SELECT person_id, purchase_date
FROM t
WHERE rn = 1
Aggregate queries (using GROUP BY and aggregate functions) are much faster than equivalent analytic functions that do the same job. So, if you have a lot of data to process, or if the data is not excessively large but you must run this query often, you may want a more efficient query that uses aggregation instead of analytic functions.
Here is one possible approach:
select person_id,
max(purchase_date) keep (dense_rank first order by dbms_random.value())
as random_purchase_date
from df
group by person_id
;
If I use a dense_rank window function below that works in giving me my output which is the transaction refunded at dates in ascending order and assigns it 1 as rank:
select p.billing_cycle_in_months, avg(t.days)
from (
select *,
datediff(day,transaction_settled_at, transaction_refunded_at) as days,
dense_rank() over (partition by signup_id order by transaction_settled_at asc) as rank
from transactions
) t
join signups s on s.signup_id = t.signup_id
join plans p on p.id = s.plan_id
where datediff(year,s.started_at, current_date) > 1 and t.rank = 1
group by p.billing_cycle_in_months
Would I essentially get same result as using row_number window function ranked over same date (transaction_settled_at asc) column?
Basically grouped by billing cycle I want to rank the earliest day as 1, just wanted to clairfy that in this case row_number would give me same result?
Thanks
In your query, the difference between using dense_rank() and row_number() is that the former allows top ties, while the latter does not.
So if two (or more) records have the same, earliest, transaction_settled_at for a given signup_id, then condition dense_rank() ... = 1 will keep them both, while row_number() will select an undefined record out of the two.
If there no risk of ties, both functions will in your context produce the same resulting dataset.
To reduce the possibility of ties, you can also add additional sorting criterias to the order by clause of the window function:
dense_rank() over (
partition by signup_id
order by transaction_settled_at, some_other_column desc, some_more_column
)
Below works as intended, but you guys sometimes can do magic when it comes to optimization. Is this all right or it can be done in better/faster way?
WITH last_events AS (
SELECT DISTINCT ON (type, adid)
type,
adid,
value,
created_at
FROM public.adid
ORDER BY type, adid, created_at DESC
)
SELECT
adid.type,
adid.adid,
count(*) as count,
sum(adid.value) as summary,
le.created_at
FROM public.adid
JOIN last_events le ON le.type = adid.type AND le.adid = adid.adid
GROUP BY adid.type, adid.adid, le.created_at
ORDER BY summary DESC, le.created_at DESC;
I believe that certain parts of your solution are unnecessary. The CTE returns max created_at per (type,adid) group. The main query computes number of rows per (type,adid) group and sum of value per (type,adid) group. Therefore, it can be written like this
SELECT
adid.type,
adid.adid,
count(*) as count,
sum(adid.value) as summary,
max(adid.created_at) max_created_at
FROM public.adid
GROUP BY adid.type, adid.adid
ORDER BY summary DESC, max_created_at DESC;
If you are interested in other columns corresponding to the row with highest created_at then you can use one of the classical greatest-per-group approaches. One that I prefer is to use GROUP BY to find the greatest value (very similar to your approach):
SELECT
adid.type,
adid.adid,
t.count,
t.summary,
t.max_created_at,
adid.value
FROM public.adid
JOIN (
SELECT
adid.type,
adid.adid,
count(*) as count,
sum(adid.value) as summary,
max(adid.created_at) max_created_at
FROM public.adid
GROUP BY adid.type, adid.adid
) t ON t.type = adid.type and
t.adid = adid.adid and
t.max_created_at = adid.created_at
ORDER BY t.summary DESC, t.max_created_at DESC;
I believe it is better like this since my solution has just one aggregation. Your solution use DISTINCT ON (which is hidden aggregation) and another GROUP BY in the outer join.
Another option to find greatest-per-group is to use window function, however, I think aggregation is a much better solution for your problem since you need more aggregation values. Moreover, GROUP BY seems to have a better performance in certain cases than the window functions.
i am having table having 23 records , I am trying to get total count of record and last record also in single query. something like that
select count(*) ,(m order by createdDate) from music m ;
is there any way to pull this out only last record as well as total count in PostgreSQL.
This can be done using window functions
select *
from (
select m.*,
row_number() over (order by createddate desc) as rn,
count(*) over () as total_count
from music
) t
where rn = 1;
Another option would be to use a scalar sub-query and combine it with a limit clause:
select *,
(select count(*) from order_test.orders) as total_count
from music
order by createddate desc
limit 1;
Depending on the indexes, your memory configuration and the table definition might be faster then the two window functions.
No, it's not not possible to do what is being asked, sql does not function that way, the second you ask for a count () sql changes the level of your data to an aggregation. The only way to do what you are asking is to do a count() and order by in a separate query.
Another solution using windowing functions and no subquery:
SELECT DISTINCT count(*) OVER w, last_value(m) OVER w
FROM music m
WINDOW w AS (ORDER BY date DESC RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
The point here is that last_value applies on partitions defined by windows and not on groups defined by GROUP BY.
I did not perform any test but I suspect my solution to be the less effective amongst the three already posted. But it is also the closest to your example query so far.
I have the following table structure, with daily-hourly data:
time_of_ocurrence(timestamp); particles(numeric)
"2012-11-01 00:30:00";191.3
"2012-11-01 01:30:00";46
...
"2013-01-01 02:30:00";319.6
How do i select the DAILY max and THE HOUR in which this max occur?
I've tried
SELECT date_trunc('hour', time_of_ocurrence) as hora,
MAX(particles)
from my_table WHERE time_of_ocurrence > '2013-09-01'
GROUP BY hora ORDER BY hora
But it doesn't work:
"2013-09-01 00:00:00";34.35
"2013-09-01 01:00:00";33.13
"2013-09-01 02:00:00";33.09
"2013-09-01 03:00:00";28.08
My result would be in this format instead (one max per day, showing the hour)
"2013-09-01 05:00:00";100.35
"2013-09-02 03:30:00";80.13
How can i do that? Thanks!
This type of question has come up on StackOverflow frequently, and these questions are categorized with the greatest-n-per-group tag, if you want to see other solutions.
edit: I changed the following code to group by day instead of by hour.
Here's one solution:
SELECT t.*
FROM (
SELECT date_trunc('day', time_of_ocurrence) as hora, MAX(particles) AS particles
FROM my_table
GROUP BY hora
) AS _max
INNER JOIN my_table AS t
ON _max.hora = date_trunc('day', t.time_of_ocurrence)
AND _max.particles = t.particles
WHERE time_of_ocurrence > '2013-09-01'
ORDER BY time_of_ocurrence;
This might also show more than one result per day, if more than one row has the max value.
Another solution using window functions that does not show such duplicates:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY date_trunc('day', time_of_ocurrence)
ORDER BY particles DESC) AS _rn
FROM my_table
) AS _max
WHERE _rn = 1
ORDER BY time_of_ocurrence;
If multiple rows have the same max, one row with nevertheless be numbered row 1. If you need specific control over which row is numbered 1, you need to use ORDER BY in the partitioning clause using a unique column to break such ties.
Use window functions:
select distinct
date_trunc('day',time_of_ocurrence) as day,
max(particles) over (partition by date_trunc('day',time_of_ocurrence)) as particles_max_of_day,
first_value(date_trunc('hour',time_of_ocurrence)) over (partition by date_trunc('day',time_of_ocurrence) order by particles desc)
from my_table
order by 1
One edge case here is if the same MAX number of particles show up in the same day, but in different hours. This version would randomly pick one of them. If you prefer one over the other (always the earlier one for example) you can add that to the order by clause:
first_value(date_trunc('hour',time_of_ocurrence)) over (partition by date_trunc('day',time_of_ocurrence) order by particles desc, time_of_ocurrence)