DISTINCT ON to find min and max times - sql

I have tried using DISTINCT ON with posrgresql to achieve the following:
Lets say I have a table that looks like this:
id time price
1 12:00 10
1 13:00 20
1 14:00 30
And my goal is to create a table with only 1 row per id, that shows a column of the minimum time price and the maximum time price. Something that looks like this:
id min_time_price max_time_price
1 10 30
I tried using DISTINCT ON (id) but can't really get it.
Would love some help, Thank you!

Here is one method:
select t.id, tmin.price, tmax.price
from (select t.id, min(time) as min_time, max(time) as max_time
from t
) t join
t tmin
on t.id = tmin.id and t.min_time = tmin.time join
t tmax
on t.id = tmax.id and t.max_time = tmax.time;
You can also use aggregation. Postgres doesn't have first()/last() aggregation functions, but arrays are handy:
select t.id,
array_agg(price order by time asc)[1] as min_time_price,
array_agg(price order by time desc)[1] as max_time_price
from t
group by id;
Or using first_value() and last_value():
select distinct t.id,
first_value(price) over (partition by time order by time) as min_time_price,
first_value(price) over (partition by time order by time desc) as max_time_price
from t

Related

SQL count new values only with partition by - running count with no duplicates

Based on table below in Presto I need a column for all new 'rid'. What I managed to do is the same what I can achieve with partition by but it's not exactly what I'm looking for (db<>fiddle demo).
Goal is to have many groupings counts but I think this should describe problem sufficiently.
I need data truncated by days and column for new users every day as shown at example below. In simple words - if value repeats don't count it. I've tried to find correlation between this and relational division problem but I just stuck.
You could use row_number() to rank the records of each rid by time; then you can aggregate and count in only the top record per group.
select
date_trunc(day, t.time) dy,
count(*) rid_count,
sum(case when t.rn = 1 then 1 else 0 end) new_rid_count
from (
select
t.*
row_number() over(partition by t.rid order by t.time) rn
from mytable t
) t
group by date_trunc(day, t.time)
I think of this as two levels of aggregation. The inner one to get the earliest date. The outer to aggregate:
select first_day, count(*)
from (select rid, date_trunc('day', min(time))::date as first_day
from orders o
group by rid
) r
group by 1

Add columns to SQL query and filter by min(date) and sum(price)

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?
I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100
Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.
No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.

Alternative for window function? (code example)

Hope you can give me a hand. I'm looking for a different way (more "classic", maybe) to achieve the same result as this query:
WITH a AS (
SELECT dev_id,
time_stamp,
LEAD(time_stamp) OVER (PARTITION BY dev_id ORDER BY time_stamp) as next_t
FROM my_table
WHERE month = 'July'
AND app_id = 1
AND event_id = 4
),
b AS (
SELECT
dev_id,
DATE_DIFF('second', time_stamp, next_t) as diff
FROM a
)
SELECT
dev_id,
AVG(diff) as AVG_pu
FROM b
GROUP BY 1
The final output is just an AVG function based on the past results from the subquery a and b.
I was thinking about using another subquery and an INNER JOIN ON dev_id, but I'm not sure on how to do it exactly. Any help will be highly appreciated!
If you want the average time, then the simplest method is:
SELECT dev_id,
DATE_DIFF('second', MIN(time_stamp), MAX(time_stamp)) / NULLIF(COUNT(*) - 1, 0) as avg_diff
FROM my_table
WHERE month = 'July' AND
app_id = 1 AND
event_id = 4
GROUP BY dev_id;
That is, the average difference is the total span divided by one less than the number of gaps. No window functions are needed.

The Maximum value of two columns with group by

I have a table that contains the followings data :
TRIP TRIP_DATE TRIP_TIME
A 2018-08-08 11:00
A 2018-08-09 11:00
A 2018-08-08 23:00
A 2018-08-20 11:00
A 2018-08-20 14:00
I want the select statement to retrieve the Number of trips, Count , the latest date and time.
Basically the output should be like this:
TRIPS MAX(TRIP_DATE) TRIP_TIME
5 2018-08-20 14:00
This is tricky. I think I would do:
select cnt, date, time
from (select t.*,
row_number() over (partition by trip order by date desc, time desc) as seqnum
count(*) over (partition by trip) as cnt
from t
) t
where seqnum = 1;
You can use the following using GROUP BY:
SELECT TRIP, COUNT(TRIP) AS cnt, MAX(CONCAT(TRIP_DATE, ' ', TRIP_TIME)) AS maxDateTime
FROM table_name
GROUP BY TRIP
To combine the DATE and TIME value you can use one of the following:
using CONCAT_WS: CONCAT_WS(' ', TRIP_DATE, TRIP_TIME)
using CONCAT: CONCAT(TRIP_DATE, ' ', TRIP_TIME)
You can use the above query as sub-query to get the DATE and TIME as seperate values:
SELECT TRIP, cnt, DATE(maxDateTime), TIME_FORMAT(TIME(maxDateTime), '%H:%i') FROM (
SELECT TRIP, COUNT(TRIP) AS cnt, MAX(CONCAT(TRIP_DATE, ' ', TRIP_TIME)) AS maxDateTime
FROM table_name
GROUP BY TRIP
)t;
Note: I recommend to split the DATE and TIME values on the application side. I would also store the DATE and TIME value in one column as DATETIME instead of separate columns.
demos: https://www.db-fiddle.com/f/xcMdmivjJa29rDhHxkUmuJ/2
You can use row_number() function :
select t.*
from (select *, row_number() over (partition by trip order by date desc, time desc) seq
from table t
) t
where seq = 1;
I would go with this (assuming you wanted the MAX Trip_Time as well, its a little difficult to tell from your example):
SELECT COUNT(TRIP) AS Trips,
MAX(TRIP_DATE) AS MAX(TRIP_DATE),
MAX(TRIP_TIME) AS TRIP_TIME
FROM myTable
GROUP BY TRIP
You have option of using analytic function as will as group function here.
All will do the job . Looking at final output I believe max function with group by is more suitable.
There is no hard and fast rule but personally I prefer grouping when final outcome need to be suppressed.

SQL Aggregates OVER and PARTITION

All,
This is my first post on Stackoverflow, so go easy...
I am using SQL Server 2008.
I am fairly new to writing SQL queries, and I have a problem that I thought was pretty simple, but I've been fighting for 2 days. I have a set of data that looks like this:
UserId Duration(Seconds) Month
1 45 January
1 90 January
1 50 February
1 42 February
2 80 January
2 110 February
3 45 January
3 62 January
3 56 January
3 60 February
Now, what I want is to write a single query that gives me the average for a particular user and compares it against all user's average for that month. So the resulting dataset after a query for user #1 would look like this:
UserId Duration(seconds) OrganizationDuration(Seconds) Month
1 67.5 63 January
1 46 65.5 February
I've been batting around different subqueries and group by scenarios and nothing ever seems to work. Lately, I've been trying OVER and PARTITION BY, but with no success there either. My latest query looks like this:
select Userid,
AVG(duration) OVER () as OrgAverage,
AVG(duration) as UserAverage,
DATENAME(mm,MONTH(StartDate)) as Month
from table.name
where YEAR(StartDate)=2014
AND userid=119
GROUP BY MONTH(StartDate), UserId
This query bombs out with a "Duration' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause" error.
Please keep in mind I'm dealing with a very large amount of data. I think I can make it work with CASE statements, but I'm looking for a cleaner, more efficient way to write the query if possible.
Thank you!
You are joining two queries together here:
Per-User average per month
All Organisation average per month
If you are only going to return data for one user at a time then an inline select may give you joy:
SELECT AVG(a.duration) AS UserAvergage,
(SELECT AVG(b.Duration) FROM tbl b WHERE MONTH(b.StartDate) = MONTH(a.StartDate)) AS OrgAverage
...
FROM tbl a
WHERE userid = 119
GROUP BY MONTH(StartDate), UserId
Note - using comparison on MONTH may be slow - you may be better off having a CTE (Common Table Expression)
missing partition clause in Average function
OVER ( Partition by MONTH(StartDate))
Please try this. It works fine to me.
WITH C1
AS
(
SELECT
AVG(Duration) AS TotalAvg,
[Month]
FROM [dbo].[Test]
GROUP BY [Month]
),
C2
AS
(
SELECT Distinct UserID,
AVG(Duration) OVER(PARTITION BY UserID, [Month] ORDER BY UserID) AS DetailedAvg,
[Month]
FROM [dbo].[Test]
)
SELECT C2.*, C1.TotalAvg
FROM C2 c2
INNER JOIN C1 c1 ON c1.[Month] = c2.[Month]
ORDER BY c2.UserID, c2.[Month] desc;
I was able to get it done using a self join, There's probably a better way.
Select UserId, AVG(t1.Duration) as Duration, t2.duration as OrgDur, t1.Month
from #temp t1
inner join (Select Distinct MONTH, AVG(Duration) over (partition by Month) as duration
from #temp) t2 on t2.Month = t1.Month
group by t1.Month, t1.UserId, t2.Duration
order by t1.UserId, Month desc
Here's using a CTE which is probably a better solution and definitely easier to read
With MonthlyAverage
as
(
Select MONTH, AVG(Duration) as OrgDur
from #temp
group by Month
)
Select UserId, AVG(t1.Duration) as Duration, m.duration as OrgDur , t1.Month
from #temp t1
inner join MonthlyAverage m on m.Month = t1.Month
group by UserId, t1.Month, m.duration
You can try below with less code.
SELECT Distinct UserID,
AVG(Duration) OVER(PARTITION BY [Month]) AS TotalAvg,
AVG(Duration) OVER(PARTITION BY UserID, [Month] ORDER BY UserID) AS DetailedAvg,
[Month]
FROM [dbo].[Test]