How to retrieve other columns when performing an aggregate function? - sql

I've been trying to retrieve other columns from a table in which I'm performing an aggregate function to get the minimum number by date, this is an example of the data:
id resource date quality ask ask_volume
1 1 2020-06-08 10:50 0 6.9 5102
2 1 2020-06-08 10:50 1 6.8 2943
3 1 2020-06-08 10:50 2 6.9 25338
4 1 2020-06-08 10:50 3 7.0 69720
5 1 2020-06-08 10:50 4 7.0 9778
6 1 2020-06-08 10:50 5 7.0 297435
7 1 2020-06-08 10:40 0 6.6 611
8 1 2020-06-08 10:40 1 6.6 4331
9 1 2020-06-08 10:40 2 6.7 1000
10 1 2020-06-08 10:40 3 7.0 69720
11 1 2020-06-08 10:40 4 7.0 9778
12 1 2020-06-08 10:40 5 7.0 297435
...
This is the desired result I'm trying to get, so I can perform a weighted average on it:
date ask ask_volume
2020-06-08 10:50 6.8 2943
2020-06-08 10:40 6.6 4331
...
Though both quality 0 and quality 1 have the same ask, quality 1 shall be chosen because its ask_volume is higher.
I have tried the classic:
SELECT date, min(ask) FROM table GROUP BY date;
But adding ask_volume to the column list will force me to add it to the GROUP BY as well, messing up the result.
The problems are:
How can I get the corresponding ask_volume of the minimum ask displayed in the result?
And, if there are two records with the same ask value on the same date, how can I get ask_volume to show the one with the highest value?
I use PostgreSQL, but SQL from a different database will help me get the idea as well.

In standard SQL, you would use window functions:
select *
from (
select t.*, row_number() over(partition by date order by ask, ask_volume desc) rn
from mytable
) t
where rn = 1
In Postgres this is better suited for distinct on:
select distinct on (date) *
from mytable
order by ask, ask_volume desc

You can do what you want with distinct on:
select distinct on (date) t.*
from (select t.*,
order by date, ask, ask_volume desc;
I find your date column confusing. It has a time component, so the name is misleading.

Other answers are simpler and better, but here is an alternative to get around your aggregation problem. You could use a subquery to only include max ask_volume per date per ask before you get the min ask per date.
select date, min(ask), max(ask_volume)
from t
where (date, ask_volume) in (select date, max(ask_volume)
from t
group by date, ask)
group by date;

DISTINCT ON has already been suggested, but in imperfect ways. (The currently accepted answer is incorrect.) That's how you do it:
SELECT DISTINCT ON (date) *
FROM tbl
ORDER BY date, ask, ask_volume DESC NULLS LAST;
Most importantly, leading expressions in ORDER BY must be in the set of expressions in DISTINCT ON. In other words for the simple case, date must be the first ORDER BY expression.
While null values have not been ruled out (with a NOT NULL constraint), you must add NULLS LAST or get null values first in descending order.
Detailed explanation:
Select first row in each GROUP BY group?

Related

redshift cumulative count records via SQL

I've been struggling to find an answer for this question. I think this question is similar to what i'm looking for but when i tried this it didn't work.
Because there's no new unique user_id added between 02-20 and 02-27, the cumulative count will be the same. Then for 02-27, there is a unique user_id which hasn't appeared on any previous dates (6)
Here's my input
date user_id
2020-02-20 1
2020-02-20 2
2020-02-20 3
2020-02-20 4
2020-02-20 4
2020-02-20 5
2020-02-21 1
2020-02-22 2
2020-02-23 3
2020-02-24 4
2020-02-25 4
2020-02-27 6
Output table:
date daily_cumulative_count
2020-02-20 5
2020-02-21 5
2020-02-22 5
2020-02-23 5
2020-02-24 5
2020-02-25 5
2020-02-27 6
This is what i tried and the result is not quite what i want
select
stat_date,count(DISTINCT user_id),
sum(count(DISTINCT user_id)) over (order by stat_date rows unbounded preceding) as cumulative_signups
from data_engineer_interview
group by stat_date
order by stat_date
it returns this instead;
date,count,cumulative_sum
2022-02-20,5,5
2022-02-21,1,6
2022-02-22,1,7
2022-02-23,1,8
2022-02-24,1,9
2022-02-25,1,10
2022-02-27,1,11
The problem with this task is that it could be done by comparing each row uniquely with all previous rows to see if there is a match in user_id. Since you are using Redshift I'll assume that your data table could be very large so attacking the problem this way will bog down in some form of a loop join.
You want to think about the problem differently to avoid this looping issue. If you derive a dataset with id and first_date_of_id you can then just do a cumulative sum sorted by date. Like this
select user_id, min("date") as first_date,
count(user_id) over (order by first_date rows unbounded preceding) as date_out
from data_engineer_interview
group by user_id
order by date_out;
This is untested and won't produce the full list of dates that you have in your example output but rather only the dates where new ids show up. If this is an issue it is simple to add in the additional dates with no count change.
We can do this via a correlated subquery followed by aggregation:
WITH cte AS (
SELECT
date,
CASE WHEN EXISTS (
SELECT 1
FROM data_engineer_interview d2
WHERE d2.date < d1.date AND
d2.user_id = d1.user_id
) THEN 0 ELSE 1 END AS flag
FROM (SELECT DISTINCT date, user_id FROM data_engineer_interview) d1
)
SELECT date, SUM(flag) AS daily_cumulative_count
FROM cte
ORDER BY date;

SQL Troubleshooting Help on Table Structure

I'm attempting to calculate average number of days between a customer's 1st and 3rd purchase, but struggling to get the data ordered in a way that will allow me to calculate.
I currently have the below data table. (Note: Order sequence number refers to the number order for that customer.)
Order Date
Customer Number
Order Sequence Number
2020-09-20
1
1
2021-01-20
1
2
2021-01-21
1
3
2020-10-01
2
1
2020-08-06
3
1
2020-09-06
3
2
2020-09-09
3
3
I've been trying to get the data to look like the following table. [To then be able to calculate datediff on the last two columns.]
Customer Number
Order Count
First Order Date
Third Order Date
1
3
2020-09-20
2021-01-21
2
1
2020-10-01
Null
3
3
2020-08-06
2020-09-09
I've completely messed up the code, but here's what I've been trying.
CREATE TABLE X2 as
SELECT
customer_number,
max(order_sequence_number) as order_count,
CASE
WHEN order_sequence_number = 1 then order_date
ELSE null
END as first_order_date,
CASE
WHEN order_sequence_number = 3 then order_date
ELSE null
END as third_order_date
FROM X1
GROUP BY customer_number;
Can someone please tell me what I'm missing? Thanks in advance!
You are on the right track but you need aggregation functions:
SELECT customer_number,
max(order_sequence_number) as order_count,
MAX(CASE WHEN order_sequence_number = 1 THEN order_date END) as first_order_date,
MAX(CASE WHEN order_sequence_number = 3 THEN order_date END) as third_order_date
FROM X1
GROUP BY customer_number;
To get the difference in days, you would just subtract the two expressions using whatever date arithmetic is supported in your database.

Is there a method to write a SQL query that returns cumulative results based on the count of another column?

I have a query where I am counting the total number of new users signed up to a particular service each day since the service started.
So far I have:
SELECT DISTINCT CONVERT(DATE, Account_Created) AS Date_Created,
COUNT(ID) OVER (PARTITION BY CONVERT(DATE, Account_Created)) AS New_Users
FROM My_Table.dbo.NewAccts
ORDER BY Date_Created
This returns:
Date_Created | New_Users
--------------------------
2020-01-01 1
2020-01-03 3
2020-01-04 2
2020-01-06 5
2020-01-07 9
What I would like is to return a third column with a cumulative total for each day starting from the beginning until the present. So the first day there was only one new user. On January 3rd, three new users signed up for a total of four since the beginning--so on and so forth.
Date_Created | New_Users | Cumulative_Tot
------------------------------------------
2020-01-01 1 1
2020-01-03 3 4
2020-01-04 2 6
2020-01-06 5 11
2020-01-07 9 20
My thought process was to involve the ROW_NUMBER() function so that I can separate and add each consecutive row together, though I am not sure if that is correct. My feeling is that I am probably thinking about this too hard and the logic is simply just escaping me at the moment. Thank you for any help.
As a starter: I would recommend aggregation rather than DISTINCT and a window count. This makes the intent clearer, and is likely more efficient.
Then, you can make use of a window sum to compute the cumulative count.
SELECT
CONVERT(DATE, Account_Created) AS Date_Created,
COUNT(*) AS New_Users
SUM(COUNT(*)) OVER(ORDER BY CONVERT(DATE, Account_Created)) Cumulative_New_Users
FROM My_Table.dbo.NewAccts
GROUP BY CONVERT(DATE, Account_Created)
ORDER BY Date_Created

Get MAX count but keep the repeated calculated value if highest

I have the following table, I am using SQL Server 2008
BayNo FixDateTime FixType
1 04/05/2015 16:15:00 tyre change
1 12/05/2015 00:15:00 oil change
1 12/05/2015 08:15:00 engine tuning
1 04/05/2016 08:11:00 car tuning
2 13/05/2015 19:30:00 puncture
2 14/05/2015 08:00:00 light repair
2 15/05/2015 10:30:00 super op
2 20/05/2015 12:30:00 wiper change
2 12/05/2016 09:30:00 denting
2 12/05/2016 10:30:00 wiper repair
2 12/06/2016 10:30:00 exhaust repair
4 12/05/2016 05:30:00 stereo unlock
4 17/05/2016 15:05:00 door handle repair
on any given day need do find the highest number of fixes made on a given bay number, and if that calculated number is repeated then it should also appear in the resultset
so would like to see the result set as follows
BayNo FixDateTime noOfFixes
1 12/05/2015 00:15:00 2
2 12/05/2016 09:30:00 2
4 12/05/2016 05:30:00 1
4 17/05/2016 15:05:00 1
I manage to get the counts of each but struggling to get the max and keep the highest calculated repeated value. can someone help please
Use window functions.
Get the count for each day by bayno and also find the min fixdatetime for each day per bayno.
Then use dense_rank to compute the highest ranked row for each bayno based on the number of fixes.
Finally get the highest ranked rows.
select distinct bayno,minfixdatetime,no_of_fixes
from (
select bayno,minfixdatetime,no_of_fixes
,dense_rank() over(partition by bayno order by no_of_fixes desc) rnk
from (
select t.*,
count(*) over(partition by bayno,cast(fixdatetime as date)) no_of_fixes,
min(fixdatetime) over(partition by bayno,cast(fixdatetime as date)) minfixdatetime
from tablename t
) x
) y
where rnk = 1
Sample Demo
You are looking for rank() or dense_rank(). I would right the query like this:
select bayno, thedate, numFixes
from (select bayno, cast(fixdatetime) as date) as thedate,
count(*) as numFixes,
rank() over (partition by cast(fixdatetime as date) order by count(*) desc) as seqnum
from t
group by bayno, cast(fixdatetime as date)
) b
where seqnum = 1;
Note that this returns the date in question. The date does not have a time component.

SQL Calculating time from last transaction for each ID

Hello I'm stuck trying to calculate the difference in time between each transaction for each ID.
The data looks like
Customer_ID | Transaction_Time
1 00:30
1 00:35
1 00:37
1 00:38
2 00:20
2 00:21
2 00:23
I'm trying to get the result to look something like
Customer_ID | Time_diff
1 5
1 2
1 1
2 1
2 2
I would really appreciate any help.
Thanks
Most databases support the LAG() function. However, the date/time functions can depend on the database. Here is an example for SQL Server:
select t.*
from (select t.*,
datediff(second,
lag(transaction_time) over (partition by customer_id order by transaction_time),
transaction_time
) as diff
from t
) t
where diff is not null;
The logic would be similar in most databases, although the function for calculating the time difference varies.