Checking conditions per group, and ranking most recent row? - sql

I'm handling a table like so:
Name
Status
Date
Alfred
1
Jan 1 2023
Alfred
2
Jan 2 2023
Alfred
3
Jan 2 2023
Alfred
4
Jan 3 2023
Bob
1
Jan 1 2023
Bob
3
Jan 2 2023
Carl
1
Jan 5 2023
Dan
1
Jan 8 2023
Dan
2
Jan 9 2023
I'm trying to setup a query so I can handle the following:
I'd like to pull the most recent status per Name,
SELECT MAX(Date), Status, Name
FROM test_table
GROUP BY Status, Name
Additionally I'd like in the same query to be able to pull if the user has ever had a status of 2, regardless of if the most recent one is 2 or not
WITH has_2_table AS (
SELECT DISTINCT Name, TRUE as has_2
FROM test_table
WHERE Status = 2 )
And then maybe joining the above on a left join on Name?
But having these as two seperate queries and joining them feels clunky to me, especially since I'd like to add additional columns and other checks. Is there a better way to set this up in one singular query, or is this the most effecient way?

You said, "I'd like to add additional columns" so I interpret that to mean you would like to Select the entire most recent record and add an 'ever-2' column.
You can either do this by joining two queries, or use window functions. Not knowing Snowflake Cloud Data, I cannot tell you which is more efficient.
Join 2 Queries
Select A.*,Coalesce(B.Ever2,"No") as Ever2
From (
Select * From testable x
Where date=(Select max(date) From test_table y
Where x.name=y.name)
) A Left Outer Join (
Select name,"Yes" as Ever2 From test_table
Where status=2
Group By name
) B On A.name=B.name
The first subquery can also be written as an Inner Join if correlated subqueries are implemented badly on your platform.
use of Window Functions
Select * From (
Select row_number() Over (Partition by name, order by date desc, status desc) as bestrow,
A.*,
Coalesce(max(Case When status=2 Then "Yes" End) Over (Partition By name Rows Unbounded Preceding And Unbounded Following), "No") as Ever2
From test_table A
)
Where bestrow=1
This second query type always reads and sorts the entire test_table so it might not be the most efficient.

Given that you have a different partitioning on the two aggregations, you could try going with window functions instead:
SELECT DISTINCT Name,
MAX(Date) OVER(
PARTITION BY Name, Status
) AS lastdate,
MAX(CASE WHEN Status = 2 THEN 1 ELSE 0 END) OVER(
PARTITION BY Name
) AS status2
FROM tab

I'd like to pull the most recent status per name […] Additionally I'd like in the same query to be able to pull if the user has ever had a status of 2.
Snowflake has sophisticated aggregate functions.
Using group by, we can get the latest status with arrays and check for a given status with boolean aggregation:
select name, max(date) max_date,
get(array_agg(status) within group (order by date desc), 0) last_status,
boolor_agg(status = 2) has_status2
from mytable
group by name
We could also use window functions and qualify:
select name, date as max_date,
status as last_status,
boolor_agg(status = 2) over(partition by name) has_status2
from mytable
qualify rank() over(order by name order by date desc) = 1

Related

Use window functions to select the value from a column based on the sum of another column, in an aggregate query

Consider this data (View on DB Fiddle):
id
dept
value
1
A
5
1
A
5
1
B
7
1
C
5
2
A
5
2
A
5
2
B
15
2
A
2
The base query I am running is pretty simple. Just get the total value by id and the most frequent dept.
SELECT
id,
MODE() WITHIN GROUP(ORDER BY dept) AS dept_freq,
SUM(value) AS value
FROM test
GROUP BY id
;
id
dept_freq
value
1
A
22
2
A
27
But I also need to get, for each id, the dept that concentrates the greatest value (so the greatest sum of value by id and dept, not the highest individual value in the original table).
Is there any way to use window functions to achieve that and do it directly in the base query above?
The expected output for this particular example would be:
id
dept_freq
dept_value
value
1
A
A
22
2
A
B
27
I could achieve that with the query below and then joining that with the results of the base query above
SELECT * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY value DESC) as row
FROM (
SELECT id, dept, SUM(value) AS value
FROM test
GROUP BY id, dept
) AS alias1
) AS alias2
WHERE alias2.row = 1
;
id
dept
value
row
1
A
10
1
2
B
15
1
But it is not easy to read/maintain and seems also pretty inefficient. So I thought it should be possible to achieve this using window functions directly in the base query, and that also may also help Postgres to come up with a better query plan that does less passes over the data. But none of my attempts using over partition and filter worked.
step-by-step demo:db<>fiddle
You can fetch the dept for the highest values using the first_value() partition function. Adding this before your mode() grouping should do it:
SELECT
id,
highest_value_dept,
MODE() WITHIN GROUP(ORDER BY dept) AS dept_freq,
SUM(value) as value
FROM (
SELECT
id,
dept,
value,
FIRST_VALUE(dept) OVER (PARTITION BY id ORDER BY value DESC) as highest_value_dept
FROM test
) s
GROUP BY 1,2

Retrieve records from versioned table

this sql case has been troubling me for a while and I wanted to ask here what other folks think.
I have a table user who owns vehicles, but the same vehicle maybe owned by multiple user over time, there is another column called effective_date which tells from what day this is owning is effective. Two driver doesn't own the same vehicle, but records are versioned, meaning we can check who owned this vehicle 2 years ago, or 5 years ago using effective date.
Table has following columns,
id, version, name, vehicle_id, effective_date. Every change to this table is versioned
Now there is another table called accidents which tells what accident with vehicle and when, not versioned
it has id, description, vehicle_id, acc_date
Now I am trying to select all accidents and who caused the accident. Inner join doesn't work here, What I do is select all rows from accident table and run sub query for each row and find the user's id and version that was responsible for the cause. This will be super slow and I am looking for more performant way of organizing the date or constructing a query. Right now it runs a subquery for every row it selects from accident table, because each row has different accident date. I am ok doing few queries if there is easy way of doing within a single query.
Example
user table
id
version
name
vehicle_id
effective_date
1
1
A
1
01/10/2021
1
2
A
2
02/10/2021
2
1
B
1
03/10/2021
2
2
B
2
04/10/2021
accident:
id
description
vehicle_id
acc_date
1
hit1
1
03/5/2021
2
hit2
1
03/15/2021
Result:
user_id
user_version
acc_id
vehicle_id
acc_date
1
1
1
1
03/5/2021
2
1
2
1
03/15/2021
thanks for your help
To get the latest user at the time of the accident you can use ROW_NUMBER() sorting by descending effective_date. With this ordering the first user listed for each accident is the responsible one.
For example:
select *
from (
select *,
row_number() over(partition by u.vehicle_id
order by effective_date desc) as rn
from user u
join accident a on a.vehicle_id = u.vehicle_id
where u.effective_date <= a.acc_date
) x
where rn = 1
Select user_id, user_version,
acc_id, vehicle_id, acc_date from(
Select rownumber() over
(Partition by a.id, a vehicle_id,
b.id) sn ,a.id
as user_id, a.version
as user_version,
b.id as acc_id, a.vehicle_id,
acc_date from user a
Inner Join
Accident b on
a.vehicle_id = b.vehicle_id) a
where sn = 1

Calculate distinct totals over time

I have the following data:
UniqueID SenderID EntryID Date
1 1 1 2015-09-17
2 1 1 2015-09-23
3 2 1 2015-09-17
4 2 1 2015-09-17
5 3 1 2015-09-17
6 4 1 2015-09-19
7 3 1 2015-09-20
What I require is the following:
3 2015-09-17
4 2015-09-19
4 2015-09-20
4 2015-09-23
Where the first column is the total unique entries upto that date. So for example the entry on the 23/9 of Sender 1 and Entry 1 does not increase the total column because there is a duplicate from the 17/9.
How can I do this efficiently ideally without joining on the same table as what you end up with is a very large query which is not practical. I have done something similar in Postgres with OVER() but unfortunately this isn't available in this setup.
I could also do this in code - which I have but yet again it has to calculate outside of the db system and then import back in. With millions of rows this process takes days and I ideally only have hours.
OVER is ANSI standard functionality available in most databases. What you are counting are starts for users, and you can readily do this with a cumulative sum:
select startdate,
sum(count(*)) over (order by startdate) as CumulativeUniqueCount
from (select senderid, min(date) as startdate
from table t
group by senderid
) t
group by startdate
order by startdate;
This should work in any database that supports window functions, such as Oracle, SQL Server 2012+, Postgres, Teradata, DB2, Hive, Redshift, to name a few.
EDIT:
You need a left join to get all the dates in the data:
select d.date,
sum(count(d.date)) over (order by d.date) as CumulativeUniqueCount
from (select distinct date from table t) d left join
(select senderid, min(date) as startdate
from table t
group by senderid
) t
on t.startdate = d.date
group by d.date
order by d.date;
Credit to Gordon Linoff for the basic query. However, it will not return rows for dates that don't increase the cumulative sum.
To get those extra rows, you need to include an additional subquery that lists all the distinct dates from the table. And then you left join with Gordon's query + a few minor tweaks to get the desired result:
select d.SomeDate,
sum(count(t.SenderId)) over (order by d.SomeDate)
from (select distinct SomeDate
from SomeTable) d
left join (select SenderId, min(somedate) as MinDate
from SomeTable
group by SenderId) t
on d.SomeDate = t.MinDate
group by d.SomeDate
order by d.SomeDate;

Finding the first occurrence of an element in a SQL database

I have a table with a column for customer names, a column for purchase amount, and a column for the date of the purchase. Is there an easy way I can find how much first time customers spent on each day?
So I have
Name | Purchase Amount | Date
Joe 10 9/1/2014
Tom 27 9/1/2014
Dave 36 9/1/2014
Tom 7 9/2/2014
Diane 10 9/3/2014
Larry 12 9/3/2014
Dave 14 9/5/2014
Jerry 16 9/6/2014
And I would like something like
Date | Total first Time Purchase
9/1/2014 73
9/3/2014 22
9/6/2014 16
Can anyone help me out with this?
The following is standard SQL and works on nearly all DBMS
select date,
sum(purchaseamount) as total_first_time_purchase
from (
select date,
purchaseamount,
row_number() over (partition by name order by date) as rn
from the_table
) t
where rn = 1
group by date;
The derived table (the inner select) selects all "first time" purchases and the outside the aggregates based on the date.
The two key concepts here are aggregates and sub-queries, and the details of which dbms you're using may change the exact implementation, but the basic concept is the same.
For each name, determine they're first date
Using the results of 1, find each person's first day purchase amount
Using the results of 2, sum the amounts for each date
In SQL Server, it could look like this:
select Date, [totalFirstTimePurchases] = sum(PurchaseAmount)
from (
select t.Date, t.PurchaseAmount, t.Name
from table1 t
join (
select Name, [firstDate] = min(Date)
from table1
group by Name
) f on t.Name=f.Name and t.Date=f.firstDate
) ftp
group by Date
If you are using SQL Server you can accomplish this with either sub-queries or CTEs (Common Table Expressions). Since there is already an answer with sub-queries, here is the CTE version.
First the following will identify each row where there is a first time purchase and then get the sum of those values grouped by date:
;WITH cte
AS (
SELECT [Name]
,PurchaseAmount
,[date]
,ROW_NUMBER() OVER (
PARTITION BY [Name] ORDER BY [date] --start at 1 for each name at the earliest date and count up, reset every time the name changes
) AS rn
FROM yourTableName
)
SELECT [date]
,sum(PurchaseAmount) AS TotalFirstTimePurchases
FROM cte
WHERE rn = 1
GROUP BY [date]

Selecting 5 Most Recent Records Of Each Group

The below statement retrieves the top 2 records within each group in SQL Server. It works correctly, however as you can see it doesn't scale at all. I mean that if I wanted to retrieve the top 5 or 10 records instead of just 2, you can see how this query statement would grow very quickly.
How can I convert this query into something that returns the same records, but that I can quickly change it to return the top 5 or 10 records within each group instead, rather than just 2? (i.e. I want to just tell it to return the top 5 within each group, rather than having 5 unions as the below format would require)
Thanks!
WITH tSub
as (SELECT CustomerID,
TransactionTypeID,
Max(EventDate) as EventDate,
Max(TransactionID) as TransactionID
FROM Transactions
WHERE ParentTransactionID is NULL
Group By CustomerID,
TransactionTypeID)
SELECT *
from tSub
UNION
SELECT t.CustomerID,
t.TransactionTypeID,
Max(t.EventDate) as EventDate,
Max(t.TransactionID) as TransactionID
FROM Transactions t
WHERE t.TransactionID NOT IN (SELECT tSub.TransactionID
FROM tSub)
and ParentTransactionID is NULL
Group By CustomerID,
TransactionTypeID
Use Partition by to solve this type problem
select values from
(select values ROW_NUMBER() over (PARTITION by <GroupColumn> order by <OrderColumn>)
as rownum from YourTable) ut where ut.rownum<=5
This will partitioned the result on the column you wanted order by EventDate Column then then select those entry having rownum<=5. Now you can change this value 5 to get the top n recent entry of each group.