SQL GROUP BY and retrieve last child records - sql

I'm writing a DB view that pulls data from several tables. The goal is to determine the latest status of a company, and this is noted by each record (grouped by company_id) with the highest vetting_event_type_position.
Essentially I'm trying to grab the latest record for each company. I'm not a SQL guru at all; I understand I need to group by in order to collapse the related records, but I can't get that to work.
Current results
company_id | name | ... | vetting_event_type_position
-----------------------------------------------------
1 | ABC | ... | 1
1 | ABC | ... | 2
1 | ABC | ... | 3
2 | CBS | ... | 1
2 | CBS | ... | 2
3 | HBO | ... | 1
DESIRED results
company_id | name | ... | vetting_event_type_position
-----------------------------------------------------
1 | ABC | ... | 3
2 | CBS | ... | 2
3 | HBO | ... | 1
SQL Code
SELECT
companies.id as company_id,
companies.name as name,
companies.uuid as uuid,
companies.company_type as company_type,
companies.description as overview,
practice_areas.id as practice_area_id,
practice_areas.name as practice_area_name,
companies.created_at as created_at,
companies.updated_at as updated_at,
companies.created_by as created_by,
companies.updated_by as updated_by,
vettings.id as vetting_id,
vettings.name as vetting_name,
vetting_event_types.name as vetting_event_status,
vetting_events.id as vetting_event_id,
vetting_event_types.position as vetting_event_type_position
FROM
vettings
LEFT OUTER JOIN vetting_events ON (vettings.id = vetting_events.vetting_id)
LEFT OUTER JOIN vetting_event_types ON (vetting_events.vetting_event_type_id = vetting_event_types.id)
RIGHT OUTER JOIN companies ON (companies.id = vettings.company_id)
LEFT OUTER JOIN practice_areas ON (companies.practice_area_id = practice_areas.id)
LEFT OUTER JOIN dispositions ON (companies.disposition_id = dispositions.id)
ORDER BY
name, vetting_name, vetting_event_type_position
;
Associations among tables
companies has_many vettings
vettings has_many vetting_events
vetting_events belongs_to vetting_event_types
or put another way...
companies -> vettings -> vetting_events <- vetting_event_types
I am trying to retrieve the company record with the highest vetting_event_types.position value for each group.

SELECT company_id
,name
,uuid
,company_type
,overview
,practice_area_id
,practice_area_name
,created_at
,created_by
,updated_by
,vetting_id
,vetting_name
,vetting_event_status
,vetting_event_id
,vetting_event_type_position
FROM (
SELECT
companies.id as company_id,
companies.name as name,
companies.uuid as uuid,
companies.company_type as company_type,
companies.description as overview,
practice_areas.id as practice_area_id,
practice_areas.name as practice_area_name,
companies.created_at as created_at,
companies.updated_at as updated_at,
companies.created_by as created_by,
companies.updated_by as updated_by,
vettings.id as vetting_id,
vettings.name as vetting_name,
vetting_event_types.name as vetting_event_status,
vetting_events.id as vetting_event_id,
vetting_event_types.position as vetting_event_type_position,
ROW_NUMBER() OVER (PARTITION BY companies.id ORDER BY vetting_event_types.position DESC) rn
FROM vettings
LEFT OUTER JOIN vetting_events ON (vettings.id = vetting_events.vetting_id)
LEFT OUTER JOIN vetting_event_types ON (vetting_events.vetting_event_type_id = vetting_event_types.id)
RIGHT OUTER JOIN companies ON (companies.id = vettings.company_id)
LEFT OUTER JOIN practice_areas ON (companies.practice_area_id = practice_areas.id)
LEFT OUTER JOIN dispositions ON (companies.disposition_id = dispositions.id)
) A
WHERE A.rn = 1
ORDER BY name, vetting_name, vetting_event_type_position

You can use row_number analytic function.
Select * from (
Select ...,
Row_number() over ( partition by company_id order by vetting_event_type_position desc) as seq) T
Where seq=1

Related

SUM CASE when DISTINCT?

Joining two tables and grouping, we're trying to get the sum of a user's value but only include a user's value once if that user is represented in a grouping multiple times.
Some sample tables:
user table:
| id | net_worth |
------------------
| 1 | 100 |
| 2 | 1000 |
visit table:
| id | location | user_id |
-----------------------------
| 1 | mcdonalds | 1 |
| 2 | mcdonalds | 1 |
| 3 | mcdonalds | 2 |
| 4 | subway | 1 |
We want to find the total net worth of users visiting each location. User 1 visited McDonalds twice, but we don't want to double count their net worth. Ideally we can use a SUM but only add in the net worth value if that user hasn't already been counted for at that location. Something like this:
-- NOTE: Hypothetical query
SELECT
location,
SUM(CASE WHEN DISTINCT user.id then user.net_worth ELSE 0 END) as total_net_worth
FROM visit
JOIN user on user.id = visit.user_id
GROUP BY 1;
The ideal output being:
| location | total_net_worth |
-------------------------------
| mcdonalds | 1100 |
| subway | 100 |
This particular database is Redshift/PostgreSQL, but it would be interesting if there is a generic SQL solution. Is something like the above possible?
You don't want to consider duplicate entries in the visits table. So, select distinct rows from the table instead.
SELECT
v.location,
SUM(u.net_worth) as total_net_worth
FROM (SELECT DISTINCT location, user_id FROM visit) v
JOIN user u on u.id = v.user_id
GROUP BY v.location
ORDER BY v.location;
You can use a window function to get the unique users, then join that to the user table:
select v.location, sum(u.net_worth)
from "user" u
join (
select location, user_id,
row_number() over (partition by location, user_id) as rn
from visit
order by user_id, location, id
) v on v.user_id = u.id and v.rn = 1
group by v.location;
The above is standard ANSI SQL, in Postgres this can also be expressed using distinct on ()
select v.location, sum(u.net_worth)
from "user" u
join (
select distinct on (user_id, location) *
from visit
order by user_id, location, id
) v on v.user_id = u.id
group by v.location;
You can join the user table with distinct values of location & user id combination like the below generic SQL.
SELECT v.location, SUM(u.net_worth)
FROM (SELECT location, user_id FROM visit GROUP BY location, user_id) v
JOIN user u on u.id = v.user_id
GROUP BY v.location;

select single row from foreign table in left join

I want to fetch the first row where foreign key match. I don't know how to select first row
where foreign key matches
events table
id | name
----------------
1 | john
----------------
2 | Cat
event_attendee table
id | event_id | type
--------------------------
1 | 1 | User
--------------------------
2 | 1 | Local
--------------------------
3 | 1 | User
--------------------------
4 | 2 | User
--------------------------
5 | 2 | User
I want this result
id | name | event_id | type
------------------------------------
1 | John | 1 | User
------------------------------------
2 | Cat | 2 | User
Tried
select
a.*,
b.*
from
events as a
left join (
select
distinct
event_attendee.events_id,
event_attendee.type
from
event_attendee
left join events on
event_attendees.events_id = events.id
where
events.id = event_attendees.events_id
limit 1
) as b on
a.id = b.events_id
Problem
It only works for the 1st row, for 2nd row its show empty
id | name | type
------------------------------------
1 | John | User
------------------------------------
2 | Cat |
You can do this using a lateral join. In Postgres, the syntax is:
select e.*, ea.*
from events e left join lateral
(select ea.event_Id, ea.Type
from event_attendee ea
where ea.event_id = e.id
order by ea.id
) ea
on 1=1;
However, distinct on is a way to do this with no subqueries:
select distinct on (e.event_id) e.*, ea.*
from events e join
event_attendee ea
on ea.event_id = e.id
order by e.event_id, ea.id;
I would expect the lateral join to work better on larger tables, particularly with the correct indexes.
This is easy with a cross apply:
select *
from events e
cross apply (
select top (1) event_Id, Type
from event_attendee ea
where ea.event_id=e.id
order by id
)x
Edit, alternative compatible method!
select e.*,ea.event_Id, (select type from event_attendee ea2 where ea2.id=ea.id ) Type
from (
select Min(id) Id, event_id
from event_attendee
group by event_id
)ea
join events e on e.id=ea.event_id
One way to get the rank and use it to filter 1st record:
select
t_.id, t_.name, t_.type
from
(
select a.*, b.type,
rank() OVER (PARTITION BY a.id ORDER BY b.id asc) rank_
from events a
left join event_attendees b
on
a.id = b.events_id
) t_
where
t_.rank_ = 1

Left join command is not showing all results

I have a table RESTAURANT:
Id | Name
------------------
0 | 'McDonalds'
1 | 'Burger King'
2 | 'Starbucks'
3 | 'Pans'
And a table ORDER:
Id | ResId | Client
--------------------
0 | 1 | 'Peter'
1 | 2 | 'John'
2 | 2 | 'Peter'
Where 'ResId' is a foreign key from RESTAURANT.Id.
I want to select the number of order per restaurant:
Expected result:
Restaurant | Number of orders
----------------------------------
'McDonalds' | 0
'Burguer King' | 1
'Starbucks' | 2
'Pans' | 0
Actual result:
Restaurant | Number of orders
----------------------------------
'McDonalds' | 0
'Burguer King' | 1
'Starbucks' | 2
Command used:
select r.Name, count(o.ResId)
from RESTAURANT r
left join ORDER o on r.Id like o.ResId
group by o.ResId;
Just fix the group by clause:
select r.name, count(*) as cnt_orders
from restaurants r
left join orders o on r.id = o.resid
group by r.id, r.name;
That way, the SELECT and GROUP BY clauses are consistent; I also added the restaurant id to the group, so potential restaurants having the same name are not aggregated together. I also changed like to =: this is more efficient, and does not alter the logic.
You could also phrase this with a subquery, so there is no need for outer aggregation. I would prefer:
select r.*,
(select count(*) from orders o where o.resid = r.id) as cnt_orders
from restaurants r
Your query should be generating an error because the select columns and the group by columns are incompatible. Just aggregate by the unaggregated columns in the select:
select r.Name, count(o.ResId)
from RESTAURANT r left join
ORDER o
on r.Id = o.ResId
group by r.Name;
Notes:
You might want to include r.id in the GROUP BY (and SELECT) in case restaurants can have the same name.
Note the use of = instead of LIKE. The ids look like numbers, so you should use number operations. LIKE is a string operation.
ORDER is a bad name for a table because it is a SQL keyword.
As a general rule, in a LEFT JOIN, you don't want the aggregation keys to be from the second table, because those values could be NULL.

Select from multiple table, eliminating duplicates values

I have these tables and values:
Person Account
------------------ -----------------------
ID | CREATED_BY ID | TYPE | DATA
------------------ -----------------------
1 | 1 | T1 | USEFUL DATA
2 | 2 | T2 |
3 | 3 | T3 |
4 | 4 | T2 |
Person_account_link
--------------------------
ID | PERSON_ID | ACCOUNT_ID
--------------------------
1 | 1 | 1
2 | 1 | 2
3 | 2 | 3
4 | 3 | 4
I want to select all persons with T1 account type and get the data column, for the others persons they should be in the result without any account information.
(I note that person 1 has two accounts : account_id_1 and account_id_2 but only one row must be displayed (priority for T1 type if exist otherwise null)
The result should be :
Table1
-----------------------------------------------------
PERSON_ID | ACCOUNT_ID | ACCOUNT_TYPE | ACCOUNT_DATA
-----------------------------------------------------
1 | 1 | T1 | USEFUL DATA
2 | NULL | NULL | NULL
3 | NULL | NULL | NULL
4 | NULL | NULL | NULL
You can do conditional aggregation :
SELECT p.id,
MAX(CASE WHEN a.type = 'T1' THEN a.id END) AS ACCOUNT_ID,
MAX(CASE WHEN a.type = 'T1' THEN 'T1' END) AS ACCOUNT_TYPE,
MAX(CASE WHEN a.type = 'T1' THEN a.data END) AS ACCOUNT_DATA
FROM person p LEFT JOIN
Person_account_link pl
ON p.id = pl.person_id LEFT JOIN
account a
ON pl.account_id = a.id
GROUP BY p.id;
You would need an outer join, starting with Person and then to the other two tables. I would also aggregate with group by and min to tackle the situation where a person would have two or more T1 accounts. In that case one of the data is taken (the min of them):
select p.id person_id,
min(a.id) account_id,
min(a.type) account_type,
min(a.data) account_data
from Person p
left join Person_account_link pa on p.id = pa.person_id
left join Account a on pa.account_id = a.id and a.type = 'T1'
group by p.id
In Postgres, I like to use the FILTER keyword. In addition, the Person table is not needed if you only want persons with an account. If you want all persons:
SELECT p.id,
MAX(a.id) FILTER (a.type = 'T1') as account_id,
MAX(a.type) FILTER (a.type = 'T1') as account_type,
MAX(a.data) FILTER (a.type = 'T1') as account_data
FROM Person p LEFT JOIN
Person_account_link pl
ON pl.person_id = p.id LEFT JOIN
account a
ON pl.account_id = a.id
GROUP BY p.id;

Joining tables based on the maximum value

Here's a simplified example of what I'm talking about:
Table: students exam_results
_____________ ____________________________________
| id | name | | id | student_id | score | date |
|----+------| |----+------------+-------+--------|
| 1 | Jim | | 1 | 1 | 73 | 8/1/09 |
| 2 | Joe | | 2 | 1 | 67 | 9/2/09 |
| 3 | Jay | | 3 | 1 | 93 | 1/3/09 |
|____|______| | 4 | 2 | 27 | 4/9/09 |
| 5 | 2 | 17 | 8/9/09 |
| 6 | 3 | 100 | 1/6/09 |
|____|____________|_______|________|
Assume, for the sake of this question, that every student has at least one exam result recorded.
How would you select each student along with their highest score? Edit: ...AND the other fields in that record?
Expected output:
_________________________
| name | score | date |
|------+-------|--------|
| Jim | 93 | 1/3/09 |
| Joe | 27 | 4/9/09 |
| Jay | 100 | 1/6/09 |
|______|_______|________|
Answers using all types of DBMS are welcome.
Answering the EDITED question (i.e. to get associated columns as well).
In Sql Server 2005+, the best approach would be to use a ranking/window function in conjunction with a CTE, like this:
with exam_data as
(
select r.student_id, r.score, r.date,
row_number() over(partition by r.student_id order by r.score desc) as rn
from exam_results r
)
select s.name, d.score, d.date, d.student_id
from students s
join exam_data d
on s.id = d.student_id
where d.rn = 1;
For an ANSI-SQL compliant solution, a subquery and self-join will work, like this:
select s.name, r.student_id, r.score, r.date
from (
select r.student_id, max(r.score) as max_score
from exam_results r
group by r.student_id
) d
join exam_results r
on r.student_id = d.student_id
and r.score = d.max_score
join students s
on s.id = r.student_id;
This last one assumes there aren't duplicate student_id/max_score combinations, if there are and/or you want to plan to de-duplicate them, you'll need to use another subquery to join to with something deterministic to decide which record to pull. For example, assuming you can't have multiple records for a given student with the same date, if you wanted to break a tie based on the most recent max_score, you'd do something like the following:
select s.name, r3.student_id, r3.score, r3.date, r3.other_column_a, ...
from (
select r2.student_id, r2.score as max_score, max(r2.date) as max_score_max_date
from (
select r1.student_id, max(r1.score) as max_score
from exam_results r1
group by r1.student_id
) d
join exam_results r2
on r2.student_id = d.student_id
and r2.score = d.max_score
group by r2.student_id, r2.score
) r
join exam_results r3
on r3.student_id = r.student_id
and r3.score = r.max_score
and r3.date = r.max_score_max_date
join students s
on s.id = r3.student_id;
EDIT: Added proper de-duplicating query thanks to Mark's good catch in comments
SELECT s.name,
COALESCE(MAX(er.score), 0) AS high_score
FROM STUDENTS s
LEFT JOIN EXAM_RESULTS er ON er.student_id = s.id
GROUP BY s.name
Try this,
Select student.name, max(result.score) As Score from Student
INNER JOIN
result
ON student.ID = result.student_id
GROUP BY
student.name
With Oracle's analytic functions this is easy:
SELECT DISTINCT
students.name
,FIRST_VALUE(exam_results.score)
OVER (PARTITION BY students.id
ORDER BY exam_results.score DESC) AS score
,FIRST_VALUE(exam_results.date)
OVER (PARTITION BY students.id
ORDER BY exam_results.score DESC) AS date
FROM students, exam_results
WHERE students.id = exam_results.student_id;
Select Name, T.Score, er. date
from Students S inner join
(Select Student_ID,Max(Score) as Score from Exam_Results
Group by Student_ID) T
On S.id=T.Student_ID inner join Exam_Result er
On er.Student_ID = T.Student_ID And er.Score=T.Score
Using MS SQL Server:
SELECT name, score, date FROM exam_results
JOIN students ON student_id = students.id
JOIN (SELECT DISTINCT student_id FROM exam_results) T1
ON exam_results.student_id = T1.student_id
WHERE exam_results.id = (
SELECT TOP(1) id FROM exam_results T2
WHERE exam_results.student_id = T2.student_id
ORDER BY score DESC, date ASC)
If there is a tied score, the oldest date is returned (change date ASC to date DESC to return the most recent instead).
Output:
Jim 93 2009-01-03 00:00:00.000
Joe 27 2009-04-09 00:00:00.000
Jay 100 2009-01-06 00:00:00.000
Test bed:
CREATE TABLE students(id int , name nvarchar(20) );
CREATE TABLE exam_results(id int , student_id int , score int, date datetime);
INSERT INTO students
VALUES
(1,'Jim'),(2,'Joe'),(3,'Jay')
INSERT INTO exam_results VALUES
(1, 1, 73, '8/1/09'),
(2, 1, 93, '9/2/09'),
(3, 1, 93, '1/3/09'),
(4, 2, 27, '4/9/09'),
(5, 2, 17, '8/9/09'),
(6, 3, 100, '1/6/09')
SELECT name, score, date FROM exam_results
JOIN students ON student_id = students.id
JOIN (SELECT DISTINCT student_id FROM exam_results) T1
ON exam_results.student_id = T1.student_id
WHERE exam_results.id = (
SELECT TOP(1) id FROM exam_results T2
WHERE exam_results.student_id = T2.student_id
ORDER BY score DESC, date ASC)
On MySQL, I think you can change the TOP(1) to a LIMIT 1 at the end of the statement. I have not tested this though.