Getting the most recent record of a group - sql

I'm trying to find the most recent record of a group after doing a inner join.
Say I have the following two tables:
dateCreated | id
2011-12-27 | 1
2011-12-15 | 2
2011-12-17 | 6
2011-12-26 | 15
2011-12-15 | 18
2011-12-07 | 22
2011-12-09 | 23
2011-12-27 | 24
code | id
EFG | 1
ABC | 2
BCD | 6
BCD | 15
ABC | 18
BCD | 22
EFG | 23
EFG | 24
I want to display only the most recent of the groupings:
So the result would be:
dateCreated | code
2011-12-27 | EFG
2011-12-15 | ABC
2011-12-26 | BCD
I know this can be achieved using the max and group by functions, but I can't seem to get the desired result.

I think this should get you there:
select max(a.dateCreated) as dateCreated
, b.code
from table1 a
join table2 b on a.id = b.id
group by b.code

Assuming your tables are called a and b, try this:
select max(a.dateCreated) as dateCreated, b.code
from a join b on a.id = b.id
group by b.code

You can use analytical functions for this. This way, you are still choosing only one result for every code, even if they are two with the same last dateCreated (this may or may not be what you actually want as a result)
SELECT Code, dateCreated
FROM ( SELECT T2.Code, T1.dateCreated, ROW_NUMBER() OVER(PARTITION BY T2.Code ORDER BY T1.dateCreated DESC) Corr
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.id = T2.id) A
WHERE Corr = 1

Related

Remove duplicates using multiple criteria in SQL

I've tried all afternoon to dedup a table that looks like this:
ID1 | ID2 | Date | Time |Status | Price
----+-----+------------+-----------------+--------+-------
01 | A | 01/01/2022 | 10:41:47.000000 | DDD | 55
01 | B | 02/01/2022 | 16:22:31.000000 | DDD | 53
02 | C | 01/01/2022 | 08:54:03.000000 | AAA | 72
02 | D | 03/01/2022 | 11:12:35.000000 | DDD |
03 | E | 01/01/2022 | 17:15:41.000000 | DDD | 67
03 | F | 01/01/2022 | 19:27:22.000000 | DDD | 69
03 | G | 02/01/2022 | 06:45:52.000000 | DDD | 78
Basically, I need to dedup based on two conditions:
Status: where AAA > BBB > CCC > DDD. So, pick the highest one.
When the Status is the same given the same ID1, pick the latest one based on Date and Time.
The final table should look like:
ID1 | ID2 | Date | Time |Status | Price
----+-----+------------+-----------------+--------+-------
01 | B | 02/01/2022 | 16:22:31.000000 | DDD | 53
02 | C | 01/01/2022 | 08:54:03.000000 | AAA | 72
03 | G | 02/01/2022 | 06:45:52.000000 | DDD | 78
Is there a way to do this in Redshift SQL / PostgreSQL?
I tried variations of this, but everytime it doesn't work because it demands that I add all columns to the group by, so then it defeats the purpose
select a.id1,
b.id2,
b.date,
b.time,
b.status,
b.price,
case when (status = 'AAA') then 4
when (status = 'BBB') then 3
when (status= 'CCC') then 2
when (status = 'DDD') then 1
when (status = 'EEE') then 0
else null end as row_order
from table1 a
left join table2 b
on a.id1=b.id1
group by id1
having row_order = max(row_order)
and date=max(date)
and time=max(time)
Any help at all is appreciated!
Windowing functions are good at this:
SELECT ID1, ID2, Date, Time, Status, Price
FROM (
SELECT *,
row_number() OVER (PARTITION BY ID1 ORDER BY Status, Date DESC, Time DESC) rn
FROM MyTable
) t
WHERE rn = 1
See it work here:
https://dbfiddle.uk/uAvDz1Qn
You can use ROW_NUMBER() like so:
with cte as (
select a.id1,
b.id2,
b.date,
b.time,
b.status,
b.price,
ROW_NUMBER() OVER (PARTITION BY a.id1 ORDER BY b.status ASC, b.date DESC, b.time DESC) RN
from table1 a
left join table2 b on a.id1=b.id1
)
select * from cte where rn = 1
This is a typical top-1-per-group problem. The canonical solution indeed involves window functions, as demonstrated by Joel Coehoorn and Aaron Dietz.
But Postgres has a specific extension, called distinct on, which is built exactly for the purpose of solving top-1-per-group problems. The syntax is neater, and you benefit built-in optimizations:
select distinct on (id1) t.*
from mytable t
order by id1, status, "Date" desc, "Time" desc
Here is a demo on DB Fiddle based on that of Joel Coehoorn.

LEFT JOIN ON most recent date in Google BigQuery

I've got two tables, both with timestamps and some more data:
Table A
| name | timestamp | a_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:10:00 | a |
| 2 | 2018-01-01 12:20:00 | b |
| 3 | 2018-01-01 13:30:00 | c |
Table B
| name | timestamp | b_data |
| ---- | ------------------- | ------ |
| 1 | 2018-01-01 11:00:00 | w |
| 2 | 2018-01-01 12:00:00 | x |
| 3 | 2018-01-01 13:00:00 | y |
| 3 | 2018-01-01 13:10:00 | y |
| 3 | 2018-01-01 13:10:00 | z |
What I want to do is
For each row in Table A LEFT JOIN the most recent record in Table B that predates it.
When there is more than one possibility take the last one
Target Result
| name | timestamp | a_data | b_data |
| ---- | ------------------- | ------ | ------ |
| 1 | 2018-01-01 11:10:00 | a | w |
| 2 | 2018-01-01 12:20:00 | b | x |
| 3 | 2018-01-01 13:30:00 | c | z | <-- note z, not y
I think this involves a subquery, but I cannot get this to work in Big Query. What I have so far:
SELECT a.a_data, b.b_data
FROM `table_a` AS a
LEFT JOIN `table_b` AS b
ON a.name = b.name
WHERE a.timestamp = (
SELECT max(timestamp) from `table_b` as sub
WHERE sub.name = b.name
AND sub.timestamp < a.timestamp
)
On my actual dataset, which is a very small test set (under 2Mb) the query runs but never completes. Any pointers much appreciated 👍🏻
You can try to use a select subquery.
SELECT a.*,(
SELECT MAX(b.b_data)
FROM `table_b` AS b
WHERE
a.name = b.name
and
b.timestamp < a.timestamp
) b_data
FROM `table_a` AS a
EDIT
Or you can try to use ROW_NUMBER window function in a subquery.
SELECT name,timestamp,a_data , b_data
FROM (
SELECT a.*,b.b_data,ROW_NUMBER() OVER(PARTITION BY a.name ORDER BY b.timestamp desc,b.name desc) rn
FROM `table_a` AS a
LEFT JOIN `table_b` AS b ON a.name = b.name AND b.timestamp < a.timestamp
) t1
WHERE rn = 1
Below is for BigQuery Standard SQL and does not require specifying all columns on both sides - only name and timestamp. So it will work for any number of the columns in both tables (assuming no ambiguity in name rather than for above mentioned two columns)
#standardSQL
SELECT a.*, b.* EXCEPT (name, timestamp)
FROM (
SELECT
ANY_VALUE(a) a,
ARRAY_AGG(b ORDER BY b.timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] b
FROM `project.dataset.table_a` a
LEFT JOIN `project.dataset.table_b` b
USING (name)
WHERE a.timestamp > b.timestamp
GROUP BY TO_JSON_STRING(a)
)
In BigQuery, arrays are often an efficient way to solve such problems:
SELECT a.a_data, b.b_data
FROM `table_a` a LEFT JOIN
(SELECT b.name,
ARRAY_AGG(b.b_data ORDER BY b.timestamp DESC LIMIT 1)[OFFSET(1)] as b_data
FROM `table_b` b
GROUP BY b.name
) b
ON a.name = b.name;
this is a common case where you can't just Group by and get the minimum. I suggest the following:
SELECT *
FROM table_a as a inner join (SELECT name, min(timestamp) as timestamp
FROM table_b group by 1) as b
on (a.timestamp = b.timestamp and a.name = b.name)
This way you limit it only to the minimum present in Table b, as you specified.
You can also achieve that in a more readable way using the WITH statement:
WITH min_b as (
SELECT name,
min(timestamp) as timestamp
FROM table_b group by 1
)
SELECT *
FROM table_a as a inner join min_b
on (a.timestamp = min_b.timestamp and a.name = min_b.name)
Let me know if it worked!

SQL JOIN same table

I have one table with "ID", "Sequence", "Status":
ID | Seq | Status
======================
10 | 001 | 010
10 | 002 | test
10 | 003 | 005
11 | 001 | 010
11 | 002 | 338
The result from my query should give me the complete table plus an extra column with the status for the highest sequence for the respective ID:
ID | Seq | Status | LStatus
======================
10 | 001 | 010 | 005
10 | 002 | test | 005
10 | 003 | 005 | 005
11 | 001 | 010 | 338
11 | 002 | 338 | 338
I have no clue how to do it. I startet with something like that:
SELECT a.*, b.status as lstatus
FROM table a
left join (select top 1 b.status from table b order by b.seq DESC)
on a.id = b.id
Hope you can help me :)
Thanks in advance!!!
You should use a group by max for the subquery and join the base table
SELECT a.*, b.status as t.lstatus
FROM my_table a
INNER join (select id,
max(b.status) lstatus
from my_table b
group by id) t on t.id = a.id and
for numeric value only
SELECT a.*, b.status as t.lstatus
FROM my_table a
INNER join (select id,
max(b.status) lstatus
from my_table b
where IsNumeric([b.status])=True
group by id) t on t.id = a.id and
Try the below.
with status as (
Select distinct(id),status from table order by seq desc)select a.*,s.status as LStatus from table a,status s where a.id=s.id;
You can try using ROW_NUMBER() to assign number to each row within the ids ordered by seq. Join that back to your table where the id matches and the rownumber rn = 1.
DEMO
SELECT
a.*
, b.status
FROM table a
JOIN (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY seq DESC) AS rn
FROM dbo.Table) b ON a.id = b.id AND b.rn = 1
Use can use subquery here:
select *,
(select top 1 Status from table where id = t.id order by status desc) as Lstatus
from table t;

Find the first instance of a null value in a group as long as no non null comes after - Teradata SQL

I am trying to find the very first row where a certain field is null but the caveat is there cannot be a non-null coming after. If there isn't a null value or a non-null comes after the null then I do not want to return that one at all. I am using Teradata SQL and the following mock dataset should illustrate what I am looking for.
ID | Date | Field_Of_Interest
A | 1/1/2015 | 1
A | 2/1/2015 | 1
A | 3/1/2015 |
A | 4/1/2015 |
A | 5/1/2015 |
B | 1/1/2015 | 1
B | 2/1/2015 | 1
B | 3/1/2015 |
B | 4/1/2015 | 1
B | 5/1/2015 |
C | 1/1/2015 | 1
C | 2/1/2015 | 1
C | 3/1/2015 | 1
C | 4/1/2015 | 1
C | 5/1/2015 | 1
D | 1/1/2015 | 1
D | 2/1/2015 | 1
D | 3/1/2015 |
D | 4/1/2015 |
D | 5/1/2015 | 1
Desired Result:
ID | Date
A | 3/1/2015
B | 5/1/2015
Since C and D have a non-null for the last record I do not want them all all.
Where I run into trouble are situations like B or D where I can't just take the minimum of the date field where Field_Of_Interest is null. Another thought I had was to find the min where null and the max where not null and if the date for the min was greater than that of the max use that. The problem there is in B where a non-null came after a null and then it went back to null.
Any ideas?
Does this give you what you want?
SELECT
T1.ID,
MIN(T1.some_date) AS some_date
FROM
My_Table T1
WHERE
T1.some_column IS NULL AND
NOT EXISTS (SELECT * FROM My_Table T2 WHERE T2.ID = T1.ID AND T2.some_date > T1.some_date AND T2.some_column IS NOT NULL)
GROUP BY
T1.ID
Alternatively:
SELECT
T1.id,
MIN(T1.some_date) AS some_date
FROM
My_Table T1
LEFT OUTER JOIN My_Table T2 ON
T2.id = T1.id AND
T2.some_date > T1.some_date AND
T2.some_column IS NOT NULL
WHERE
T1.some_column IS NULL AND
T2.id IS NULL
GROUP BY
T1.id
You can do this with a difference of row number or using subqueries. The latter method results in a query like this:
select id, min(date)
from t
where t.field_of_interest is null and
not exists (select 1
from t
where t2.id = t.id and t2.date > t.date and
t2.field_of_interest is not null
)
group by id;
You can get the expected result with a single table access using Windowed Aggregate Functions. Depending on the actual data/query this might be more efficient.
SELECT ID, MIN(dt)
FROM
(
SELECT *
FROM tab
QUALIFY
-- returns NULL until the first row with a value in Field_Of_Interest
MIN(Field_Of_Interest)
OVER (PARTITION BY ID
ORDER BY dt DESC
ROWS UNBOUNDED PRECEDING) IS NULL
) AS dt
GROUP BY 1

Transact-SQL: Display two SUM() from differents tables?

This is a simplified exemple of what I want :
Table 1 :
CODE | VALUE
A | 10
A | 20
B | 10
C | 20
Table 2 :
CODE | VALUE2
A | 25
B | 10
B | 10
D | 20
And this is what I want :
CODE | SUM(VALUE) | SUM(VALUE2)
A | 30 | 25
B | 10 | 20
C | 20 | NULL
D | NULL | 20
I tried naively :
SELECT T1.CODE, SUM(VALUE), SUM(VALUE2)
FROM table1 T1
LEFT OUTER JOIN table2 T2
ON T1.CODE = T2.CODE
GROUP BY T.CODE
But the results are wrong and I don't know what to do... Someone can explain me how to resolve this problem and create a proper query ?
May be something like this?
select code, sum(v1), sum(v2)
from (select code, value v1, null v2
from table1
union
select code, null v1, value2 v2
from table2)
group by code