most efficient way to select duplicate rows with max timestamp

most efficient way to select duplicate rows with max timestamp - sql

Suppose I have a table called t, which is like
id content time
1 'a' 100
1 'a' 101
1 'b' 102
2 'c' 200
2 'c' 201
id are duplicate, and for the same id, content could also be duplicate. Now I want to select for each id the rows with max timestamp, which would be
id content time
1 'b' 102
2 'c' 201
And this is my current solution:
select t1.id, t1.content, t1.time
from (
select id, content, time from t
) as t1
right join (
select id, max(time) as time from t group by id
) as t2
on t1.id = t2.id and t1.time = t2.time;
But this looks inefficient to me. Because theoretically when select id, max(time) as time from t group by id is executed, the rows I want have already been located. The right join brings extra O(n^2) time cost, which seems unnecessary.
So is there any more efficient way to do it, or anything that I missunderstand?

Use DISTINCT ON:
SELECT DISTINCT ON (id) id, content, time
FROM yourTable
ORDER BY id, time DESC;
On Postgres, this is usually the most performant way to write your query, and it should outperform ROW_NUMBER and other approaches.
The following index might speed up this query:
CREATE INDEX idx ON yourTable (id, time DESC, content);
This index, if used, would let Postgres rapidly find, for each id, the record having the latest time. This index also covers the content column.

Try this
SELECT a.id, a.content, a.time FROM t AS a
INNER JOIN (
SELECT a.content, MAX(a.time) AS time FROM t
GROUP BY a.content
) AS b ON a.content = b.content AND a.time = b.time

Related

sql: first row after the last row with a property

I would like to write a query that returns the first row immediately after the last row with a given property (ordered by id). Id's may not be consecutive.
Ideally it would look something like this:
...
JOIN (select max(id) id from my_table where CONDITION) m
JOIN (select min(id) from my_table where id > m.id) n
However, I can not use identifier m in the second subselect.
It is possible to use nested queries in nested queries, but is there an easier way?
Thank you.

You could use lead() to get the next id before applying the condition:
select t.*
from my_table t join
(select max(next_id) as max_next_id
from (select t.*, lead(id) over (order by id) as next_id
from my_table t
) t
where <condition>
) tt
on t.id = tt.max_next_id;
You could also do:
select t.*
from my_table t
where t.id > (select max(t2.id) from my_table t2 where <condition>)
order by t2.id asc
fetch first 1 row only;

I am not sure how this is getting woven into the rest of your query, so I have used a CTE
WITH max_next AS (
SELECT r.id as max_id
,r.next_id
FROM (
SELECT m.id
,m.next_id
,ROW_NUMBER() OVER (ORDER BY m.id DESC) AS rn
FROM (
SELECT n.* -- to provide data to satisfy CONDITIONS
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
) AS r
WHERE r.rn = 1
)
I would also shrink the n.* to the columns needed by CONDITIONS to a, not be implicit as the * slows the compile time down (or historically has) as all meta data needs to be read to understand what columns is in the ANY, and the while the compile can also prune not used columns, it's faster if you just ask for what you want (in best case just a compile time savings, worse case, it read all the data when you only need x number of columns read)
And borrowing from Gordon solution, the ROW_NUMBER part could be simpler
WITH max_next AS (
SELECT m.id
,m.next_id
--, plus what ever other things you want from m
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE CONDITIONS
ORDER BY m.id DESC LIMIT 1
)
So for an example for #PIG,
WITH my_table AS (
SELECT column1 AS id
,column2 AS con1
,column3 AS other
FROM VALUES (1,'a',123),(2,'b',234),(3,'a',345),(5,'b',456),(7,'a',567),(10,'c',678)
)
SELECT m.id
,m.next_id
,m.other
FROM (
SELECT n.* -- to satisfy CONDITIONS needs
,LEAD(n.id) OVER(ORDER BY n.id) as next_id
FROM my_table AS n
) AS m
WHERE m.con1 = 'b'
ORDER BY m.id DESC LIMIT 1;
gives 5, 7, 456 which is the last 'b' and the new row, and an extra value on my_table for entertainment purposes (and run on Snowflake to, which means I fixed the prior SQL also.)

This should work, it's pretty straightforward (easy), and it's good that you know records may not be stored in a ordered/consecutive fashion.
SELECT *
FROM my_table
WHERE id = (
SELECT min(id)
FROM my_table
WHERE id > (
SELECT max(id)
FROM my_table
WHERE CONDITION));

How to return two values from PostgreSQL subquery?

I have a problem where I need to get the last item across various tables in PostgreSQL.
The following code works and returns me the type of the latest update and when it was last updated.
The problem is, this query needs to be used as a subquery, so I want to select both the type and the last updated value from this query and PostgreSQL does not seem to like this... (Subquery must return only one column)
Any suggestions?
SELECT last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
Query is used like this inside of a CTE;
WITH sql_query as (
SELECT id, name, address, (...other columns),
last.type, last.max FROM (
SELECT MAX(a.updated_at), 'a' AS type FROM table_a a WHERE a.ref = 5 UNION
SELECT MAX(b.updated_at), 'b' AS type FROM table_b b WHERE b.ref = 5
) AS last ORDER BY max LIMIT 1
FROM table_c
WHERE table_c.fk_id = 1
)

The inherent problem is that SQL (all SQL not just Postgres) requires that a subquery used within a select clause can only return a single value. If you think about that restriction for a while it does makes sense. The select clause is returning rows and a certain number of columns, each row.column location is a single position within a grid. You can bend that rule a bit by putting concatenations into a single position (or a single "complex type" like a JSON value) but it remains a single position in that grid regardless.
Here however you do want 2 separate columns AND you need to return both columns from the same row, so instead of LIMIT 1 I suggest using ROW_NUMBER() instead to facilitate this:
WITH LastVals as (
SELECT type
, max_date
, row_number() over(order by max_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS max_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS max_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
)
, sql_query as (
SELECT id
, name, address, (...other columns)
, (select type from lastVals where rn = 1) as last_type
, (select max_date from lastVals where rn = 1) as last_date
FROM table_c
WHERE table_c.fk_id = 1
)
----
By the way in your subquery you should use UNION ALL with type being a constant like 'a' or 'b' then even if MAX(a.updated_at) was identical for 2 or more tables, the rows would still be unique because of the difference in type. UNION will attempt to remove duplicate rows but here it just isn't going to help, so avoid that wasted effort by using UNION ALL.
----
For another way to skin this cat, consider using a LEFT JOIN instead
SELECT id
, name, address, (...other columns)
, lastVals.type
, LastVals.last_date
FROM table_c
WHERE table_c.fk_id = 1
LEFT JOIN (
SELECT type
, last_date
, row_number() over(order by last_date DESC) as rn
FROM (
SELECT MAX(a.updated_at) AS last_date, 'a' AS type FROM table_a a WHERE a.ref = 5
UNION ALL
SELECT MAX(b.updated_at) AS last_date, 'b' AS type FROM table_b b WHERE b.ref = 5
)
) LastVals ON LastVals.rn = 1

SQL Case depending on previous status of record

I have a table containing status of a records. Something like this:
ID STATUS TIMESTAMP
1 I 01-01-2016
1 A 01-03-2016
1 P 01-04-2016
2 I 01-01-2016
2 P 01-02-2016
3 P 01-01-2016
I want to make a case where I take the newest version of each row, and for all P that has at some point been an I, they should be cased as a 'G' instead of P.
When I try to do something like
Select case when ID in (select ID from TABLE where ID = 'I') else ID END as status)
From TABLE
where ID in (select max(ID) from TABLE)
I get an error that this isn't possible using IN when casing.
So my question is, how do I do it then?
Want to end up with:
ID STATUS TIMESTAMP
1 G 01-04-2016
2 G 01-02-2016
3 P 01-01-2016
DBMS is IBM DB2

Have a derived table which returns each id with its newest timestamp. Join with that result:
select t1.ID, t1.STATUS, t1.TIMESTAMP
from tablename t1
join (select id, max(timestamp) as max_timestamp
from tablename
group by id) t2
ON t1.id = t2.id and t1.TIMESTAMP = t2.max_timestamp
Will return both rows in case of a tie (two rows with same newest timestamp.)
Note that ANSI SQL has TIMESTAMP as reserved word, so you may need to delimit it as "TIMESTAMP".

You can do this by using a common table expression find all IDs that have had a status of 'I', and then using an outer join with your table to determine which IDs have had a status of 'I' at some point.
To get the final result (with only the newest record) you can use the row_number() OLAP function and select only the "newest" record (this is shown in the ranked common table expression below:
with irecs (ID) as (
select distinct
ID
from
TABLE
where
status = 'I'
),
ranked as (
select
rownumber() over (partition by t.ID order by t.timestamp desc) as rn,
t.id,
case when i.id is null then t.status else 'G' end as status,
t.timestamp
from
TABLE t
left outer join irecs i
on t.id = i.id
)
select
id,
status,
timestamp
from
ranked
where
rn = 1;

other solution
with youtableranked as (
select f1.id,
case (select count(*) from yourtable f2 where f2.ID=f1.ID and f2."TIMESTAMP"<f1."TIMESTAMP" and f2.STATUS='I')>0 then 'G' else f1.STATUS end as STATUS,
rownumber() over(partition by f1.id order by f1.TIMESTAMP desc, rrn(f1) desc) rang,
f1."TIMESTAMP"
from yourtable f1
)
select * from youtableranked f0
where f0.rang=1
ANSI SQL has TIMESTAMP as reserved word, so you may need to delimit it as "TIMESTAMP"

try this
select distinct f1.id, f4.*
from yourtable f1
inner join lateral
(
select
case (select count(*) from yourtable f3 where f3.ID=f2.ID and f3."TIMESTAMP"<f2."TIMESTAMP" and f3.STATUS='I')>0 then 'G' else f2.STATUS end as STATUS,
f2."TIMESTAMP"
from yourtable f2 where f2.ID=f3.ID
order by f2."TIMESTAMP" desc, rrn(f2) desc
fetch first rows only
) f4 on 1=1
rrn(f2) order is for same last date
ANSI SQL has TIMESTAMP as reserved word, so you may need to delimit it as "TIMESTAMP"

Case on union of multiple unions and issue with alias

I have 2 series of unions which I wish to join by another union. In the first one, I have 3 Selects and in the second one I have 2 different Selects.
Select id, min(value)
from table1 t1
join (Select id, value
Union
Select id, value
Union
Select id, value) as foo
on foo.id=t1.id
Group by id
Select id, max(value)
from table1 t1
join (Select id, value
Union
Select id, value) as bar
on bar.id=t1.id
Group by id
I tried to do a union between these two, but it made things pretty complicated. My biggest issue is with my alias. My second is with the case linked to my value columns, which I wish to name value.
Select (alias).id,
Case
When foo.value= 0 or bar.value=1 THEN 1
Else 0
End as value
from table1 t1
Join (Select id, min(value)
from table1 t1
join (Select id, value
Union
Select id, value
Union
Select id, value) as foo
on foo.id=t1.id
Group by id
UNION
Select id, max(value)
from table1 t1
join (Select id, value
Union
Select id, value) as bar
on bar.id=t1.id
Group by id) as (alias)
on ??.id=??.id
I wrote my case the way I think it should be written, but normally, when there are more than one column with the same name, SQL states it as ambiguous. I am still unsure if I should use UNION or INTERSECT, but I assume either of them would be done the same way. How should I deal with this?

I'm reading this right, you probably want something like this
SELECT ...
FROM ( ... union #1 ) AS u1
JOIN (... union #2 ) AS u2 ON u1.id = u2.id

TSQL: Return row(s) with earliest dates

Given 2 tables called "table1" and "table1_hist" that structurally resemble this:
TABLE1
id status date_this_status
1 open 2008-12-12
2 closed 2009-01-01
3 pending 2009-05-05
4 pending 2009-05-06
5 open 2009-06-01
TABLE1_hist
id status date_this_status
2 open 2008-12-24
2 pending 2008-12-26
3 open 2009-04-24
4 open 2009-05-04
With table1 being the current status and table1_hist being a history table of table1, how can I return the rows for each id that has the earliest date. In other words, for each id, I need to know it's earliest status and date.
EXAMPLE:
For id 1 earliest status and date is open and 2008-12-12.
For id 2 earliest status and date is open and 2008-12-24.
I've tried using MIN(datetime), unions, dynamic SQL, etc. I've just reached tsql writers block today and I'm stuck.
Edited to add: Ugh. This is for a SQL2000 database, so Alex Martelli's answer won't work. ROW_NUMBER wasn't introduced until SQL2005.

SQL Server 2005 and later support an interesting (relatively recent) aspect of SQL Standards, "ranking/windowing functions", allowing:
WITH AllRows AS (
SELECT id, status, date_this_status,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY date_this_status ASC) AS row,
FROM (SELECT * FROM Table1 UNION SELECT * FROM Table1_hist) Both_tables
)
SELECT id, status, date_this_status
FROM AllRows
WHERE row = 1
ORDER BY id;
where I'm also using the nice (and equally "new") WITH syntax to avoid nesting the sub-query in the main SELECT.
This article shows how one could hack the equivalent of ROW_NUMBER (and also RANK and DENSE_RANK, the other two "new" ranking/windowing functions) in SQL Server 2000 -- but that's not necessarily pretty nor especially well-performing, alas.

The following code sample is completely self-sufficient, just copy and paste it into a management studio query and hit F5 =)
DECLARE #TABLE1 TABLE
(
id INT,
status VARCHAR(50),
date_this_status DATETIME
)
DECLARE #TABLE1_hist TABLE
(
id INT,
status VARCHAR(50),
date_this_status DATETIME
)
--TABLE1
INSERT #TABLE1
SELECT 1, 'open', '2008-12-12' UNION ALL
SELECT 2, 'closed', '2009-01-01' UNION ALL
SELECT 3, 'pending', '2009-05-05' UNION ALL
SELECT 4, 'pending', '2009-05-06' UNION ALL
SELECT 5, 'open', '2009-06-01'
--TABLE1_hist
INSERT #TABLE1_hist
SELECT 2, 'open', '2008-12-24' UNION ALL
SELECT 2, 'pending', '2008-12-26' UNION ALL
SELECT 3, 'open', '2009-04-24' UNION ALL
SELECT 4, 'open', '2009-05-04'
SELECT x.id,
ISNULL(y.[status], x.[status]) AS [status],
ISNULL(y.date_this_status, x.date_this_status) AS date_this_status
FROM #TABLE1 x
LEFT JOIN (
SELECT a.*
FROM #TABLE1_hist a
INNER JOIN (
SELECT id,
MIN(date_this_status) AS date_this_status
FROM #TABLE1_hist
GROUP BY id
) b
ON a.id = b.id
AND a.date_this_status = b.date_this_status
) y
ON x.id = y.id

SELECT id,
status,
date_this_status
FROM ( SELECT *
FROM Table1
UNION
SELECT *
from TABLE1_hist
) a
WHERE date_this_status = ( SELECT MIN(date_this_status)
FROM ( SELECT *
FROM Table1
UNION
SELECT *
from TABLE1_hist
) t
WHERE id = a.id
)
This is a bit ugly, but seems to work in MS SQL Server 2005.

You can do this with an exclusive self join. Join on the history table, and then another time on all earlier history entries. In the where statement, you specify that there are not allowed to be any earlier entries.
select t1.id,
isnull(hist.status, t1.status),
isnull(hist.date_this_status, t1.date_this_status)
from table1 t1
left join (
select h1.id, h1.status, h1.date_this_status
from table1_hist h1
left join table1_hist h2
on h2.id = h1.id
and h2.date_this_status < h1.date_this_status
where h2.date_this_status is null
) hist on hist.id = t1.id
A bit of a mind-binder, but fairly flexible and efficient!
This assumes there are no two history entries with the exact same date. If there are, write the self join like:
left join table1_hist h2
on h2.id = h1.id
and (
h2.date_this_status < h1.date_this_status
or (h2.date_this_status = h1.date_this_status and h2.id < h1.id)
)

If I understand the OP correctly, a given ID may appear in TABLE1 or TABLE1_HISTORY or both.
In your result set, you want back each distinct ID and the oldest status/date associated with that ID, regardless which table the oldest one happens to be in.
So, look in BOTH tables and return any record where there is no record in either table for it's ID that has a smaller date_this_status.
Try this:
SELECT ID, status, date_this_status FROM table1 ta WHERE
NOT EXISTS(SELECT null FROM table1 tb WHERE
tb.id = ta.id
AND tb.date_this_status < ta.date_this_status)
AND NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = ta.id
AND tbh.date_this_status < ta.date_this_status)
UNION ALL
SELECT ID, status, date_this_status FROM table1_history tah WHERE
NOT EXISTS(SELECT null FROM table1 tb WHERE
tb.id = tah.id
AND tb.date_this_status < tah.date_this_status)
AND NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = tah.id
AND tbh.date_this_status < tah.date_this_status)
Three underlying assumptions here:
Every ID you want back will have at least one record in at least one of the tables.
There won't be multiple records for the same ID in the same table with the same date_this_status value (can be mitigated by using DISTINCT)
There won't be records for the same ID in the other table with the same date_this_status value (can be mitigated by using UNION instead of UNION ALL)
There are two slight optimizations we can make:
If an ID has a record in TABLE1_HISTORY, it will always be older than the record in TABLE1 for that ID.
TABLE1 will never contain multiple records for the same ID (but the history table may).
So:
SELECT ID, status, date_this_status FROM table1 ta WHERE
NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = ta.id
)
UNION ALL
SELECT ID, status, date_this_status FROM table1_history tah WHERE
NOT EXISTS(SELECT null FROM table1_history tbh WHERE
tbh.id = tah.id
AND tbh.date_this_status < tah.date_this_status)

If that is the actual structure of your tables, you can't get a 100% accurate answer, the issue being that you can have 2 different statuses for the same (earliest) date for any given record and you would not know which one was entered first, because you don't have a primary key on the history table

Ignoring the "two tables" issues for a moment, I'd use the following logic...
SELECT
id, status, date
FROM
Table1_hist AS [data]
WHERE
[data].date = (SELECT MIN(date) FROM Table1_hist WHERE id = [data].id)
(EDIT: As per BlackTigerX's comment, this assumes no id can have more than one status with the same datetime.)
The simple way to extrapolate this to two tables is to use breitak67's answer. Replace all instances of "my_table" with subqueries that UNION the two tables together. A potential issue here is that of performance, as you may find that indexes become unusable.
One method of speeding this up could be to use implied knowledge:
1. The main table always has a record for each id.
2. The history table doesn't always have a record.
3. Any record in the history table is always 'older' than the one in main table.
SELECT
[main].id,
ISNULL([hist].status, [main].status),
ISNULL([hist].date, [main].date)
FROM
Table1 AS [main]
LEFT JOIN
(
SELECT
id, status, date
FROM
Table1_hist AS [data]
WHERE
[data].date = (SELECT MIN(date) FROM Table1_hist WHERE id = [data].id)
)
AS [hist]
ON [hist].id = [main].id
Find the oldest status for each id in the history table. (Can use its indexes)
LEFT JOIN that to the main table (which always has exactly one record for each id)
If [hist] contains a value, it's the older by definition
If the [hist] doesn't have a value, use the [main] value

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

most efficient way to select duplicate rows with max timestamp - sql

Try this SELECT a.id, a.content, a.time FROM t AS a INNER JOIN ( SELECT a.content, MAX(a.time) AS time FROM t GROUP BY a.content ) AS b ON a.content = b.content AND a.time = b.time

Related

sql: first row after the last row with a property

How to return two values from PostgreSQL subquery?

SQL Case depending on previous status of record

Case on union of multiple unions and issue with alias

TSQL: Return row(s) with earliest dates

Categories

Resources