Select latest revision of each row in a table - sql

I have table structures that include a composite primary key of id & revision where both are integers.
I need a query that will return the latest revision of each row. If I understood this answer correctly then the following would have worked on an Oracle DB.
SELECT Id, Title
FROM ( SELECT Id, Revision, MAX(Revision) OVER (PARTITION BY Id) LatestRevision FROM Task )
WHERE Revision = LatestRevision
I am using SQL Server (2005) and need a performant query to do the same.

I think this should work (I didn't test it)...
SELECT ID,
Title
FROM Task AS T
INNER JOIN
(
SELECT ID,
Max(Revision)
FROM Task
GROUP BY ID
) AS sub
ON T.ID = sub.ID
AND T.Revision = sub.Revision

See this post by ayende for an ealuation of the Best strategies.

I would try to create a subquery like this:
SELECT Id, Title
FROM Task T, (Select ID, Max(Revision) MaxRev from Task group by ID) LatestT
WHERE T.Revision = LatestT.MaxRev and T.ID = LatestT.ID
Another option is to "cheat" and create a trigger that will flag the revision as latest revision if one item is added.
Then add that field to the index. (I would link the table to insert only)
Also an index on ID, Revision desc could help the performance.

The query you posted will work in SQL 2005 (in compatibility mode 90) with the syntax errors corrected:
SELECT t1.Id, t1.Title
FROM ( SELECT Id, Revision, MAX(Revision) OVER (PARTITION BY Id) LatestRevision FROM Task ) AS x
JOIN Task as t1
ON t1.Revision = x.LatestRevision
AND t1.id = x.id

try this:
DECLARE #YourTable table(RowID int, Revision int, Title varchar(10))
INSERT INTO #YourTable VALUES (1,1,'A')
INSERT INTO #YourTable VALUES (2,1,'B')
INSERT INTO #YourTable VALUES (2,2,'BB')
INSERT INTO #YourTable VALUES (3,1,'C')
INSERT INTO #YourTable VALUES (4,1,'D')
INSERT INTO #YourTable VALUES (1,2,'AA')
INSERT INTO #YourTable VALUES (2,3,'BBB')
INSERT INTO #YourTable VALUES (5,1,'E')
INSERT INTO #YourTable VALUES (5,2,'EE')
INSERT INTO #YourTable VALUES (4,2,'DD')
INSERT INTO #YourTable VALUES (4,3,'DDD')
INSERT INTO #YourTable VALUES (6,1,'F')
;WITH YourTableRank AS
(
SELECT
RowID,Revision,Title, ROW_NUMBER() OVER(PARTITION BY RowID ORDER BY RowID,Revision DESC) AS Rank
FROM #YourTable
)
SELECT
RowID, Revision, Title
FROM YourTableRank
WHERE Rank=1
OUTPUT:
RowID Revision Title
----------- ----------- ----------
1 2 AA
2 3 BBB
3 1 C
4 3 DDD
5 2 EE
6 1 F
(6 row(s) affected)

Related

Latest entry in a group SQL Server

Given the sample data below, I need a list of the ids whose latest entry is Rejected. Thus, I need to see id 2 because its latest is 6/4/2020 and that is Rejected. I do not want to see id 1 as its latest entry is Requested.
CREATE TABLE #temp
(
id int,
mydate datetime,
status VARCHAR(20)
)
INSERT INTO #temp VALUES (1, '6/1/2020', 'Rejected')
INSERT INTO #temp VALUES (1, '6/2/2020', 'Requested')
INSERT INTO #temp VALUES (1, '6/3/2020', 'Rejected')
INSERT INTO #temp VALUES (1, '6/4/2020', 'Requested')
INSERT INTO #temp VALUES (2, '6/1/2020', 'Requested')
INSERT INTO #temp VALUES (2, '6/2/2020', 'Requested')
INSERT INTO #temp VALUES (2, '6/3/2020', 'Requested')
INSERT INTO #temp VALUES (2, '6/4/2020', 'Rejected')
SELECT * FROM #temp
SELECT id, MAX(mydate)
FROM #temp
WHERE status = 'Rejected'
GROUP BY id
This is my pathetic attempt so far
SELECT id, MAX(mydate)
FROM #temp
WHERE status = 'Rejected'
GROUP BY id
But this will only bring back the latest date in each group. I need a list where the latest entry was Rejected. I expect the answer to be embarrassingly simple but I'm having a heck of a time with this.
Thanks
Carl
You can get this using row_number() function as shown below.
;WITH cte
AS (
SELECT Id
,mydate
,STATUS
,ROW_NUMBER() OVER (
PARTITION BY Id, status ORDER BY mydate desc
) row_num
FROM #temp
)
SELECT *
FROM cte
WHERE row_num = 1
AND STATUS = 'Rejected'
Here is the live db<>fiddle demo.
One method uses aggregation and having:
select id
from #temp
group by id
having max(case when status = 'Rejected' then mydate end) = max(mydate);
This is almost a direct translation of your question: the latest date for 'Rejected' is the latest date for a given id.
More traditional methods use a correlated subquery:
select t.*
from #temp t
where t.mydate = (select max(t2.mydate)
from #temp t2
where t2.id = t.id
) and
t.status = 'Rejected';
Or window functions:
select t.*
from (select t.*,
row_number() over (partition by id order by mydate desc) as seqnum
from #temp t
) t
where t.seqnum = 1 and t.status = 'Rejected';

How To Create Duplicate Records depending on Column which indicates on Repetition

I've got a table which consisting aggregated records, and i need to Split them according to specific column ('Shares Bought' like in the example below), as Follow:
Original Table:
Requested Table:
Needless to say, that there are more records like that in the table and i need an automated query (not manual insertions),
and also there are some more attributes which i will need to duplicate (like the field 'Date').
You would need to first generate_rows with increasing row_number and then perform a cross join with your table.
Eg:
create table t(rowid int, name varchar(100),shares_bought int, date_val date)
insert into t
select *
from (values (1,'Dan',2,'2018-08-23')
,(2,'Mirko',1,'2018-08-25')
,(3,'Shuli',3,'2018-05-14')
,(4,'Regina',1,'2018-01-19')
)t(x,y,z,a)
with generate_data
as (select top (select max(shares_bought) from t)
row_number() over(order by (select null)) as rnk /* This would generate rows starting from 1,2,3 etc*/
from sys.objects a
cross join sys.objects b
)
select row_number() over(order by t.rowid) as rowid,t.name,1 as shares_bought,t.date_val
from t
join generate_data gd
on gd.rnk <=t.shares_bought /* generate rows up and until the number of shares bought*/
order by 1
Here is a db fiddle link
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=5736255585c3ab2c2964c655bec9e08b
declare #t table (rowid int, name varchar(100), sb int, dt date);
insert into #t values
(1, 'Dan', 2, '20180823'),
(2, 'Mirco', 1, '20180825'),
(3, 'Shuli', 3, '20180514'),
(4, 'Regina', 1, '20180119');
with nums as
(
select n
from (values(1), (2), (3), (4)) v(n)
)
select t.*
from #t t
cross apply (select top (t.sb) *
from nums) a;
Use a table of numbers instead of CTE nums or add there as many values as you can find in Shares Bought column.
Other option is to use recursive cte :
with t as (
select 1 as RowId, Name, ShareBought, Date
from table
union all
select RowId+1, Name, ShareBought, Date
from t
where RowId <= ShareBought
)
select row_number() over (order by name) as RowId,
Name, 1 as ShareBought, Date
from t;
If the sharebought not limited to only 2 or 3 then you would have to use option (maxrecursion 0) query hint as because by default it is limited to only 100 sharebought.

finding duplicates and removing but keeping one value [duplicate]

This question already has answers here:
Delete duplicate records in SQL Server?
(10 answers)
Closed 9 years ago.
I currently have a URL redirect table in my database that contains ~8000 rows and ~6000 of them are duplicates.
I was wondering if there was a way I could delete these duplicates based on a certain columns value and if it matches, I am looking to use my "old_url" column to find duplicates and I have used
SELECT old_url
,DuplicateCount = COUNT(1)
FROM tbl_ecom_url_redirect
GROUP BY old_url
HAVING COUNT(1) > 1 -- more than one value
ORDER BY COUNT(1) DESC -- sort by most duplicates
however I'm not sure what I can do to remove them now as I don't want to lose every single one, just the duplicates. They are almost a match completely apart from sometimes the new_url is different and the url_id (GUID) is different in each time
In my opinion ranking functions and a CTE are the easiest approach:
WITH CTE AS
(
SELECT old_url
,Num = ROW_NUMBER()OVER(PARTITION BY old_url ORDER BY DateColumn ASC)
FROM tbl_ecom_url_redirect
)
DELETE FROM CTE WHERE Num > 1
Change ORDER BY DateColumn ASC accordingly to determine which records should be deleted and which record should be left alone. In this case i delete all newer duplicates.
If your table has a primary key then this is easy:
BEGIN TRAN
CREATE TABLE #T(Id INT, OldUrl VARCHAR(MAX))
INSERT INTO #T VALUES
(1, 'foo'),
(2, 'moo'),
(3, 'foo'),
(4, 'moo'),
(5, 'foo'),
(6, 'zoo'),
(7, 'foo')
DELETE FROM #T WHERE Id NOT IN (
SELECT MIN(Id)
FROM #T
GROUP BY OldUrl
HAVING COUNT(OldUrl) = 1
UNION
SELECT MIN(Id)
FROM #T
GROUP BY OldUrl
HAVING COUNT(OldUrl) > 1)
SELECT * FROM #T
DROP TABLE #T
ROLLBACK
this is the sample to delete multiple record with guid, hope it can help u=)
DECLARE #t1 TABLE
(
DupID UNIQUEIDENTIFIER,
DupRecords NVARCHAR(255)
)
INSERT INTO #t1 VALUES
(NEWID(),'A1'),
(NEWID(),'A1'),
(NEWID(),'A2'),
(NEWID(),'A1'),
(NEWID(),'A3')
so now, a duplicated record with guid is created in #t1
;WITH CTE AS(
SELECT DupID,DupRecords, Rn = ROW_NUMBER()
OVER (PARTITION BY DupRecords ORDER BY DupRecords)
FROM #t1
)
DELETE FROM #t1 WHERE DupID IN (SELECT DupID FROM CTE WHERE RN>1)
with query above, duplicated record is deleted from #t1, i use Row_number() to distinct each of the records
SELECT * FROM #t1

How to select top 3 values from each group in a table with SQL which have duplicates [duplicate]

This question already has answers here:
Select top 10 records for each category
(14 answers)
Closed 5 years ago.
Assume we have a table which has two columns, one column contains the names of some people and the other column contains some values related to each person. One person can have more than one value. Each value has a numeric type. The question is we want to select the top 3 values for each person from the table. If one person has less than 3 values, we select all the values for that person.
The issue can be solved if there are no duplicates in the table by the query provided in this article Select top 3 values from each group in a table with SQL . But if there are duplicates, what is the solution?
For example, if for one name John, he has 5 values related to him. They are 20,7,7,7,4. I need to return the name/value pairs as below order by value descending for each name:
-----------+-------+
| name | value |
-----------+-------+
| John | 20 |
| John | 7 |
| John | 7 |
-----------+-------+
Only 3 rows should be returned for John even though there are three 7s for John.
In many modern DBMS (e.g. Postgres, Oracle, SQL-Server, DB2 and many others), the following will work just fine. It uses CTEs and ranking function ROW_NUMBER() which is part of the latest SQL standard:
WITH cte AS
( SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name
ORDER BY value DESC
)
AS rn
FROM t
)
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;
Without CTE, only ROW_NUMBER():
SELECT name, value, rn
FROM
( SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name
ORDER BY value DESC
)
AS rn
FROM t
) tmp
WHERE rn <= 3
ORDER BY name, rn ;
Tested in:
Postgres
Oracle
SQL-Server
In MySQL and other DBMS that do not have ranking functions, one has to use either derived tables, correlated subqueries or self-joins with GROUP BY.
The (tid) is assumed to be the primary key of the table:
SELECT t.tid, t.name, t.value, -- self join and GROUP BY
COUNT(*) AS rn
FROM t
JOIN t AS t2
ON t2.name = t.name
AND ( t2.value > t.value
OR t2.value = t.value
AND t2.tid <= t.tid
)
GROUP BY t.tid, t.name, t.value
HAVING COUNT(*) <= 3
ORDER BY name, rn ;
SELECT t.tid, t.name, t.value, rn
FROM
( SELECT t.tid, t.name, t.value,
( SELECT COUNT(*) -- inline, correlated subquery
FROM t AS t2
WHERE t2.name = t.name
AND ( t2.value > t.value
OR t2.value = t.value
AND t2.tid <= t.tid
)
) AS rn
FROM t
) AS t
WHERE rn <= 3
ORDER BY name, rn ;
Tested in MySQL
I was going to downvote the question. However, I realized that it might really be asking for a cross-database solution.
Assuming you are looking for a database independent way to do this, the only way I can think of uses correlated subqueries (or non-equijoins). Here is an example:
select distinct t.personid, val, rank
from (select t.*,
(select COUNT(distinct val) from t t2 where t2.personid = t.personid and t2.val >= t.val
) as rank
from t
) t
where rank in (1, 2, 3)
However, each database that you mention (and I note, Hadoop is not a database) has a better way of doing this. Unfortunately, none of them are standard SQL.
Here is an example of it working in SQL Server:
with t as (
select 1 as personid, 5 as val union all
select 1 as personid, 6 as val union all
select 1 as personid, 6 as val union all
select 1 as personid, 7 as val union all
select 1 as personid, 8 as val
)
select distinct t.personid, val, rank
from (select t.*,
(select COUNT(distinct val) from t t2 where t2.personid = t.personid and t2.val >= t.val
) as rank
from t
) t
where rank in (1, 2, 3);
Using GROUP_CONCAT and FIND_IN_SET you can do that.Check SQLFIDDLE.
SELECT *
FROM tbl t
WHERE FIND_IN_SET(t.value,(SELECT
SUBSTRING_INDEX(GROUP_CONCAT(t1.value ORDER BY VALUE DESC),',',3)
FROM tbl t1
WHERE t1.name = t.name
GROUP BY t1.name)) > 0
ORDER BY t.name,t.value desc
If your result set is not so heavy, you can write a stored procedure (or an anonymous PL/SQL-block) for that problem which iterates the result set and finds the bigges three by a simple comparing algorithm.
Try this -
CREATE TABLE #list ([name] [varchar](100) NOT NULL, [value] [int] NOT NULL)
INSERT INTO #list VALUES ('John', 20), ('John', 7), ('John', 7), ('John', 7), ('John', 4);
WITH cte
AS (
SELECT NAME
,value
,ROW_NUMBER() OVER (
PARTITION BY NAME ORDER BY (value) DESC
) RN
FROM #list
)
SELECT NAME
,value
FROM cte
WHERE RN < 4
ORDER BY value DESC
This works for MS SQL. Should be workable in any other SQL dialect that has the ability to assign row numbers in a group by or over clause (or equivelant)
if object_id('tempdb..#Data') is not null drop table #Data;
GO
create table #data (name varchar(25), value integer);
GO
set nocount on;
insert into #data values ('John', 20);
insert into #data values ('John', 7);
insert into #data values ('John', 7);
insert into #data values ('John', 7);
insert into #data values ('John', 5);
insert into #data values ('Jack', 5);
insert into #data values ('Jane', 30);
insert into #data values ('Jane', 21);
insert into #data values ('John', 5);
insert into #data values ('John', -1);
insert into #data values ('John', -1);
insert into #data values ('Jane', 18);
set nocount off;
GO
with D as (
SELECT
name
,Value
,row_number() over (partition by name order by value desc) rn
From
#Data
)
SELECT Name, Value
FROM D
WHERE RN <= 3
order by Name, Value Desc
Name Value
Jack 5
Jane 30
Jane 21
Jane 18
John 20
John 7
John 7

Delete records which are considered duplicates based on same value on a column and keep the newest

I would like to delete records which are considered duplicates based on them having the same value in a certain column and keep one which is considered the newest based on InsertedDate in my example below. I would like a solution which doesn't use a cursor but is set based. Goal: delete all duplicates and keep the newest.
The ddl below creates some duplicates. The records which need to be deleted are: John1 & John2 because they have the same ID as John3 and John3 is the newest record.
Also record John5 needs to be deleted because there's another record with ID = 3 and is newer (John6).
Create table dbo.TestTable (ID int, InsertedDate DateTime, Name varchar(50))
Insert into dbo.TestTable Select 1, '07/01/2009', 'John1'
Insert into dbo.TestTable Select 1, '07/02/2009', 'John2'
Insert into dbo.TestTable Select 1, '07/03/2009', 'John3'
Insert into dbo.TestTable Select 2, '07/03/2009', 'John4'
Insert into dbo.TestTable Select 3, '07/05/2009', 'John5'
Insert into dbo.TestTable Select 3, '07/06/2009', 'John6'
Just as an academic exercise:
with cte as (
select *, row_number() over (partition by ID order by InsertedDate desc) as rn
from TestTable)
delete from cte
where rn <> 1;
Most of the time the solution proposed by Sam performs much better.
This works:
delete t
from TestTable t
left join
(
select id, InsertedDate = max(InsertedDate) from TestTable
group by id
) as sub on sub.id = t.id and sub.InsertedDate = t.InsertedDate
where sub.id is null
If you have to deal with ties it gets a tiny bit trickier.