Get the latest entry of each duplicate on Postgres - sql

I am using Postgres v12, and I have a table with duplicated rows. I need to retrieve only the last entry for each duplicate, ignoring entries which have no duplicate.
This table has the following columns:
id (unique)
request_id (where to find the duplicates)
created_at (where to see which entry is the latest)
id
request_id
created_at
1
a
2020.06.06
2
a
2020.05.05
3
b
2020.04.04
4
b
2020.03.03
5
c
2020.04.04
6
c
2020.03.03
7
d
2020.03.03
The query should retrieve rows with id 1,3,5 , since they are the latest entry (created_at) of each duplicate. ID 7 has no duplicate, so it is ignored.
I have tried with the solution proposed here: https://www.geeksengine.com/article/get-single-record-from-duplicates.html but due to be using Postgres v12, those queries do not work, I get the error "column must appear in the group by clause" which is another problem cited here: must appear in the GROUP BY clause or be used in an aggregate function
I have been searching for a solution for days to this problem, but I am not an SQL expert. I would appreciate any help very much.

here is one way using window functions :
select * from (
select *
, row_number() over (partition by request_id order by created_at desc) as rn
, count() over (partition by request_id) cn
from tablename
) t where cn > 1 and rn = 1

Related

Identify duplicate fields in a table

I'm trying to identify specific fields that are duplicated in a table in a mariadb-10.4.20 Joomla database. I would like to identify all rows that have a specific field duplicated, then ultimately be able to remove those duplicates, leaving just the one with the highest ID.
This table contains the IDs, titles and aliases for the articles in a joomla website. The script I'm building (in perl) will use this information to print the primary title alias and create redirects for any others.
I was previously using "group by" but it appears there's been a change recently in how it's used, and now it doesn't work properly. I don't understand the new format, and I'm not even sure it was previously working fully.
Here's a basic query that shows there are two of the same articles with different IDs:
MariaDB [mydb]> select id,alias,title from db1_content where title = "article title";
+--------+---------------+--------------+
| id | alias | title |
+--------+---------------+--------------+
| 299959 | unique-title | Unique Title |
| 300026 | unique-title | Unique Title |
+--------+------------------------------+
Here's an attempt at trying to use "group by" but it returns no results.
MariaDB [mydb]> select id,title,count(title) from db1_content group by id,title having count(title) > 1;
Empty set (0.230 sec)
If I run the same query without the id field, then it does return a list of all titles that are duplicated, along with the number of occurrences of each title.
That's not exactly what I want, though. I need it to print the id, alias and title fields so I can reference them in my perl script to subsequently perform another query to ultimately delete the duplicates and create links to be used in RewriteRules.
What am I doing wrong?
Since MariaDB cannot currently delete from a CTE, you could use a derived table to generate row numbers for each title ordered by id descending, JOIN that to your main table and then delete any row which has a row number greater than 1. For example:
DELETE db1 FROM db1_content db1
JOIN (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
FROM db1_content
) dbr ON db1.id = dbr.id
WHERE dbr.rn > 1
If you don't want to actually delete the records using SQL, you can just select the ones that need to be deleted by using a CTE:
WITH rns AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY title ORDER BY id DESC) AS rn
FROM db1_content
)
SELECT id, alias, title
FROM rns
WHERE rn > 1
Demo on dbfiddle

Delete Occurrence of Unique ID from SQL Server Table [duplicate]

This question already has answers here:
How can I remove duplicate rows?
(43 answers)
Delete all records except the most recent one?
(3 answers)
Closed last year.
I have a SQL Server Table where I have a Column that contains a unique ID. I also have another column called Level, every time a new occurrence of a unique ID enters the table the Level will increase.
ID Level DateTime Symbol Exchange
XRP/USD_FTXSPOT 1 2022-01-04 17:03:24.027 XRP/USD FTX
XRP/USD_FTXSPOT 2 2022-01-04 17:03:31.147 XRP/USD FTX
Therefore it would look something like this. The more recent the row entered the higher the level gets basically.
What I am trying to do is be able to say once a new row is entered for a unique ID, remove all previous occurrences based on its Level. Meaning, remove all rows where the level is < the greatest.
SELECT * FROM
Thursday_crypto JOIN
(
SELECT ID, MAX(Level) Level
FROM Thursday_crypto
GROUP BY ID
) max_date ON Thursday_crypto.ID = max_date.ID AND Thursday_crypto.Level = max_date.Level
I have this which basically returns the rows where each unique ID has its highest Level. But I am wondering how I can alter this to then remove all rows not within this selection. I want to reduce the size of the table, so I guess my main goal is to remove all rows not within this selection.
You can calculate a row_number based on the ID and the level.
Then remove the dups based on the row_number.
WITH CTE_DATA AS (
SELECT [RowNum] = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Level DESC)
FROM Thursday_crypto
)
DELETE
FROM CTE_DATA
WHERE RowNum > 1
Demo on db<>fiddle here

Group BY Statement error to get unique records

I am new to SQL Server, used to work with MYSQL and trying to get the records from a table using Group By.
The table structure is given below:
SELECT S1.ID,S1.Template_ID,S1.Assigned_By,S1.Assignees,S1.Active FROM "Schedule" AS S1;
Output:
ID Template_ID Assigned_By Assignees Active
2 25 1 3 1
3 25 5 6 1
6 26 5 6 1
I need to get the values of all columns using the Group By statement below
SELECT Template_ID FROM "Schedule" WHERE "Assignees" IN(6, 3) GROUP BY "Template_ID";
Output:
Template_ID
25
26
I tried the following code to fetch the table using Group By, but it's fetching all the rows.
SELECT S1.ID,S1.Template_ID,S1.Assigned_By,S1.Assignees,S1.Active FROM "Schedule" AS S1 INNER JOIN(SELECT Template_ID FROM "Schedule" WHERE "Assignees" IN(6, 3) GROUP BY "Template_ID") AS S2 ON S2.Template_ID=S1.Template_ID
My Output Should be like,
ID Template_ID Assigned_By Assignees Active
2 25 1 3 1
6 26 5 6 1
I was wondering whether I can get ID of the column as well? I use the ID for editing the records in the web.
The query doesn't work as expected in MySQL either, except by accident.
Nonaggregated columns in MySQL aren't part of the SQL standard and not even allowed in MySQL 5.7 and later unless the default value of the ONLY_FULL_GROUP_BY mode is changed.
In earlier versions the result is non-deterministic.
The server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause.
This means there's was no way to know what rows will be returned this query :
SELECT S1.ID,S1.Template_ID,S1.Assigned_By,S1.Assignees,S1.Active
FROM "Schedule" AS S1
GROUP BY Template_ID;
To get deterministic results you'd need a way to rank rows with the ranking functions introduced in MySQL 8, like ROW_NUMBER(). These are already available in SQL Server since SQL Server 2012 at least. The syntax is the same for both databases :
WITH ranked as AS
(
SELECT
ID,Template_ID,Assigned_By,Assignees Active,
ROW_NUMBER(PARTITION BY Template_ID Order BY ID)
FROM Scheduled
WHERE Assignees IN(6, 3)
)
SELECT ID,Template_ID,Assigned_By,Assignees Active
FROM ranked
Where RN=1
PARTITION BY Template_ID splits the result rows based on their Template_ID value into separate partitions. Within that partition, the rows are ordered based on the ORDER BY clause. Finally, ROW_NUMBER calculates a row number for each ordered partition row.

How to distinguish rows in a database table on the basis of two or more columns while returning all columns in sql server

I want to distinguish Rows on the basis of two or more columns value of the same table at the same time returns all columns from the table.
Ex: I have this table
DB Table
I want my result to be displayed as: filter on the basis of type and Number only. As in abover table type and Number for first and second Row is same so it should be suppressed in result.
txn item Discrip Category type Number Mode
60 2 Loyalty L 6174 XXXXXXX1390 0
60 4 Visa C 1600 XXXXXXXXXXXX4108 1
I have tried with sub query but yet unsuccessful. Please suggest what to try.
Thanks
You can do what you want with row_number():
select t.*
from (select t.*,
row_number() over (partition by type, number order by item) as seqnum
from t
) t
where seqnum = 1;

Getting the min() of a count(*) column

I have a table called Vehicle_Location containing the columns (and more):
ID NUMBER(10)
SEQUENCE_NUMBER NUMBER(10)
TIME DATE
and I'm trying to get the min/max/avg number of records per day per id.
So far, I have
select id, to_char(time), count(*) as c
from vehicle_location
group by id, to_char(time), min having id = 16
which gives me:
ID TO_CHAR(TIME) COUNT(*)
---------------------- ------------- ----------------------
16 11-05-31 159
16 11-05-23 127
16 11-06-03 56
So I'd like to get the min/max/avg of the count(*) column. I am using Oracle as my RDBMS.
I don't have an oracle station to test on but you should be able to just wrap the aggregator around your SELECT as a subquery/derived table/inline view
So it would be (UNTESTED!!)
SELECT
AVG(s.c)
, MIN(s.c)
, MAX(s.c)
, s.ID
FROM
--Note this is just your query
(select id, to_char(time), count(*) as c from vehicle_location group by id, to_char(time), min having id = 16) as s
GROUP BY s.ID
Here's some reading on it:
http://www.devshed.com/c/a/Oracle/Inserting-SubQueries-in-SELECT-Statements-in-Oracle/3/
EDIT: Though normally it is a bad idea to select both the MIN and MAX in a single query.
EDIT2: The min/max issue is related to how some RDBMS (including oracle) handle aggregations on indexed columns. It may not affect this particular query but the premise is that it's easy to use the index to find either the MIN or the MAX but not both at the same time because any index may not be used effectively.
Here's some reading on it:
http://momendba.blogspot.com/2008/07/min-and-max-functions-in-single-query.html