Pgsql Delete rows with some columns (not all) duplicate - sql

Table - col_pk, col1, col2,col3, col4, col_date_updated
This table has some rows with duplicate column values for col2 and col3.
I want to keep those rows with col_date_updated is latest(max).
Eg:
col_pk, col1, col2, col3, col4, col_date_updated
1, A, hello, now, 200.00, 2017-12-12 15:09:44.437546
2, B, hello, now, 490.00, 2017-12-12 15:09:42.437065
3, C, hi, now, 300.00, 2017-12-12 15:09:41.436617
4, D, hello, now, 250.00, 2017-12-12 15:09:45.436617
5, E, hi, now, 250.00, 2017-12-12 10:09:41.436617
Expected Result:
col_pk, col1, col2, col3, col4, col_date_updated
3, C, hi, now, 300.00, 2017-12-12 15:09:41.436617
4, D, hello, now, 250.00, 2017-12-12 15:09:45.436617

Check this.
SELECT DISTINCT ON (col2, col3) t.*
FROM table t
ORDER BY col_date_updated DESC
apply distinct on col2 and col3 cause you want them unique and keep the latest with order by desc

If you just want to select to get your expected output, then ROW_NUMBER comes in handy:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY col2, col3
ORDER BY col_date_updated DESC) rn
FROM yourTable
)
SELECT col_pk, col1, col2, col3, col4, col_date_updated
FROM cte
WHERE rn = 1;
If you instead want to delete the other records, then we can also reuse the CTE:
DELETE FROM yourTable WHERE col_pk IN (SELECT col_pk FROM cte WHERE rn > 1);

You could try something like this.
SELECT t.*
FROM yourtable t
WHERE col_date_updated IN (SELECT MAX (col_date_updated)
FROM yourtable i
WHERE t.col2 = i.col2 AND t.col3 = i.col3);
So, If you wish to delete other records, you may use this.
DELETE
FROM yourtable t
WHERE col_date_updated NOT IN (SELECT MAX (col_date_updated)
FROM yourtable i
WHERE t.col2 = i.col2 AND t.col3 = i.col3);
DEMO

If you want to suppress all but the most recent rows for any {col2,col3}:
SELECT *
FROM thetable zt
WHERE NOT EXISTS (
-- If a record exists with the same col2,col3,
-- but a more recent date than zt.col_date_updated
-- then zt.* cannot be the most recent one
SELECT *
FROM thetable nx
WHERE nx.col2 = zt.col2 -- same value
AND nx.col3 = zt.col3 -- same value
AND nx.col_date_updated > zt.col_date_updated -- more recent
);
If you want to physically delete all but the most recent rows for the same {col2,col3}:
DELETE
FROM thetable zt
WHERE EXISTS (
-- If a record exists with the same col2,col3,
-- but a more recent date than zt.t.col_date_updated
-- then zt.* cannot be the most recent one
-- and we can delete zt.
SELECT *
FROM thetable nx
WHERE nx.col2 = zt.col2 -- same value
AND nx.col3 = zt.col3 -- same value
AND nx.col_date_updated > zt.col_date_updated -- more recent
);

This is fastest way:
SELECT * FROM tablename WHERE col_pk IN
(SELECT col_pk FROM
(SELECT col_pk, ROW_NUMBER() OVER (partition BY col2, col3 ORDER BY col_date_updated) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
if you want delete:
DELETE FROM tablename WHERE col_pk IN
(SELECT col_pk FROM
(SELECT col_pk, ROW_NUMBER() OVER (partition BY col2, col3 ORDER BY col_date_updated) AS rnum
FROM tablename DESC) t
WHERE t.rnum > 1);

Related

SQL with having statement now want complete rows

Here is a mock table
MYTABLE ROWS
PKEY 1,2,3,4,5,6
COL1 a,b,b,c,d,d
COL2 55,44,33,88,22,33
I want to know which rows have duplicated COL1 values:
select col1, count(*)
from MYTABLE
group by col1
having count(*) > 1
This returns :
b,2
d,2
I now want all the rows that contain b and d. Normally, I would use where in stmt, but with the count column, not certain what type of statement I should use?
maybe you need
select * from MYTABLE
where col1 in
(
select col1
from MYTABLE
group by col1
having count(*) > 1
)
Use a CTE and a windowed aggregate:
WITH CTE AS(
SELECT Pkey,
Col1,
Col2,
COUNT(1) OVER (PARTITION BY Col1) AS C
FROM dbo.YourTable)
SELECT PKey,
Col1,
Col2
FROM CTE
WHERE C > 1;
Lots of ways to solve this here's another
select * from MYTABLE
join
(
select col1 ,count(*)
from MYTABLE
group by col1
having count(*) > 1
) s on s.col1 = mytable.col1;

Redshift sample from table based on count of another table

I have TableA of, say, 3000 rows (could be any number < 10000). I need to create TableX with 10000 rows. So I need to select random 10000 - (number of rows in TableA) from TableB (and add in TableA as well) to create TableX. Any ideas please?
Something like this (which obviously wouldnt work):
Create table TableX as
select * from TableA
union
select * from TableB limit (10000 - count(*) from TableA);
You could use union all and window functions. You did not list the table columns, so I assumed col1 and col2:
insert into tableX (col1, col2)
select col1, col2 from table1
union all
select t2.col1, t2.col2
from (select t2.*, row_number() over(order by random()) from table2 t2) t2
inner join (select count(*) cnt from table1) t1 on t2.rn <= 10000 - t1.cnt
The first query in union all selects all rows from table1. The second query assigns random row numbers to rows in table2, and then selects as many rows as needed to reach a total of 10000.
Actually it might be simpler to select all rows from both tables, then order by and limit in the outer query:
insert into tableX (col1, col2)
select col1, col2
from (
select col1, col2, 't1' which from table1
union all
select col1, col2, 't2' from table2
) t
order by which, random()
limit 10000
with inparms as (
select 10000 as target_rows
), acount as (
select count(*) as acount, inparms.target_rows
from tablea
cross join inparms
), btag as (
select b.*, 'tableb' as tabsource,
row_number() over (order by random()) as rnum
from tableb
)
select a.*, 'tablea', row_number() over (order by 1) as rnum
from tablea
union all
select b.*
from btag b
join acount a on b.rnum <= a.target_rows - a.acount
;

How to select non-distinct rows with a distinct on multiple columns

I have found many answers on selecting non-distinct rows where they group by a singular column, for example, e-mail. However, there seems to have been issue in our system where we are getting some duplicate data whereby everything is the same except the identity column.
SELECT DISTINCT
COLUMN1,
COLUMN2,
COLUMN3,
...
COLUMN14
FROM TABLE1
How can I get the non-distinct rows from the query above? Ideally it would include the identity column as currently that is obviously missing from the distinct query.
select COLUMN1,COLUMN2,COLUMN3
from TABLE_NAME
group by COLUMN1,COLUMN2,COLUMN3
having COUNT(*) > 1
With _cte (col1, col2, col3, id) As
(
Select cOl1, col2, col3, Count(*)
From mySchema.myTable
Group By Col1, Col2, Col3
Having Count(*) > 1
)
Select t.*
From _Cte As c
Join mySchema.myTable As t
On c.col1 = t.col1
And c.col2 = t.col2
And c.col3 = t.col3
SELECT * FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY COL 1, COL 2, .... COL N ORDER BY COL M
) RN
FROM TABLE_NAME
)T
WHERE T.RN>1

Get the values from the Max date after union

This query works but I'd like to see if there's a more optimize/shorter way to get the same result. I'd like to retrieve all the data with the maximum date from the union of 3 tables, TABLE_01, TABLE_02, TABLE_03. Whichever table has the latest bill_date, I want to retrieve the rows for that bill_date. It will always have more than one row returned for the same PID and bill_date.
SELECT T1.PID, T1.BILL_DATE, T2.COL3, T2.COL4, T2.COL5
FROM
(
SELECT T.PID, MAX(T.BILL_DATE)
FROM
(
SELECT DISTINCT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0501
GROUP BY 1,2,3,4,5
UNION ALL
SELECT DISTINCT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0502
GROUP BY 1,2,3,4,5
UNION ALL
SELECT DISTINCT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0503
GROUP BY 1,2,3,4,5
) T
GROUP BY 1
) T1
INNER JOIN
( SELECT DISTINCT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0501
GROUP BY 1,2,3,4,5
UNION ALL
SELECT DISTINCT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0501
GROUP BY 1,2,3,4,5
UNION ALL
SELECT DISTINCT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0501
GROUP BY 1,2,3,4,5
) T2
ON T1.PID = T2.PID
AND T1.BILL_DATE = T2.BILL_DATE
Yes, a QUALIFY clause comes handy here.
SELECT * FROM
(
SELECT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0501
GROUP BY 1,2,3,4,5
UNION ALL
SELECT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0502
GROUP BY 1,2,3,4,5
UNION ALL
SELECT PID, BILL_DATE, COL3, COL4, COL5
FROM TABLE_0503
GROUP BY 1,2,3,4,5
) T
QUALIFY RANK() OVER (PARTITION BY PID ORDER BY BILL_DATE DESC) = 1;
Within each group of PID, a rank will be assigned to rows starting from BILL_DATE to lowest. QUALIFY ... = 1 will select the highest ranked BILL_DATE.

SQL query to simulate distinct

SELECT DISTINCT col1, col2 FROM table t ORDER BY col1;
This gives me distinct combination of col1 & col2. Is there an alternative way of writing the Oracle SQL query to get the unique combination of col1 & col2 records with out using the keyword distinct?
Use the UNIQUE keyword which is a synonym for DISTINCT:
SELECT UNIQUE col1, col2 FROM table t ORDER BY col1;
I don't see why you would want to but you could do
SELECT col1, col2 FROM table_t GROUP BY col1, col2 ORDER BY col1
Another - yet overly complex and somewhat useless - solution:
select *
from (
select col1,
col2,
row_number() over (partition by col1, col2 order by col1, col2) as rn
from the_table
)
where rn = 1
order by col1
select col1, col2
from table
group by col1, col2
order by col1
or a less elegant way:
select col1,col2 from table
UNION
select col1,col2 from table
order by col1;
or a even less elegant way:
select a.col1, a.col2
from (select col1, col2 from table
UNION
select NULL, NULL) a
where a.col1 is not null
order by a.col1
Yet another ...
select
col1,
col2
from
table t1
where
not exists (select *
from table t2
where t2.col1 = t1.col1 and
t2.col2 = t1.col2 and
t2.rowid > t1.rowid)
order by
col1;
Variations on the UNION solution by #aF. :
INTERSECT
SELECT col1, col2 FROM tableX
INTERSECT
SELECT col1, col2 FROM tableX
ORDER BY col1;
MINUS
SELECT col1, col2 FROM tableX
MINUS
SELECT col1, col2 FROM tableX WHERE 0 = 1
ORDER BY col1;
MINUS (2nd version, it will return one row less than the other versions, if there is (NULL, NULL) group)
SELECT col1, col2 FROM tableX
MINUS
SELECT NULL, NULL FROM dual
ORDER BY col1;
Another ...
select col1,
col2
from (
select col1,
col2,
rowid,
min(rowid) over (partition by col1, col2) min_rowid
from table)
where rowid = min_rowid
order by col1;