Delete oldest entries with two duplicate columns from a table - SQL - sql

SELECT column1, column2, count(*) as duplicate
FROM table
GROUP BY column1, column2 HAVING count(*)> 1 ;
ID column1 column2 timestamp
abc 123 1 2020-02-03 19:36:27
xyz 123 1 2020-02-02 15:36:27
column1 and column2 is a unique combination with duplicate entry.
The above queries gives the entries that have duplicates. We want to delete the oldest entries based on another column timestamp

One method is:
delete from t
where t.timestamp > (select min(t2.timestamp)
from t t2
where t2.column1 = t.column1 and t2.column2 = t.column2
);

DELETE
FROM table a
JOIN (
SELECT id, row_number() OVER (PARTITION BY column1, column2 ORDER BY timestamp DESC) AS rownum
FROM table ) b
ON a.id = b.id
WHERE rownum > 1
You can use row_number function to get an ordered ranking of the results. Partitioning by column1 and column2 will restart the row number at each change in those values. Ordering by your timestamp descending will start your count with the newest record, so deleting anything where a rownum > 1 would keep only the newest record. If you needed something like a top 3, you would simply change the rownum > from 1 to 3.

Related

SQL Inner join with sum and null value

The table below is an extract of a larger set of data
In my scenario Column 2 is null when is the "parent" record (Column 1 = AB1 and Column 2 is NULL) and as you can see the following 2 "child" records under Column 2 have AB1 as identifier which matches the AB1 from Column 1, what I want to do is to sum the values on Column 3 when Column 2 has the same identifier (AB1), up to this point the sum = 29 (for this case I can do a SUM and group by AB1). My issue arises when I need to add the value of 10 in Column 3 when column 2 is NULL and Column 1 is AB1 (parent identifier). The common identifier is AB1 but for the parent record the identifier is in Column 1 instead of Column 2. I need a SQL that return a total sum of 39.
Edit:
Thanks for the prompt responses, my apologies I think my question was not clear enough. I am using MS SQL Server Management Studio
The goal for the query to sum the amounts on Column 3 by grouping by the records on Column 2 that have the same identifier (AB1) and then find that same identifier on Column 1 (AB1) and also add that value to the total sum.
The query below is doing the group by Column 2 correctly because for example if I have 10 records with the identifier AB1 it is returning one row with the sum of the amounts on Column 3, the issue is that I also need to add to that sum when the identifier AB1 is also in Column 1.
select t1.Column1 , round(sum (t1.Column3),2) as Total from table t1, table t2 where
and t1.Column2 = t2. Column1 group by t1. Column2
Basically this table stores transactions and the initial transaction “parent” is in Column 1 (AB1) and all other transactions “children” linked to the parent transaction have that identifier (AB1) but in Column 2. Column 1 is a unique identifier and does not repeat and then is the “parent” transaction it is NULL on Column 2 but that identifier (AB1) can be repeated multiple times in Column 2 depending all the “children” transactions that are linked to the “parent”.
Oracle
The WITH clause is here just to generate sample data and, as such, it is not the part of the answer.
I don't know what is the expected result, but the Totals could be calculated using Union All (without Inner Join)
WITH
tbl AS
(
Select 'AB1' "COL_1", Null "COL_2", 10 "COL_3" From Dual Union All
Select 'CD2' "COL_1", 'AB1' "COL_2", 15 "COL_3" From Dual Union All
Select 'EF3' "COL_1", 'AB1' "COL_2", 14 "COL_3" From Dual
)
SELECT
ID, Sum(TOTAL) "TOTAL"
FROM
(
SELECT COL_1 "ID", Sum(COL_3) "TOTAL" FROM tbl GROUP BY COL_1 UNION ALL
SELECT COL_2 "ID", Sum(COL_3) "TOTAL" FROM tbl GROUP BY COL_2
)
WHERE ID Is Not Null
GROUP BY ID
ORDER BY ID
--
-- R e s u l t
--
ID TOTAL
--- ----------
AB1 39
CD2 15
EF3 14
It is a Sum() Group By aggregation, but the same result gives Sum() analytic function with DISTINCT keyword.
SELECT DISTINCT
ID, Sum(TOTAL) OVER(PARTITION BY ID ORDER BY ID) "TOTAL"
FROM
(
SELECT COL_1 "ID", Sum(COL_3) "TOTAL" FROM tbl GROUP BY COL_1 UNION ALL
SELECT COL_2 "ID", Sum(COL_3) "TOTAL" FROM tbl GROUP BY COL_2
)
WHERE ID Is Not Null
--
-- R e s u l t
--
ID TOTAL
--- ----------
AB1 39
CD2 15
EF3 14
And if you need Inner Join then the answer is below. Note that there is only ID which actually has children. That is because of the Inner Join. Regards...
SELECT
t1.COL_1 "ID",
Max(t1.COL_3) + Sum(t2.COL_3) "TOTAL"
FROM
tbl t1
INNER JOIN
tbl t2 ON (t2.COL_2 = t1.COL_1)
GROUP BY t1.COL_1
ORDER BY t1.COL_1
--
-- R e s u l t
--
ID TOTAL
--- ----------
AB1 39
select sum(Column3)
from TheTable
where 'AB1' in (Column1, Column2);
will sum the value of Column3 for the parent (Column1 = 'AB1') and the children (Column2 = 'AB1').
If the parent-child hierarchy has more than two levels, and you want to sum Column3 for grandchildren, grand-grandchildren, and so on, you can use a hierarchical query (also known as a recursive query). The exact syntax depends on your database, this is for PostgreSQL:
with recursive Hier(Column1, Column2, Column3) as
(
select Column1, Column2, Column3
from TheTable
where Column1 = 'AB1'
union all
select t.Column1, t.Column2, t.Column3
from TheTable t
join Hier h on t.Column2 = h.Column1
)
select sum(Column3)
from Hier;
You can spilt the two sets of data then union them together. From there it will be a simple sum group by.
To do this we simply saying
take Column1 as the Parent if Column2 IS null
take Column2 as the Parent if Column2 IS not null
Select Column1 as Parent, Column3
from TheTable
where Column2 IS null
Union
Select Column2 as Parent, Column3
from TheTable
where Column2 IS not null
From there you can use this as a cte
WITH data AS
(
Select Column1 as Parent, Column3
from TheTable
where Column2 IS null
Union
Select Column2 as Parent, Column3
from TheTable
where Column2 IS not null)
Select Parent, Sum(Column3)
from Data
Group by Parent
Result will be
Parent SumColumn3
AB1 39

Recursive Lag Column Calculation in SQL

I am trying to write a procedure that inserts calculated table data into another table.
The problem I have is that I need each row's calculated column to be influenced by the result of the previous row's calculated column. I tried to lag the calculation itself but this does not work!
Such as:
(Max is a function I created that returns the highest of two values)
Id Product Model Column1 Column2
1 A 1 5 =MAX(Column1*2, Lag(Column2))
2 A 2 2 =MAX(Column1*2, Lag(Column2))
3 B 1 3 =MAX(Column1*2, Lag(Column2))
If I try the above in SQL:
SELECT
Column1,
MyMAX(Column1,LAG(Column2, 1, 0) OVER (PARTITION BY Product ORDER BY Model ASC) As Column2
FROM Source
...it says column2 is unknown.
Output I get if I LAG the Column2 calculation:
Select Column1, MyMAX(Column1,LAG(Column1*2, 1, 0) OVER (PARTITION BY Product ORDER BY Model ASC) As Column2
Id Column1 Column2
1 5 10
2 2 10
3 3 6
Why 6 on row 3? Because 3*2 > 2*2.
Output that I want:
Id Column1 Column2
1 5 10
2 2 10
3 3 10
Why 10 on row 3? Because previous result of 10 > 3*2
The problem is I can't lag the result of Column2 - I can only lag other columns or calculations of them!
Is there a technique of achieving this with LAG or must I use Recursive CTE? I read that LAG succeeds CTE so I assumed it would be possible. If not, what would this 'CTE' look like?
Edit: Or alternatively - what else could I do to resolve this calculation?
Edit
In hindsight, this problem is a running partitioned maximum over Column1 * 2. It can be done as simply as
SELECT Id, Column1, Model, Product,
MAX(Column1 * 2) OVER (Partition BY Model, Product Order BY ID ASC) AS Column2
FROM Table1;
Fiddle
Original Answer
Here's a way to do this with a recursive CTE, without LAG at all, by joining on incrementing row numbers. I haven't assumed that your Id is contiguous, hence have added an additional ROW_NUMBER(). You haven't mentioned any partitioning, so haven't applied same. The query simply starts at the first row, and then projects the greater of the current Column1 * 2, or the preceding Column2
WITH IncrementingRowNums AS
(
SELECT Id, Column1, Column1 * 2 AS Column2,
ROW_NUMBER() OVER (Order BY ID ASC) AS RowNum
FROM Table1
),
lagged AS
(
SELECT Id, Column1, Column2, RowNum
FROM IncrementingRowNums
WHERE RowNum = 1
UNION ALL
SELECT i.Id, i.Column1,
CASE WHEN (i.Column2 > l.Column2)
THEN i.Column2
ELSE l.Column2
END,
i.RowNum
FROM IncrementingRowNums i
INNER JOIN lagged l
ON i.RowNum = l.RowNum + 1
)
SELECT Id, Column1, Column2
FROM lagged;
SqlFiddle here
Edit, Re Partitions
Partitioning is much the same, by just dragging the Model + Product columns through, then partitioning by these in the row numbering (i.e. starting back at 1 each time the Product or Model resets), including these in the CTE JOIN condition and also in the final ordering.
WITH IncrementingRowNums AS
(
SELECT Id, Column1, Column1 * 2 AS Column2, Model, Product,
ROW_NUMBER() OVER (Partition BY Model, Product Order BY ID ASC) AS RowNum
FROM Table1
),
lagged AS
(
SELECT Id, Column1, Column2, Model, Product, RowNum
FROM IncrementingRowNums
WHERE RowNum = 1
UNION ALL
SELECT i.Id, i.Column1,
CASE WHEN (i.Column2 > l.Column2)
THEN i.Column2
ELSE l.Column2
END,
i.Model, i.Product,
i.RowNum
FROM IncrementingRowNums i
INNER JOIN lagged l
ON i.RowNum = l.RowNum + 1
AND i.Model = l.Model AND i.Product = l.Product
)
SELECT Id, Column1, Column2, Model, Product
FROM lagged
ORDER BY Model, Product, Id;
Updated Fiddle

SQL Server : how to find ids where columns have different values

I have a table like this:
Column1 Column2
---------------
1 1
1 2
1 3
1 4
2 1
2 1
2 1
2 1
In column1 one there are 2 different ids, in column2 there are different values for each id from column1.
How can I get the id from column1 where not all ids from column2 are the same? So in this instance the output should be 1 - because they have all different values in column2, where id from column1 has all 1's in column2
Just use group by and having:
select column1
from table t
group by column1
having min(column2) <> max(column2);
Note: you could also use count(distinct), but that has more overhead than min() and max().
Similar logic can be used if the second column could be NULL. That doesn't appear in the sample data so it doesn't seem worth including it in the logic unless the OP specifically says this is a possibility.
Try like this:
select Column1
from yourTable
group by Column1
having count(DISTINCT column2) > 1;
I would think something like this should do the job:
SELECT t.column1 FROM table t
GROUP BY t.column1
HAVING COUNT(DISTINCT t.column2) > 1
This approach will handle the case where a null is an acceptable value in column2.
select column1
from
(
select distinct column1, column2
from yourTable
) t
group by column1
having count(*) > 1

Sampling unique set of records in Oracle table

I have an Oracle table that from which I need to select a given percentage of records for each type of a given set of unique column combination.
For example,
SELECT distinct column1, column2, Column3 from TableX;
provides me all the combination of unique records from that table. I need a % of each rows from each such combination. Currently I am using the following query to accomplish this, which is lengthy and slow.
SELECT *
FROM tableX Sample ( 3 )
WHERE Column1 = ‘value1’ and
Column2 = ‘value2’ and
Column3 = ‘value3
UNION
SELECT *
FROM tableX Sample ( 3 )
WHERE Column1 = ‘value1’ and
Column2 = ‘value2’ and
Column3 = ‘value4
UNION
…
…
SELECT *
FROM tableX Sample ( 3 )
WHERE Column1 = ‘valueP’ and
Column2 = ‘valueQ’ and
Column3 = ‘valueR’
Where the combination of suffix in the “Value” is unique for that table (obtained from the first query)
How can I improve the length of the query and speed?
Here is one approach:
select t.*
from (select t.*,
row_number() over (partition by column1, column2, column3 order by dbms_random()
) as seqnum,
count(*) over (partition by column1, column2, column3) as totcnt
from tablex t
) t
where seqnum / totcnt <= 0.10 -- or whatever your threshold is
It uses row_number() to assign a sequential number to rows in each group, in a random order. The where clause chooses the proportion that you want.

How do I commit/execute deletion row by row in SQL (MSSQL)

I have a simple table but a long one (few millions or rows).
The table contains many paired rows which I need to delete.
The row data is not distinct!
There are single rows (which has no a paired row)
The table pairs are defined by cross info in two columns concatenated to a 3rd column.
I would like to have only one row of each data identifier.
Therefore, I need the myTable to shrink immediately whereis a condition is met.
I tried:
myIndexColumn = Column1 + Column2 + Column3
myReversedIndexColumn = Column2 + Column1 + Column3
CREATE NONCLUSTERED INDEX myIndex1 ON myDB.dbo.myTable (
myIndexColumn ASC
)
CREATE NONCLUSTERED INDEX myIndex2 ON myDB.dbo.myTable (
myReversedIndexColumn ASC
)
DELETE FROM myDB.dbo.myTable
WHERE myIndexColumn in (SELECT myReversedIndex FROM myDB.dbo.myTable)
The problem is that both paired data is deleted instead of leaving one row of the data.
Obviously, that is because the DELETE commits changes only after running the entire transaction.
If I could persuid the MS SQL 2008 R2 Express edition to commit the DELETE upon condition is met, the SELECT clause would have output a shorter list on each row test to delete.
How do I do that?
To not delete the cases where column1 = column2
DELETE FROM myDB.dbo.myTable
WHERE myIndexColumn in (SELECT myReversedIndex FROM myDB.dbo.myTable)
AND column1 <> column2
To remove column1 = column2
;with cte as
(
select *,
row_number() over (
partition by Column1 + Column2 + Column3
order by (SELECT 1)
) rn
from yourtable
where column1 = column2
)
delete cte where rn > 1
The CTE can be used to delete all duplicates too
;with cte as
(
select *,
row_number() over (
partition by
CASE WHEN Column1 > Column2 THEN Column2 ELSE Column1 END +
CASE WHEN Column1 > Column2 THEN Column1 ELSE Column2 END +
Column3
order by (SELECT 1)
) rn
from yourtable
)
delete cte where rn > 1