Delete almost duplicates in a table - sql

I have a table where strings 1 and 2 are almost duplicate - they have the same values but in reverse order. How can I delete these duplicates?
+--------+-------+
| COL_1 | COL_# |
+--------+-------+
| a1 | b1 |
| b1 | a1 | <- same as 1stline but in reversed order, needs to be removed
| a2 | b2 |
| a3 | b3 |
| b3 | a3 |<-- also is duplicate of string above, one of these 2str need
+--------+-------+ to be removed
expected result:
+--------+-------+
| COL_1 | COL_# |
+--------+-------+
| b1 | a1 |
| a2 | b2 |
| a3 | b3 |
+--------+-------+
or
+--------+-------+
| COL_1 | COL_# |
+--------+-------+
| a1 | b1 |
| a2 | b2 |
| a3 | b3 |
+--------+-------+

Do you mean like this :
DELETE(
select e.COL_#, f.COL_1
from example as e
join example as f on e.COL_# = f.COL_1 and e.COL_# < f.COL_1 )

If col_1 and col_# cannot be null, you can use LEASTand GREATEST to keep one row per combination:
delete from tbl
where rowid not in
(
select min(rowid) -- one rowid per combination to keep
from tbl
group by least(col_1, col_#), greatest(col_1, col_#)
);

I think it is a bit complicated to achieve it with a delete query.
Maybe it is OK for you to have such a select query first:
select A,B from (
select A,
B,
ROW_NUMBER() over (PARTITION BY ORA_HASH(A) * ORA_HASH(B) ORDER BY A) as RANK
FROM <your_table_name>
) where RANK = 1;
You could save the result of this query as a new table with CREATE TABLE AS SELECT ...
And then you simply DROP your old table.

One possible solution is:
DELETE FROM T
WHERE (COL_1, "COL_#") IN (SELECT COL_1, "COL_#"
FROM (SELECT ROWNUM AS RN, t1.COL_1, t1."COL_#"
FROM T t1
INNER JOIN T t2
ON t2."COL_#" = t1.COL_1 AND
t2.COL_1 = t1."COL_#")
WHERE RN / 2 <> TRUNC(RN / 2));
Note that COL_# must be quoted in Oracle as # is not a legal character in an unquoted identifier.
dbfiddle here

This doesn't seem so complicated:
delete t
where not (col1 < col2 or
not exists (select 1
from t t2
where t2.col1 = t.col2 and
t2.col2 = t.col1
)
);
Or:
delete t
where col1 > col2 and
exists (select 1
from t t2
where t2.col1 = t.col2 and
t2.col2 = t.col1
);

Related

SQL query for many to many exclusive IN query

I have a table Table1 with columns A and B (many to many table).
|---------------------|------------------|
| ColumnA | ColumnB |
|---------------------|------------------|
| a1 | b1 |
|---------------------|------------------|
| a1 | b2 |
|---------------------|------------------|
| a2 | b1 |
|---------------------|------------------|
| a2 | b3 |
|---------------------|------------------|
| a3 | b2 |
|---------------------|------------------|
I want a list of As whose Bs are ONLY in list of Bs.
So, from above table, if list is [b1, b2]
Expected [a1, a3]
Not including a2as it is associated with b3 also.
You can use aggregation and having:
select a
from ab
group by a
having sum(case when b not in ('b1', 'b2') then 1 else 0 end) = 0;
The having clause is checking the number of rows that are not in the list. The = 0 says there are none.
Assuming there are not any nulls in ColumnB you can use NOT EXISTS:
select t.*
from tablename t
where not exists (select 1 from tablename where ColumnA = t.ColumnA and ColumnB not in ('b1', 'b2'))
If you want only the distinct values of ColumnA:
select distinct t.ColumnA
from tablename t
where not exists (select 1 from tablename where ColumnA = t.ColumnA and ColumnB not in ('b1', 'b2'))
See the demo.

Conditional Joins in SQL Server - How do I make SQL first to see if the conditions are met , then do the JOIN after

I am trying to do fuzzy string matching to get as more matches as I can. First I execute the Levenshtein Distance Algorithm (http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx) and store it as "distance" in my dbo.
My first table (t1) looks like this:
Name | Synonym
A | A1
A | A2
A | A3
B | B1
B | B2
My second table (t2) looks like this:
The ID field may look like Name and Synonym very much
ID | Description
A | XXX
B | YYY
My goal is to make left joins either on the Name or its Synonyms when the distance between 2 strings from each table (t1 and t2) are smaller than 2.
Here is my current work:
SELECT *
FROM (
SELECT t2.ID, ROW_NUMBER() over (partition by ID order by ID) as rn
FROM table1 as t1
LEFT JOIN table2 as t2
ON (upper(trim(t1.Name)) = upper(trim(t2.ID)) OR upper(trim(t1.Synonym)) = upper(trim(t2.ID)))
WHERE (dbo.distance(t1.Name,t2.ID)<=2 OR dbo.distance(t1.Synonym,t2.ID)<=2)
) temp
WHERE rn=1
Ideally, as long as their distance is smaller than 2, we will still doing the join.
It should get more matches by adding that condition, however it doesn't.
Am I missing anything here?
I was wondering if my problem is coming from this:
My intention is to see if the conditions meet, if so then just do the join.
But my code here probably tells SQL to "join first", and the filter it afterwards.
Is there a way to let it see if the condition qualifies and then do the join "after"?
I have tried DIFFERENCE function just for demo purpose. It finds if two strings are similar and then returns 4 (lowest possible difference) and goes down to 0(lowest possible difference). You can try similar logic using your distance function.
DECLARE #table1 TABLE(Name varchar(10), synon varchar(10))
DECLARE #table2 TABLE(Name varchar(10), synon varchar(10))
INSERT INTO #table1
VALUES ('A','A1'),('A','A2'),('A','A3'),('B','B1'),('B','B2'),('B','B3')
INSERT INTO #table2
VALUES ('A','A1'),('A','A2'),('C','C1'),('B','B2'),('B','B3')
SELECT t1.name, t1.synon, t2.Name,t2.synon
FROM #table1 as t1
CROSS APPLY (SELECT T2.Name, t2.synon FROM #table2 as t2 WHERE DIFFERENCE(t2.Name,t1.Name) = 4 OR DIFFERENCE(t2.synon,t1.synon) = 4) as t2
+------+-------+------+-------+
| name | synon | Name | synon |
+------+-------+------+-------+
| A | A1 | A | A1 |
| A | A2 | A | A1 |
| A | A3 | A | A1 |
| A | A1 | A | A2 |
| A | A2 | A | A2 |
| A | A3 | A | A2 |
| B | B1 | B | B2 |
| B | B2 | B | B2 |
| B | B3 | B | B2 |
| B | B1 | B | B3 |
| B | B2 | B | B3 |
| B | B3 | B | B3 |
+------+-------+------+-------+

Join on two tables gives duplicate results

i have a table with data that I want to join unto another table. Problem is that the join can happen on two columns of the same table, where I want to get the first join to work and if this Fails i want the second join to give me a valid result.
Base table:
| ID1 | ID2 | Value |
| a1 | a2 | val_1 |
| b1 | b2 | val_2 |
| c1 | c2 | val_3 |
join Table:
| ID1 | ID2 | Join_Value |
| | a2 | join_val_1 |
| b1 | | join_val_2 |
| c1 | c2 | join_val_3 |
What i tried was this:
select base.id1, base.id2, Value, isnull(j1.Join_value,j2.Join_value) Join_Value from base
left join Join j1 on j1.id1 = base.id1
left join Join j2 on j2.id2 = base.id2
The Result is this:
| ID1 | ID2 | Value | Join_Value |
| a1 | a2 | val_1 | join_val_1 |
| b1 | b2 | val_2 | join_val_2 |
| c1 | c2 | val_3 | join_val_3 |
| c1 | c2 | val_3 | join_val_3 |
What i want is this:
| ID1 | ID2 | Value | Join_Value |
| a1 | a2 | val_1 | join_val_1 |
| b1 | b2 | val_2 | join_val_2 |
| c1 | c2 | val_3 | join_val_3 |
I hope i made my Problem clear.
You don't need to join the same table twice. Just specify the condition in the ON
select b.ID1, b.ID2, b.[Value], j.Join_Value
from [base] b
inner join [join] j on b.ID1 = j.ID1
or (
j.ID1 = ''
and b.ID2 = j.ID2
)
You are going to get duplicate rows for for the c1 and c2 rows because they match on both of your Join table joins (j1 and j2).
A quick fix is to add a DISTINCT to your query:
select DISTINCT base.id1, base.id2, Value, isnull(j1.Join_value,j2.Join_value) Join_Value
from base
left join Join j1 on j1.id1 = base.id1
left join Join j2 on j2.id2 = base.id2
A better fix, depending on your DBMS is to use a window function:
select id1, id2, Value, Join_Value
FROM (
select base.id1, base.id2, Value, isnull(j1.Join_value,j2.Join_value) Join_Value,
ROW_NUMBER() OVER(
PARTITION BY base.id1, base.id2 -- Group rows based on (id1, id2) combination
ORDER BY j1.id1 -- If more than one row, give priority to row with "id1" value
) AS RowNum
from base
left join Join j1 on j1.id1 = base.id1
left join Join j2 on j2.id2 = base.id2
) src
WHERE RowNum = 1 -- Only return one row
This will make sure you always one row maximum per (id1, id2) combination.
Try:
select *
from base b
join [join] j on b.id1 = j.id1 or b.id2 = j.id2
First, your version does exactly what you want. Here is a db<>fiddle.
Second, for more control over the matching, you can use a lateral join. This allows you to choose only one matching row -- say the one where both ids match:
select b.id1, b.id2, b.value, jt.join_value
from base b cross apply
(select top (1) jt.*
from jointable jt
where b.id1 = jt.id1 or
b.id2 = jt.id2
order by (case when b.id1 = jt.id1 then 1 else 0 end) +
(case when b.id2 = jt.id2 then 1 else 0 end) desc
) jt ;

SQL Server select column names from multiple tables

I have three tables in SQL Server with following structure:
col1 col2 a1 a2 ... an,
col1 col2 b1 b2 ... bn,
col1 col2 c1 c2 ... cn
The two first records are the same, col1 and col2, however the tables have different lengths.
I need to select the column names of the tables and the result I'm trying to achieve is the followig:
col1, col2, a1, b1, c1, a2, b2, c2 ...
Is there a way to do it?
It's possible but result's is combined into single column of three table tables.
For example
SELECT A.col1 +'/' +B.col1 +'/' + C.col1 As Col1 ,
A.col2 +'/' +B.col2 +'/' + C.col2 As col2 ,a1, b1, c1, a2, b2, c2 ,
* FROM A
INNER JOIN B
ON A.ID =B.ID
INNER JOIN C
ON C.ID = B.ID
SQL-Server is not the right tool to create a generic resultset. The engine needs to know what's coming out in advance. Well, you might try to find a solution with dynamic SQL...
I want to suggest two different approaches.
Both would work with any number of tables, as long as all of them have the columns col1 and col2 with appropriate types.
Let's create a simple mokcup scenario before:
DECLARE #mockup1 TABLE(col1 INT,col2 INT,SomeMore1 VARCHAR(100),SomeMore2 VARCHAR(100));
INSERT INTO #mockup1 VALUES(1,1,'blah 1.1','blub 1.1')
,(1,2,'blah 1.2','blub 1.2')
,(1,100,'not in t2','not in t2');
DECLARE #mockup2 TABLE(col1 INT,col2 INT,OtherType1 INT,OtherType2 DATETIME);
INSERT INTO #mockup2 VALUES(1,1,101,GETDATE())
,(1,2,102,GETDATE()+1)
,(1,200,200,GETDATE()+200);
--You can add as many tables as you need
A very pragmatic approach:
Try this simple FULL OUTER JOIN:
SELECT *
FROM #mockup1 m1
FULL OUTER JOIN #mockup2 m2 ON m1.col1=m2.col1 AND m1.col2=m2.col2
--add more tables here
The result
+------+------+-----------+-----------+------+------+------------+-------------------------+
| col1 | col2 | SomeMore1 | SomeMore2 | col1 | col2 | OtherType1 | OtherType2 |
+------+------+-----------+-----------+------+------+------------+-------------------------+
| 1 | 1 | blah 1.1 | blub 1.1 | 1 | 1 | 101 | 2019-03-08 10:53:20.257 |
+------+------+-----------+-----------+------+------+------------+-------------------------+
| 1 | 2 | blah 1.2 | blub 1.2 | 1 | 2 | 102 | 2019-03-09 10:53:20.257 |
+------+------+-----------+-----------+------+------+------------+-------------------------+
| 1 | 100 | not in t2 | not in t2 | NULL | NULL | NULL | NULL |
+------+------+-----------+-----------+------+------+------------+-------------------------+
| NULL | NULL | NULL | NULL | 1 | 200 | 200 | 2019-09-24 10:53:20.257 |
+------+------+-----------+-----------+------+------+------------+-------------------------+
But you will have to deal with non-unique column names... (This is the moment, where a dynamically created statement can help).
A generic approach using container type XML
Whenever you do not know the result in advance, you can pack the result in a container. This allows a clear structure on the side of your RDBMS and shifts the troubles how to deal with this set to the consumer.
The cte will read all existing pairs of col1 and col2
Each table's row(s) for the pair of values is inserted as XML
Pairs not existing in any of the tables show up as NULL
Try this out
WITH AllDistinctCol1Col2Values AS
(
SELECT col1,col2 FROM #mockup1
UNION ALL
SELECT col1,col2 FROM #mockup2
--add all your tables here
)
SELECT col1,col2
,(SELECT * FROM #mockup1 x WHERE c1c2.col1=x.col1 AND c1c2.col2=x.col2 FOR XML PATH('row'),TYPE) AS Content1
,(SELECT * FROM #mockup2 x WHERE c1c2.col1=x.col1 AND c1c2.col2=x.col2 FOR XML PATH('row'),TYPE) AS Content2
FROM AllDistinctCol1Col2Values c1c2
GROUP BY col1,col2;
The result
+------+------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| col1 | col2 | Content1 | Content2 |
+------+------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| 1 | 1 | <row><col1>1</col1><col2>1</col2><SomeMore1>blah 1.1</SomeMore1><SomeMore2>blub 1.1</SomeMore2></row> | <row><col1>1</col1><col2>1</col2><OtherType1>101</OtherType1><OtherType2>2019-03-08T11:03:49.877</OtherType2></row> |
+------+------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| 1 | 2 | <row><col1>1</col1><col2>2</col2><SomeMore1>blah 1.2</SomeMore1><SomeMore2>blub 1.2</SomeMore2></row> | <row><col1>1</col1><col2>2</col2><OtherType1>102</OtherType1><OtherType2>2019-03-09T11:03:49.877</OtherType2></row> |
+------+------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| 1 | 100 | <row><col1>1</col1><col2>100</col2><SomeMore1>not in t2</SomeMore1><SomeMore2>not in t2</SomeMore2></row> | NULL |
+------+------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
| 1 | 200 | NULL | <row><col1>1</col1><col2>200</col2><OtherType1>200</OtherType1><OtherType2>2019-09-24T11:03:49.877</OtherType2></row> |
+------+------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+

Sql Server get first matching value

I have two tables History and Historyvalues:
History
HID(uniqeidentifier) | Version(int)
a1 | 1
a2 | 2
a3 | 3
a4 | 4
Historyvalues
HVID(uniqeidentifier) | HID(uniqeidentifier) | ControlID(uniqeidentifier) | Value(string)
b1 | a1 | c1 | value1
b2 | a2 | c1 | value2
b3 | a2 | c2 | value3
Now I Need a query where I can get a list with the last historyvalue of each control from a specific Version like:
Get the last values from Version 3 -> receiving ->
HVID | ControlID | Value
b2 | c1 | value2
b3 | c2 | value3
I tried something like this:
Select HVID, ControlId, max(Version), Value from
(
Select HVID, ControlId, Version, Value
from History inner JOIN
Historyvalues ON History.HID = Historyvalues.HID
where Version <= 3
) as a
group by ControlId
order by Version desc
but this does not work.
Are there any ideas?
Thank you very much for your help.
Best regards
Latest version from each control with your specific Version (WHERE t1.Version <= 3)
Query:
SQLFIDDLEExample
SELECT HVID, ControlId, Version, Value
FROM
(
SELECT t2.HVID, t2.ControlId, t1.Version, t2.Value,
ROW_NUMBER() OVER(PARTITION BY t2.ControlId ORDER BY t1.Version DESC) as rnk
FROM History t1
JOIN Historyvalues t2
ON t1.HID = t2.HID
WHERE t1.Version <= 3
) AS a
WHERE a.rnk = 1
ORDER BY a.Version desc
Result:
| HVID | CONTROLID | VERSION | VALUE |
|------|-----------|---------|--------|
| b2 | c1 | 2 | value2 |
| b3 | c2 | 2 | value3 |
here is your solution
Select Historyvalues.HVID,Historyvalues.ControlID,Historyvalues.Value
from Historyvalues
inner join History on Historyvalues.hid=History.hid
where Historyvalues.hvid in (
select MAX(Historyvalues.hvid) from Historyvalues
inner join History on Historyvalues.hid=History.hid
group by ControlID)