DELETE WITH INTERSECT - sql

I have two tables with the same number of columns with no primary keys (I know, this is not my fault). Now I need to delete all rows from table A that exists in table B (they are equal, each one with 30 columns).
The most immediate way I thought is to do a INNER JOIN and solve my problem. But, write conditions for all columns (worrying about NULL) is not elegant (maybe cause my tables are not elegant either).
I want to use INTERSECT. I am not knowing how to do it? This is my first question:
I tried (SQL Fiddle):
declare #A table (value int, username varchar(20))
declare #B table (value int, username varchar(20))
insert into #A values (1, 'User 1'), (2, 'User 2'), (3, 'User 3'), (4, 'User 4')
insert into #B values (2, 'User 2'), (4, 'User 4'), (5, 'User 5')
DELETE #A
FROM (SELECT * FROM #A INTERSECT SELECT * from #B) A
But all rows were deleted from table #A.
This drived me to second question: why the command DELETE #A FROM #B deletes all rows from table #A?

Try this:
DELETE a
FROM #A a
WHERE EXISTS (SELECT a.* INTERSECT SELECT * FROM #B)
Delete from #A where, for each record in #A, there is a match where the record in #A intersects with a record in #B.
This is based on Paul White's blog post using INTERSECT for inequality checking.
SQL Fiddle

To answer your first question you can delete based on join:
delete a
from #a a
join #b b on a.value = b.value and a.username = b.username
The second case is really strange. I remember similar case here and many complaints about this behaviour. I will try to fing that question.

You can use Giorgi's answer to delete the rows you need.
As for the question regarding why all rows were deleted, that's because there is no limiting condition. Your FROM clause gets a table to process, but there is no WHERE clause to prevent certain rows from being deleted from #A.

Create a table (T) defining the primary keys
insert all records from A into T (i will assume there are no duplicates in A)
try to insert all records from B in T
3A. if insert fails delete it from B (already exists)
Drop T (you really shouldn't !!!)

Giorgi's answer explicitly compares all columns, which you wanted to avoid.
It is possible to write code that doesn't list all columns explicitly.
EXCEPT produces the result set that you need, but I don't know a good way to use this result set to DELETE original rows from A without primary key. So, the solution below saves this intermediary result in a temporary table using SELECT * INTO. Then deletes everything from A and copies temporary result into A. Wrap it in a transaction.
-- generate the final result set that we want to have and save it in a temporary table
SELECT *
INTO #t
FROM
(
SELECT * FROM #A
EXCEPT
SELECT * FROM #B
) AS E;
-- copy temporary result back into A
DELETE FROM #A;
INSERT INTO #A
SELECT * FROM #t;
DROP TABLE #t;
-- check the result
SELECT * FROM #A;
result set
value username
1 User 1
3 User 3
The good side of this solution is that it uses * instead of the full list of columns. Of course, you can list all columns explicitly as well. It will still be easier to write and handle, than writing comparisons of all columns and taking care of possible NULLs.

Related

compare primary/alias groups across two tables

Gday,
We have two tables that contain exactly the same structure. There are two columns "PrimaryAddress" and "AliasAddress". These are for email addresses and aliases. We want to find any records that need to be added to either side to keep the records in sync. The catch is that the primary name in one table might be listed as an alias in the other. The good news is that an address wont appear twice in the "AliasAddress" column.
TABLE A
PrimaryAddress~~~~~AliasAdress
chris#work~~~~~~~~~chris#home
chris#work~~~~~~~~~c#work
chris#work~~~~~~~~~theboss#work
chris#work~~~~~~~~~thatguy#aol
bob#test~~~~~~~~~~~test1#test
bob#test~~~~~~~~~~~charles#work
bob#test~~~~~~~~~~~chuck#aol
sally#mars~~~~~~~~~sally#nasa
sally#mars~~~~~~~~~sally#gmail
TABLE B
PrimaryAddress~~~~~AliasAdress
chris#home~~~~~~~~~chris#work
chris#home~~~~~~~~~c#work
chris#home~~~~~~~~~theboss#work
chris#home~~~~~~~~~thatguy#aol
bob#test~~~~~~~~~~~test1#test
bob#test~~~~~~~~~~~charles#work
sally#nasa~~~~~~~~~sally#mars
sally#nasa~~~~~~~~~sally#gmail
sally#nasa~~~~~~~~~ripley#nostromo
The expected result is to return the following missing records from both tables:
bob#test~~~~~~~~~~~chuck#aol
sally#nasa~~~~~~~~~ripley#nostromo
Note that the chris#* block is a total match because the sum of all the aliases (plus primary) is still the same regardless of which address is considered primary. It doesnt matter which address is primary as along as the sum of the entire primary group contains all entries in both tables.
I dont mind if this is run in two passes A->B and B->A but I just cant get my head around a solution.
Any help appreciated :)
drop TABLE #TABLEA
CREATE TABLE #TABLEA
([PrimaryAddress] varchar(10), [AliasAdress] varchar(12))
;
INSERT INTO #TABLEA
([PrimaryAddress], [AliasAdress])
VALUES
('chris#work', 'chris#home'),
('chris#work', 'c#work'),
('chris#work', 'theboss#work'),
('chris#work', 'thatguy#aol'),
('bob#test', 'test1#test'),
('bob#test', 'charles#work'),
('bob#test', 'chuck#aol'),
('sally#mars', 'sally#nasa'),
('sally#mars', 'sally#gmail')
;
drop TABLE #TABLEB
CREATE TABLE #TABLEB
([PrimaryAddress] varchar(10), [AliasAdress] varchar(15))
;
INSERT INTO #TABLEB
([PrimaryAddress], [AliasAdress])
VALUES
('chris#home', 'chris#work'),
('chris#home', 'c#work'),
('chris#home', 'theboss#work'),
('chris#home', 'thatguy#aol'),
('bob#test', 'test1#test'),
('bob#test', 'charles#work'),
('sally#nasa', 'sally#mars'),
('sally#nasa', 'sally#gmail'),
('sally#nasa', 'ripley#nostromo')
;
try the following
select a.PrimaryAddress,a.AliasAdress from #TABLEA a left join #TABLEB b on a.AliasAdress=b.AliasAdress or b.PrimaryAddress=a.AliasAdress
where b.PrimaryAddress is null
union all
select a.PrimaryAddress,a.AliasAdress from #TABLEB a left join #TABLEA b on a.AliasAdress=b.AliasAdress or b.PrimaryAddress=a.AliasAdress
where b.PrimaryAddress is null
So you want to compare table A and B, and find rows which are unqiue in either table. How about an outer join, followed by looking for NULL values:
SELECT ta.*, tb.*
FROM table_a ta
FULL OUTER JOIN table_b tb ON tb.PrimaryAddress = ta.PrimaryAddress
AND tb.AliasAddress = ta.AliasAddress
WHERE ta.PrimaryAddress IS NULL
OR tb.PrimaryAddress IS NULL
If I understand the question correctly, this should return what you ask for.
Here's how I did it, with a bit of throwing-hands-up-in-the-air at the end.
Step one, identify the sets of items to be compared. This is:
For a “primary” value, all values found in Alias
Including the “primary” value as well (to cover that nasa/nostromo case)
A set in a table (A or B) is identified by its primary value. What really makes it hard is that the primary value is not shared across the two tables (sally#mars, sally#nasa). So we can compare sets, but we have to be able to “go back” to the primary on each table separately (e.g. the stand-out from table B may be sally#nasa / ripley#nostroomo, but we have to add sally#mars / ripley#nostromo to table A)
Major problems arise if, in a table, a primary value appears as an alias for a different primary value (e.g. in table A, chris#work appears as an alias for bob#test). For the sake of sanity, I am going to assume this will not happen… but if it does, the problem becomes even harder.
This query works to add missing items in B that are not in A, where the PrimaryAddress is the same for both A and B:
;WITH setA (SetId, FullSet)
as (-- Complete sets in A
select PrimaryAddress, AliasAdress
from A
union select PrimaryAddress, PrimaryAddress
from A
)
,setB (SetId, FullSet)
as (-- Complete sets in B
select PrimaryAddress, AliasAdress
from B
union select PrimaryAddress, PrimaryAddress
from B
)
,NotInB (Missing)
as (-- What's in A that's not in B
select FullSet
from setA
except select FullSet -- This is the secret sauce. Definitely worth your time to read up on how EXCEPT works.
from setB
)
-- Take the missing values plus their primaries from A and load them into B
INSERT B (PrimaryAddress, AliasAdress)
select A.PrimaryAddress, nB.Missing
from NotInB nB
inner join A
on A.AliasAdress = nb.Missing
Run it again with the tables reversed (from “NotInB” on) to do the same for A.
HOWEVER
Doing so with your sample data for "in B not in A" will add (sally#nasa, ripley#nostromo) to A, and as that’s a different primary, it’d create a new set, and so does not solve the problem. It gets ugly quickly. Talking it out from here:
Takes two passes, one for A not in B, one for B not in A
For each pass, have to do two checks
First check is what’s above: what’s in A not in B where primary addresses match, and add it
Second check is ugly: what’s in A not in B where the primary addresses from A is NOT a primary address in B and, thus, must be an alias. Here, find A’s primary address in B’s alias list, get the primary key used for this set in B, and create the row(s) in B using those values.
OK, This is how we did it... As it was becoming a pain, we ran a procedure that added the primary address of each entry as an alias: xx#xx -> xx#xx so that all addresses were listed as aliases for each user. This is similar to what #Phillip Kelly did above. Then we ran the following code: (its messy but it works; in one pass too)
SELECT 'Missing from B:' as Reason, TableA.[primary] as APrimary, TableA.[alias] as AAlias, TableB.[primary] as BPrimary,TableB.[alias] as BAlias into #A FROM dbo.TableA LEFT OUTER JOIN TableB ON TableB.alias = TableA.alias
SELECT 'Missing from A:' as Reason,TableA.[primary] as APrimary, TableA.[alias] as AAlias, TableB.[primary] as BPrimary,TableB.[alias] as BAlias into #B FROM dbo.TableB LEFT OUTER JOIN TableA ON TableA.alias = TableB.alias
select * from #A
select * from #B
UPDATE #A
SET #A.APrimary = #B.BPrimary
FROM #B INNER JOIN #A ON #A.APrimary = #B.BPrimary
WHERE #A.BPrimary IS NULL
UPDATE #B
SET #B.BPrimary = #A.APrimary
FROM #B INNER JOIN #A ON #B.BPrimary = #A.BPrimary
WHERE #B.APrimary IS NULL
select * from #A
select * from #B
select * into #result from (
select Reason, BPrimary as [primary], BAlias as [alias] from #B where APrimary IS NULL
union
select Reason, APrimary as [primary], AAlias as [alias] from #A where BPrimary IS NULL
) as tmp
select * from #result
drop table #A
drop table #B
drop table #result
GO

SQL column duplicate value count

I have a Student table. Currently it has many columns like ID, StudentName, FatherName, NIC, MotherName, No_Of_Childrens, Occupation etc.
I want to check the NIC field on insert time. If it is a duplicate, then count the duplicated NIC and and add the count number in No_of_Children column.
What is the best way to do that in SQL Server?
It sounds like you want an UPSERT. The most concise way to accomplish that in SQL (that I know) is through a MERGE operation.
declare #students table
(
NIC int
,No_Of_Childrens int
);
--set up some test data to get us started
insert into #students
select 12345,1
union select 12346,2
union select 12347,2;
--show before
select * from #students;
declare #incomingrow table(NIC int,childcount int);
insert into #incomingrow values (12345,2);
MERGE
--the table we want to change
#students AS target
USING
--the incoming data
#incomingrow AS source
ON
--if these fields match, then the "when matched" section happens.
--else the "when not matched".
target.NIC = source.NIC
WHEN MATCHED THEN
--this statement will happen when you find a match.
--in our case, we increment the child count.
UPDATE SET NO_OF_CHILDRENS = no_of_childrens + source.childcount
WHEN NOT MATCHED THEN
--this statement will happen when you do *not* find a match.
--in our case, we insert a new row with a child count of 0.
INSERT (nic,no_of_childrens) values(source.nic,0);
--show the results *after* the merge
select * from #students;

Multiple row insert or select if exists

CREATE TABLE object (
object_id serial,
object_attribute_1 integer,
object_attribute_2 VARCHAR(255)
)
-- primary key object_id
-- btree index on object_attribute_1, object_attribute_2
Here is what I currently have:
SELECT * FROM object
WHERE (object_attribute_1=100 AND object_attribute_2='Some String') OR
(object_attribute_1=200 AND object_attribute_2='Some other String') OR
(..another row..) OR
(..another row..)
When the query returns, I check for what is missing (thus, does not exist in the database).
Then I will make an multiple row insert:
INSERT INTO object (object_attribute_1, object_attribute_2)
VALUES (info, info), (info, info),(info, info)
Then I will select what I just inserted
SELECT ... WHERE (condition) OR (condition) OR ...
And at last, I will merge the two selects on the client side.
Is there a way that I can combine these 3 queries, into one single queries, where I will provide all the data, and INSERT if the records do not already exist and then do a SELECT in the end.
Your suspicion was well founded. Do it all in a single statement using a data-modifying CTE (Postgres 9.1+):
WITH list(object_attribute_1, object_attribute_2) AS (
VALUES
(100, 'Some String')
, (200, 'Some other String')
, .....
)
, ins AS (
INSERT INTO object (object_attribute_1, object_attribute_2)
SELECT l.*
FROM list l
LEFT JOIN object o1 USING (object_attribute_1, object_attribute_2)
WHERE o1.object_attribute_1 IS NULL
RETURNING *
)
SELECT * FROM ins -- newly inserted rows
UNION ALL -- append pre-existing rows
SELECT o.*
FROM list l
JOIN object o USING (object_attribute_1, object_attribute_2);
Note, there is a tiny time frame for a race condition. So this might break if many clients try it at the same time. If you are working under heavy concurrent load, consider this related answer, in particular the part on locking or serializable transaction isolation:
Postgresql batch insert or ignore

How does transact sql know which table I'm referencing in this subquery?

This is a question about documentation on how t-sql decides which "column" is in scope for subqueries. I tried google-ing which turned up this link but it didn't explain it.
Here's a runnable example. The update statement sets the only entry in #a.a to null. Presumably this is because the subquery reference to alias a resolves to table #b which has no rows that match value 1, thus returning null to the outer update query.
if object_id('tempdb..#a') is not null
drop table #a
if object_id('tempdb..#b') is not null
drop table #b
create table #a (a int)
create table #b (a int)
insert into #a values (1)
insert into #b values (2)
update a
set a = (select a from #b as a where a.a = 1)
from #a as a
Is there documentation that indicates this design choice? It is otherwise ambiguous, because if I change the update statement to use a different alias, the final value in #a.a is 2:
update aa
set a = (select a from #b as a where aa.a = 1)
from #a as aa
This reference might do a better job of explaining it.
The idea is quite simple. A table alias is interpreted as the "first" table definition, starting with the current level of the subquery and then moving outward. A table alias in a subquery cannot be used in an outer query, so references can only move "inward".
In your example:
update a
set a = (select a from #b as a where a.a = 1)
from #a as a
The a.a is referring to column a of table a. In the subquery itself, table a is defined as #b. That is the reference.
In this query:
update aa
set a = (select a from #b as a where aa.a = 1)
from #a as aa;
The table aliases is aa. This is not defined in the subquery. It is defined at the next level out, so it refers to #a.
In general, don't give different tables the same alias in a query (with the exception of aliases on subqueries that are essentially just a filtered/selected version of a specific table). That can just lead to confusion.
In your first example there is no relationship between the outer and inner query, and so you are setting the value of column 'a' to the results of the inner query for every row in table #a. The inner query returns null, as there are no rows in #b which have the value of 1, so the column a in #a is set to null
In your second example, you are still not providing a relationship between the inner and outer query. All the inner query is doing is selecting every value from #b, because for every row in #b, the value of #a.a is 1. You might just as well have (select a from #b) as your inner query.
The reason rhat #a.a gets set to 2 is that there is only 1 row in the #b table, and its value is 2. If there were multiple rows in #b, then I think that #a.a would get set to the value of the last returned row in table #b. So if there were 2 rows in #b and the first had value 2 and the second had value 3, then I would expect that #a would be set to 3. (Or it would not execute).
Either way these are not very good pieces of SQL IMHO.

T-SQL cursor and update

I use a cursor to iterate through quite a big table. For each row I check if value from one column exists in other.
If the value exists, I would like to increase value column in that other table.
If not, I would like to insert there new row with value set to 1.
I check "if exists" by:
IF (SELECT COUNT(*) FROM otherTabe WHERE... > 1)
BEGIN
...
END
ELSE
BEGIN
...
END
I don't know how to get that row which was found and update value. I don't want to make another select.
How can I do this efficiently?
I assume that the method of checking described above isn't good for this case.
Depending on the size of your data and the actual condition, you have two basic approaches:
1) use MERGE
MERGE TOP (...) INTO table1
USING table2 ON table1.column = table2.column
WHEN MATCHED
THEN UPDATE SET table1.counter += 1
WHEN NOT MATCHED SOURCE
THEN INSERT (...) VALUES (...);
the TOP is needed because when you're doing a huge update like this (you mention the table is 'big', big is relative, but lets assume truly big, +100MM rows) you have to batch the updates, otherwise you'll overwhelm the transaction log with one single gigantic transaction.
2) use a cursor, as you are trying. Your original question can be easily solved, simply always update and then check the count of rows updated:
UPDATE table
SET column += 1
WHERE ...;
IF ##ROW_COUNT = 0
BEGIN
-- no match, insert new value
INSERT INTO (...) VALUES (...);
END
Note that this approach is dangerous though because of race conditions: there is nothing to prevent another thread from inserting the value concurrently, so you may end up with either duplicates or a constraint violation error (preferably the latter...).
This is just psuedo code because I have no idea of your table structure but I think you will understand... basically Update the columns you want then Insert the columns you need. A Cursor operation sounds unnecessary.
Update OtherTable
Set ColumnToIncrease = ColumnToIncrease + 1
FROM CurrentTable Where ColumnToCheckValue is not null
Insert Into OtherTable (ColumnToIncrease, Field1, Field2,...)
SELECT
1,
?
?
FROM CurrentTable Where ColumnToCheckValue is not null
Without a sample, I think this is the best I can do. Bottom line: you don't need a cursor. UPDATE where a match exists (INNER JOIN) and INSERT where one does not.
UPDATE otherTable
SET IncrementingColumn = IncrementingColumn + 1
FROM thisTable INNER JOIN otherTable ON thisTable.ID = otherTable.ID
INSERT INTO otherTable
(
ID
, IncrementingColumn
)
SELECT ID, 1
FROM thisTable
WHERE NOT EXISTS (SELECT *
FROM otherTable
WHERE thisTable.ID = otherTable.ID)
I think you'd be better off using a view for this -- then it's always up to date, no risk of mistakenly double/triple/etc counting:
CREATE VIEW vw_value_count AS
SELECT st.value,
COUNT(*) AS numValue
FROM SOME_TABLE st
GROUP BY st.value
But if you still want to use the INSERT/UPDATE approach:
IF EXISTS(SELECT NULL
FROM SOMETABLE WHERE ... > 1)
BEGIN
UPDATE TABLE
SET count = count + 1
WHERE value = #value
END
ELSE
BEGIN
INSERT INTO TABLE
(value, count)
VALUES
(#value, 1)
END
What about Update statement with inner join to perform +1, and Insert selected rows that do not exist in the first table.
Provide the tables schema and the columns you want to check and update so I can help.
Regards.