INSERT INTO SELECT with LEFT JOIN not preventing duplicates for simultaneous hits

INSERT INTO SELECT with LEFT JOIN not preventing duplicates for simultaneous hits - sql

I have this SQL query that inserts records from one table to another without duplicates.
It works fine, if I call this SQL query from one instance of my application. But in production, the application is horizontally scaled, having more than one instance of application, each calling below query simultaneously at the same time. That is causing duplicate records to me. Is there any way to fix this query, so it allows simultaneous hits?
INSERT INTO table1 (col1, col2)
SELECT DISTINCT TOP 10
t2.col1,
t2.col2
FROM
table2 t2
LEFT JOIN
table1 t1 ON t2.col1 = t1.col1
AND t2.col2 = t1.col2
WHERE
t1.col1 IS NULL

The corrective action here depends on the behavior you want. If you intend to allow for just a single horizontal instance of your application to execute this query, then you need to create a critical section, into which one instance is allowed to enter. Since you are already using SQL Server, you could implement by forcing each instance to get a lock on a certain table. Only the instance which gets the lock will execute the query, and the others will drop off.
If, on the other hand, you really want each instance to execute the query, then you should use a serializable transaction. Using a serializable transaction will ensure that only one instance can do the insert on the table at a given time. It would not be possible for two or more instances to interleave and execute the same insert.

Related

Use to SQL to detect changes between tables

I want to create a SQL script that would compare 2 of the same fields in two different tables. These tables may be in two different servers. I want to use this script to see if one field gets updated in one table/server, it is also updated in the other table/server. Any ideas to approach this?

First thing you need to be sure of is that your servers are linked, otherwise you won't easily be able to compare the two. If the servers are linked, and the tables are identical you can use an EXCEPT query to identify the changes e.g.
select * from [server1].[db].[schema].[table]
except
select * from [server2].[db].[schema].[table]
This query will return all rows from the table in server1 that don't appear in server2 from here you can either wrap this in a count or insert/update the missing/changed rows from one table to another
Identifying whether the rows have changed or been inserted will rely on using a primary key, with that you can join one table to another and identify what needs updating using a query like so:
select *
from [server1].[db].[schema].[table] t1
inner join [server2].[db].[schema].[table] t2 on t1.id = t2.id
where ( t1.col1 <> t2.col1 or t1.col2 <> t2.col2 ... )
Another way of tracking changes is to use a DML trigger and have this propagate changes from one table to another.
I was working on a SQL Server auditing tool that uses these principles, have a look through the code if you like its not 100% working https://github.com/simtbak/panko/blob/main/archive/Panko%20v003.sql

Trying to use cursor on one database using select from another db

So I'm trying to wrap my head around cursors. I have task to transfer data from one database to another, but they have slightly diffrent schemas. Let's say I have TableOne (Id, Name, Gold) and TableTwo (Id, Name, Lvl). I want to take all records from TableTwo and insert it into TableOne, but it can be duplicated data on Name column. So if single record from TableTwo exist (on Name column comparison) in TableOne, I want to skip it, if don't - create record in TableOne with unique Id.
I was thinking about looping on each record in TableTwo, and for every record check if it's exist in TableOne. So, how do I make this check without making call to another database every time? I wanted first select all record from TableOne, save it into variable and in loop itself make check against this variable. Is this even possible in SQL? I'm not so familiar with SQL, some code sample would help a lot.
I'm using Microsoft SQL Server Management Studio if that matters. And of course, TableOne and TableTwo exists in diffrent databases.

Try this
Insert into table1(id,name,gold)
Select id,name,lvl from table2
Where table2.name not in(select t1.name from table1 t1)
If you want to add newId for every row you can try
Insert into table1(id,name,gold)
Select (select max(m.id) from table1 m) + row_number() over (order by t2.id) ,name,lvl from table2 t2
Where t2.name not in(select t1.name from table1 t1)

It is possible yes, but I would not recommend it. Looping (which is essentially what a cursor does) is usually not advisable in SQL when a set-based operation will do.
At a high level, you probably want to join the two tables together (the fact that they're in different databases shouldn't make a difference). You mention one table has duplicates. You can eliminate those in a number of ways such as using a group by or a row_number. Both approaches will require you understanding which rows you want to "pick" and which ones you want to "ignore". You could also do what another user posted in a comment where you do an existence check against the target table using a correlated subquery. That will essentially mean that if any rows exist in the target table that have duplicates you're trying to insert, none of those duplicates will be put in.
As far as cursors are concerned, to do something like this, you'd be doing essentially the same thing, except on each pass of the cursor you would be temporarily assigning and using variables instead of columns. This approach is sometimes called RBAR (for "Rob by Agonizing Row"). On every pass of the cursor or loop, it has to re-open the table, figure out what data it needs, then operate on it. Even if that's efficient and it's only pulling back one row, there's still lots of overhead to doing that query. So while, yes, you can force SQL to do what you've describe, the database engine already has an operation for this (joins) which does it far faster than any loop you could conceivably write

T-Sql Algorithm Question

I have a T-Sql Statement as follows;
Insert into Table1
Select * From Table2
I want to know the running sequence. Does the insert waits select statement to finish before starting or it starts asap select statement starts returning values and expects new records from the select statement to continue.
This is a plain stored procedure and no transactions used.

What you have there is effectively a single statement. It will only insert into Table2 the records that were present in Table2 when you begin the insert. Otherwise the properties of ACID would not apply and you'd have issues with isolation (what if the sp is run twice concurrently) and durability. SQL Server will enforce this via locking.

To echo #CodeByMoonlight's answer, and to address your comment there: the physical considerations (including the specifics of locking) are always subordinate to the logical instructions specified by the query.
In the processing of an INSERT ... SELECT statement, logically speaking the SELECT is carried out to produce a resultset, and then the rows of this resultset are INSERTed. The fact that in this case the source table and the target table are the same table is irrelevant. I'm fairly sure that specifying NOLOCK or TABLOCK would in any case apply only to the SELECT, if that's where you position them.
Consider as another example this statement, which makes no sense if you read it in an 'imperative' way:
UPDATE SomeTable
SET Column1 = Column2, Column2 = Column1
With an imperative, rather than set-based, understanding, this statement might look as if it will result in Column1 and Column2 having the same value for all rows. But it doesn't - it in fact swaps the values in Column1 and Column2. Only by understanding that the logical instructions of the query dictate what actually happens can this be seen.

Query times out in SP, but runs fine in query analyzer

I seem to be facing a strange issue in SQL 2008.
I have a query which runs fine and fast from query analyser, but times out if run through a stored procedure! The SP just starts with this query and has no other code before this query
SELECT col1,col2 FROM TBL1 (nolock)
INNER JOIN TBL2 (nolock)
ON tbl1.col=LEFT(tbl2.col1,LEN(tbl2.col1)-2) AND tbl1.col2=RIGHT(tbl2.col1,2)
AND tbl1.col4=2233
AND tbl1.date1 BETWEEN tbl2.date1 and isnull(tbl2.date2,getdate())
Please note that tbl1 is actually a view, where the col and col2 are coming via a self join. Also as per business requirement, tbl2.col1 needs to have concatenated value. If required to solve this issue, I can modify my view though.

As a side issue, please note that if you can make some assumptions about string lengths, your join expression can be simplified (and possibly get better performance because one side is now using equality):
SELECT tbl1.col1, tbl1.col2
FROM
TBL1
INNER JOIN TBL2
ON tbl1.col + tbl1.col2 = tbl2.col1
AND tbl1.col4=2233
AND tbl1.date1 BETWEEN tbl2.date1 AND Coalesce(tbl2.date2, GetDate())
Also, if you're looking for the best performance, try this:
ALTER TABLE TBL2 ADD LeftPart AS (LEFT(col1, LEN(col1)-2));
ALTER TABLE TBL2 ADD RightPart AS (RIGHT(tbl2.col1,2));
CREATE NONCLUSTERED INDEX IX_TBL2_Parts ON TBL2 (LeftPart, RightPart);
Now you can just join like so:
SELECT tbl1.col1, tbl1.col2
FROM
TBL1
INNER JOIN TBL2
ON tbl1.col = tbl2.LeftPart
AND tbl1.col2 = tbl2.RightPart
AND tbl1.col4=2233
AND tbl1.date1 BETWEEN tbl2.date1 AND Coalesce(tbl2.date2, GetDate())
Even better, change your database design to actually store the TBL2.col1 data in two columns. You're violating first normal form by putting two distinct pieces of data in one column, and now, as you're discovering, you're paying for it throughout your application in terms of performance, development & maintenance time, query complexity, and so on.
You could even reverse my scheme so that the LeftPart and RightPart columns are real, and you create a new calculated column that has the Col1 name, with an index to materialize the values and make them searchable. Finally, if absolutely required, you could rename the table, create a view on the table using the old name, and then put an INSTEAD-OF triggers on the view to intercept data operations against the table and translate them into the correct schema.
Update
By the way, if you have any influence on table design you may want to consider using an "open ended date" value of '99991231' or some such for tbl2.date2 rather than NULLs. That Coalesce can kill performance, sometimes forcing a scan when a seek would have been possible.

I've run into similar situations in the past that were caused by parameter sniffing. You might try the method discussed in the article above and see if it makes a difference.
What happens is that when you run the stored procedure for the first time SQL Server caches it's execution plan and uses it going forward. If you run the stored proc with parameters that make this execution plan not optimal you can see the behavior you are describing.
You can also use the query hint recompile to make sure that it uses a new execution plan every time it executes. To do this you would add OPTION(RECOMPILE) to the end of your query:
SELECT id, name
from tableName
WHERE id between #min and #max
OPTION(RECOMPILE);
This link goes over several solutions for the parameter sniffing problem.

SQL Server locks - avoid insertion of duplicate entries

After reading a lot of articles and many answers related to the above subject, I am still wondering how the SQL Server database engine works in the following example:
Let's assume that we have a table named t3:
create table t3 (a int , b int);
create index test on t3 (a);
and a query as follow:
INSERT INTO T3
SELECT -86,-86
WHERE NOT EXISTS (SELECT 1 FROM t3 where t3.a=-86);
The query inserts a line in the table t3 after verifying that the row does not already exist based on the column "a".
Many articles and answers indicate that using the above query there is no way that a row will be inserted twice.
For the execution of the above query, I assume that the database engine works as follow:
The subquery is executed first.
The database engine sets a shared(s) lock on a range.
The data is read.
The shared lock is released. According to MSDN a shared
lock is released as soon as the data
has been read.
If a row does not exist it inserts a new line in the table.
The new line is locked with an exclusive lock (x)
Now consider the following scenario:
The above query is executed by processor A (SPID 1).
The same query is executed by a
processor B (SPID 2).
[SPID 1] The database engine sets a shared(s) lock
[SPID 1] The subquery reads the
data. Now rows are returned.
[SPID 1] The shared lock is
released.
[SPID 2] The database engine sets a
shared(s) lock
[SPID 2] The subquery reads the
data. No rows are return.
[SPID 2] The shared lock is
released.
Both processes proceed with a row insertion (and we get a duplicate entry).
Am I missing something? Is the above way a correct way for avoiding duplicate entries?
A safe way to avoid duplicate entries is using the code below, but I am just wondering whether the above method is correct.
begin tran
if (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
begin
INSERT INTO T3
SELECT -86,-86
end
commit

If you just have a unique constraint on the column, you'll never have duplicates.
The technique you've outlined will avoid you having to catch an error or an exception in the case of the (second "simultaneous") operation failing.
I'd like to add that relying on "outer" code (even T-SQL) to enforce your database consistency is not a great idea. In all cases, using declarative referential integrity at the table level is important for the database to ensure consistency and matching expectations, regardless of whether application code is written well or not. As in security, you need to utilize a strategy of defense in depth - constraints, unique indexes, triggers, stored procedures, and views can all assist in making a multi-layered approach to ensure the database presents a consistent and reliable interface to the application or system.

To keep locks between multiple statements, they have to be wrapped in a transaction. In your example:
If (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
INSERT INTO T3 SELECT -86,-86
The update lock can be released before the insert is executed. This would work reliably:
begin transaction
If (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
INSERT INTO T3 SELECT -86,-86
commit transaction
Single statements are always wrapped in a transaction, so this would work too:
INSERT INTO T3 SELECT -86,-86
WHERE NOT EXISTS (SELECT 1 FROM t3 with (updlock) where t3.a=-86)
(This is assuming you have "implicit transactions" turned off, like the default SQL Server setting.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas