SQL Server.
I have a proc that takes a user defined table (readonly) and is about 7500 records large. Using that UDT, I run about 15 different delete statements:
delete from table1
where id in (select id from #table)
delete from table2
where id in (select id from #table)
delete from table3
where id in (select id from #table)
delete from table4
where id in (select id from #table)
....
This operation, as expected, does take a while (about 7-10 minutes). These columns are indexed. However, I suspect there is a more efficient way to do this. I know deletes are traditionally slower, but I wasn't expecting this slow.
Is there a better way to do this?
You can test/try "exists" instead of "IN". I really don't like IN clauses for anything besides casual lookup-queries. (Some people will argue about IN until they are blue in the face)
Delete deleteAlias
from table1 deleteAlias
where exists ( select null from #table vart where vart.Id = deleteAlias.Id )
You can populate a #temp table instead of a #variableTable. Again, over the years, this has been trial and test it out. #variable vs #temp , most of the time, doesn't make that big of a different. But in about 4 situations I had, going to a #temp table made a big impact.
You can also experiment with putting an index on the #temp table (the "joining" column, 'Id' in this example )
IF OBJECT_ID('tempdb..#Holder') IS NOT NULL
begin
drop table #Holder
end
CREATE TABLE #Holder
(ID INT )
/* simulate your insert */
INSERT INTO #HOLDER (ID)
select 1 union all select 2 union all select 3 union all select 4
/* CREATE CLUSTERED INDEX IDX_TempHolder_ID ON #Holder (ID) */
/* optional, create an index on the "join" column of the #temp table */
CREATE INDEX IDX_TempHolder_ID ON #Holder (ID)
Delete deleteAlias
from table1 deleteAlias
where exists ( select null from #Holder holder where holder.Id = deleteAlias.Id )
IF OBJECT_ID('tempdb..#Holder') IS NOT NULL
begin
drop table #Holder
end
IMHO, there is not clear cut answer, sometimes you gotta experiment a little.
And "how your tempdb is setup' is a huge fork in the road that can affect #temp table performance. But try the suggestions above first.
And one last experiment
Delete deleteAlias
from table1 deleteAlias
where exists ( select 1 from #table vart where vart.Id = deleteAlias.Id )
change the null to "1".... once I saw this affect something. Weird, right?
Related
The below SQL Server code successfully calculates and inserts the monthly pay for all employees along with their staffID number and inserts it into Tablepayroll.
INSERT INTO Tablepayroll (StaffID,Totalpaid)
(SELECT Tabletimelog.StaffID , Tabletimelog.hoursworked * Tablestaff.hourlypay
FROM Tabletimelog
JOIN Tablestaff ON
Tabletimelog.StaffID = Tablestaff.StaffID)
However, I want to be able to also insert a batchIDso that you can identify each time the above insert has been run and the records inserted by it at that time. Meaning that all staff payroll calculated at the same time would have the same batchID number. Each subsequent batchID should just increase by 1.
Please see image below for visual explanation .
I think that Select MAX(batch_id) + 1 would work , but I don't know how to include it in the insert statement.
You can use subquery to find latest batch_id from your current table using this query:
INSERT INTO TablePayroll (StaffID, TotalPaid, batch_id)
SELECT T1.StaffID
, T1.HoursWorked * T2.HourlyPay
, ISNULL((SELECT MAX(batch_id) FROM TablePayRoll), 0) + 1 AS batch_id
FROM TableTimeLog AS T1
INNER JOIN TableStaff AS T2
ON T1.StaffID = T2.StaffID;
As you can see, I just add 1 to current MAX(batch_id) and that's it.
By the way, learn to use aliases. It will make your life easier
Yet another solution would be having your batch_id as a GUID, so you wouldn't have to create sequences or get MAX(batch_id) from current table.
DECLARE #batch_id UNIQUEIDENTIFIER = NEWID();
INSERT INTO TablePayroll (StaffID, TotalPaid, batch_id)
SELECT T1.StaffID, T1.HoursWorked * T2.HourlyPay, #batch_id
FROM TableTimeLog AS T1
INNER JOIN TableStaff AS T2
ON T1.StaffID = T2.StaffID;
Updated
First of all obtain the maximum value in a large table (based on the name of the table it must be big) can be very expensive. Especially if there is no index on the column batch_id
Secondly, pay attantion your solution SELECT MAX(batch_id) + 1 may behave incorrectly when you will have competitive inserts. Solution from #EvaldasBuinauskas without opening transaction and right isolation level can also lead to same batch_id if you run the two inserts at the same time in parallel.
If your SQL Server ver 2012 or higer you can try SEQUENCE. This at least ensures that no duplicates batch_id
Creating SEQUENCE:
CREATE SEQUENCE dbo.BatchID
START WITH 1
INCREMENT BY 1 ;
-- DROP SEQUENCE dbo.BatchID
GO
And using it:
DECLARE #BatchID INT
SET #BatchID = NEXT VALUE FOR dbo.BatchID;
INSERT INTO Tablepayroll (StaffID,Totalpaid, batch_id)
(SELECT Tabletimelog.StaffID , Tabletimelog.hoursworked * Tablestaff.hourlypay, #BatchID
FROM Tabletimelog
JOIN Tablestaff ON Tabletimelog.StaffID = Tablestaff.StaffID)
An alternative SEQUENCE may be additional table:
CREATE TABLE dbo.Batch (
ID INT NOT NULL IDENTITY
CONSTRAINT PK_Batch PRIMARY KEY CLUSTERED
,DT DATETIME
CONSTRAINT DF_Batch_DT DEFAULT GETDATE()
);
This solution works even on older version of the server.
DECLARE #BatchID INT
INSERT INTO dbo.Batch (DT)
VALUES (GETDATE());
SET #BatchID = SCOPE_IDENTITY();
INSERT INTO Tablepayroll (StaffID,Totalpaid, batch_id)
(SELECT Tabletimelog.StaffID , Tabletimelog.hoursworked * Tablestaff.hourlypay, #BatchID
FROM Tabletimelog ...
And yes, all of these solutions do not guarantee the absence of holes in the numbering. This can happen during a transaction rollback (deadlock for ex.)
I have some Delete statements in a stored procedure to delete some child records in other tables, and eventually delete the ID passed to the stored procedure.
I'm concerned about what happens if one of the select statements used with the Delete returns nothing, will this delete anything on that table?
Example
DELETE FROM [tblPurchases]
WHERE [ID] IN (SELECT [ID] FROM #PurchaseIDs)
In the case (from your example), when SELECT [ID] FROM #PurchaseIDs will not return anything, nothing will be deleted from tblPurchases because ID in (empty_set) condition will not be met.
by the way - you can easily check it by yourself, for example like this:
declare #t1 table (ID int)
insert into #t1 (ID)
select 1
union all
select 2
union all
select 3
declare #t2 table (ID int)
insert into #t2 (ID)
select 1
delete from #t1 where ID in (select ID from #t2 where ID > 1)
select * from #t1
Answer is no , nothing will be deleted. Nothing is "in" an empty collection, so nothing will be deleted.
here my query-
SELECT final.* into #FinalTemp from
(
select * from #temp1
UNION
select * from #temp2
UNION
select * from #temp3
UNION
select * from #temp4
)final
but at a time only one temp table exists so how to check if #temp exists then do union or ignore?
You can't have a union or query on a non-existent object at compile time (compiling to a query plan just before execution).
So there is no way to refer to a non-existent table in the same batch
The pattern you have to use is like this: dynamic SQL is a separate batch
IF OBJECT('tempdb..#temp1') IS NOT NULL
EXEC ('SELECT * FROM #temp1')
ELSE IF OBJECT('tempdb..#temp3') IS NOT NULL
EXEC ('SELECT * FROM #temp3')
ELSE IF OBJECT('tempdb..#temp3') IS NOT NULL
EXEC ('SELECT * FROM #temp3')
...
Would you not be better creating #FinalTemp as an explicit temp table at the top of your query, and then replace your existing population methods which I assume look like this:
SELECT * INTO #temp1 FROM ... /* Rest of Query */
With:
INSERT INTO #FinalTemp (Columns...)
SELECT * FROM ... /* Rest of Query */
And then you don't have to do this final union step at all. Or, if you do need 4 separate temp tables (perhaps for multi-step operations on each), define each of them at the start of your query, and then they will all exist when you perform the union.
Now, given you've said only one will be populated (so the others will be empty), it's probably moot, but I always tend to use UNION ALL to combine disjoint tables - unless you're implicitly relying on UNIONs duplicate removal feature?
You can declare Temp Tables using the same syntax as you do for real tables:
CREATE TABLE #FinalTemp (
ColumnA int not null primary key,
ColumnB varchar(20) not null,
ColumnC decimal(19,5) null,
)
Or, as you've also alluded to, you can use table variables rather than temp tables:
declare #FinalTemp table (
ColumnA int not null primary key,
ColumnB varchar(20) not null,
ColumnC decimal(19,5) null,
)
The predominant different (so far as I'm concerned) is that table variables follow the same scoping rules as other variables - they're not available inside a called stored procedure, and they're cleaned up between batches.
I am attempting to insert many records using T-SQL's MERGE statement, but my query fails to INSERT when there are duplicate records in the source table. The failure is caused by:
The target table has a Primary Key based on two columns
The source table may contain duplicate records that violate the target table's Primary Key constraint ("Violation of PRIMARY KEY constraint" is thrown)
I'm looking for a way to change my MERGE statement so that it either ignores duplicate records within the source table and/or will try/catch the INSERT statement to catch exceptions that may occur (i.e. all other INSERT statements will run regardless of the few bad eggs that may occur) - or, maybe, there's a better way to go about this problem?
Here's a query example of what I'm trying to explain. The example below will add 100k records to a temp table and then will attempt to insert those records in the target table -
EDIT
In my original post I only included two fields in the example tables which gave way to SO friends to give a DISTINCT solution to avoid duplicates in the MERGE statement. I should have mentioned that in my real-world problem the tables have 15 fields and of those 15, two of the fields are a CLUSTERED PRIMARY KEY. So the DISTINCT keyword doesn't work because I need to SELECT all 15 fields and ignore duplicates based on two of the fields.
I have updated the query below to include one more field, col4. I need to include col4 in the MERGE, but I only need to make sure that ONLY col2 and col3 are unique.
-- Create the source table
CREATE TABLE #tmp (
col2 datetime NOT NULL,
col3 int NOT NULL,
col4 int
)
GO
-- Add a bunch of test data to the source table
-- For testing purposes, allow duplicate records to be added to this table
DECLARE #loopCount int = 100000
DECLARE #loopCounter int = 0
DECLARE #randDateOffset int
DECLARE #col2 datetime
DECLARE #col3 int
DECLARE #col4 int
WHILE (#loopCounter) < #loopCount
BEGIN
SET #randDateOffset = RAND() * 100000
SET #col2 = DATEADD(MI,#randDateOffset,GETDATE())
SET #col3 = RAND() * 1000
SET #col4 = RAND() * 10
INSERT INTO #tmp
(col2,col3,col4)
VALUES
(#col2,#col3,#col4);
SET #loopCounter = #loopCounter + 1
END
-- Insert the source data into the target table
-- How do we make sure we don't attempt to INSERT a duplicate record? Or how can we
-- catch exceptions? Or?
MERGE INTO dbo.tbl1 AS tbl
USING (SELECT * FROM #tmp) AS src
ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)
WHEN NOT MATCHED THEN
INSERT (col2,col3,col4)
VALUES (src.col2,src.col3,src.col4);
GO
Solved to your new specification. Only inserting the highest value of col4: This time I used a group by to prevent duplicate rows.
MERGE INTO dbo.tbl1 AS tbl
USING (SELECT col2,col3, max(col4) col4 FROM #tmp group by col2,col3) AS src
ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)
WHEN NOT MATCHED THEN
INSERT (col2,col3,col4)
VALUES (src.col2,src.col3,src.col4);
Given the source has duplicates and you aren't using MERGE fully, I'd use an INSERT.
INSERT dbo.tbl1 (col2,col3)
SELECT DISTINCT col2,col3
FROM #tmp src
WHERE NOT EXISTS (
SELECT *
FROM dbo.tbl1 tbl
WHERE tbl.col2 = src.col2 AND tbl.col3 = src.col3)
The reason MERGE fails is that it isn't checked row by row. All non-matches are found, then it tries to INSERT all these. It doesn't check for rows in the same batch that already match.
This reminds me a bit of the "Halloween problem" where early data changes of an atomic operation affect later data changes: it isn't correct
Instead of GROUP BY you can use an analytic function, allowing you to select a specific record in the set of duplicate records to merge.
MERGE INTO dbo.tbl1 AS tbl
USING (
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY col2, col3 ORDER BY ModifiedDate DESC) AS Rn
FROM #tmp
) t
WHERE Rn = 1 --choose the most recently modified record
) AS src
ON (tbl.col2 = src.col2 AND tbl.col3 = src.col3)
I want to delete many rows with the same set of field values in some (6) tables. I could do this by deleting the result of one subquery in every table (Solution 1), which would be redundant, because the subquery would be the same every time; so I want to store the result of the subquery in a temporary table and delete the value of each row (of the temp table) in the tables (Solution 2). Which solution is the better one?
First solution:
DELETE FROM dbo.SubProtocols
WHERE ProtocolID IN (
SELECT ProtocolID
FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
)
DELETE FROM dbo.ProtocolHeaders
WHERE ProtocolID IN (
SELECT ProtocolID
FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
)
// ...
DELETE FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
Second Solution:
DECLARE #Protocols table(ProtocolID int NOT NULL)
INSERT INTO #Protocols
SELECT ProtocolID
FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
DELETE FROM dbo.SubProtocols
WHERE ProtocolID IN (
SELECT ProtocolID
FROM #Protocols
)
DELETE FROM dbo.ProtocolHeaders
WHERE ProtocolID IN (
SELECT ProtocolID
FROM #Protocols
)
// ...
DELETE FROM dbo.Protocols
WHERE WorkplaceID = #WorkplaceID
Is it possible to do solution 2 without the subquery? Say doing WHERE ProtocolID IN #Protocols (but syntactically correct)?
I am using Microsoft SQL Server 2005.
While you can avoid the subquery in SQL Server with a join, like so:
delete from sp
from subprotocols sp
inner join protocols p on
sp.protocolid = p.protocolid
and p.workspaceid = #workspaceid
You'll find that this doesn't gain you really any performance over either of your approaches. Generally, with your subquery, SQL Server 2005 optimizes that in into an inner join, since it doesn't rely on each row. Also, SQL Server will probably cache the subquery in your case, so shoving it into a temp table is most likely unnecessary.
The first way, though, would be susceptible to changes in Protocols during the transactions, where the second one wouldn't. Just something to think about.
Can try this
DELETE FROM dbo.ProtocolHeaders
FROM dbo.ProtocolHeaders INNER JOIN
dbo.Protocols ON ProtocolHeaders.ProtocolID = Protocols.ProtocolID
WHERE Protocols.WorkplaceID = #WorkplaceID
DELETE ... FROM is a T-SQL extension to the standard SQL DELETE that provides an alternative to using a subquery. From the help:
D. Using DELETE based on a subquery
and using the Transact-SQL extension
The following example shows the
Transact-SQL extension used to delete
records from a base table that is
based on a join or correlated
subquery. The first DELETE statement
shows the SQL-2003-compatible subquery
solution, and the second DELETE
statement shows the Transact-SQL
extension. Both queries remove rows
from the SalesPersonQuotaHistory table
based on the year-to-date sales stored
in the SalesPerson table.
-- SQL-2003 Standard subquery
USE AdventureWorks;
GO
DELETE FROM Sales.SalesPersonQuotaHistory
WHERE SalesPersonID IN
(SELECT SalesPersonID
FROM Sales.SalesPerson
WHERE SalesYTD > 2500000.00);
GO
-- Transact-SQL extension
USE AdventureWorks;
GO
DELETE FROM Sales.SalesPersonQuotaHistory
FROM Sales.SalesPersonQuotaHistory AS spqh
INNER JOIN Sales.SalesPerson AS sp
ON spqh.SalesPersonID = sp.SalesPersonID
WHERE sp.SalesYTD > 2500000.00;
GO
You would want, in your second solution, something like
-- untested!
DELETE FROM
dbo.SubProtocols -- ProtocolHeaders, etc
FROM
dbo.SubProtocols
INNER JOIN #Protocols ON SubProtocols.ProtocolID = #Protocols.ProtocolID
However!!
Is it not possible to alter your design so that all the susidiary protocol tables have a FOREIGN KEY with DELETE CASCADE to the main Protocols table? Then you could just DELETE from Protocols and the rest would be taken care of...
edit to add:
If you already have FOREIGN KEYs set up, you would need to use DDL to alter them (I think a drop and recreate is required) in order for them to have DELETE CASCADE turned on. Once that is in place, a DELETE from the main table will automatically DELETE related records from the child table.
Without the temp table you risk deleting different rows in the the second delete, but that takes three operations to do.
You could delete from the first table and use the OUTPUT INTO clause to insert into a temp table all the IDs, and then use that temp table to delete the second table. This will make sure you only delete the same keys with and with only two statements.
declare #x table(RowID int identity(1,1) primary key, ValueData varchar(3))
declare #y table(RowID int identity(1,1) primary key, ValueData varchar(3))
declare #temp table (RowID int)
insert into #x values ('aaa')
insert into #x values ('bab')
insert into #x values ('aac')
insert into #x values ('bad')
insert into #x values ('aae')
insert into #x values ('baf')
insert into #x values ('aag')
insert into #y values ('aaa')
insert into #y values ('bab')
insert into #y values ('aac')
insert into #y values ('bad')
insert into #y values ('aae')
insert into #y values ('baf')
insert into #y values ('aag')
DELETE #x
OUTPUT DELETED.RowID
INTO #temp
WHERE ValueData like 'a%'
DELETE y
FROM #y y
INNER JOIN #temp t ON y.RowID=t.RowID
select * from #x
select * from #y
SELECT OUTPUT:
RowID ValueData
----------- ---------
2 bab
4 bad
6 baf
(3 row(s) affected)
RowID ValueData
----------- ---------
2 bab
4 bad
6 baf
(3 row(s) affected)