sql : copying into a table in chunks? - sql

My task is I need to create index on a large table in SQL Server (~370G). The plan is to
create a new table with the same columns and
create a clustered index in the new table on three columns
copy in small chunks (grouped by the three columns) the original data into the new table.
I can do 1) and 2) in SQL with the following script:
SELECT TOP 0 *
INTO js_sample_indexed
FROM dbo.js_sample
CREATE CLUSTERED INDEX domain_event_platform_idx
ON dbo.js_sample_indexed (domain ASC, event_type ASC, platform ASC)
GO
But I am stuck in the third step. Presumably there are thousands of values in the index, for example, an value might be ('Amazon', 'search', 'mobile').
So I might need to put a where statement in a for loop, while updating the condition for selection every time.
But I'm stuck at how to store and retrieve the values in each column (e.g. 'domain') using SQL.
Don't know whether I've phrased this question clearly, but any comments would be helpful. Thanks!

I am assuming that there is an identity field of some sort (a sequentially numbered field used as an index) on the table. For this example, I will call this field ID. If this is true, then a simple looping construct will do what you need.
DECLARE #MinID int, #MaxID int, #Step int = 10000 -- Move 10k records per loop
SELECT #MinID = MIN(ID), #MaxID = MAX(ID)
FROM MyTableToCopyFrom
While ##MinID <= #MaxID
BEGIN
INSERT INTO MyTableToCopyTo (Field1, Field2, Field3, Fieldx)
SELECT Field1, Field2, Field3, Field4
FROM MyTableToCopyFrom
WHERE ID >= #MinId
AND ID < #MinId + #Step
SET #MinID = #MinID + #Step
END

So I came up with an answer after some reading and asking. Here is the code:
USE jumpshot_data
GO
DROP TABLE dbo.js_indexed
-- create a new table with existing structure
SELECT TOP 0 *
INTO dbo.js_full_indexed_1
FROM dbo.js_test
CREATE CLUSTERED INDEX domain_event_platform_idx
ON dbo.js_full_indexed_1 (domain ASC, event_type ASC, platform ASC)
GO
CREATE NONCLUSTERED INDEX device_id_idx
ON js_full_indexed_1 (device_id ASC);
-- using cursor to loop through meta-data table, and insert by chunk into the new table
DECLARE #event_type varchar(50)
DECLARE #platform varchar(50)
DECLARE #domain varchar(50)
DECLARE SelectionCursor CURSOR LOCAL FOR
SELECT * FROM dbo.js_index_info
OPEN SelectionCursor
FETCH NEXT FROM SelectionCursor into #event_type, #platform, #domain
WHILE (##FETCH_STATUS = 0)
BEGIN
-- operation at each row
INSERT INTO dbo.js_full_indexed_1
SELECT *
FROM dbo.js_test
WHERE event_type = #event_type AND domain = #domain AND platform = #platform
-- loop condition
FETCH NEXT FROM SelectionCursor into #event_type, #platform, #domain
END
CLOSE SelectionCursor
DEALLOCATE SelectionCursor
GO

Related

Copying data from one table to another using Insert Into

I have two tables. They both have identical structures except table2 has an additional column. I currently copy data from table1 into table2 using a stored proc, as shown below.
However, due to the sheer number of records (20million+), and the structure of the stored proc, this currently takes a couple of hours to run.
Does anyone have any suggestions on how to optimize the code?
CREATE PROCEDURE dbo.insert_period #period INT AS
DECLARE #batchsize INT
DECLARE #start INT
DECLARE #numberofrows INT
SELECT #numberofrows = COUNT(*) from daily_table
SET #batchsize = 150000
SET #start = 1
WHILE #start < #numberofrows
BEGIN
INSERT INTO dbo.main_table WITH (TABLOCK) (
col1,
col2,
....,
col26,
time_period
)
SELECT *, #period FROM dbo.daily_table
ORDER BY id
OFFSET #start ROWS
FETCH NEXT #batchsize ROWS ONLY
SET #start += #batchsize + 1
END
The id that I am using here is not unique. The table itself does not have any keys or unique id's.
First I would like to point out that the logic in your insert is flawed.
With #start starting at 1 your always skipping the first row of the source table. Then adding 1 to it at the end of your loop is causing it to skip another row on each subsequent run of the loop.
If your set on using batched inserts I suggest you read up on how it works over on MSSQLTips.
To help you with performance I would suggest taking a look at the following:
SELECT *
Remove the SELECT * and replace with the column names. This will help the optimizer get you a better query plan. Further reading on why SELECT * is bad can be found in this SO Question.
ORDER BY
That ORDER BY is probably slowing you down. Without seeing your query plan we cannot know for sure though. Each time your loop executes it queries the source table and has to sort all those records. Sorting 20+ milling records that many times is a lot of work. Take a look at my simplified example below.
CREATE TABLE #Test (Id INT);
INSERT INTO #Test VALUES (1), (2), (3), (4), (5);
DECLARE #batchsize INT;
DECLARE #start INT;
DECLARE #numberofrows INT;
SELECT #numberofrows = COUNT(*) FROM #Test;
SET #batchsize = 2;
SET #start = 0;
WHILE #start < #numberofrows
BEGIN
SELECT
*
, 10
FROM
#Test
ORDER BY
Id OFFSET #start ROWS FETCH NEXT #batchsize ROWS ONLY;
SET #start += #batchsize;
END;
Below is a portion of the query plan produced by the sample. Notice the Sort operation highlighted in yellow. Its cost accounts for 78% of that query plan.
If we add an index that is already sorted on the Id column of the source table we can eliminate the sort. Now when the loop runs it doesn't have to do any sorting.
CREATE INDEX ix_Test ON #Test (Id)
Other Options to Research
Columnstore Indexes
Batch Mode in RowStore
Parallel Inserts
You copy the table row by row, that's why it takes so long. The simplest way to achieve what you want is an 'INSERT' combined with a 'SELECT' statement. This way, you would insert the data in one batch.
CREATE TABLE dbo.daily_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL);
GO
CREATE TABLE dbo.main_table (id INT PRIMARY KEY IDENTITY,
value1 NVARCHAR(100) NULL,
value2 NVARCHAR(100) NULL,
value3 NVARCHAR(100) NULL);
GO
INSERT INTO dbo.daily_table (value1, value2)
VALUES('1', '2');
-- Insert with Select
INSERT INTO dbo.main_table (value1, value2)
SELECT value1, value2
FROM dbo.daily_table;
Also, it's better not to use an asterisk in your 'SELECT' statement since the result could be unpredictable.

Generating dummy data from existing data set is slow using cursor

I'm trying to generate dummy data from the existing data I have in the tables. All I want is to increase the number of records in Table1 to N specified amount. The other tables should increase based on the foreign key references.
The tables has one to many relationship. For one record in table 1, I can have multiple entries in table 2, and in table 3 I can have many records based on IDs of the second table.
Since IDs are primary keys, I either capture it by
SET #NEWLY_INSERTED_ID = SCOPE_IDENTITY()
after inserting to table 1 and using in insert for table2, or inserting them to temp table and joining them to achieve the same results for table 3.
Here's the approach I'm taking with the CURSOR.
DECLARE #MyId as INT;
DECLARE #myCursor as CURSOR;
DECLARE #DESIRED_ROW_COUNT INT = 70000
DECLARE #ROWS_INSERTED INT = 0
DECLARE #CURRENT_ROW_COUNT INT = 0
DECLARE #NEWLY_INSERTED_ID INT
DECLARE #LANGUAGE_PAIR_IDS TABLE ( LangugePairId INT, NewId INT, SourceLanguage varchar(100), TargetLangauge varchar(100) )
WHILE (#ROWS_INSERTED < #DESIRED_ROW_COUNT)
BEGIN
SET #myCursor = CURSOR FOR
SELECT Id FROM MyTable
SET #CURRENT_ROW_COUNT = (SELECT COUNT(ID) FROM MyTable)
OPEN #myCursor;
FETCH NEXT FROM #myCursor INTO #MyId;
WHILE ##FETCH_STATUS = 0
BEGIN
IF ((#CURRENT_SUBMISSION_COUNT < #DESIRED_ROW_COUNT) AND (#ROWS_INSERTED < #DESIRED_ROW_COUNT))
BEGIN
INSERT INTO [dbo].[MyTable]
([Column1]
([Column2]
([Column3]
)
SELECT
,convert(numeric(9,0),rand() * 899999999) + 100000000
,COlumn2
,Colum3
FROM MyTable
WHERE Id = #MyId
SET #NEWLY_INSERTED_ID = SCOPE_IDENTITY()
INSERT INTO [dbo].[Language]
([MyTable1Id]
,[Target]
,[Source]
OUTPUT inserted.Id, inserted.MyTable1Id, inserted.Source, inserted.[Target] INTO #LANGUAGE_PAIR_IDS (LangugePairId, NewId, SourceLanguage, TargetLangauge)
SELECT
#NEWLY_INSERTED_ID
,[Target]
,[Source]
FROM [dbo].[Language]
WHERE MyTableId = #MyId
ORDER BY Id
DECLARE #tbl AS TABLE (newLanguageId INT, oldLanguageId INT, sourceLanguage VARCHAR(100), targetLanguage VARCHAR(100))
INSERT INTO #tbl (newLanguageId, oldLanguageId, sourceLanguage, targetLanguage)
SELECT 0, id, [Source], [Target] MyTable1Id FROM Language WHERE MyTable1Id = #MyId ORDER BY Id
UPDATE t
SET t.newlanguageid = lp.LangugePairId
FROM #tbl t
JOIN #LANGUAGE_PAIR_IDS lp
ON t.sourceLanguage = lp.SourceLanguage
AND t.targetLanguage = lp.TargetLangauge
INSERT INTO [dbo].[Manager]
([LanguagePairId]
,[UserId]
,[MyDate])
SELECT
tbl.newLanguageId
,p.[UserId]
,p.[MyDate]
FROM Manager m
INNER JOIN #tbl tbl
ON m.LanguagePairId = tbl.oldLanguageId
WHERE m.LanguagePairId in (SELECT Id FROM Language WHERE MyTable1Id = #MyId) -- returns the old language pair id
SET #ROWS_INSERTED += 1
SET #CURRENT_ROW_COUNT +=1
END
ELSE
BEGIN
PRINT 'REACHED EXIT'
SET #ROWS_INSERTED = #DESIRED_ROW_COUNT
BREAK
END
FETCH NEXT FROM #myCursor INTO #MyId;
END
CLOSE #myCursor
DEALLOCATE #myCursor
END
The above code works! It generates the data I need. However, it's very very slow. Just to give some comparison. Initial load of data for table 1 was ~60,000 records, Table2: ~74,000 and Tabl3 ~3,400
I tried to insert 9,000 rows in Table1. With the above code, it took 17:05:01 seconds to complete.
Any suggestion on how I can optimize the query to run little faster? My goal is to insert 1-2 mln records in Table1 without having to wait for days. I'm not tied to CURSOR. I'm ok to achieve the same result in any other way possible.

Using a temp table with a stored procedure to cycle through IDs [duplicate]

How can one call a stored procedure for each row in a table, where the columns of a row are input parameters to the sp without using a Cursor?
Generally speaking I always look for a set based approach (sometimes at the expense of changing the schema).
However, this snippet does have its place..
-- Declare & init (2008 syntax)
DECLARE #CustomerID INT = 0
-- Iterate over all customers
WHILE (1 = 1)
BEGIN
-- Get next customerId
SELECT TOP 1 #CustomerID = CustomerID
FROM Sales.Customer
WHERE CustomerID > #CustomerId
ORDER BY CustomerID
-- Exit loop if no more customers
IF ##ROWCOUNT = 0 BREAK;
-- call your sproc
EXEC dbo.YOURSPROC #CustomerId
END
You could do something like this: order your table by e.g. CustomerID (using the AdventureWorks Sales.Customer sample table), and iterate over those customers using a WHILE loop:
-- define the last customer ID handled
DECLARE #LastCustomerID INT
SET #LastCustomerID = 0
-- define the customer ID to be handled now
DECLARE #CustomerIDToHandle INT
-- select the next customer to handle
SELECT TOP 1 #CustomerIDToHandle = CustomerID
FROM Sales.Customer
WHERE CustomerID > #LastCustomerID
ORDER BY CustomerID
-- as long as we have customers......
WHILE #CustomerIDToHandle IS NOT NULL
BEGIN
-- call your sproc
-- set the last customer handled to the one we just handled
SET #LastCustomerID = #CustomerIDToHandle
SET #CustomerIDToHandle = NULL
-- select the next customer to handle
SELECT TOP 1 #CustomerIDToHandle = CustomerID
FROM Sales.Customer
WHERE CustomerID > #LastCustomerID
ORDER BY CustomerID
END
That should work with any table as long as you can define some kind of an ORDER BY on some column.
DECLARE #SQL varchar(max)=''
-- MyTable has fields fld1 & fld2
Select #SQL = #SQL + 'exec myproc ' + convert(varchar(10),fld1) + ','
+ convert(varchar(10),fld2) + ';'
From MyTable
EXEC (#SQL)
Ok, so I would never put such code into production, but it does satisfy your requirements.
I'd use the accepted answer, but another possibility is to use a table variable to hold a numbered set of values (in this case just the ID field of a table) and loop through those by Row Number with a JOIN to the table to retrieve whatever you need for the action within the loop.
DECLARE #RowCnt int; SET #RowCnt = 0 -- Loop Counter
-- Use a table variable to hold numbered rows containg MyTable's ID values
DECLARE #tblLoop TABLE (RowNum int IDENTITY (1, 1) Primary key NOT NULL,
ID INT )
INSERT INTO #tblLoop (ID) SELECT ID FROM MyTable
-- Vars to use within the loop
DECLARE #Code NVarChar(10); DECLARE #Name NVarChar(100);
WHILE #RowCnt < (SELECT COUNT(RowNum) FROM #tblLoop)
BEGIN
SET #RowCnt = #RowCnt + 1
-- Do what you want here with the data stored in tblLoop for the given RowNum
SELECT #Code=Code, #Name=LongName
FROM MyTable INNER JOIN #tblLoop tL on MyTable.ID=tL.ID
WHERE tl.RowNum=#RowCnt
PRINT Convert(NVarChar(10),#RowCnt) +' '+ #Code +' '+ #Name
END
Marc's answer is good (I'd comment on it if I could work out how to!)
Just thought I'd point out that it may be better to change the loop so the SELECT only exists once (in a real case where I needed to do this, the SELECT was quite complex, and writing it twice was a risky maintenance issue).
-- define the last customer ID handled
DECLARE #LastCustomerID INT
SET #LastCustomerID = 0
-- define the customer ID to be handled now
DECLARE #CustomerIDToHandle INT
SET #CustomerIDToHandle = 1
-- as long as we have customers......
WHILE #LastCustomerID <> #CustomerIDToHandle
BEGIN
SET #LastCustomerId = #CustomerIDToHandle
-- select the next customer to handle
SELECT TOP 1 #CustomerIDToHandle = CustomerID
FROM Sales.Customer
WHERE CustomerID > #LastCustomerId
ORDER BY CustomerID
IF #CustomerIDToHandle <> #LastCustomerID
BEGIN
-- call your sproc
END
END
If you can turn the stored procedure into a function that returns a table, then you can use cross-apply.
For example, say you have a table of customers, and you want to compute the sum of their orders, you would create a function that took a CustomerID and returned the sum.
And you could do this:
SELECT CustomerID, CustomerSum.Total
FROM Customers
CROSS APPLY ufn_ComputeCustomerTotal(Customers.CustomerID) AS CustomerSum
Where the function would look like:
CREATE FUNCTION ComputeCustomerTotal
(
#CustomerID INT
)
RETURNS TABLE
AS
RETURN
(
SELECT SUM(CustomerOrder.Amount) AS Total FROM CustomerOrder WHERE CustomerID = #CustomerID
)
Obviously, the example above could be done without a user defined function in a single query.
The drawback is that functions are very limited - many of the features of a stored procedure are not available in a user-defined function, and converting a stored procedure to a function does not always work.
For SQL Server 2005 onwards, you can do this with CROSS APPLY and a table-valued function.
Using CROSS APPLY in SQL Server 2005
Just for clarity, I'm referring to those cases where the stored procedure can be converted into a table valued function.
This is a variation on the answers already provided, but should be better performing because it doesn't require ORDER BY, COUNT or MIN/MAX. The only disadvantage with this approach is that you have to create a temp table to hold all the Ids (the assumption is that you have gaps in your list of CustomerIDs).
That said, I agree with #Mark Powell though that, generally speaking, a set based approach should still be better.
DECLARE #tmp table (Id INT IDENTITY(1,1) PRIMARY KEY NOT NULL, CustomerID INT NOT NULL)
DECLARE #CustomerId INT
DECLARE #Id INT = 0
INSERT INTO #tmp SELECT CustomerId FROM Sales.Customer
WHILE (1=1)
BEGIN
SELECT #CustomerId = CustomerId, #Id = Id
FROM #tmp
WHERE Id = #Id + 1
IF ##rowcount = 0 BREAK;
-- call your sproc
EXEC dbo.YOURSPROC #CustomerId;
END
This is a variation of n3rds solution above. No sorting by using ORDER BY is needed, as MIN() is used.
Remember that CustomerID (or whatever other numerical column you use for progress) must have a unique constraint. Furthermore, to make it as fast as possible CustomerID must be indexed on.
-- Declare & init
DECLARE #CustomerID INT = (SELECT MIN(CustomerID) FROM Sales.Customer); -- First ID
DECLARE #Data1 VARCHAR(200);
DECLARE #Data2 VARCHAR(200);
-- Iterate over all customers
WHILE #CustomerID IS NOT NULL
BEGIN
-- Get data based on ID
SELECT #Data1 = Data1, #Data2 = Data2
FROM Sales.Customer
WHERE [ID] = #CustomerID ;
-- call your sproc
EXEC dbo.YOURSPROC #Data1, #Data2
-- Get next customerId
SELECT #CustomerID = MIN(CustomerID)
FROM Sales.Customer
WHERE CustomerID > #CustomerId
END
I use this approach on some varchars I need to look over, by putting them in a temporary table first, to give them an ID.
If you don't what to use a cursor I think you'll have to do it externally (get the table, and then run for each statement and each time call the sp)
it Is the same as using a cursor, but only outside SQL.
Why won't you use a cursor ?
I usually do it this way when it's a quite a few rows:
Select all sproc parameters in a dataset with SQL Management Studio
Right-click -> Copy
Paste in to excel
Create single-row sql statements with a formula like '="EXEC schema.mysproc #param=" & A2' in a new excel column. (Where A2 is your excel column containing the parameter)
Copy the list of excel statements into a new query in SQL Management Studio and execute.
Done.
(On larger datasets i'd use one of the solutions mentioned above though).
DELIMITER //
CREATE PROCEDURE setFakeUsers (OUT output VARCHAR(100))
BEGIN
-- define the last customer ID handled
DECLARE LastGameID INT;
DECLARE CurrentGameID INT;
DECLARE userID INT;
SET #LastGameID = 0;
-- define the customer ID to be handled now
SET #userID = 0;
-- select the next game to handle
SELECT #CurrentGameID = id
FROM online_games
WHERE id > LastGameID
ORDER BY id LIMIT 0,1;
-- as long as we have customers......
WHILE (#CurrentGameID IS NOT NULL)
DO
-- call your sproc
-- set the last customer handled to the one we just handled
SET #LastGameID = #CurrentGameID;
SET #CurrentGameID = NULL;
-- select the random bot
SELECT #userID = userID
FROM users
WHERE FIND_IN_SET('bot',baseInfo)
ORDER BY RAND() LIMIT 0,1;
-- update the game
UPDATE online_games SET userID = #userID WHERE id = #CurrentGameID;
-- select the next game to handle
SELECT #CurrentGameID = id
FROM online_games
WHERE id > LastGameID
ORDER BY id LIMIT 0,1;
END WHILE;
SET output = "done";
END;//
CALL setFakeUsers(#status);
SELECT #status;
A better solution for this is to
Copy/past code of Stored Procedure
Join that code with the table for which you want to run it again (for each row)
This was you get a clean table-formatted output. While if you run SP for every row, you get a separate query result for each iteration which is ugly.
In case the order is important
--declare counter
DECLARE #CurrentRowNum BIGINT = 0;
--Iterate over all rows in [DataTable]
WHILE (1 = 1)
BEGIN
--Get next row by number of row
SELECT TOP 1 #CurrentRowNum = extendedData.RowNum
--here also you can store another values
--for following usage
--#MyVariable = extendedData.Value
FROM (
SELECT
data.*
,ROW_NUMBER() OVER(ORDER BY (SELECT 0)) RowNum
FROM [DataTable] data
) extendedData
WHERE extendedData.RowNum > #CurrentRowNum
ORDER BY extendedData.RowNum
--Exit loop if no more rows
IF ##ROWCOUNT = 0 BREAK;
--call your sproc
--EXEC dbo.YOURSPROC #MyVariable
END
I had some production code that could only handle 20 employees at a time, below is the framework for the code. I just copied the production code and removed stuff below.
ALTER procedure GetEmployees
#ClientId varchar(50)
as
begin
declare #EEList table (employeeId varchar(50));
declare #EE20 table (employeeId varchar(50));
insert into #EEList select employeeId from Employee where (ClientId = #ClientId);
-- Do 20 at a time
while (select count(*) from #EEList) > 0
BEGIN
insert into #EE20 select top 20 employeeId from #EEList;
-- Call sp here
delete #EEList where employeeId in (select employeeId from #EE20)
delete #EE20;
END;
RETURN
end
I had a situation where I needed to perform a series of operations on a result set (table). The operations are all set operations, so its not an issue, but...
I needed to do this in multiple places. So putting the relevant pieces in a table type, then populating a table variable w/ each result set allows me to call the sp and repeat the operations each time i need to .
While this does not address the exact question he asks, it does address how to perform an operation on all rows of a table without using a cursor.
#Johannes offers no insight into his motivation , so this may or may not help him.
my research led me to this well written article which served as a basis for my solution
https://codingsight.com/passing-data-table-as-parameter-to-stored-procedures/
Here is the setup
drop type if exists cpRootMapType
go
create type cpRootMapType as Table(
RootId1 int
, RootId2 int
)
go
drop procedure if exists spMapRoot2toRoot1
go
create procedure spMapRoot2toRoot1
(
#map cpRootMapType Readonly
)
as
update linkTable set root = root1
from linktable lt
join #map m on lt.root = root2
update comments set root = root1
from comments c
join #map m on c.root = root2
-- ever growing list of places this map would need to be applied....
-- now consolidated into one place
here is the implementation
... populate #matches
declare #map cpRootMapType
insert #map select rootid1, rootid2 from #matches
exec spMapRoot2toRoot1 #map
I like to do something similar to this (though it is still very similar to using a cursor)
[code]
-- Table variable to hold list of things that need looping
DECLARE #holdStuff TABLE (
id INT IDENTITY(1,1) ,
isIterated BIT DEFAULT 0 ,
someInt INT ,
someBool BIT ,
otherStuff VARCHAR(200)
)
-- Populate your #holdStuff with... stuff
INSERT INTO #holdStuff (
someInt ,
someBool ,
otherStuff
)
SELECT
1 , -- someInt - int
1 , -- someBool - bit
'I like turtles' -- otherStuff - varchar(200)
UNION ALL
SELECT
42 , -- someInt - int
0 , -- someBool - bit
'something profound' -- otherStuff - varchar(200)
-- Loop tracking variables
DECLARE #tableCount INT
SET #tableCount = (SELECT COUNT(1) FROM [#holdStuff])
DECLARE #loopCount INT
SET #loopCount = 1
-- While loop variables
DECLARE #id INT
DECLARE #someInt INT
DECLARE #someBool BIT
DECLARE #otherStuff VARCHAR(200)
-- Loop through item in #holdStuff
WHILE (#loopCount <= #tableCount)
BEGIN
-- Increment the loopCount variable
SET #loopCount = #loopCount + 1
-- Grab the top unprocessed record
SELECT TOP 1
#id = id ,
#someInt = someInt ,
#someBool = someBool ,
#otherStuff = otherStuff
FROM #holdStuff
WHERE isIterated = 0
-- Update the grabbed record to be iterated
UPDATE #holdAccounts
SET isIterated = 1
WHERE id = #id
-- Execute your stored procedure
EXEC someRandomSp #someInt, #someBool, #otherStuff
END
[/code]
Note that you don't need the identity or the isIterated column on your temp/variable table, i just prefer to do it this way so i don't have to delete the top record from the collection as i iterate through the loop.

SQL: efficiently append incremental number to string, avoiding duplicates

I have a set of records (table [#tmp_origin]) containing duplicate entries in a string field ([Names]). I would like to insert the whole content of [#tmp_origin] into the destination table [#tmp_destination], that does NOT allow duplicates and may already contain items.
If the string in the origin table does not exist in the destination table, then in is simply inserted in the destination table, as is.
If an entry in the destination table already exists with the same value of the entry in the original table, a string-ified incremental number must be appended to the string, before it is inserted in the destination table.
The process of moving data in this way has been implemented with a cursor, in this sample script:
-- create initial situation (origin and destination table, both containing items)
-- Begin
CREATE TABLE [#tmp_origin] ([Names] VARCHAR(10))
CREATE TABLE [#tmp_destination] ([Names] VARCHAR(10))
CREATE UNIQUE INDEX [IX_UniqueName] ON [#tmp_destination]([Names] ASC)
INSERT INTO [#tmp_origin]([Names]) VALUES ('a')
INSERT INTO [#tmp_origin]([Names]) VALUES ('a')
INSERT INTO [#tmp_origin]([Names]) VALUES ('b')
INSERT INTO [#tmp_origin]([Names]) VALUES ('c')
INSERT INTO [#tmp_destination]([Names]) VALUES ('a')
INSERT INTO [#tmp_destination]([Names]) VALUES ('a_1')
INSERT INTO [#tmp_destination]([Names]) VALUES ('b')
-- create initial situation - End
DECLARE #Name VARCHAR(10)
DECLARE NamesCursor CURSOR LOCAL FORWARD_ONLY FAST_FORWARD READ_ONLY FOR
SELECT [Names]
FROM [#tmp_origin];
OPEN NamesCursor;
FETCH NEXT FROM NamesCursor INTO #Name;
WHILE ##FETCH_STATUS = 0
BEGIN
DECLARE #finalName VARCHAR(10)
SET #finalName = #Name
DECLARE #counter INT
SET #counter = 1
WHILE(1=1)
BEGIN
IF NOT EXISTS(SELECT * FROM [#tmp_destination] WHERE [Names] = #finalName)
BREAK;
SET #finalName = #Name + '_' + CAST(#counter AS VARCHAR)
SET #counter = #counter + 1
END
INSERT INTO [#tmp_destination] ([Names]) (
SELECT #finalName
)
FETCH NEXT FROM NamesCursor INTO #Name;
END
CLOSE NamesCursor;
DEALLOCATE NamesCursor;
SELECT *
FROM [#tmp_destination]
/*
Expected result:
a
a_1
a_2
a_3
b
b_1
c
*/
DROP TABLE [#tmp_origin]
DROP TABLE [#tmp_destination]
This works correctly, but its performance drastically slows down when the number of items to insert increases.
Any idea to speed it up?
thanks
Using a windowing function allows the duplicates to be numbered. You can also get the count from the destination table (will need where condition to strip off the suffix you've added):
select orig.names,
row_number() over (partition by orig.names order by orig.names) as rowNo,
dest.count
from ##tmp_origin orig
cross apply (select count(1) from #tmp_destination where names = orig.names) as dest
An insert can be built from the above (new suffix is rowNo + dest.count -1 if greater than zero).
Suggest you refactor the destination temporary table to include the name and suffix as separate columns – this might mean having a new intermediate stage – because this will make the matching logic much simpler.
Something like this:
insert [#tmp_destination]
select CASE WHEN row_number() over(partition by Names order by Names) > 1 THEN Names + '_' + CONVERT(VARCHAR(10), row_number() over(partition by Names order by Names)) ELSE Names END
from [#tmp_origin]
I wouldn't use a cursor in that case. Instead, I would build the query using ROW_NUMBER(). This way you add a counter in your original table, and then use this counter to append to your [Names]:
SELECT [Names], ROW_NUMBER() OVER (PARTITION BY [Names] ORDER BY [Names]) - 1 AS [counter]
INTO #tmp_origin_with_counter
FROM #tmp_origin
SELECT CONCAT([Names], IIF([counter] = 0, '', '_'+ CAST([counter] AS NVARCHAR)))
INTO #tmp_destination
FROM #tmp_origin_with_counter

Slow MSSQL stored procedure in processes excel files with only 30,000 rows

I have a web applciation with an iterface that users can uplaod files on. The data form the excel file is collected, concatenated and passed to
a stored procedure which process and returns data.
A brief explanation of the stored procedure.
The stored Procedure collects the string, break it down using a delimeter and stores it in a temp variable table.
Another process is run trough the temp table, where a count is done to find the exact match count and approximate match count by comparing each string
agains a view which contains
all the names to compare against for each row in the first
An exact match count is where the eact string is found in the view for example.. (Bobby Bolonski )
An approximate match is done using a levenshtein distance algorithm database function with a frequency of 2.
temo table #temp1.
The result (name, exactmatch count and approximate match count) are stored in the final temp table.
a select statement is run on the last temp table to return all the data to the application..
MY problem is that, when i passed huge files like and excel file with 27000 names. IT took like 2 hours to process and return data from the database.
I have checked both servers where the application is on and where the database is on.
On the application server. Both memory and cpu usage are less than 15 %
On the database server. both memory and cpu usage are also less than 15 %.
Am looking for advice on what improvements i can do to make the process faster.
Below is the copy of the stored procedure as it is doing all the work and returning the results to the web application.
CREATE PROCEDURE [dbo].[FindMatch]
#fullname varchar(max),#frequency int,
#delimeter varchar(max) AS
set #frequency = 2
declare #transID bigint
SELECT #transID = ABS(CAST(CAST(NEWID() AS VARBINARY(5)) AS Bigint))
DECLARE #exactMatch int = 99
DECLARE #approximateMatch int = 99
declare #name varchar(50)
DECLARE #TEMP1 TABLE (fullname varchar(max),approxMatch varchar(max), exactmatch varchar(max))
DECLARE #ID varchar(max)
--declare a temp table
DECLARE #TEMP TABLE (ID int ,fullname varchar(max),approxMatch varchar(max), exactmatch varchar(max))
--split and store the result in the #temp table
insert into #TEMP (ID,fullname) select * from fnSplitTest(#fullname, #delimeter)
--loop trough the #temp table
WHILE EXISTS (SELECT ID FROM #TEMP)
BEGIN
SELECT Top 1 #ID = ID FROM #TEMP
select #name = fullname from #TEMP where id = #ID
--get the exact match count of the first row from the #temp table and so on until the loop ends
select #exactMatch = count(1) from getalldata where replace(name,',','') COLLATE Latin1_general_CI_AI = #name COLLATE Latin1_general_CI_AI
--declare temp #TEMP3
DECLARE #TEMP3 TABLE (name varchar(max))
--insert into #temp 3 only the data that are similar to our search name so as not to loop over all the data in the view
INSERT INTO #TEMP3(name)
select name from getalldata where SOUNDEX(name) LIKE SOUNDEX(#name)
--get the approximate count using the [DEMLEV] function.
--this function uses the Damerau levenshtein distance algorithm to calculate the distinct between the search string
--and the names inserted into #temp3 above. Uses frequency 2 so as to eliminate all the others
select #approximateMatch = count(1) from #TEMP3 where
dbo.[DamLev](replace(name,',',''),#name,#frequency) <= #frequency and
dbo.[DamLev](replace(name,',',''),#name,#frequency) > 0 and name != #name
--insert into #temp1 at end of every loop results
insert into #TEMP1 (fullname,approxMatch, exactmatch) values(#name,#approximateMatch,#exactMatch)
insert into FileUploadNameInsert (name) values (#name + ' ' +cast(#approximateMatch as varchar) + ' ' + cast(#exactMatch as varchar) + ', ' + cast(#transID as varchar) )
DELETE FROM #TEMP WHERE ID= #ID
delete from #TEMP3
END
--Return all the data stored in #temp3
select fullname,exactmatch,approxMatch, #transID as transactionID from #TEMP1
GO
In my opinion,
Use Openrowset to directly read the records into a pre-defined, properly indexed table of your database.
Now, perform your operations using this table at back-end using pre-defined Stored Procedures.
It should take around 15 minutes for 30,000 rows.