Fastest way to loop thru a SQL Query

Fastest way to loop thru a SQL Query - sql

What is the fastest way to loop thru a Query in T-SQL .
1) Cursors or
2) Temp tables with Key added or
any thing else.

The fastest way to "loop" thru a query is to just not do it. In SQL, you should be thinking set-based instead of loop-based. You should probably evaluate your query, ask why you need to loop, and look for ways to do it as a set.
With that said, using the FAST_FORWARD option on your cursors will help speed things along.

For your stated goal, something like this is actually a better bet - avoids the "looping" issue entirely.
declare #table table
(
ID int
)
insert into #table select 1 union select 2 union select 3 union select 4 union select 5
declare #concat varchar(256)
-- Add comma if it is not the first item in the list
select #concat = isnull(#concat + ', ', '') + ltrim(rtrim(str(ID))) from #table order by ID desc
-- or do whatever you want with the concatenated value now...
print #concat

Depends on what you're trying to do. Some tasks are better suited for cursors, some for temp tables. That's why they both exist.

I don't think you need a cursor for that (your comment about concat) if I understand what you're going for.
Here's one of mine that grabs all the phone numbers for a contact and plops them in a field and returns it.
DECLARE #numbers VARCHAR(255)
SELECT #numbers = COALESCE(#numbers + ' | ','') + PHONE_NUMB FROM my_table (NOLOCK)
WHERE CONTACT_ID=#contact_id RETURN #numbers

Cursors are usually resource hogs especially as your table size grows. So if your table size is small I would be okay with recommending a cursor, however, a larger table would probably do better with an external or temporary table.

Do you want to loop through query output inside stored procedure OR from C# code?
Generally speaking, you should avoid looping through query output one row at a time. SQL is meant for set based operations so see if you can solve your problem using set based approach.

Depending on the size of your result set - Table variables are in memory and require no disk read, can be treated just like a table (set operations) and are very fast until result set gets to large for memory (which then requires swap file writes).

Here's a shortcut to get a comma-delimited string of a single field from a query that returns a number of rows. Pretty quick compared to the alternatives of cursors, etc., and it can be part of a subquery (i.e., get some things, and in one column, the ids of all the things related to each thing in some other table):
SELECT
COALESCE(
REPLACE(
REPLACE(
REPLACE(
(SELECT MyField AS 'c' FROM [mytable] FOR XML PATH('')),'</c><c>',','),
'<c>',''),
'</c>',''),
'')
AS MyFieldCSV
Caveat: it won't play nice if your column contains characters that FOR XML PATH will escape.

Cursor is not good avoid cursor and use while loop in place of cursor
Temp table with key added is the best way to use looping.
i have to manipulate more than 1000000 rows in the table and
for cursor take 2 min because of complex logic.
but when convert cursor in to while loop it will take
25 seconds only. so that's big diffrence in performace.

Related

How to get a list of IDs from a parameter which sometimes includes the IDs already, but sometimes include another sql query

I have developed a SQL query in SSMS-2017 like this:
DECLARE #property NVARCHAR(MAX) = #p;
SET #property = REPLACE(#property, '''', '');
DECLARE #propList TABLE (hproperty NUMERIC(18, 0));
IF CHARINDEX('SELECT', #property) > 0 OR CHARINDEX('select', #property) > 0
BEGIN
INSERT INTO #propList
EXECUTE sp_executesql #property;
END;
ELSE
BEGIN
DECLARE #x TABLE (val NUMERIC(18, 0));
INSERT INTO #x
SELECT CONVERT(NUMERIC(18, 0), strval)
FROM dbo.StringSplit(#property, ',');
INSERT INTO #propList
SELECT val
FROM #x;
END;
SELECT ...columns...
FROM ...tables and joins...
WHERE ...filters...
AND HMY IN (SELECT hproperty FROM #propList)
The issue is, it is possible that the value of the parameter #p can be a list of IDs (Example: 1,2,3,4) or a direct select query (Example: Select ID from mytable where code='A123').
The code is working well as shown above. However it causes a problem in our system (as we use Yardi7-Voyager), and we need to leave only the select statement as a query. To manage it, I was planning to create a function and use it in the where clause like:
WHERE HMY IN (SELECT myFunction(#p))
However I could not manage it as I see I cannot execute a dynamic query in an SQL Function. Then I am stacked. Any idea at this point to handle this issue will be so appreciated.

Others have pointed out that the best fix for this would be a design change, and I agree with them. However, I'd also like to treat your question as academic and answer it in case any future readers ever have the same question in a use case where a design change wouldn't be possible/desirable.
I can think of two ways you might be able to do what you're attempting in a single select, as long as there are no other restrictions on what you can do that you haven't mentioned yet. To keep this brief, I'm just going to give you psuedo-code that can be adapted to your situation as well as those of future readers:
OPENQUERY (or OPENROWSET)
You can incorporate your code above into a stored procedure instead of a function, since stored procedures DO allow dynamic sql, unlike functions. Then the SELECT query in your app would be a SELECT from OPENQUERY(Execute Your Stored Prodedure).
UNION ALL possibilities.
I'm about 99% sure no one would ever want to use this, but I'm mentioning it to be as academically complete as I know how to be.
The second possibility would only work if there are a limited, known, number of possible queries that might be supported by your application. For instance, you can only get your Properties from either TableA, filtered by column1, or from TableB, filtered by Column2 and/or Column3.
Could be more than these possibilities, but it has to be a limited, known quantity, and the more possibilities, the more complex and lengthy the code will get.
But if that's the case, you can simply SELECT from a UNION ALL of every possible scenario, and make it so that only one of the SELECTs in the UNION ALL will return results.
For example:
SELECT ... FROM TableA WHERE Column1=fnGetValue(#p, 'Column1')
AND CHARINDEX('SELECT', #property) > 0
AND CHARINDEX('TableA', #property) > 0
AND CHARINDEX('Column1', #property) > 0
AND (Whatever other filters are needed to uniquely identify this case)
UNION ALL
SELECT
...
Note that fnGetValue() isn't a built-in function. You'd have to write it. It would parse the string in #p, find the location of 'Column1=', and return whatever value comes after it.
At the end of your UNION ALL, you'd need to add a last UNION ALL to a query that will handle the case where the user passed a comma-separated string instead of a query, but that's easy, because all the steps in your code where you populated table variables are unnecessary. You can simply do the final query like this:
WHERE NOT CHARINDEX('SELECT', #p) > 0
AND HMY IN (SELECT strval FROM dbo.StringSplit(#p, ','))
I'm pretty sure this possibility is way more work than its worth, but it is an example of how, in general, dynamic SQL can be replaced with regular SQL that simply covers every possible option you wanted the dynamic sql to be able to handle.

Optimizing stored procedure with multiple "LIKE"s

I am passing in a comma-delimited list of values that I need to compare to the database
Here is an example of the values I'm passing in:
#orgList = "1123, 223%, 54%"
To use the wildcard I think I have to do LIKE but the query runs a long time and only returns 14 rows (the results are correct, but it's just taking forever, probably because I'm using the join incorrectly)
Can I make it better?
This is what I do now:
declare #tempTable Table (SearchOrg nvarchar(max) )
insert into #tempTable
select * from dbo.udf_split(#orgList) as split
-- this splits the values at the comma and puts them in a temp table
-- then I do a join on the main table and the temp table to do a like on it....
-- but I think it's not right because it's too long.
select something
from maintable gt
join #tempTable tt on gt.org like tt.SearchOrg
where
AYEAR= ISNULL(#year, ayear)
and (AYEAR >= ISNULL(#yearR1, ayear) and ayear <= ISNULL(#yearr2, ayear))
and adate = ISNULL(#Date, adate)
and (adate >= ISNULL(#dateR1, adate) and adate <= ISNULL(#DateR2 , adate))
The final result would be all rows where the maintable.org is 1123, or starts with 223 or starts with 554
The reason for my date craziness is because sometimes the stored procedure only checks for a year, sometimes for a year range, sometimes for a specific date and sometimes for a date range... everything that's not used in passed in as null.
Maybe the problem is there?

Try something like this:
Declare #tempTable Table
(
-- Since the column is a varchar(10), you don't want to use nvarchar here.
SearchOrg varchar(20)
);
INSERT INTO #tempTable
SELECT * FROM dbo.udf_split(#orgList);
SELECT
something
FROM
maintable gt
WHERE
some where statements go here
And
Exists
(
SELECT 1
FROM #tempTable tt
WHERE gt.org Like tt.SearchOrg
)

Such a dynamic query with optional filters and LIKE driven by a table (!) are very hard to optimize because almost nothing is statically known. The optimizer has to create a very general plan.
You can do two things to speed this up by orders of magnitute:
Play with OPTION (RECOMPILE). If the compile times are acceptable this will at least deal with all the optional filters (but not with the LIKE table).
Do code generation and EXEC sp_executesql the code. Build a query with all LIKE clauses inlined into the SQL so that it looks like this: WHERE a LIKE #like0 OR a LIKE #like1 ... (not sure if you need OR or AND). This allows the optimizer to get rid of the join and just execute a normal predicate).

Your query may be difficult to optimize. Part of the question is what is in the where clause. You probably want to filter these first, and then do the join using like. Or, you can try to make the join faster, and then do a full table scan on the results.
SQL Server should optimize a like statement of the form 'abc%' -- that is, where the wildcard is at the end. (See here, for example.) So, you can start with an index on maintable.org. Fortunately, your examples meet this criteria. However, if you have '%abc' -- the wildcard comes first -- then the optimization won't work.
For the index to work best, it might also need to take into account the conditions in the where clause. In other words, adding the index is suggestive, but the rest of the query may preclude the use of the index.
And, let me add, the best solution for these types of searches is to use the full text search capability in SQL Server (see here).

Matching sub string in a column

First I apologize for the poor formatting here.
Second I should say up front that changing the table schema is not an option.
So I have a table defined as follows:
Pin varchar
OfferCode varchar
Pin will contain data such as:
abc,
abc123
OfferCode will contain data such as:
123
123~124~125
I need a query to check for a count of a Pin/OfferCode combination and when I say OfferCode, I mean an individual item delimited by the tilde.
For example if there is one row that looks like abc, 123 and another that looks like abc,123~124, and I search for a count of Pin=abc,OfferCode=123 I wand to get a count = 2.
Obviously I can do a similar query to this:
SELECT count(1) from MyTable (nolock) where OfferCode like '%' + #OfferCode + '%' and Pin = #Pin
using like here is very expensive and I'm hoping there may be a more efficient way.
I'm also looking into using a split string solution. I have a Table-valued function SplitString(string,delim) that will return table OutParam, but I'm not quite sure how to apply this to a table column vs a string. Would this even be worth wile pursuing? It seems like it would be much more expensive, but I'm unable to get a working solution to compare to the like solution.

Your like/% solution is open to a bug if you had offer codes other than 3 digits (if there was offer code 123 and 1234, searching for like '%123%' would return both, which is wrong). You can use your string function this way:
SELECT Pin, count(1)
FROM MyTable (nolock)
CROSS APPLY SplitString(OfferCode,'~') OutParam
WHERE OutParam.Value = #OfferCode and Pin = #Pin
GROUP BY Pin
If you have a relatively small table you can probably get away with this. If you are working with a large number of rows or encountering performance problems, it would be more effective to normalize it as RedFilter suggested.

using like here is very expensive and I'm hoping there may be a more efficient way
The efficient way is to normalize the schema and put each OfferCode in its own row.
Then your query is more like (although you may need to use an intersection table depending on your schema):
select count(*)
from MyTable
where OfferCode = #OfferCode
and Pin = #Pin

Here is one way to use like for this problem, which is standard for getting exact matches when searching delimited strings while avoiding the '%123%' matches '123' and '1234' problem:
-- Create some test data
declare #table table (
Pin varchar(10) not null
, OfferCode varchar(100) not null
)
insert into #table select 'abc', '123'
insert into #table select 'abc', '123~124'
-- Mock some proc params
declare #Pin varchar(10) = 'abc'
declare #OfferCode varchar(10) = '123'
-- Run the actual query
select count(*) as Matches
from #table
where Pin = #Pin
-- Append delimiters to find exact matches
and '~' + OfferCode + '~' like '%~' + #OfferCode + '~%'
As you can see, we're adding the delimiters to the searched string, and also the search string in order to find matches, thus avoiding the bugs mentioned by other answers.
I highly doubt that a string splitting function will yield better performance over like, but it may be worth a test or two using some of the more recently suggested methods. If you still have unacceptable performance, you have a few options:
Updated:
Try an index on OfferCode (or on a computed persisted column of '~' + OfferCode + '~'). Contrary to the myth that SQL Server won't use an index with like and wildcards, this might actually help.
Check out full text search.
Create a normalized version of this table using a string splitter. Use this table to run your counts. Update this table according to some schedule or event (trigger, etc.).
If you have some standard search terms, pre-calculate the counts for these and store them on some regular basis.

Actually, the LIKE condition is going to have much less cost than doing any sort of string manipulation and comparison.
http://www.simple-talk.com/sql/performance/the-seven-sins-against-tsql-performance/

Conditional Joins - Dynamic SQL

The DBA here at work is trying to turn my straightforward stored procs into a dynamic sql monstrosity. Admittedly, my stored procedure might not be as fast as they'd like, but I can't help but believe there's an adequate way to do what is basically a conditional join.
Here's an example of my stored proc:
SELECT
*
FROM
table
WHERE
(
#Filter IS NULL OR table.FilterField IN
(SELECT Value FROM dbo.udfGetTableFromStringList(#Filter, ','))
)
The UDF turns a comma delimited list of filters (for example, bank names) into a table.
Obviously, having the filter condition in the where clause isn't ideal. Any suggestions of a better way to conditionally join based on a stored proc parameter are welcome. Outside of that, does anyone have any suggestions for or against the dynamic sql approach?
Thanks

You could INNER JOIN on the table returned from the UDF instead of using it in an IN clause
Your UDF might be something like
CREATE FUNCTION [dbo].[csl_to_table] (#list varchar(8000) )
RETURNS #list_table TABLE ([id] INT)
AS
BEGIN
DECLARE #index INT,
#start_index INT,
#id INT
SELECT #index = 1
SELECT #start_index = 1
WHILE #index <= DATALENGTH(#list)
BEGIN
IF SUBSTRING(#list,#index,1) = ','
BEGIN
SELECT #id = CAST(SUBSTRING(#list, #start_index, #index - #start_index ) AS INT)
INSERT #list_table ([id]) VALUES (#id)
SELECT #start_index = #index + 1
END
SELECT #index = #index + 1
END
SELECT #id = CAST(SUBSTRING(#list, #start_index, #index - #start_index ) AS INT)
INSERT #list_table ([id]) VALUES (#id)
RETURN
END
and then INNER JOIN on the ids in the returned table. This UDF assumes that you're passing in INTs in your comma separated list
EDIT:
In order to handle a null or no value being passed in for #filter, the most straightforward way that I can see would be to execute a different query within the sproc based on the #filter value. I'm not certain how this affects the cached execution plan (will update if someone can confirm) or if the end result would be faster than your original sproc, I think that the answer here would lie in testing.

Looks like the rewrite of the code is being addressed in another answer, but a good argument against dynamic SQL in a stored procedure is that it breaks the ownership chain.
That is, when you call a stored procedure normally, it executes under the permissions of the stored procedure owner EXCEPT when executing dynamic SQL with the execute command,for the context of the dynamic SQL it reverts back to the permissions of the caller, which may be undesirable depending on your security model.
In the end, you are probably better off compromising and rewriting it to address the concerns of the DBA while avoiding dynamic SQL.

I am not sure I understand your aversion to dynamic SQL. Perhaps it is that your UDF has nicely abstracted away some of the messyness of the problem, and you feel dynamic SQL will bring that back. Well, consider that most if not all DAL or ORM tools will rely extensively on dynamic SQL, and I think your problem could be restated as "how can I nicely abstract away the messyness of dynamic SQL".
For my part, dynamic SQL gives me exactly the query I want, and subsequently the performance and behavior I am looking for.

I don't see anything wrong with your approach. Rewriting it to use dynamic SQL to execute two different queries based on whether #Filter is null seems silly to me, honestly.
The only potential downside I can see of what you have is that it could cause some difficulty in determining a good execution plan. But if the performance is good enough as it is, there's no reason to change it.

No matter what you do (and the answers here all have good points), be sure to compare the performance and execution plans of each option.
Sometimes, hand optimization is simply pointless if it impacts your code maintainability and really produces no difference in how the code executes.
I would first simply look at changing the IN to a simple LEFT JOIN with NULL check (this doesn't get rid of your udf, but it should only get called once):
SELECT *
FROM table
LEFT JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
WHERE #Filter IS NULL
OR filter.Value IS NOT NULL

It appears that you are trying to write a a single query to deal with two scenarios:
1. #filter = "x,y,z"
2. #filter IS NULL
To optimise scenario 2, I would INNER JOIN on the UDF, rather than use an IN clause...
SELECT * FROM table
INNER JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
To optimise for scenario 2, I would NOT try to adapt the existing query, instead I would deliberately keep those cases separate, either an IF statement or a UNION and simulate the IF with a WHERE clause...
TSQL IF
IF (#filter IS NULL)
SELECT * FROM table
ELSE
SELECT * FROM table
INNER JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
UNION to Simulate IF
SELECT * FROM table
INNER JOIN dbo.udfGetTableFromStringList(#Filter, ',') AS filter
ON table.FilterField = filter.Value
UNION ALL
SELECT * FROM table WHERE #filter IS NULL
The advantage of such designs is that each case is simple, and determining which is simple is it self simple. Combining the two into a single query, however, leads to compromises such as LEFT JOINs and so introduces significant performance loss to each.

Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

I'm writing an import utility that is using phone numbers as a unique key within the import.
I need to check that the phone number does not already exist in my DB. The problem is that phone numbers in the DB could have things like dashes and parenthesis and possibly other things. I wrote a function to remove these things, the problem is that it is slow and with thousands of records in my DB and thousands of records to import at once, this process can be unacceptably slow. I've already made the phone number column an index.
I tried using the script from this post:
T-SQL trim &nbsp (and other non-alphanumeric characters)
But that didn't speed it up any.
Is there a faster way to remove non-numeric characters? Something that can perform well when 10,000 to 100,000 records have to be compared.
Whatever is done needs to perform fast.
Update
Given what people responded with, I think I'm going to have to clean the fields before I run the import utility.
To answer the question of what I'm writing the import utility in, it is a C# app. I'm comparing BIGINT to BIGINT now, with no need to alter DB data and I'm still taking a performance hit with a very small set of data (about 2000 records).
Could comparing BIGINT to BIGINT be slowing things down?
I've optimized the code side of my app as much as I can (removed regexes, removed unneccessary DB calls). Although I can't isolate SQL as the source of the problem anymore, I still feel like it is.

I saw this solution with T-SQL code and PATINDEX. I like it :-)
CREATE Function [fnRemoveNonNumericCharacters](#strText VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', #strText) > 0
BEGIN
SET #strText = STUFF(#strText, PATINDEX('%[^0-9]%', #strText), 1, '')
END
RETURN #strText
END

replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(string,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')*1 AS string,
:)

In case you didn't want to create a function, or you needed just a single inline call in T-SQL, you could try:
set #Phone = REPLACE(REPLACE(REPLACE(REPLACE(#Phone,'(',''),' ',''),'-',''),')','')
Of course this is specific to removing phone number formatting, not a generic remove all special characters from string function.

I may misunderstand, but you've got two sets of data to remove the strings from one for current data in the database and then a new set whenever you import.
For updating the existing records, I would just use SQL, that only has to happen once.
However, SQL isn't optimized for this sort of operation, since you said you are writing an import utility, I would do those updates in the context of the import utility itself, not in SQL. This would be much better performance wise. What are you writing the utility in?
Also, I may be completely misunderstanding the process, so I apologize if off-base.
Edit:
For the initial update, if you are using SQL Server 2005, you could try a CLR function. Here's a quick one using regex. Not sure how the performance would compare, I've never used this myself except for a quick test right now.
using System;
using System.Data;
using System.Text.RegularExpressions;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString StripNonNumeric(SqlString input)
{
Regex regEx = new Regex(#"\D");
return regEx.Replace(input.Value, "");
}
};
After this is deployed, to update you could just use:
UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)

Simple function:
CREATE FUNCTION [dbo].[RemoveAlphaCharacters](#InputString VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
WHILE PATINDEX('%[^0-9]%',#InputString)>0
SET #InputString = STUFF(#InputString,PATINDEX('%[^0-9]%',#InputString),1,'')
RETURN #InputString
END
GO

create function dbo.RemoveNonNumericChar(#str varchar(500))
returns varchar(500)
begin
declare #startingIndex int
set #startingIndex=0
while 1=1
begin
set #startingIndex= patindex('%[^0-9]%',#str)
if #startingIndex <> 0
begin
set #str = replace(#str,substring(#str,#startingIndex,1),'')
end
else break;
end
return #str
end
go
select dbo.RemoveNonNumericChar('aisdfhoiqwei352345234##$%^$#345345%^##$^')

From SQL Server 2017 the native TRANSLATE function is available.
If you have a known list of all characters to remove then you can simply use the following (to first convert all bad characters to a single known bad character and then to strip that specific character out with a REPLACE)
DECLARE #BadCharacters VARCHAR(256) = 'abcdefghijklmnopqrstuvwxyz';
SELECT REPLACE(
TRANSLATE(YourColumn,
#BadCharacters,
REPLICATE(LEFT(#BadCharacters,1),LEN(#BadCharacters))),
LEFT(#BadCharacters,1),
'')
FROM #YourTable
If the list of possible "bad" characters is too extensive to enumerate all in advance then you can use a double TRANSLATE
DECLARE #CharactersToKeep VARCHAR(30) = '0123456789',
#ExampleBadCharacter CHAR(1) = CHAR(26);
SELECT REPLACE(TRANSLATE(YourColumn, bad_chars, REPLICATE(#ExampleBadCharacter, LEN(bad_chars + 'X') - 1)), #ExampleBadCharacter, '')
FROM #YourTable
CROSS APPLY (SELECT REPLACE(
TRANSLATE(YourColumn,
#CharactersToKeep,
REPLICATE(LEFT(#CharactersToKeep, 1), LEN(#CharactersToKeep))),
LEFT(#CharactersToKeep, 1),
'')) ca(bad_chars)

can you remove them in a nightly process, storing them in a separate field, then do an update on changed records right before you run the process?
Or on the insert/update, store the "numeric" format, to reference later. A trigger would be an easy way to do it.

I would try Scott's CLR function first but add a WHERE clause to reduce the number of records updated.
UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)
WHERE phonenumber like '%[^0-9]%'
If you know that the great majority of your records have non-numeric characters it might not help though.

I know it is late to the game, but here is a function that I created for T-SQL that quickly removes non-numeric characters. Of note, I have a schema "String" that I put utility functions for strings into...
CREATE FUNCTION String.ComparablePhone( #string nvarchar(32) ) RETURNS bigint AS
BEGIN
DECLARE #out bigint;
-- 1. table of unique characters to be kept
DECLARE #keepers table ( chr nchar(1) not null primary key );
INSERT INTO #keepers ( chr ) VALUES (N'0'),(N'1'),(N'2'),(N'3'),(N'4'),(N'5'),(N'6'),(N'7'),(N'8'),(N'9');
-- 2. Identify the characters in the string to remove
WITH found ( id, position ) AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (n1+n10) DESC), -- since we are using stuff, for the position to continue to be accurate, start from the greatest position and work towards the smallest
(n1+n10)
FROM
(SELECT 0 AS n1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) AS d1,
(SELECT 0 AS n10 UNION SELECT 10 UNION SELECT 20 UNION SELECT 30) AS d10
WHERE
(n1+n10) BETWEEN 1 AND len(#string)
AND substring(#string, (n1+n10), 1) NOT IN (SELECT chr FROM #keepers)
)
-- 3. Use stuff to snuff out the identified characters
SELECT
#string = stuff( #string, position, 1, '' )
FROM
found
ORDER BY
id ASC; -- important to process the removals in order, see ROW_NUMBER() above
-- 4. Try and convert the results to a bigint
IF len(#string) = 0
RETURN NULL; -- an empty string converts to 0
RETURN convert(bigint,#string);
END
Then to use it to compare for inserting, something like this;
INSERT INTO Contacts ( phone, first_name, last_name )
SELECT i.phone, i.first_name, i.last_name
FROM Imported AS i
LEFT JOIN Contacts AS c ON String.ComparablePhone(c.phone) = String.ComparablePhone(i.phone)
WHERE c.phone IS NULL -- Exclude those that already exist

Working with varchars is fundamentally slow and inefficient compared to working with numerics, for obvious reasons. The functions you link to in the original post will indeed be quite slow, as they loop through each character in the string to determine whether or not it's a number. Do that for thousands of records and the process is bound to be slow. This is the perfect job for Regular Expressions, but they're not natively supported in SQL Server. You can add support using a CLR function, but it's hard to say how slow this will be without trying it I would definitely expect it to be significantly faster than looping through each character of each phone number, however!
Once you get the phone numbers formatted in your database so that they're only numbers, you could switch to a numeric type in SQL which would yield lightning-fast comparisons against other numeric types. You might find that, depending on how fast your new data is coming in, doing the trimming and conversion to numeric on the database side is plenty fast enough once what you're comparing to is properly formatted, but if possible, you would be better off writing an import utility in a .NET language that would take care of these formatting issues before hitting the database.
Either way though, you're going to have a big problem regarding optional formatting. Even if your numbers are guaranteed to be only North American in origin, some people will put the 1 in front of a fully area-code qualified phone number and others will not, which will cause the potential for multiple entries of the same phone number. Furthermore, depending on what your data represents, some people will be using their home phone number which might have several people living there, so a unique constraint on it would only allow one database member per household. Some would use their work number and have the same problem, and some would or wouldn't include the extension which would cause artificial uniqueness potential again.
All of that may or may not impact you, depending on your particular data and usages, but it's important to keep in mind!

I'd use an Inline Function from performance perspective, see below:
Note that symbols like '+','-' etc will not be removed
CREATE FUNCTION [dbo].[UDF_RemoveNumericStringsFromString]
(
#str varchar(100)
)
RETURNS TABLE AS RETURN
WITH Tally (n) as
(
-- 100 rows
SELECT TOP (Len(#Str)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
)
SELECT OutStr = STUFF(
(SELECT SUBSTRING(#Str, n,1) st
FROM Tally
WHERE ISNUMERIC(SUBSTRING(#Str, n,1)) = 1
FOR XML PATH(''),type).value('.', 'varchar(100)'),1,0,'')
GO
/*Use it*/
SELECT OutStr
FROM dbo.UDF_RemoveNumericStringsFromString('fjkfhk759734977fwe9794t23')
/*Result set
759734977979423 */
You can define it with more than 100 characters...

"Although I can't isolate SQL as the source of the problem anymore, I still feel like it is."
Fire up SQL Profiler and take a look. Take the resulting queries and check their execution plans to make sure that index is being used.

Thousands of records against thousands of records is not normally a problem. I've used SSIS to import millions of records with de-duping like this.
I would clean up the database to remove the non-numeric characters in the first place and keep them out.

Looking for a super simple solution:
SUBSTRING([Phone], CHARINDEX('(', [Phone], 1)+1, 3)
+ SUBSTRING([Phone], CHARINDEX(')', [Phone], 1)+1, 3)
+ SUBSTRING([Phone], CHARINDEX('-', [Phone], 1)+1, 4) AS Phone

I would recommend enforcing a strict format for phone numbers in the database. I use the following format. (Assuming US phone numbers)
Database: 5555555555x555
Display: (555) 555-5555 ext 555
Input: 10 digits or more digits embedded in any string. (Regex replacing removes all non-numeric characters)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fastest way to loop thru a SQL Query - sql

What is the fastest way to loop thru a Query in T-SQL . 1) Cursors or 2) Temp tables with Key added or any thing else.

Depends on what you're trying to do. Some tasks are better suited for cursors, some for temp tables. That's why they both exist.

Cursors are usually resource hogs especially as your table size grows. So if your table size is small I would be okay with recommending a cursor, however, a larger table would probably do better with an external or temporary table.

Do you want to loop through query output inside stored procedure OR from C# code? Generally speaking, you should avoid looping through query output one row at a time. SQL is meant for set based operations so see if you can solve your problem using set based approach.

Depending on the size of your result set - Table variables are in memory and require no disk read, can be treated just like a table (set operations) and are very fast until result set gets to large for memory (which then requires swap file writes).

Related

How to get a list of IDs from a parameter which sometimes includes the IDs already, but sometimes include another sql query

Optimizing stored procedure with multiple "LIKE"s

Matching sub string in a column

Conditional Joins - Dynamic SQL

Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

Categories

Resources