Trying to generate a subset from an IN statement - sql

I am writing a solution for a user that matches a list of phone numbers they enter against a customer database.
The user needs to enter a comma separated list of phone numbers (integers), and the query needs to tell the user which phone numbers from their list are NOT in the database.
The only way I could think to do this is by first creating a subset NUMBER_LIST that includes all of the phone numbers that I can join and then exclude that list from what I bring back from my customer database.
WITH NUMBER_LIST AS (
SELECT INTEGERS
FROM (
SELECT level - 1 + 8000000000 INTEGERS
FROM dual
CONNECT BY level <= 8009999999-8000000000+1
)
WHERE INTEGERS IN (8001231001,8001231003,8001231234,8001231235,...up to 1000 phone numbers)
)
The problem here is the above code works fine to create my subset, for numbers between 800-000-0000 and 800-999-9999. The phone numbers in my list and customer database can be ANY range (not just 800 numbers). I did this just as a test. It takes about 6 seconds to generate the subset from that query. If I create the CONNECT BY LEVEL to include all numbers from 100-000-0000 to 999-999-9999 that is running my query out of memory to create a subset that large (and I believe it is ridiculously overkill to create a huge list and break it down using my IN statement).
The problem is creating the initial subset. I can handle the rest of the query, but I need to be able to generate the subset of numbers to query against my customer database from my IN statement.
Few things to remember:
I don't have the ability to load the numbers in a temporary table first. The user will be entering the "IN(...,...,...)" statement themselves.
This needs to be a single statement, no extra functions or variable declarations
The database is Oracle 10g, and I am using SQL Developer to create the query.
The user understands that they can only enter 1000 numbers into the IN statement. This needs to be robust enough to select any 1000 numbers from the entire area code range.
The end result is to get a list of phone numbers that ARE NOT in the database. A simple NOT IN... will not work, because that will bring back which numbers are in the database, but not in my list.
How can I make this work for all numbers between 1000000000-9999999999 (or all U.S. 10-digit phone number possibilities). I may be going about it completely wrong to generate my initial HUGE list and then excluding everything other than my IN statement, but I'm not sure where to go from here.
Thanks so much for your help in advance. I've learned so much from all of you.

You could use the following:
SELECT *
FROM (SELECT regexp_substr(&x, '[^,]+', 1, LEVEL) phone_number
FROM dual
CONNECT BY LEVEL <= length(&x) - length(REPLACE(&x, ',', '')) + 1)
WHERE phone_number NOT IN (SELECT phone_table.phone_number
FROM phone_table)
The first query will build a list with the individual phone numbers.

This problem is very closely related to the 'how do I bind an in list' problem, which has come up on here a few times. I posted an answer Dynamic query with HibernateCritera API & Oracle - performance in the past.
Something like this should do what you want:
create table phone_nums (phone varchar2(10));
insert into phone_nums values ('12345');
insert into phone_nums values ('23456');
with bound_inlist
as
(
select
substr(txt,
instr (txt, ',', 1, level ) + 1,
instr (txt, ',', 1, level+1) - instr (txt, ',', 1, level) -1 )
as token
from (select ','||:txt||',' txt from dual)
connect by level <= length(:txt)-length(replace(:txt,',',''))+1
)
select *
from bound_inlist a
where not exists (select null from phone_nums where phone = token);
Here the list of comma separated phone numbers is bound into the query, so you are using bind variables correctly, and you will be able to enter probably an unlimited number of phone numbers to check in one go (although I would check both the 4000 and 32767 character boundaries to be sure).

You say you can't use temp tables or procs or custom functions -- it would be a simple task if you could.
What's the client tool being used to submit this query? Is there a reason why you can't query all phone numbers from the database and do the compare on the client?

If you are constrained to the point that it MUST be solved with IN (n1,n2,n3,...,n1000), then your approach would appear to be the only solution.
As you mentioned though, that's a big list you're creating up front.
Are you able to adapt your approach slightly?
WITH NUMBER_LIST (number) AS (
SELECT n1 FROM DUAL
UNION ALL SELECT n2 FROM DUAL
UNION ALL SELECT n3 FROM DUAL
...
UNION ALL SELECT n1000 FROM DUAL
)

Related

Validate exist mobile number as per starting series

I have mobile number more than 50K, and I need to validate all number, whether these are valid number or not as per Indian mobile number series. I have downloaded Indian mobile number series from Wikipedia, and storing these in column named Series in another table. Now I want to validate all number in one go, please provide any standard query which is faster, and in best plan execution.
For example series is: 6000,6001,6002,9977,9947
Below is mobile number: 1241124154,6011101101,8414141401,6014141410,9947256585
Please note that above number is randomly entered, these are not related to the number I have in my record. Any resemblance/existence of this number will be just coincidence.
Given that you can determine valid phone matches using known prefixes of the numbers, you should be able to just index the phone number column and then run something like:
SELECT *
FROM yourTable
WHERE
phone LIKE '6000%' OR
phone LIKE '6002%' OR
phone LIKE '9977%' OR
phone LIKE '9947%';
If you have many possible phone prefixes to check, then I suggest the following approach. First, create a new column based on the phone number which contains only the prefix. You may do this in your current table, or you may create a temporary table if you don't want to/can't change your current schema. Next, create a new table which contains just a single column. Populate this table with your 4000 actual valid phone prefixes, and then index this phone column. Now, the following query should be very fast:
SELECT t1.phone
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM prefixes t2 WHERE t2.prefix = t1.prefix);
Your SQL database should be able to use the index to satisfy the WHERE clause, and make the query execute quickly.
use some thing like this, If you have mobile master table name tblmobile and series in tblmobileseries table then below is a sample query for your solution
SELECT mobileno,
(case when exists(select tm.mobileno
from tblmobile tm, tblmobileseries tms
where tm.mobileno Like tms.series + '%')
then 'VALID'
else 'Not Valid' end
) as ISValid
from tblmobileno
Select M.Mobile,
Case When Exists (Select 1 from SeriesT T where Left(M.Mobile,4)=T.Series)
Then 'Valid'
Else 'Invalid'
End as Result
From DM M
DM- Table for mobile number
SeriesT- Table for mobile number series

Is there a way to query a ranged Expression in DB?

Our application is a Mainframe which is a IBM iSeries – DB2 database set up. Some of our table values have a range.
Ex: 100;105;108;110:160;180
-- UPDATE --
The above data is from a single row (Single column to be precise). In the same format there would be multiple values (on various rows)
It this case, individual values are delimited by a “;” but 110:160 is a range. It includes all the values from 110 to 160. Now, for the individual values we were using like statements obviously. Ex; if I have to query for 105.
The challenge here is, if I had to query 125 which is technically not present in the database. However, logically I need to retrieve that record.
The system (application) somehow was able to accomplish this, I am not sure how. I am not a mainframe developer, I just had to query the database to retrieve a specific record for some of the automation that we work on.
As a workaround, I could think of two things:
Expand the ranges and store it in a temp database programmatically.
Ex: 110:160 would be expanded to 110;111;112..160 (Yes, it’s tedious)
Reduce the number of records, by filtering through certain unique colums (the one’w which are without ranges) then programmatically apply a logic to identify the right record
As both are workarounds, I was so curious to how the system does it. (I reached out to dev’s of the app. So far, no luck). So is there a direct approach to achieve this ? Could it be a stored procedure ?
If i got your question right your example values are not in a single row but in multiple - otherwise some preprocessing has to be done.
I would destruct the combined value into its components with SQL - like:
with temp(id, text, value1, value2) as (
select id, text
,case when posstr(id,':') > 0
then substr(id, 1, posstr(id,':') - 1)
else id
end as value1
,case when posstr(id,':') > 0
then substr(id, posstr(id,':')+1 , length(id))
else id
end as value2
from testrange
)
select * from temp
where 125 between value1 and value2

Suggestions for Querying Database for Names

I have an Oracle database that, like many, has a table containing biographical information. On which, I would like to search by name in a "natural" way.
The table has forename and surname fields and, currently, I am using something like this:
select id, forename, surname
from mytable
where upper(forename) like '%JOHN%'
and upper(surname) like '%SMITH%';
This works, but it can be very slow because the indices on this table obviously can't account for the preceding wildcard. Also, users will usually be searching for people based on what they tell them over the phone -- including a huge number of non-English names -- so it would be nice to also do some phonetic analysis.
As such, I have been experimenting with Oracle Text:
create index forenameFTX on mytable(forename) indextype is ctxsys.context;
create index surnameFTX on mytable(surname) indextype is ctxsys.context;
select score(1)+score(2) relevance,
id,
forename,
surname
from mytable
where contains(forename,'!%john%',1) > 0
and contains(surname,'!%smith%',2) > 0
order by relevance desc;
This has the advantage of using the Soundex algorithm as well as full text indices, so it should be a little more efficient. (Although, my anecdotal results show it to be pretty slow!) The only apprehensions I have about this are:
Firstly, the text indices need to be refreshed in some meaningful way. Using on commit would be too slow and might interfere with how the frontend software -- which is out of my control -- interacts with the database; so requires some thinking about...
The results that are returned by Oracle aren't exactly very naturally sorted; I'm not really sure about this score function. For example, my development data is showing "Jonathan Peter Jason Smith" at the top -- fine -- but also "Jane Margaret Simpson" at the same level as "John Terrance Smith"
I'm thinking that removing the preceding wildcard might improve performance without degrading the results as, in real life, you would never search for a chunk in the middle of a name. However, otherwise, I'm open to ideas... This scenario must have been implemented ad nauseam! Can anyone suggest a better approach to what I'm doing/considering now?
Thanks :)
I have come up with a solution which works pretty well, following the suggestions in the comments. Particularly, #X-Zero's suggestion of creating a table of Soundexes: In my case, I can create new tables, but altering the existing schema is not allowed!
So, my process is as follows:
Create a new table with columns: ID, token, sound and position; with the primary key over (ID, sound,position) and an additional index over (ID,sound).
Go through each person in the biographical table:
Concatenate their forename and surname.
Change the codepage to us7ascii, so accented characters are normalised. This is because the Soundex algorithm doesn't work with accented characters.
Convert all non-alphabetic characters into whitespace and consider this the boundary between tokens.
Tokenise this string and insert into the table the token (in lowercase), the Soundex of the token and the position the token comes in the original string; associate this with ID.
Like so:
declare
nameString varchar2(82);
token varchar2(40);
posn integer;
cursor myNames is
select id,
forename||' '||surname person_name
from mypeople;
begin
for person in myNames
loop
nameString := trim(
utl_i18n.escape_reference(
regexp_replace(
regexp_replace(person.person_name,'[^[:alpha:]]',' '),
'\s+',' '),
'us7ascii')
)||' ';
posn := 1;
while nameString is not null
loop
token := substr(nameString,1,instr(nameString,' ') - 1);
insert into personsearch values (person.id,lower(token),soundex(token),posn);
nameString := substr(nameString,instr(nameString,' ') + 1);
posn := posn + 1;
end loop;
end loop;
end;
/
So, for example, "Siân O'Conner" gets tokenised into "sian" (position 1), "o" (position 2) and "conner" (position 3) and those three entries, with their Soundex, get inserted into personsearch along with their ID.
To search, we do the same process: tokenise the search criteria and then return results where the Soundexes and relative positions match. We order by the position and then the Levenshtein distance (ld) from the original search for each token, in turn.
This query, for example, will search against two tokens (i.e., pre-tokenised search string):
with searchcriteria as (
select 'john' token1,
'smith' token2
from dual)
select alpha.id,
mypeople.forename||' '||mypeople.surname
from peoplesearch alpha
join mypeople
on mypeople.student_id = alpha.student_id
join peoplesearch beta
on beta.student_id = alpha.student_id
and beta.position > alpha.position
join searchcriteria
on 1 = 1
where alpha.sound = soundex(searchcriteria.token1)
and beta.sound = soundex(searchcriteria.token2)
order by alpha.position,
ld(alpha.token,searchcriteria.token1),
beta.position,
ld(beta.token,searchcriteria.token2),
alpha.student_id;
To search against an arbitrary number of tokens, we would need to use dynamic SQL: joining the search table as many times as there are tokens, where the position field in the joined table must be greater than the position of the previously joined table... I plan to write a function to do this -- as well as the search string tokenisation -- which will return a table of IDs. However, I just post this here so you get the idea :)
As I say, this works pretty well: It returns good results pretty quickly. Even searching for "John Smith", once cached by the server, runs in less than 0.2s; returning over 200 rows... I'm pretty pleased with it and will be looking to put it into production. The only issues are:
The precalculation of tokens takes a while, but it's a one-off process, so not too much of a problem. A related problem however is that a trigger needs to be put on the mypeople table to insert/update/delete tokens into the search table whenever the corresponding operation is performed on mypeople. This may slow up the system; but as this should only happen during a few periods in a year, perhaps a better solution would be to rebuild the search table on a scheduled basis.
No stemming is being done, so the Soundex algorithm only matches on full tokens. For example, a search for "chris" will not return any "christopher"s. A possible solution to this is to only store the Soundex of the stem of the token, but calculating the stem is not a simple problem! This will be a future upgrade, possibly using the hyphenation engine used by TeX...
Anyway, hope that helps :) Comments welcome!
EDIT My full solution (write up and implementation) is now here, using Metaphone and the Damerau-Levenshtein Distance.

How to make result set from ('1','2','3')?

I have a question, how can i make a result set making only list of values. For example i have such values : ('1','2','3')
And i want to make a sql that returns such table:
1
2
3
Thanks.
[Edit]
Sorry for wrong question.
Actually list not containing integers, but it contains strings.
I am currently need like ('aa','bb,'cc').
[/Edit]
If you want to write a SQL statement which will take a comma separate list and generate an arbitrary number of actually rows the only real way would be to use a table function, which calls a PL/SQL function which splits the input string and returns the elements as separate rows.
Check out this link for an intro to table-functions.
Alternatively, if you can construct the SQL statement programmatically in your client you can do:
SELECT 'aa' FROM DUAL
UNION
SELECT 'bb' FROM DUAL
UNION
SELECT 'cc' FROM DUAL
The best way I've found is using XML.
SELECT items.extract('/l/text()').getStringVal() item
FROM TABLE(xmlSequence(
EXTRACT(XMLType(''||
REPLACE('aa,bb,cc',',','')||'')
,'/all/l'))) items;
Wish I could take credit but alas : http://pbarut.blogspot.com/2006/10/binding-list-variable.html.
Basically what it does is convert the list to an xmldocument then parse it back out.
The easiest way is to abuse a table that is guaranteed to have enough rows.
-- for Oracle
select rownum from tab where rownum < 4;
If that is not possible, check out Oracle Row Generator Techniques.
I like this one (requires 10g):
select integer_value
from dual
where 1=2
model
dimension by ( 0 as key )
measures ( 0 as integer_value )
rules upsert ( integer_value[ for key from 1 to 10 increment 1 ] = cv(key) )
;
One trick I've used in various database systems (not just SQL databases) is actually to have a table which just contains the first 100 or 1000 integers. Such a table is very easy to create programatically, and your query then becomes:
SELECT value FROM numbers WHERE value < 4 ORDER BY value
You can use the table for lots of similar purposes.

Fastest way to remove non-numeric characters from a VARCHAR in SQL Server

I'm writing an import utility that is using phone numbers as a unique key within the import.
I need to check that the phone number does not already exist in my DB. The problem is that phone numbers in the DB could have things like dashes and parenthesis and possibly other things. I wrote a function to remove these things, the problem is that it is slow and with thousands of records in my DB and thousands of records to import at once, this process can be unacceptably slow. I've already made the phone number column an index.
I tried using the script from this post:
T-SQL trim &nbsp (and other non-alphanumeric characters)
But that didn't speed it up any.
Is there a faster way to remove non-numeric characters? Something that can perform well when 10,000 to 100,000 records have to be compared.
Whatever is done needs to perform fast.
Update
Given what people responded with, I think I'm going to have to clean the fields before I run the import utility.
To answer the question of what I'm writing the import utility in, it is a C# app. I'm comparing BIGINT to BIGINT now, with no need to alter DB data and I'm still taking a performance hit with a very small set of data (about 2000 records).
Could comparing BIGINT to BIGINT be slowing things down?
I've optimized the code side of my app as much as I can (removed regexes, removed unneccessary DB calls). Although I can't isolate SQL as the source of the problem anymore, I still feel like it is.
I saw this solution with T-SQL code and PATINDEX. I like it :-)
CREATE Function [fnRemoveNonNumericCharacters](#strText VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', #strText) > 0
BEGIN
SET #strText = STUFF(#strText, PATINDEX('%[^0-9]%', #strText), 1, '')
END
RETURN #strText
END
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(string,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z','')*1 AS string,
:)
In case you didn't want to create a function, or you needed just a single inline call in T-SQL, you could try:
set #Phone = REPLACE(REPLACE(REPLACE(REPLACE(#Phone,'(',''),' ',''),'-',''),')','')
Of course this is specific to removing phone number formatting, not a generic remove all special characters from string function.
I may misunderstand, but you've got two sets of data to remove the strings from one for current data in the database and then a new set whenever you import.
For updating the existing records, I would just use SQL, that only has to happen once.
However, SQL isn't optimized for this sort of operation, since you said you are writing an import utility, I would do those updates in the context of the import utility itself, not in SQL. This would be much better performance wise. What are you writing the utility in?
Also, I may be completely misunderstanding the process, so I apologize if off-base.
Edit:
For the initial update, if you are using SQL Server 2005, you could try a CLR function. Here's a quick one using regex. Not sure how the performance would compare, I've never used this myself except for a quick test right now.
using System;
using System.Data;
using System.Text.RegularExpressions;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlString StripNonNumeric(SqlString input)
{
Regex regEx = new Regex(#"\D");
return regEx.Replace(input.Value, "");
}
};
After this is deployed, to update you could just use:
UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)
Simple function:
CREATE FUNCTION [dbo].[RemoveAlphaCharacters](#InputString VARCHAR(1000))
RETURNS VARCHAR(1000)
AS
BEGIN
WHILE PATINDEX('%[^0-9]%',#InputString)>0
SET #InputString = STUFF(#InputString,PATINDEX('%[^0-9]%',#InputString),1,'')
RETURN #InputString
END
GO
create function dbo.RemoveNonNumericChar(#str varchar(500))
returns varchar(500)
begin
declare #startingIndex int
set #startingIndex=0
while 1=1
begin
set #startingIndex= patindex('%[^0-9]%',#str)
if #startingIndex <> 0
begin
set #str = replace(#str,substring(#str,#startingIndex,1),'')
end
else break;
end
return #str
end
go
select dbo.RemoveNonNumericChar('aisdfhoiqwei352345234##$%^$#345345%^##$^')
From SQL Server 2017 the native TRANSLATE function is available.
If you have a known list of all characters to remove then you can simply use the following (to first convert all bad characters to a single known bad character and then to strip that specific character out with a REPLACE)
DECLARE #BadCharacters VARCHAR(256) = 'abcdefghijklmnopqrstuvwxyz';
SELECT REPLACE(
TRANSLATE(YourColumn,
#BadCharacters,
REPLICATE(LEFT(#BadCharacters,1),LEN(#BadCharacters))),
LEFT(#BadCharacters,1),
'')
FROM #YourTable
If the list of possible "bad" characters is too extensive to enumerate all in advance then you can use a double TRANSLATE
DECLARE #CharactersToKeep VARCHAR(30) = '0123456789',
#ExampleBadCharacter CHAR(1) = CHAR(26);
SELECT REPLACE(TRANSLATE(YourColumn, bad_chars, REPLICATE(#ExampleBadCharacter, LEN(bad_chars + 'X') - 1)), #ExampleBadCharacter, '')
FROM #YourTable
CROSS APPLY (SELECT REPLACE(
TRANSLATE(YourColumn,
#CharactersToKeep,
REPLICATE(LEFT(#CharactersToKeep, 1), LEN(#CharactersToKeep))),
LEFT(#CharactersToKeep, 1),
'')) ca(bad_chars)
can you remove them in a nightly process, storing them in a separate field, then do an update on changed records right before you run the process?
Or on the insert/update, store the "numeric" format, to reference later. A trigger would be an easy way to do it.
I would try Scott's CLR function first but add a WHERE clause to reduce the number of records updated.
UPDATE table SET phoneNumber = dbo.StripNonNumeric(phoneNumber)
WHERE phonenumber like '%[^0-9]%'
If you know that the great majority of your records have non-numeric characters it might not help though.
I know it is late to the game, but here is a function that I created for T-SQL that quickly removes non-numeric characters. Of note, I have a schema "String" that I put utility functions for strings into...
CREATE FUNCTION String.ComparablePhone( #string nvarchar(32) ) RETURNS bigint AS
BEGIN
DECLARE #out bigint;
-- 1. table of unique characters to be kept
DECLARE #keepers table ( chr nchar(1) not null primary key );
INSERT INTO #keepers ( chr ) VALUES (N'0'),(N'1'),(N'2'),(N'3'),(N'4'),(N'5'),(N'6'),(N'7'),(N'8'),(N'9');
-- 2. Identify the characters in the string to remove
WITH found ( id, position ) AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (n1+n10) DESC), -- since we are using stuff, for the position to continue to be accurate, start from the greatest position and work towards the smallest
(n1+n10)
FROM
(SELECT 0 AS n1 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) AS d1,
(SELECT 0 AS n10 UNION SELECT 10 UNION SELECT 20 UNION SELECT 30) AS d10
WHERE
(n1+n10) BETWEEN 1 AND len(#string)
AND substring(#string, (n1+n10), 1) NOT IN (SELECT chr FROM #keepers)
)
-- 3. Use stuff to snuff out the identified characters
SELECT
#string = stuff( #string, position, 1, '' )
FROM
found
ORDER BY
id ASC; -- important to process the removals in order, see ROW_NUMBER() above
-- 4. Try and convert the results to a bigint
IF len(#string) = 0
RETURN NULL; -- an empty string converts to 0
RETURN convert(bigint,#string);
END
Then to use it to compare for inserting, something like this;
INSERT INTO Contacts ( phone, first_name, last_name )
SELECT i.phone, i.first_name, i.last_name
FROM Imported AS i
LEFT JOIN Contacts AS c ON String.ComparablePhone(c.phone) = String.ComparablePhone(i.phone)
WHERE c.phone IS NULL -- Exclude those that already exist
Working with varchars is fundamentally slow and inefficient compared to working with numerics, for obvious reasons. The functions you link to in the original post will indeed be quite slow, as they loop through each character in the string to determine whether or not it's a number. Do that for thousands of records and the process is bound to be slow. This is the perfect job for Regular Expressions, but they're not natively supported in SQL Server. You can add support using a CLR function, but it's hard to say how slow this will be without trying it I would definitely expect it to be significantly faster than looping through each character of each phone number, however!
Once you get the phone numbers formatted in your database so that they're only numbers, you could switch to a numeric type in SQL which would yield lightning-fast comparisons against other numeric types. You might find that, depending on how fast your new data is coming in, doing the trimming and conversion to numeric on the database side is plenty fast enough once what you're comparing to is properly formatted, but if possible, you would be better off writing an import utility in a .NET language that would take care of these formatting issues before hitting the database.
Either way though, you're going to have a big problem regarding optional formatting. Even if your numbers are guaranteed to be only North American in origin, some people will put the 1 in front of a fully area-code qualified phone number and others will not, which will cause the potential for multiple entries of the same phone number. Furthermore, depending on what your data represents, some people will be using their home phone number which might have several people living there, so a unique constraint on it would only allow one database member per household. Some would use their work number and have the same problem, and some would or wouldn't include the extension which would cause artificial uniqueness potential again.
All of that may or may not impact you, depending on your particular data and usages, but it's important to keep in mind!
I'd use an Inline Function from performance perspective, see below:
Note that symbols like '+','-' etc will not be removed
CREATE FUNCTION [dbo].[UDF_RemoveNumericStringsFromString]
(
#str varchar(100)
)
RETURNS TABLE AS RETURN
WITH Tally (n) as
(
-- 100 rows
SELECT TOP (Len(#Str)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) a(n)
CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n)
)
SELECT OutStr = STUFF(
(SELECT SUBSTRING(#Str, n,1) st
FROM Tally
WHERE ISNUMERIC(SUBSTRING(#Str, n,1)) = 1
FOR XML PATH(''),type).value('.', 'varchar(100)'),1,0,'')
GO
/*Use it*/
SELECT OutStr
FROM dbo.UDF_RemoveNumericStringsFromString('fjkfhk759734977fwe9794t23')
/*Result set
759734977979423 */
You can define it with more than 100 characters...
"Although I can't isolate SQL as the source of the problem anymore, I still feel like it is."
Fire up SQL Profiler and take a look. Take the resulting queries and check their execution plans to make sure that index is being used.
Thousands of records against thousands of records is not normally a problem. I've used SSIS to import millions of records with de-duping like this.
I would clean up the database to remove the non-numeric characters in the first place and keep them out.
Looking for a super simple solution:
SUBSTRING([Phone], CHARINDEX('(', [Phone], 1)+1, 3)
+ SUBSTRING([Phone], CHARINDEX(')', [Phone], 1)+1, 3)
+ SUBSTRING([Phone], CHARINDEX('-', [Phone], 1)+1, 4) AS Phone
I would recommend enforcing a strict format for phone numbers in the database. I use the following format. (Assuming US phone numbers)
Database: 5555555555x555
Display: (555) 555-5555 ext 555
Input: 10 digits or more digits embedded in any string. (Regex replacing removes all non-numeric characters)