Extract phone number from noised string - sql

I have a column in a table that contains random data along with phone numbers in different formats. The column may contain
Name
Phone
Email
HTML tags
Addresses (with numbers)
Examples:
1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546
2) John Smith
3) xxx#yyy.com
4) John Smith 8 999 888 77 77
How a phone number is written is also depends. It may be like 8 927 410 00 22, 8(927)410-00-22, +7(927)410-00-22, +7 (927) 410-00-22, (927)410 00 22, 927 410 00 22, 9(2741) 0 0 0-22 and so on
The common rule here is that the phone number format contains 10-11 digits.
My best guess is to use regular expressions and firstly remove email addresses (since they can contain phone numbers in them like 79990001122#gmail.com) from the string and then use some regular expression to extract phone based on knowing it's 10 or 11 digits in row delimited with characters like ,(,),+,- and so on (I don't think someone would use . as phone digit delimiter so we don't want to think of IP Addresses like 77.106.46.202 in the first sample).
So the question is how to get phone numbers from these values.
The final values I want to get from the three examples above are:
1) 79005346546 79005346546 79005346546
2)
3)
4) 89998887777
The server is Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit)

UPDATED (20200226)
There were a couple comments that a CLR/regex solution could be faster than the ngram8k solution I posted. I've heard this for six years but every single time, without exception, the test harness tells a different story. I already posted in the earlier comments instructions to get the Microsoft© MDQ family of CLR Regex running in just a few minutes. They were developed, tested and tuned by Microsoft and ship with Master Data Services/Data Quality Services. I've used them for years, they're good.
RegexReplace/RegexSplit vs PatExtract8k/DigitsOnlyEE: 1,000,000 rows
Obviously you don't want functions in your WHEREclause but, since my Regex is rusty AF, I needed to. To level the playing field I did the same with DigitsOnlyEE in the N-Gram solution's WHERE clause.
SET NOCOUNT ON;
DBCC FREEPROCCACHE WITH NO_INFOMSGS;
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
SET STATISTICS TIME ON;
DECLARE
#newData BIT = 0,
#string VARCHAR(8000) = '1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 ',
#pattern VARCHAR(50) = '[^0-9()+.-]',
#srchLen INT = 11;
IF #newData = 1
BEGIN
IF OBJECT_ID('tempdb..#strings','U') IS NOT NULL DROP TABLE #strings;
SELECT
StringId = IDENTITY(INT,1,1),
String = REPLICATE(#string,ABS(CHECKSUM(NEWID())%3)+1)
INTO #strings
FROM dbo.rangeAB(1,1000000,1,1) AS r;
END
PRINT CHAR(10)+'Regex/CLR version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT regex.NewString
FROM #strings AS s
CROSS APPLY
(
SELECT STRING_AGG(clr.RegexReplace(f.Token,'[^0-9]','',0),' ')
FROM clr.RegexSplit(s.string,#pattern,N'[0-9()+.-]',0) AS f
WHERE f.IsValid = 1
AND LEN(clr.RegexReplace(f.Token,'[^0-9]','',0)) = #srchLen
) AS regex(NewString);
PRINT CHAR(10)+'NGrams version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT ngramsStuff.NewString
FROM #strings AS s
CROSS APPLY
(
SELECT STRING_AGG(ee.digitsOnly,' ')
FROM samd.patExtract8K(#string,#pattern) AS pe
CROSS APPLY samd.digitsOnlyEE(pe.item) AS ee
WHERE LEN(ee.digitsOnly) = #srchLen
) AS ngramsStuff(NewString)
OPTION (MAXDOP 1);
SET STATISTICS TIME OFF;
GO
Test Results
Regex/CLR version Serial
------------------------------------------------------------------------------------------
SQL Server Execution Times: CPU time = 19918 ms, elapsed time = 12355 ms.
NGrams version Serial
------------------------------------------------------------------------------------------
SQL Server Execution Times: CPU time = 844 ms, elapsed time = 971 ms.
NGrams8k is very fast, does not require you to compile a new assembly, learn a new programming language, Enable CLR functions, etc... No issues with garbage collection. Even the CLR N-GRAMs function that ships with MDS/DQS can't touch NGrams8k for performance (see the comments under my article).
END OF UPDATE
First grab a copy of ngrams8k and use it to build PatExtract8k (DDL below at the bottom of this post.) Next a quick warm-up:
DECLARE
#string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222,
(313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
#pattern VARCHAR(50) = '%[^0-9()+.-]%';
SELECT pe.itemNumber, pe.itemIndex, pe.itemLength, pe.item
FROM samd.patExtract8K(#string,#pattern) AS pe
WHERE pe.itemLength > 1;
Returns:
ItemNumber ItemIndex ItemLength Item
----------- ----------- ----------- ----------------
1 18 8 222-3333
2 42 12 312.555.2222
3 91 13 (313)555-6789
4 112 14 1+800-555-4444
Note that the function returns the matched pattern, position in the string, Item Length and the item. The first three attributes can be leveraged for further processing which brings us to your post. Note my comments:
-- First for some easily consumable sample data.
DECLARE #things TABLE (StringId INT IDENTITY, String VARCHAR(8000));
INSERT #things (String)
VALUES
('Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 '),
('John Smith'),
('xxx#yyy.com'),
('John Smith 8 999 888 77 77');
DECLARE #SrchLen INT = 11;
SELECT
StringId = t.StringId,
ItemIndex = pe.itemIndex,
ItemLength = #SrchLen,
Item = i2.Item
FROM #things AS t
CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]') AS pe
CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',#SrchLen), pe.item))) AS i(Idx)
CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11))) AS ns(NewString)
CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ','')))) AS i2(Item)
WHERE pe.itemLength >= #SrchLen;
Returns:
StringId ItemIndex ItemLength Item
----------- -------------------- ----------- -----------
1 17 11 79005346546
1 62 11 79005346546
1 221 11 79005346546
4 11 11 89998887777
Next we can handle outer rows like so and row-to-column concatenation like this:
WITH t AS
(
SELECT i2.Item, t.StringId
FROM #things AS t
CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]') AS pe
CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',#SrchLen), pe.item))) AS i(Idx)
CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11))) AS ns(NewString)
CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ','')))) AS i2(Item)
WHERE pe.itemLength >= #SrchLen
)
SELECT
StringId = t2.StringId,
NewString = ISNULL((
SELECT t.item+' '
FROM t
WHERE t.StringId = t2.StringId
FOR XML PATH('')),'')
FROM #things AS t2
LEFT JOIN t AS t1 ON t2.StringId = t1.StringId
GROUP BY t2.StringId;
Returns:
StringId NewString
--------- --------------------------------------
1 79005346546 79005346546 79005346546
2
3
4 89998887777
I wish I had a little more time for additional details but this took a little longer then planned. Any questions welcome.
Patextract:
CREATE FUNCTION samd.patExtract8K
(
#string VARCHAR(8000),
#pattern VARCHAR(50)
)
/*****************************************************************************************
[Description]:
This can be considered a T-SQL inline table valued function (iTVF) equivalent of
Microsoft's mdq.RegexExtract except that:
1. It includes each matching substring's position in the string
2. It accepts varchar(8000) instead of nvarchar(4000) for the input string, varchar(50)
instead of nvarchar(4000) for the pattern
3. The mask parameter is not required and therefore does not exist.
4. You have specify what text we're searching for as an exclusion; e.g. for numeric
characters you should search for '[^0-9]' instead of '[0-9]'.
5. There is is no parameter for naming a "capture group". Using the variable below, both
the following queries will return the same result:
DECLARE #string nvarchar(4000) = N'123 Main Street';
SELECT item FROM samd.patExtract8K(#string, '[^0-9]');
SELECT clr.RegexExtract(#string, N'(?<number>(\d+))(?<street>(.*))', N'number', 1);
Alternatively, you can think of patExtract8K as Chris Morris' PatternSplitCM (found here:
http://www.sqlservercentral.com/articles/String+Manipulation/94365/) but only returns the
rows where [matched]=0. The key benefit of is that it performs substantially better
because you are only returning the number of rows required instead of returning twice as
many rows then filtering out half of them. Furthermore, because we're
The following two sets of queries return the same result:
DECLARE #string varchar(100) = 'xx123xx555xx999';
BEGIN
-- QUERY #1
-- patExtract8K
SELECT ps.itemNumber, ps.item
FROM samd.patExtract8K(#string, '[^0-9]') ps;
-- patternSplitCM
SELECT itemNumber = row_number() over (order by ps.itemNumber), ps.item
FROM dbo.patternSplitCM(#string, '[^0-9]') ps
WHERE [matched] = 0;
-- QUERY #2
SELECT ps.itemNumber, ps.item
FROM samd.patExtract8K(#string, '[0-9]') ps;
SELECT itemNumber = row_number() over (order by itemNumber), item
FROM dbo.patternSplitCM(#string, '[0-9]')
WHERE [matched] = 0;
END;
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT pe.ItemNumber, pe.ItemIndex, pe.ItemLength, pe.Item
FROM samd.patExtract8K(#string,#pattern) pe;
--===== Against a table using APPLY
SELECT t.someString, pe.ItemIndex, pe.ItemLength, pe.Item
FROM samd.SomeTable t
CROSS APPLY samd.patExtract8K(t.someString, #pattern) pe;
[Parameters]:
#string = varchar(8000); the input string
#searchString = varchar(50); pattern to search for
[Returns]:
itemNumber = bigint; the instance or ordinal position of the matched substring
itemIndex = bigint; the location of the matched substring inside the input string
itemLength = int; the length of the matched substring
item = varchar(8000); the returned text
[Developer Notes]:
1. Requires NGrams8k
2. patExtract8K does not return any rows on NULL or empty strings. Consider using
OUTER APPLY or append the function with the code below to force the function to return
a row on emply or NULL inputs:
UNION ALL SELECT 1, 0, NULL, #string WHERE nullif(#string,'') IS NULL;
3. patExtract8K is not case sensitive; use a case sensitive collation for
case-sensitive comparisons
4. patExtract8K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
5. patExtract8K performs substantially better with a parallel execution plan, often
2-3 times faster. For queries that leverage patextract8K that are not getting a
parallel exeution plan you should consider performance testing using Traceflag 8649
in Development environments and Adam Machanic's make_parallel in production.
[Examples]:
--===== (1) Basic extact all groups of numbers:
WITH temp(id, txt) as
(
SELECT * FROM (values
(1, 'hello 123 fff 1234567 and today;""o999999999 tester 44444444444444 done'),
(2, 'syat 123 ff tyui( 1234567 and today 999999999 tester 777777 done'),
(3, '&**OOOOO=+ + + // ==?76543// and today !!222222\\\tester{}))22222444 done'))t(x,xx)
)
SELECT
[temp.id] = t.id,
pe.itemNumber,
pe.itemIndex,
pe.itemLength,
pe.item
FROM temp AS t
CROSS APPLY samd.patExtract8K(t.txt, '[^0-9]') AS pe;
-----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170801 - Initial Development - Alan Burstein
Rev 01 - 20180619 - Complete re-write - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT itemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
itemIndex = f.position,
itemLength = itemLen.l,
item = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
SELECT ng.position, SUBSTRING(#string,ng.position,DATALENGTH(#string))
FROM samd.NGrams8k(#string, 1) AS ng
WHERE PATINDEX(#pattern, ng.token) < --<< this token does NOT match the pattern
ABS(SIGN(ng.position-1)-1) + --<< are you the first row? OR
PATINDEX(#pattern,SUBSTRING(#string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+#pattern+'%',f.token),0),
DATALENGTH(#string)+2-f.position)-1)) AS itemLen(l);
GO

The following isn't a direct answer to the question but shows how it can be done in PostgresSQL, which has a mature regular expression replace function. Would expect the solution might be adaptable to SQL Server using some kind of library CLR integration but I'm not experienced in that...
SQL
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(phoneNumber, '((([0-9])[ ()+-]*){10,11})([^0-9]|$)', '`\1¬','g'),
'(^|¬)[^`¬]*(`|$)', ',', 'g'),
'(^,|,$|[^0-9,])', '', 'g')
FROM tbl;
Online Demo
db-fiddle.uk demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b12d9f9779b686fd0c4aa84956595f70
Explanation
The innermost REGEXP_REPLACE locates groups of either 10 or 11 digits, each of which may have any number of space, bracket, plus or minus characters after it. The group must either be followed by a non-digit character or the end of the line. For each located group, a single ` is appended before the group of digits and a single ¬ is appended after. You might need to adjust these characters to something rarer - they shouldn't appear anywhere else in the text.
The middle REGEXP_REPLACE replaces each block of text that isn't between a pair of marker characters with a single comma.
The outermost REGEXP_REPLACE removes any commas at the start or end of the string and also removes anything that isn't a digit or comma.

Related

Regex replace or LISTAGG on SQL server

I need to "translate" this statement to SQL server
regexp_replace(Main.LOCK, '\/\/.*', '') TARGET
I need to get rid of this signs because before (or after, depends how you look) I use this one
LISTAGG(stock.LOCATION_NO, '//') WITHIN GROUP (ORDER BY isnull(QTY_OH,0)+isnull(QTY_TR,0) - isnull(QTY_RS, 0) desc) LOCK
Neither Regex and Listagg can be used within SQL server
What you see, what I'm trying to do (and it worked very well in Oracle) is to get the TARGET value that contains Main.LOCK with MAXIMUM value of
isnull(QTY_OH,0)+isnull(QTY_TR,0) - isnull(QTY_RS, 0)
Now I can't translate it properly to SQL server
Also, the error I've get are:
Msg 195, Level 15, State 10, Line 12
'regexp_replace' is not a recognized built-in function name.
Msg 10757, Level 15, State 1, Line 49
The function 'LISTAGG' may not have a WITHIN GROUP clause.
Can anyone help here?
SQL Server ver 18.8
Warehouse Ver 13.0
Microsoft SQL Server 2016 (SP2-GDR) (KB4583460) - 13.0.5103.6 (X64)
Nov 1 2020 00:13:28
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)
The regexp_replace() is doing something pretty simple. It is taking the portion of the string before '//', if that is there.
In the more recent versions of SQL Server, you can use string_agg() and left():
string_agg(left(main.lock,
charindex('//', main.lock + '//')- 1
), '//'
) within group (order by coalesce(qty_oh, 0) + coalesce(qty_tr, 0) - coalesce(qty_rs, 0) desc)
SQL Server has no built-in Regex functions but they are available via CLR. The good news is you don't need Regex in SQL Server. Everything I used to do with RegEx I now handle using NGrams8k. It's easy and performs much better. I've built a few functions using NGrams8K that would be helpful for this problem and many others. First we have PatReplace8K, second is Translate8K (updated code for both below.) A third option is PatExtract8K (follow the link for the code).
Examples of each performing the text transformation. With each function I'm just removing the Alpha characters and the numbers from 0-5 from "SomeString":
--==== Sample Data
DECLARE #table TABLE (SomeId INT IDENTITY, SomeString VARCHAR(40));
INSERT #table(SomeString) VALUES(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID());
--==== Using Patreplace8k
SELECT t.SomeString, f.NewString
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.SomeString,'[0-5A-F-]','') AS f
--==== Using Translate8K
SELECT t.SomeString, f.NewString
FROM #table AS t
CROSS APPLY samd.Translate8K(t.SomeString,'[012345ABCDEF-]','') AS f
--==== samd.patExtract8K
SELECT t.SomeString,
NewString = STRING_AGG(f.item,'') WITHIN GROUP (ORDER BY f.ItemNumber)
FROM #table AS t
CROSS APPLY samd.patExtract8K(t.SomeString,'[0-5A-F-]') AS f
GROUP BY t.SomeString;
Each Return:
SomeString NewString
---------------------------------------- -----------------
0818BEF3-E0B3-4B3B-AA97-649E43EB16AF 8897696
3077EE8B-9E92-4337-9E2F-97DABE2E4623 7789979976
6BCD8194-F993-42DB-AF4A-D8289F8F8DA3 6899988988
C1F152DF-8B6F-4C14-AF6F-AC8869099FDB 866886999
F877D888-245E-4CEB-84B7-1CFF6E03B974 87788887697
To perform you string aggregation you can use XML PATH(), or STRING_AGG. Here's an example using both techniques and PatReplace8k:
SELECT NewString = STUFF((
SELECT '//'+NewString
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.SomeString,'[0-5A-F-]','') AS f
ORDER BY f.NewString
FOR XML PATH('')),1,2,'');
SELECT STRING_AGG(f.NewString,'//') WITHIN GROUP (ORDER BY f.NewString)
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.SomeString,'[0-5A-F-]','') AS f;
In each case I get what I want:
NewString
----------------------------------------------------
6967899797//777689886//868796//8887789//88989
Translate Function:
CREATE OR ALTER FUNCTION samd.Translate8K
(
#string VARCHAR(8000), -- Input
#pattern VARCHAR(100), -- characters to replace
#key VARCHAR(100) -- replacement characters
)
/*
Purpose:
Standard Translate function - the fastest UDF version in the game. Enjoy.
For more about TRANSLATE see: https://www.w3schools.com/sql/func_sqlserver_translate.asp
Requires:
NGrams8K; get you some here:
https://www.sqlservercentral.com/articles/nasty-fast-n-grams-part-1-character-level-unigrams
Designed By Alan Burstein; May, 2021
*/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString = ISNULL(REPLACE(tx.NewString,CHAR(0),''),#string)
FROM
(
SELECT STRING_AGG(CAST(t.tKey+ng.Token AS CHAR(1)),'')
WITHIN GROUP (ORDER BY ng.Position)
FROM samd.ngrams8k(#string,1) AS ng
CROSS APPLY (VALUES(#key+REPLICATE(CHAR(0),100))) AS tx(NewKey)
CROSS APPLY (VALUES(CHARINDEX(ng.Token,#pattern))) AS pos(N)
CROSS APPLY (VALUES(SUBSTRING(tx.NewKey,pos.N,1))) AS t(tKey)
) AS tx(NewString);
Patreplace8k:
CREATE OR ALTER FUNCTION [samd].[patReplace8K]
(
#string VARCHAR(8000),
#pattern VARCHAR(50),
#replace VARCHAR(20)
)
/*****************************************************************************************
[Purpose]:
Given a string (#string), a pattern (#pattern), and a replacement character (#replace)
patReplace8K will replace any character in #string that matches the #Pattern parameter
with the character, #replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplace8K(#String,#Pattern,#Replace) AS pr;
[Developer Notes]:
1. #Pattern IS case sensitive but can be easily modified to make it case insensitive
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
4. Functions that use samd.ngrams8k will see huge performance gains when the optimizer
generates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not choose one) is to use make_parallel by Adam Machanic found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
As is the case with functions which leverage samd.ngrams or samd.ngrams8k,
samd.patReplace8K is almost always dramatically faster with a parallel execution
plan. On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplace8K is about 4X
faster when executed using all 8 of my logical CPUs.
5. samd.patReplace8K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplace8K('My phone number is 555-2211','[0-9]','*') AS pr;
--==== 2. Using againsts a table
DECLARE #table TABLE(OldString varchar(60));
INSERT #table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
('He can be reached at 444.665.4466 on Monday.');
SELECT t.OldString, pr.NewString
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.oldstring,'[0-9]','*') AS pr;
[Revision History]:
-----------------------------------------------------------------------------------------
Rev 00 - 20141027 Initial Development - Alan Burstein
Rev 01 - 20141029 - Redesigned based on the dbo.STRIP_NUM_EE by Eirikur Eiriksson
(see: http://www.sqlservercentral.com/Forums/Topic1585850-391-2.aspx)
- change how the cte tally table is created
- put the include/exclude logic in a CASE statement instead of a WHERE clause
- Added Latin1_General_BIN Colation
- Add code to use the pattern as a parameter. - Alan Burstein
Rev 02 - 20141106 - Added final performance enhancement (more cudos to Eirikur Eiriksson)
- Put 0 = PATINDEX filter logic into the WHERE clause
Rev 03 - 20150516 - Updated to deal with special XML characters - Alan Burstein
Rev 04 - 20170320 - changed #replace from char(1) to varchar(1) for whitespace handling
- Alan Burstein
Rev 05 - 20200515 - Complete rewrite using samd.NGrams
- changed PATINDEX(...)=0 to: PATINDEX()&0x01=0;
- Changed CASE statement to IIF; Dropped collation - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString = STRING_AGG(IIF(PATINDEX(#pattern,col.Token)&0x01=0,col.Token,
#replace),'') WITHIN GROUP (ORDER BY ng.position)
FROM samd.NGrams8K(#string,1) AS ng
CROSS APPLY (VALUES(ng.token)) AS col(Token);

Filter IDs with just numbers excluding letters

So I have results that begins with 2 letters followed by 3 numbers, for example:
ID_Sample
AB001
BC003
AB100
BC400
How can I do a query that ignores the letters and just looks up the numbers to do a filter? For example:
WHERE ID_Sample >= 100
I tried using a "Replace" to get rid of known letters, but I figured there might be a better way. For example:
Select
Replace(id_sample,'AB','')
Choosing the 3 numerals on the right would work too.
For your sample data, you can just start at the third character and convert to a number:
where try_convert(int, stuff(ID_Sample, 1, 2, '')) > 100
Or, if you know that the number is 3 characters:
where try_convert(int, right(ID_Sample, 3)) > 100
+1 for Gordon's answer. This is a fun problem that you can solve using TRANSLATE if you're using SQL 2017+.
First, in case you've never used it, Per BOL TRANSLATE:
Returns the string provided as a first argument after some characters
specified in the second argument are translated into a destination set
of characters specified in the third argument.2
This:
SELECT TRANSLATE('123AABBCC!!!','ABC','XYZ');
Returns: 123XXYYZZ!!!
Here's the solution using TRANSLATE:
-- Sample Data
DECLARE #t TABLE (ID_Sample CHAR(6))
INSERT #t (ID_Sample) VALUES ('AB001'),('BC003'),('AB100'),('BC400'),('CC555');
-- Solution
SELECT
ID_Sample = t.ID_Sample,
ID_Sample_Int = s.NewString
FROM #t AS t
CROSS JOIN (VALUES('ABCDEFGHIJKLMNOPQRSTUVWXYZ', REPLICATE(0,26))) AS f(S1,S2)
CROSS APPLY (VALUES(TRY_CAST(TRANSLATE(t.ID_Sample,f.S1,f.S2) AS INT))) AS s(NewString)
WHERE s.NewString >= 100;
Without the WHERE clause filter you get:
ID_Sample ID_Sample_Int
--------- -------------
AB001 1
BC003 3
AB100 100
BC400 400
CC555 555
... the WHERE clause filters out the first two rows.
Check these methods- Unit test also done!
Declare #Table as table(ID_Sample varchar(20))
set nocount on
Insert into #Table (ID_Sample)
Values('AB001'),('BC003'),('AB100'),('BC400')
--substring_method
select * from #Table
where try_cast(substring(ID_Sample,3,3) as int) >100
--right_method
select * from #Table
where try_cast(right(ID_Sample,3) as int) >100
--stuff_method
select * from #Table
where try_cast(stuff(ID_Sample,1,2,'') as int) >100
--replace_method
select * from #Table
where try_cast(replace(ID_Sample,left(ID_Sample,2),'') as int) >100

Remove all non numeric characters in sql SELECT

I want to remove all non-numeric characters when I call the query in SQL.
I have a function and in function, I do it so:
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^0-9]%'
While PatIndex(#KeepValues, #Temp) > 0
Set #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
But now I want to do it with query (select).
I tried so but this doesn't work
select substring(AdrTelefon1, PatIndex('%[^0-9]%', AdrTelefon1), 2000) from test
EDIT
I have it!
Select query to remove non-numeric characters
It does not work correctly
SELECT LEFT(SUBSTRING(AdrTelefon1, PATINDEX('%[0-9]%', AdrTelefon1), 8000),
PATINDEX('%[^0-9]%', SUBSTRING(AdrTelefon1, PATINDEX('%[0-9]%', AdrTelefon1), 8000) + 'X') -1) from test
I have 04532/97 and after this query, I have 04532 BUT I NEED 0453297
Some time ago I solved that problem using the below function
create function dbo.[fnrReplacetor](#strtext varchar(2000))
returns varchar(2000)
as
begin
declare #i int = 32, #rplc varchar(1) = '';
while #i < 256
begin
if (#i < 48 or #i > 57) and CHARINDEX(char(#i),#strtext) > 0
begin
--° #176 ~ 0 --¹ #185 ~ 1 --² #178 ~ 2 --³ #179 ~ 3
set #rplc = case #i
when 176 then '0'
when 185 then '1'
when 178 then '2'
when 179 then '3'
else '' end;
set #strtext = REPLACE(#strtext,CHAR(#i),#rplc);
end
set #i = #i + 1;
end
return #strtext;
end
GO
select dbo.[fnrReplacetor]('12345/97')
Note it ill also consider characters °,¹,²,³ numeric and replace then with 0,1,2,3.
I put it in a function to readly reuse it in my scenario I needed to fix many columns in many tables at once.
update t
set t.myColumn = dbo.[fnrReplacetor](tempdb.myColumn)
from test t
where tempdb.myColumn is not null
or just
select dbo.[fnrReplacetor](tempdb.myColumn) as [Only Digits]
from test t
where tempdb.myColumn is not null
Obs: this is not the fatest way but a thorough one.
Edit
A non UDF solution must be use REPLACE but since regex is not that great in SQL you can end doing something nasty like the below example:
declare #test as table (myColumn varchar(50))
insert into #test values ('123/45'),('123-4.5')
Select replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(myColumn,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z',''),'.',''),'-',''),'/','')
from #test
#Emma W.
I agree with the others... you actually should use a function for this. Here's a very high-performance function that will work for 2008 and above. It includes full documentation and usage examples.
As a bit of a sidebar, any function that contains the word BEGIN is either a slow, performance hogging, scalar function or mTFV (multi-statement Table Valued Function). Most savvy DBAs won't allow either but may not know the difference between those two and an iTVF (inline Table Valued Function), like the one below.
CREATE OR ALTER FUNCTION [dbo].[DigitsOnly]
/**********************************************************************************************************************
Purpose:
Given a VARCHAR(8000) or less string, return only the numeric digits from the string.
Programmer's Notes:
1. This is an iTVF (Inline Table Valued Function) that will be used as an iSF (Inline Scalar Function) in that it
returns a single value in the returned table and should normally be used in the FROM clause as with any other iTVF.
2. The main performance enhancement is using a WHERE clause calculation to prevent the relatively expensive XML PATH
concatentation of empty strings normally determined by a CASE statement in the XML "loop".
3. Another performance enhancement is not making this function a generic function that could handle a pattern. That
allows us to use all integer math to do the comparison using the high speed ASCII function convert characters to
their numeric equivalent. ASCII characters 48 through 57 are the digit characters of 0 through 9 in most languages.
4. Last but not least, added another of Eirikur's later optimizations using 0x7FFF which he says is a "simple trick to
shift all the negative values to the top of the range so a single operator can be applied, which is a lot less
expensive than using between.
-----------------------------------------------------------------------------------------------------------------------
Kudos:
1. Hats off to Eirikur Eiriksson for the ASCII conversion idea and for the reminders that dedicated functions will
always be faster than generic functions and that integer math beats the tar out of character comparisons that use
LIKE or PATINDEX.
2. Hats off to all of the good people that submitted and tested their code on the following thread. It's this type of
participation and interest that makes code better. You've just gotta love this commmunity.
http://www.sqlservercentral.com/Forums/Topic1585850-391-2.aspx#bm1629360
-----------------------------------------------------------------------------------------------------------------------
Usage Example:
--===== CROSS APPLY example
SELECT ca.DigitsOnly
FROM dbo.SomeTable st
CROSS APPLY dbo.DigitsOnly(st.SomeVarcharCol) ca
;
-----------------------------------------------------------------------------------------------------------------------
Test Harness:
--===== Create the 1 Million row test table
DROP TABLE IF EXISTS #TestTable
;
SELECT TOP 1000000
Txt = ISNULL(CONVERT(VARCHAR(36),NEWID()),'')
INTO #TestTable
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
ALTER TABLE #TestTable
ADD PRIMARY KEY CLUSTERED (Txt)
;
GO
--===== CROSS APPLY example.
-- This takes ~ 1 second to execute.
DROP TABLE IF EXISTS #Results;
SELECT tt.Txt, ca.DigitsOnly
INTO #Results
FROM #TestTable tt
CROSS APPLY dbo.DigitsOnly(Txt) ca
;
GO
--===== Return the results for manual verification.
SELECT * FROM #Results
;
-----------------------------------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 28 Oct 2014 - Eirikur Eiriksson
- Initial creation and unit/performance tests.
Rev 01 - 29 Oct 2014 - Jeff Moden
- Performance enhancement and unit/performance tests.
Rev 02 - 30 Oct 2014 - Eirikur Eiriksson
- Additional Performance enhancement
Rev 03 - 01 Sep 2014 - Jeff Moden
- Formalize the code and add the documenation that appears in the flower box of this code.
***********************************************************************************************************************/
--======= Declare the I/O for this function
(#pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN WITH
E1(N) AS (SELECT N FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS E0(N))
,Tally(N) AS (SELECT TOP (LEN(#pString)) (ROW_NUMBER() OVER (ORDER BY (SELECT 1))) FROM E1 a,E1 b,E1 c,E1 d)
SELECT DigitsOnly =
(
SELECT SUBSTRING(#pString,N,1)
FROM Tally
WHERE ((ASCII(SUBSTRING(#pString,N,1)) - 48) & 0x7FFF) < 10
FOR XML PATH('')
)
;
GO
If you're really up against a wall and cannot use a function of any type because of "Rules" that have no exceptions (a really bad idea), then post back and we can show you how to convert it into inline code with a little help from you.
Whatever you do, don't use a WHILE loop for this task... it'll kill you performance and resource usage wise.

create sql view from comma separated values

T-sql question:
I need help to build a join from 2 tables, where on one of the tables I have aggregated data (comma separated values).
I have a table - Users where I have 3 columns: UserId, DefaultLanguage and OtherLanguages.
The table looks like this:
UserId | DefaultLanguage | OtherLanguages
---------------------------------------------
1 | en | NULL
2 | en | it, fr
3 | fr | en, it
4 | en | sp
and so on.
I have another table where I have the association between language code (en, fr, ro, it, sp) and language name:
LangCode | LanguageName
-------------------------
en | English
fr | French
it | Italian
sp | Spanish
and so on.
I want to create a view like this:
UserId | DefaultLanguage | OtherLanguages
---------------------------------------------
1 | English | NULL
2 | English | Italian, French
3 | French | English, Italian
4 | English | Spanish
and so on.
In short, I need a view where the language code is replaced by language name.
Any help, please?
Several solutions of course you can recreate all table change the data structure.
1. If all the language are 2 digits:
select t1.UserId, t2.LanguageName,
ISNULL( t3.LanguageName, '') + ISNULL(', '+t4.LanguageName, '') + ISNULL( ', '+t5.LanguageName, '') OtherLanguages
from Table1 t1
inner join Table2 t2 on t1.DefaultLanguage = t2.LangCode
left join Table2 t3 on Left(t1.OtherLanguages,2) = t3.LangCode
left join Table2 t4 on CASE WHEN len(Replace(t1.OtherLanguages, ' ', '')) > 3 THEN
SUBSTRING( Replace(t1.OtherLanguages, ' ', ''), 4, 2) ELSE null END = t4.LangCode
left join Table2 t5 on CASE WHEN len(Replace(t1.OtherLanguages, ' ', '')) > 6 THEN
SUBSTRING( Replace(t1.OtherLanguages, ' ', ''), 7, 2) ELSE null END = t5.LangCode
Use user-define function:
CREATE FUNCTION [dbo].[func_GetLanguageName] (#pLanguageList varchar(max))
RETURNS varchar(max) AS
BEGIN
Declare #aLanguageList varchar(max) = #pLanguageList
Declare #aLangCode varchar(max) = null
Declare #aReturnName varchar(max) = null
WHILE LEN(#aLanguageList) > 0
BEGIN
IF PATINDEX('%,%',#aLanguageList) > 0
BEGIN
SET #aLangCode = RTRIM(LTRIM(SUBSTRING(#aLanguageList, 0, PATINDEX('%,%',#aLanguageList))))
SET #aLanguageList = LTRIM(SUBSTRING(#aLanguageList, LEN(#aLangCode + ',') + 1,LEN(#aLanguageList)))
END
ELSE
BEGIN
SET #aLangCode = #aLanguageList
SET #aLanguageList = NULL
END
Select #aReturnName = ISNULL( #aReturnName + ', ' , '') + LanguageName from Table2 where LangCode=#aLangCode
END
RETURN(#aReturnName)
END
and use select
select UserId, dbo.func_GetLanguageName(DefaultLanguage)DefaultLanguage, dbo.func_GetLanguageName(OtherLanguages) OtherLanguages from table1
Best practice would dictate not to have this type of comma delimited
data in a column...
Since you stated in comments that the schema cannot be changed, the next best thing is a function. This can be used in a select query in-line.
SQL is notoriously slow with string manipulation. Here is an interesting article on the topic. There are many SQL "string split" functions out there. They all generally split a comma delimited string and return a table.
For this specific use-case, you actually need a scalar-valued
function (a function which returns one value) rather than a
table-valued function (one which returns a table of values).
Below is a modified such function, which returns a scalar value in place of the original comma delimited string of language codes.
The comments explain what is happening line by line.
The gist is that you must loop through the input string keeping track of the last comma location, extract each code, lookup the full language from the languages table, and then return the output as a comma-delimited string.
Language codes to languages function:
Create Function [dbo].fn_languageCodeToFull
( #Input Varchar(100) )
Returns Varchar(1000)
As
Begin
-- To address null input, based on the example you provided, we set the output to NULL if there is no input
If #Input = '' Or #Input Is Null
Return Null
Declare
#CodeLength int, -- constant for code length to avoid hardcoded "magic numbers"
#Output varchar(1000), -- will contain the final comma delimited string of full languages
#LastIndex int, -- tracks the location of the input we are searching as we loop over the string
#CurrentCode varchar(2), -- for code readability, we extract each language code to this variable
#CurrentLanguage varchar(50), -- for code readability, we store the full language in this variable
#IndexIncrement int -- constant to increment the search index by 1 at each iteration
-- ensuring the loop moves forward
Set #LastIndex = 0 -- seed the index, so we begin to search at 0 index
Set #CodeLength = 2 -- ISO language codes are always 2 characters in length
Set #Output = '' -- seed with empty string to avoid NULL when concatenating
Set #IndexIncrement = 1 -- again avoiding hardcoded values...
-- We will loop until we have gone to or beyond the length of the input string
While #LastIndex < len(#Input)
Begin
-- Set the index of each comma (charindex is 1-based)
Set #LastIndex = CHARINDEX(',', #Input, #LastIndex)
-- When we get to the last item, CharIndex will return 0 when it does not find a comma.
-- To pull the last item, we will artificially set #LastIndex to be 1 greater than the input string
-- This will allow the code following this line to be unaltered for this scenario
If #LastIndex = 0 set #LastIndex = len(#Input) + 1 -- account for 1-based index of substring
-- Extract the code prior to the current comma that charindex has identified
Set #CurrentCode = substring(#Input, #LastIndex - #CodeLength, #CodeLength)
-- Do a lookup to get the language for the current code
Set #CurrentLanguage = (Select LanguageName From languages Where code = #CurrentCode)
-- Only add comma after first language to ensure no extra comma will be present in Output
If #LastIndex > 3 Set #Output = #Output + ','
-- Here we build the Output string with the language
Set #Output = #Output + #CurrentLanguage
-- Finally, we increment #LastIndex by 1 to avoid loop on first instance of comma
Set #LastIndex = #LastIndex + #IndexIncrement
End
Return #Output
End
Then your view would simply do something like:
Sample view using the function:
Create View vw_UserLanguages
As
Select
UserId,
dbo.fn_languageCodeToFull(DefaultLanguage) as DefaultLanguage,
dbo.fn_languageCodeToFull(OtherLanguages) as OtherLanguages,
From UserLanguageCodes -- you do not provide a name so I made one up
Note that the function will work whether there are commas or not, so there is no need to join the Languages table here as you can just have the function do all the work in this case.
One quick and dirty solution would be to use a nested REPLACE command but that could result in a very complex statement a bit long winded, especially if you have more than five languages.
As an example:
SELECT [UserId],[DefaultLanguage],
CASE
WHEN [OtherLanguages] IS NULL THEN ''
ELSE REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE([OtherLanguages],
'en','English'),
'fr','French'),
'it','Italian'),
'ro','Romulan'), --Probably not the intended language ;-)
'sp','Spanish')
END as [OtherLanguages]
FROM YourTable
Personally, I'd create a scalar function, again using the REPLACE command, but you can then check the number of languages present and add a counter so that you're not doing unnecessary lookups.
SELECT [UserId],[DefaultLanguage],
CASE
WHEN [OtherLanguages] IS NULL THEN ''
WHEN [OtherLanguages] = '' THEN ''
ELSE do_function_name([OtherLanguages])
END as [OtherLanguages]
FROM YourTable
It might not be good practice but there are times when it is more efficient to store multiple values in a single field but accept that when you do, it will slow down the way you handle that data.

Update Where Values don't match Regex format in management studio 2012

I need to validate UK phone numbers, essentially removing any which are invalid. I'd love to validate them at the front end but that isn't possible sadly.
Having spent a long time looking, I found a wonderful answer here: Validate a UK phone number
provding possible regex of: /^(?0( *\d)?){9,10}$/
and then from this question (and others):Validate telephone number in SQL Server 2000
The suggestion of using xp_pcre: http://www.codeproject.com/Articles/4733/xp-pcre-Regular-Expressions-in-T-SQL to enable and use the regex, however this isn't compatible with 64-bit.
So my question is how do I do an update statement on fields where the values don't match a regex format.
Here's one solution using standard SQL functionality (i.e. without CLR functions):
create function dbo.PhoneNoIsValid
(#number nvarchar(20))
returns bit
begin
--use an innocent until proven guilty approach
--once proven guilty, skip further checks by adding
--`if #isValid = 1 and` before further checks
declare #isValid bit = 1
--no strict rules around spaces; they are allowed but
--don't add anything
--by removing them we simplify the patterns we need to check
set #number = REPLACE(#number,' ','')
--aside from spaces, only numbers, brackets, and the plus
--sign are valid chars
if #number like '%[^\+\(\)0-9]%'
set #isValid = 0
--min length of a valid phone number is 11 chars
if #isValid = 1 and LEN(#number) < 11
set #isValid = 0
--the area code (minus leading zero (or similar) plus the
--local code are only numbers (and spaces; removed earlier)
--so we can check for invalid chars.
if #isValid = 1 and SUBSTRING(#number,LEN(#number)-9,10) like '%[^0-9]%'
set #isValid = 0
--now we've validated the last bit, remove it so we can
--focus on the first bit
if #isValid = 1
set #number = SUBSTRING(#number,1, LEN(#number)-10)
--given we're using a UK number there are limited options;
--so simplest to just enumerate these and check against
--each valid option
if #number not in ('0','0044','+44','+44(0)','0044(0)')
set #isValid = 0
--that's all the checks I can think of; at this stage the
--number's valid or has been proven invalid.
return (#isValid)
end
Example Usage:
declare #sampleData table
(
phoneNo nvarchar(20)
, isValid bit default(1)
)
insert #sampleData
(phoneNo)
values ('0044 1234 567890')
, ('+44 1234 567891')
, ('+44 (0)1234 567892')
, ('0044 (0)1234 567892')
, ('01234 567893')
, ('00441234567890')
, ('+441234567891')
, ('+44(0)1234567892')
, ('0044(0)1234567892')
, ('01234567893')
insert #sampleData
(isValid, phoneNo)
values (0,'0044 1234 56780')
, (0,'+44 1234 56781')
, (0,'+44 (0)1234 56782')
, (0,'044 (0)1234 567893')
, (0,'1234 567894')
, (0,'234567895')
, (0,'0044123456786')
, (0,'+44123456787')
, (0,'+44(0)123456788')
, (0,'044(0)1234567899')
, (0,'1234567810')
, (0,'234567811')
--select * from #sampleData
--demo
select *
from #sampleData
where dbo.PhoneNoIsValid(phoneNo) != isValid --show where I've got something wrong
--update statement
update #sampleData
set phoneNo = ''
where dbo.PhoneNoIsValid(phoneNo)= 0
select isValid, COUNT(1) from #sampleData group by isValid order by isValid
select isValid, COUNT(1) from #sampleData where phoneNo = '' group by isValid order by isValid
NB: I've assumed that when you say "valid UK phone number" you mean a phone number that's valid for a phone in the UK; as opposed to a number that's valid to call from the UK (i.e. this would show US phone numbers as invalid).