Remove all non numeric characters in sql SELECT - sql

I want to remove all non-numeric characters when I call the query in SQL.
I have a function and in function, I do it so:
Declare #KeepValues as varchar(50)
Set #KeepValues = '%[^0-9]%'
While PatIndex(#KeepValues, #Temp) > 0
Set #Temp = Stuff(#Temp, PatIndex(#KeepValues, #Temp), 1, '')
But now I want to do it with query (select).
I tried so but this doesn't work
select substring(AdrTelefon1, PatIndex('%[^0-9]%', AdrTelefon1), 2000) from test
EDIT
I have it!
Select query to remove non-numeric characters
It does not work correctly
SELECT LEFT(SUBSTRING(AdrTelefon1, PATINDEX('%[0-9]%', AdrTelefon1), 8000),
PATINDEX('%[^0-9]%', SUBSTRING(AdrTelefon1, PATINDEX('%[0-9]%', AdrTelefon1), 8000) + 'X') -1) from test
I have 04532/97 and after this query, I have 04532 BUT I NEED 0453297

Some time ago I solved that problem using the below function
create function dbo.[fnrReplacetor](#strtext varchar(2000))
returns varchar(2000)
as
begin
declare #i int = 32, #rplc varchar(1) = '';
while #i < 256
begin
if (#i < 48 or #i > 57) and CHARINDEX(char(#i),#strtext) > 0
begin
--° #176 ~ 0 --¹ #185 ~ 1 --² #178 ~ 2 --³ #179 ~ 3
set #rplc = case #i
when 176 then '0'
when 185 then '1'
when 178 then '2'
when 179 then '3'
else '' end;
set #strtext = REPLACE(#strtext,CHAR(#i),#rplc);
end
set #i = #i + 1;
end
return #strtext;
end
GO
select dbo.[fnrReplacetor]('12345/97')
Note it ill also consider characters °,¹,²,³ numeric and replace then with 0,1,2,3.
I put it in a function to readly reuse it in my scenario I needed to fix many columns in many tables at once.
update t
set t.myColumn = dbo.[fnrReplacetor](tempdb.myColumn)
from test t
where tempdb.myColumn is not null
or just
select dbo.[fnrReplacetor](tempdb.myColumn) as [Only Digits]
from test t
where tempdb.myColumn is not null
Obs: this is not the fatest way but a thorough one.
Edit
A non UDF solution must be use REPLACE but since regex is not that great in SQL you can end doing something nasty like the below example:
declare #test as table (myColumn varchar(50))
insert into #test values ('123/45'),('123-4.5')
Select replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(myColumn,'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y',''),'z',''),'A',''),'B',''),'C',''),'D',''),'E',''),'F',''),'G',''),'H',''),'I',''),'J',''),'K',''),'L',''),'M',''),'N',''),'O',''),'P',''),'Q',''),'R',''),'S',''),'T',''),'U',''),'V',''),'W',''),'X',''),'Y',''),'Z',''),'.',''),'-',''),'/','')
from #test

#Emma W.
I agree with the others... you actually should use a function for this. Here's a very high-performance function that will work for 2008 and above. It includes full documentation and usage examples.
As a bit of a sidebar, any function that contains the word BEGIN is either a slow, performance hogging, scalar function or mTFV (multi-statement Table Valued Function). Most savvy DBAs won't allow either but may not know the difference between those two and an iTVF (inline Table Valued Function), like the one below.
CREATE OR ALTER FUNCTION [dbo].[DigitsOnly]
/**********************************************************************************************************************
Purpose:
Given a VARCHAR(8000) or less string, return only the numeric digits from the string.
Programmer's Notes:
1. This is an iTVF (Inline Table Valued Function) that will be used as an iSF (Inline Scalar Function) in that it
returns a single value in the returned table and should normally be used in the FROM clause as with any other iTVF.
2. The main performance enhancement is using a WHERE clause calculation to prevent the relatively expensive XML PATH
concatentation of empty strings normally determined by a CASE statement in the XML "loop".
3. Another performance enhancement is not making this function a generic function that could handle a pattern. That
allows us to use all integer math to do the comparison using the high speed ASCII function convert characters to
their numeric equivalent. ASCII characters 48 through 57 are the digit characters of 0 through 9 in most languages.
4. Last but not least, added another of Eirikur's later optimizations using 0x7FFF which he says is a "simple trick to
shift all the negative values to the top of the range so a single operator can be applied, which is a lot less
expensive than using between.
-----------------------------------------------------------------------------------------------------------------------
Kudos:
1. Hats off to Eirikur Eiriksson for the ASCII conversion idea and for the reminders that dedicated functions will
always be faster than generic functions and that integer math beats the tar out of character comparisons that use
LIKE or PATINDEX.
2. Hats off to all of the good people that submitted and tested their code on the following thread. It's this type of
participation and interest that makes code better. You've just gotta love this commmunity.
http://www.sqlservercentral.com/Forums/Topic1585850-391-2.aspx#bm1629360
-----------------------------------------------------------------------------------------------------------------------
Usage Example:
--===== CROSS APPLY example
SELECT ca.DigitsOnly
FROM dbo.SomeTable st
CROSS APPLY dbo.DigitsOnly(st.SomeVarcharCol) ca
;
-----------------------------------------------------------------------------------------------------------------------
Test Harness:
--===== Create the 1 Million row test table
DROP TABLE IF EXISTS #TestTable
;
SELECT TOP 1000000
Txt = ISNULL(CONVERT(VARCHAR(36),NEWID()),'')
INTO #TestTable
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
;
ALTER TABLE #TestTable
ADD PRIMARY KEY CLUSTERED (Txt)
;
GO
--===== CROSS APPLY example.
-- This takes ~ 1 second to execute.
DROP TABLE IF EXISTS #Results;
SELECT tt.Txt, ca.DigitsOnly
INTO #Results
FROM #TestTable tt
CROSS APPLY dbo.DigitsOnly(Txt) ca
;
GO
--===== Return the results for manual verification.
SELECT * FROM #Results
;
-----------------------------------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 28 Oct 2014 - Eirikur Eiriksson
- Initial creation and unit/performance tests.
Rev 01 - 29 Oct 2014 - Jeff Moden
- Performance enhancement and unit/performance tests.
Rev 02 - 30 Oct 2014 - Eirikur Eiriksson
- Additional Performance enhancement
Rev 03 - 01 Sep 2014 - Jeff Moden
- Formalize the code and add the documenation that appears in the flower box of this code.
***********************************************************************************************************************/
--======= Declare the I/O for this function
(#pString VARCHAR(8000))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN WITH
E1(N) AS (SELECT N FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS E0(N))
,Tally(N) AS (SELECT TOP (LEN(#pString)) (ROW_NUMBER() OVER (ORDER BY (SELECT 1))) FROM E1 a,E1 b,E1 c,E1 d)
SELECT DigitsOnly =
(
SELECT SUBSTRING(#pString,N,1)
FROM Tally
WHERE ((ASCII(SUBSTRING(#pString,N,1)) - 48) & 0x7FFF) < 10
FOR XML PATH('')
)
;
GO
If you're really up against a wall and cannot use a function of any type because of "Rules" that have no exceptions (a really bad idea), then post back and we can show you how to convert it into inline code with a little help from you.
Whatever you do, don't use a WHILE loop for this task... it'll kill you performance and resource usage wise.

Related

Regex replace or LISTAGG on SQL server

I need to "translate" this statement to SQL server
regexp_replace(Main.LOCK, '\/\/.*', '') TARGET
I need to get rid of this signs because before (or after, depends how you look) I use this one
LISTAGG(stock.LOCATION_NO, '//') WITHIN GROUP (ORDER BY isnull(QTY_OH,0)+isnull(QTY_TR,0) - isnull(QTY_RS, 0) desc) LOCK
Neither Regex and Listagg can be used within SQL server
What you see, what I'm trying to do (and it worked very well in Oracle) is to get the TARGET value that contains Main.LOCK with MAXIMUM value of
isnull(QTY_OH,0)+isnull(QTY_TR,0) - isnull(QTY_RS, 0)
Now I can't translate it properly to SQL server
Also, the error I've get are:
Msg 195, Level 15, State 10, Line 12
'regexp_replace' is not a recognized built-in function name.
Msg 10757, Level 15, State 1, Line 49
The function 'LISTAGG' may not have a WITHIN GROUP clause.
Can anyone help here?
SQL Server ver 18.8
Warehouse Ver 13.0
Microsoft SQL Server 2016 (SP2-GDR) (KB4583460) - 13.0.5103.6 (X64)
Nov 1 2020 00:13:28
Copyright (c) Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)
The regexp_replace() is doing something pretty simple. It is taking the portion of the string before '//', if that is there.
In the more recent versions of SQL Server, you can use string_agg() and left():
string_agg(left(main.lock,
charindex('//', main.lock + '//')- 1
), '//'
) within group (order by coalesce(qty_oh, 0) + coalesce(qty_tr, 0) - coalesce(qty_rs, 0) desc)
SQL Server has no built-in Regex functions but they are available via CLR. The good news is you don't need Regex in SQL Server. Everything I used to do with RegEx I now handle using NGrams8k. It's easy and performs much better. I've built a few functions using NGrams8K that would be helpful for this problem and many others. First we have PatReplace8K, second is Translate8K (updated code for both below.) A third option is PatExtract8K (follow the link for the code).
Examples of each performing the text transformation. With each function I'm just removing the Alpha characters and the numbers from 0-5 from "SomeString":
--==== Sample Data
DECLARE #table TABLE (SomeId INT IDENTITY, SomeString VARCHAR(40));
INSERT #table(SomeString) VALUES(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID());
--==== Using Patreplace8k
SELECT t.SomeString, f.NewString
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.SomeString,'[0-5A-F-]','') AS f
--==== Using Translate8K
SELECT t.SomeString, f.NewString
FROM #table AS t
CROSS APPLY samd.Translate8K(t.SomeString,'[012345ABCDEF-]','') AS f
--==== samd.patExtract8K
SELECT t.SomeString,
NewString = STRING_AGG(f.item,'') WITHIN GROUP (ORDER BY f.ItemNumber)
FROM #table AS t
CROSS APPLY samd.patExtract8K(t.SomeString,'[0-5A-F-]') AS f
GROUP BY t.SomeString;
Each Return:
SomeString NewString
---------------------------------------- -----------------
0818BEF3-E0B3-4B3B-AA97-649E43EB16AF 8897696
3077EE8B-9E92-4337-9E2F-97DABE2E4623 7789979976
6BCD8194-F993-42DB-AF4A-D8289F8F8DA3 6899988988
C1F152DF-8B6F-4C14-AF6F-AC8869099FDB 866886999
F877D888-245E-4CEB-84B7-1CFF6E03B974 87788887697
To perform you string aggregation you can use XML PATH(), or STRING_AGG. Here's an example using both techniques and PatReplace8k:
SELECT NewString = STUFF((
SELECT '//'+NewString
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.SomeString,'[0-5A-F-]','') AS f
ORDER BY f.NewString
FOR XML PATH('')),1,2,'');
SELECT STRING_AGG(f.NewString,'//') WITHIN GROUP (ORDER BY f.NewString)
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.SomeString,'[0-5A-F-]','') AS f;
In each case I get what I want:
NewString
----------------------------------------------------
6967899797//777689886//868796//8887789//88989
Translate Function:
CREATE OR ALTER FUNCTION samd.Translate8K
(
#string VARCHAR(8000), -- Input
#pattern VARCHAR(100), -- characters to replace
#key VARCHAR(100) -- replacement characters
)
/*
Purpose:
Standard Translate function - the fastest UDF version in the game. Enjoy.
For more about TRANSLATE see: https://www.w3schools.com/sql/func_sqlserver_translate.asp
Requires:
NGrams8K; get you some here:
https://www.sqlservercentral.com/articles/nasty-fast-n-grams-part-1-character-level-unigrams
Designed By Alan Burstein; May, 2021
*/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString = ISNULL(REPLACE(tx.NewString,CHAR(0),''),#string)
FROM
(
SELECT STRING_AGG(CAST(t.tKey+ng.Token AS CHAR(1)),'')
WITHIN GROUP (ORDER BY ng.Position)
FROM samd.ngrams8k(#string,1) AS ng
CROSS APPLY (VALUES(#key+REPLICATE(CHAR(0),100))) AS tx(NewKey)
CROSS APPLY (VALUES(CHARINDEX(ng.Token,#pattern))) AS pos(N)
CROSS APPLY (VALUES(SUBSTRING(tx.NewKey,pos.N,1))) AS t(tKey)
) AS tx(NewString);
Patreplace8k:
CREATE OR ALTER FUNCTION [samd].[patReplace8K]
(
#string VARCHAR(8000),
#pattern VARCHAR(50),
#replace VARCHAR(20)
)
/*****************************************************************************************
[Purpose]:
Given a string (#string), a pattern (#pattern), and a replacement character (#replace)
patReplace8K will replace any character in #string that matches the #Pattern parameter
with the character, #replace.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Basic Syntax Example
SELECT pr.NewString
FROM samd.patReplace8K(#String,#Pattern,#Replace) AS pr;
[Developer Notes]:
1. #Pattern IS case sensitive but can be easily modified to make it case insensitive
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
3. Certain special characters, such as "$" and "%" need to be escaped with a "/"
like so: [/$/%]
4. Functions that use samd.ngrams8k will see huge performance gains when the optimizer
generates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not choose one) is to use make_parallel by Adam Machanic found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
As is the case with functions which leverage samd.ngrams or samd.ngrams8k,
samd.patReplace8K is almost always dramatically faster with a parallel execution
plan. On my PC (8 logical CPU, 64GB RAM, SQL 2019) samd.patReplace8K is about 4X
faster when executed using all 8 of my logical CPUs.
5. samd.patReplace8K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Replace numeric characters with a "*"
SELECT pr.NewString
FROM samd.patReplace8K('My phone number is 555-2211','[0-9]','*') AS pr;
--==== 2. Using againsts a table
DECLARE #table TABLE(OldString varchar(60));
INSERT #table VALUES ('Call me at 555-222-6666'), ('phone number: (312)555-2323'),
('He can be reached at 444.665.4466 on Monday.');
SELECT t.OldString, pr.NewString
FROM #table AS t
CROSS APPLY samd.patReplace8K(t.oldstring,'[0-9]','*') AS pr;
[Revision History]:
-----------------------------------------------------------------------------------------
Rev 00 - 20141027 Initial Development - Alan Burstein
Rev 01 - 20141029 - Redesigned based on the dbo.STRIP_NUM_EE by Eirikur Eiriksson
(see: http://www.sqlservercentral.com/Forums/Topic1585850-391-2.aspx)
- change how the cte tally table is created
- put the include/exclude logic in a CASE statement instead of a WHERE clause
- Added Latin1_General_BIN Colation
- Add code to use the pattern as a parameter. - Alan Burstein
Rev 02 - 20141106 - Added final performance enhancement (more cudos to Eirikur Eiriksson)
- Put 0 = PATINDEX filter logic into the WHERE clause
Rev 03 - 20150516 - Updated to deal with special XML characters - Alan Burstein
Rev 04 - 20170320 - changed #replace from char(1) to varchar(1) for whitespace handling
- Alan Burstein
Rev 05 - 20200515 - Complete rewrite using samd.NGrams
- changed PATINDEX(...)=0 to: PATINDEX()&0x01=0;
- Changed CASE statement to IIF; Dropped collation - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT NewString = STRING_AGG(IIF(PATINDEX(#pattern,col.Token)&0x01=0,col.Token,
#replace),'') WITHIN GROUP (ORDER BY ng.position)
FROM samd.NGrams8K(#string,1) AS ng
CROSS APPLY (VALUES(ng.token)) AS col(Token);

Is there any way to loop a string in SQL Server?

I am trying to loop a varchar in SQL Server, one of the columns has the format
"F1 100 F2 400 F3 600"
What I need is to take the numbers and divide by 10: "F1 10 F2 40 F3 60", for the moment I have a stored procedure which calls this function:
ALTER FUNCTION [name_offunction]
(#Chain varchar(120))
RETURNS varchar(120
AS
BEGIN
DECLARE #Result varchar(120), #Pos int, #Concat varchar(120)
WHILE LEN(#Chain) > 0
BEGIN
SET #Pos = CHARINDEX(' ', #Chain)
SET #Result = CASE
WHEN SUBSTRING(#Chain, 1, #Pos-1) LIKE '%[^A-Z]%'
THEN SUBSTRING(#Chain, 1, #Pos-1)
WHEN SUBSTRING(#Chain, 1, #Pos-1) NOT LIKE '%[^A-Z]%'
THEN CAST(CAST(SUBSTRING(#Chain, 1, #Pos-1) / 10 AS INT)AS CHAR)
END
SET #Chain = REPLACE(#Chain, SUBSTRING(#Chain, 1, #Pos), '')
SET #Concat += #Result + ' '
END
RETURN #Concat
We seem to have 2 problems here. Firstly the fact that you want to loop in SQL, however, SQL is a set based language. This means that it performs great at set-based operations but poorly at iterative ones, such as a loop.
Next is that you have what appears to be delimited data, and that you want to affect that delimited data in some way, and the reconstruct the data into a delimited string. Storing delimited data in a database is always a design flaw, and you should really be fixing said design.
I would therefore propose you move to an inline table-value function over a scalar function.
Firstly, as it appears that the ordinal position of the values is important we can't use SQL Server's built in STRING_SPLIT, as it is documented to not guarantee the order of the values will be the same. I am therefore going to use DelimitedSplit8K_LEAD which gives the ordinal position.
Then we can use TRY_CONVERT to check to see if the value is an int (I assume this is the correct data type), and if it is divide by 10. Finally we can reconstruct the data using STRING_AGG.
Outside of a function this would look like this:
DECLARE #Chain varchar(120) = 'F1 100 F2 400 F3 600';
SELECT STRING_AGG(COALESCE(CONVERT(varchar(10),TRY_CONVERT(int,DS.item)/10),DS.item),' ') WITHIN GROUP (ORDER BY DS.Item)
FROM dbo.DelimitedSplit8K_LEAD(#Chain,' ') DS;
As a function, you could therefore do this:
CREATE FUNCTION dbo.YourFunction (#Chain varchar(120))
RETURNS TABLE AS
RETURN
SELECT STRING_AGG(COALESCE(CONVERT(varchar(10),TRY_CONVERT(int,DS.item)/10),DS.item),' ') WITHIN GROUP (ORDER BY DS.Item) AS NewChain
FROM dbo.DelimitedSplit8K_LEAD(#Chain,' ') DS;
GO
And call is as such:
SELECT YF.NewChain
FROM dbo.YourTable YT
CROSS APPLY dbo.YourFunction (YT.Chain) YF;
db<>fiddle
Note that STRING_AGG was introduced in SQL Server 2017; if you're using an older version (you don't note this is the question) you'll need to use the "old" FOR XML PATH solution, shown here.

Extract phone number from noised string

I have a column in a table that contains random data along with phone numbers in different formats. The column may contain
Name
Phone
Email
HTML tags
Addresses (with numbers)
Examples:
1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546
2) John Smith
3) xxx#yyy.com
4) John Smith 8 999 888 77 77
How a phone number is written is also depends. It may be like 8 927 410 00 22, 8(927)410-00-22, +7(927)410-00-22, +7 (927) 410-00-22, (927)410 00 22, 927 410 00 22, 9(2741) 0 0 0-22 and so on
The common rule here is that the phone number format contains 10-11 digits.
My best guess is to use regular expressions and firstly remove email addresses (since they can contain phone numbers in them like 79990001122#gmail.com) from the string and then use some regular expression to extract phone based on knowing it's 10 or 11 digits in row delimited with characters like ,(,),+,- and so on (I don't think someone would use . as phone digit delimiter so we don't want to think of IP Addresses like 77.106.46.202 in the first sample).
So the question is how to get phone numbers from these values.
The final values I want to get from the three examples above are:
1) 79005346546 79005346546 79005346546
2)
3)
4) 89998887777
The server is Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit)
UPDATED (20200226)
There were a couple comments that a CLR/regex solution could be faster than the ngram8k solution I posted. I've heard this for six years but every single time, without exception, the test harness tells a different story. I already posted in the earlier comments instructions to get the Microsoft© MDQ family of CLR Regex running in just a few minutes. They were developed, tested and tuned by Microsoft and ship with Master Data Services/Data Quality Services. I've used them for years, they're good.
RegexReplace/RegexSplit vs PatExtract8k/DigitsOnlyEE: 1,000,000 rows
Obviously you don't want functions in your WHEREclause but, since my Regex is rusty AF, I needed to. To level the playing field I did the same with DigitsOnlyEE in the N-Gram solution's WHERE clause.
SET NOCOUNT ON;
DBCC FREEPROCCACHE WITH NO_INFOMSGS;
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
SET STATISTICS TIME ON;
DECLARE
#newData BIT = 0,
#string VARCHAR(8000) = '1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 ',
#pattern VARCHAR(50) = '[^0-9()+.-]',
#srchLen INT = 11;
IF #newData = 1
BEGIN
IF OBJECT_ID('tempdb..#strings','U') IS NOT NULL DROP TABLE #strings;
SELECT
StringId = IDENTITY(INT,1,1),
String = REPLICATE(#string,ABS(CHECKSUM(NEWID())%3)+1)
INTO #strings
FROM dbo.rangeAB(1,1000000,1,1) AS r;
END
PRINT CHAR(10)+'Regex/CLR version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT regex.NewString
FROM #strings AS s
CROSS APPLY
(
SELECT STRING_AGG(clr.RegexReplace(f.Token,'[^0-9]','',0),' ')
FROM clr.RegexSplit(s.string,#pattern,N'[0-9()+.-]',0) AS f
WHERE f.IsValid = 1
AND LEN(clr.RegexReplace(f.Token,'[^0-9]','',0)) = #srchLen
) AS regex(NewString);
PRINT CHAR(10)+'NGrams version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT ngramsStuff.NewString
FROM #strings AS s
CROSS APPLY
(
SELECT STRING_AGG(ee.digitsOnly,' ')
FROM samd.patExtract8K(#string,#pattern) AS pe
CROSS APPLY samd.digitsOnlyEE(pe.item) AS ee
WHERE LEN(ee.digitsOnly) = #srchLen
) AS ngramsStuff(NewString)
OPTION (MAXDOP 1);
SET STATISTICS TIME OFF;
GO
Test Results
Regex/CLR version Serial
------------------------------------------------------------------------------------------
SQL Server Execution Times: CPU time = 19918 ms, elapsed time = 12355 ms.
NGrams version Serial
------------------------------------------------------------------------------------------
SQL Server Execution Times: CPU time = 844 ms, elapsed time = 971 ms.
NGrams8k is very fast, does not require you to compile a new assembly, learn a new programming language, Enable CLR functions, etc... No issues with garbage collection. Even the CLR N-GRAMs function that ships with MDS/DQS can't touch NGrams8k for performance (see the comments under my article).
END OF UPDATE
First grab a copy of ngrams8k and use it to build PatExtract8k (DDL below at the bottom of this post.) Next a quick warm-up:
DECLARE
#string VARCHAR(8000) = 'Call me later at 222-3333 or tomorrow at 312.555.2222,
(313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
#pattern VARCHAR(50) = '%[^0-9()+.-]%';
SELECT pe.itemNumber, pe.itemIndex, pe.itemLength, pe.item
FROM samd.patExtract8K(#string,#pattern) AS pe
WHERE pe.itemLength > 1;
Returns:
ItemNumber ItemIndex ItemLength Item
----------- ----------- ----------- ----------------
1 18 8 222-3333
2 42 12 312.555.2222
3 91 13 (313)555-6789
4 112 14 1+800-555-4444
Note that the function returns the matched pattern, position in the string, Item Length and the item. The first three attributes can be leveraged for further processing which brings us to your post. Note my comments:
-- First for some easily consumable sample data.
DECLARE #things TABLE (StringId INT IDENTITY, String VARCHAR(8000));
INSERT #things (String)
VALUES
('Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 '),
('John Smith'),
('xxx#yyy.com'),
('John Smith 8 999 888 77 77');
DECLARE #SrchLen INT = 11;
SELECT
StringId = t.StringId,
ItemIndex = pe.itemIndex,
ItemLength = #SrchLen,
Item = i2.Item
FROM #things AS t
CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]') AS pe
CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',#SrchLen), pe.item))) AS i(Idx)
CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11))) AS ns(NewString)
CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ','')))) AS i2(Item)
WHERE pe.itemLength >= #SrchLen;
Returns:
StringId ItemIndex ItemLength Item
----------- -------------------- ----------- -----------
1 17 11 79005346546
1 62 11 79005346546
1 221 11 79005346546
4 11 11 89998887777
Next we can handle outer rows like so and row-to-column concatenation like this:
WITH t AS
(
SELECT i2.Item, t.StringId
FROM #things AS t
CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]') AS pe
CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',#SrchLen), pe.item))) AS i(Idx)
CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11))) AS ns(NewString)
CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ','')))) AS i2(Item)
WHERE pe.itemLength >= #SrchLen
)
SELECT
StringId = t2.StringId,
NewString = ISNULL((
SELECT t.item+' '
FROM t
WHERE t.StringId = t2.StringId
FOR XML PATH('')),'')
FROM #things AS t2
LEFT JOIN t AS t1 ON t2.StringId = t1.StringId
GROUP BY t2.StringId;
Returns:
StringId NewString
--------- --------------------------------------
1 79005346546 79005346546 79005346546
2
3
4 89998887777
I wish I had a little more time for additional details but this took a little longer then planned. Any questions welcome.
Patextract:
CREATE FUNCTION samd.patExtract8K
(
#string VARCHAR(8000),
#pattern VARCHAR(50)
)
/*****************************************************************************************
[Description]:
This can be considered a T-SQL inline table valued function (iTVF) equivalent of
Microsoft's mdq.RegexExtract except that:
1. It includes each matching substring's position in the string
2. It accepts varchar(8000) instead of nvarchar(4000) for the input string, varchar(50)
instead of nvarchar(4000) for the pattern
3. The mask parameter is not required and therefore does not exist.
4. You have specify what text we're searching for as an exclusion; e.g. for numeric
characters you should search for '[^0-9]' instead of '[0-9]'.
5. There is is no parameter for naming a "capture group". Using the variable below, both
the following queries will return the same result:
DECLARE #string nvarchar(4000) = N'123 Main Street';
SELECT item FROM samd.patExtract8K(#string, '[^0-9]');
SELECT clr.RegexExtract(#string, N'(?<number>(\d+))(?<street>(.*))', N'number', 1);
Alternatively, you can think of patExtract8K as Chris Morris' PatternSplitCM (found here:
http://www.sqlservercentral.com/articles/String+Manipulation/94365/) but only returns the
rows where [matched]=0. The key benefit of is that it performs substantially better
because you are only returning the number of rows required instead of returning twice as
many rows then filtering out half of them. Furthermore, because we're
The following two sets of queries return the same result:
DECLARE #string varchar(100) = 'xx123xx555xx999';
BEGIN
-- QUERY #1
-- patExtract8K
SELECT ps.itemNumber, ps.item
FROM samd.patExtract8K(#string, '[^0-9]') ps;
-- patternSplitCM
SELECT itemNumber = row_number() over (order by ps.itemNumber), ps.item
FROM dbo.patternSplitCM(#string, '[^0-9]') ps
WHERE [matched] = 0;
-- QUERY #2
SELECT ps.itemNumber, ps.item
FROM samd.patExtract8K(#string, '[0-9]') ps;
SELECT itemNumber = row_number() over (order by itemNumber), item
FROM dbo.patternSplitCM(#string, '[0-9]')
WHERE [matched] = 0;
END;
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT pe.ItemNumber, pe.ItemIndex, pe.ItemLength, pe.Item
FROM samd.patExtract8K(#string,#pattern) pe;
--===== Against a table using APPLY
SELECT t.someString, pe.ItemIndex, pe.ItemLength, pe.Item
FROM samd.SomeTable t
CROSS APPLY samd.patExtract8K(t.someString, #pattern) pe;
[Parameters]:
#string = varchar(8000); the input string
#searchString = varchar(50); pattern to search for
[Returns]:
itemNumber = bigint; the instance or ordinal position of the matched substring
itemIndex = bigint; the location of the matched substring inside the input string
itemLength = int; the length of the matched substring
item = varchar(8000); the returned text
[Developer Notes]:
1. Requires NGrams8k
2. patExtract8K does not return any rows on NULL or empty strings. Consider using
OUTER APPLY or append the function with the code below to force the function to return
a row on emply or NULL inputs:
UNION ALL SELECT 1, 0, NULL, #string WHERE nullif(#string,'') IS NULL;
3. patExtract8K is not case sensitive; use a case sensitive collation for
case-sensitive comparisons
4. patExtract8K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
5. patExtract8K performs substantially better with a parallel execution plan, often
2-3 times faster. For queries that leverage patextract8K that are not getting a
parallel exeution plan you should consider performance testing using Traceflag 8649
in Development environments and Adam Machanic's make_parallel in production.
[Examples]:
--===== (1) Basic extact all groups of numbers:
WITH temp(id, txt) as
(
SELECT * FROM (values
(1, 'hello 123 fff 1234567 and today;""o999999999 tester 44444444444444 done'),
(2, 'syat 123 ff tyui( 1234567 and today 999999999 tester 777777 done'),
(3, '&**OOOOO=+ + + // ==?76543// and today !!222222\\\tester{}))22222444 done'))t(x,xx)
)
SELECT
[temp.id] = t.id,
pe.itemNumber,
pe.itemIndex,
pe.itemLength,
pe.item
FROM temp AS t
CROSS APPLY samd.patExtract8K(t.txt, '[^0-9]') AS pe;
-----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170801 - Initial Development - Alan Burstein
Rev 01 - 20180619 - Complete re-write - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT itemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
itemIndex = f.position,
itemLength = itemLen.l,
item = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
SELECT ng.position, SUBSTRING(#string,ng.position,DATALENGTH(#string))
FROM samd.NGrams8k(#string, 1) AS ng
WHERE PATINDEX(#pattern, ng.token) < --<< this token does NOT match the pattern
ABS(SIGN(ng.position-1)-1) + --<< are you the first row? OR
PATINDEX(#pattern,SUBSTRING(#string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+#pattern+'%',f.token),0),
DATALENGTH(#string)+2-f.position)-1)) AS itemLen(l);
GO
The following isn't a direct answer to the question but shows how it can be done in PostgresSQL, which has a mature regular expression replace function. Would expect the solution might be adaptable to SQL Server using some kind of library CLR integration but I'm not experienced in that...
SQL
SELECT REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(phoneNumber, '((([0-9])[ ()+-]*){10,11})([^0-9]|$)', '`\1¬','g'),
'(^|¬)[^`¬]*(`|$)', ',', 'g'),
'(^,|,$|[^0-9,])', '', 'g')
FROM tbl;
Online Demo
db-fiddle.uk demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b12d9f9779b686fd0c4aa84956595f70
Explanation
The innermost REGEXP_REPLACE locates groups of either 10 or 11 digits, each of which may have any number of space, bracket, plus or minus characters after it. The group must either be followed by a non-digit character or the end of the line. For each located group, a single ` is appended before the group of digits and a single ¬ is appended after. You might need to adjust these characters to something rarer - they shouldn't appear anywhere else in the text.
The middle REGEXP_REPLACE replaces each block of text that isn't between a pair of marker characters with a single comma.
The outermost REGEXP_REPLACE removes any commas at the start or end of the string and also removes anything that isn't a digit or comma.

Looking for a scalar function to find the last occurrence of a character in a string

Table FOO has a column FILEPATH of type VARCHAR(512). Its entries are absolute paths:
FILEPATH
------------------------------------------------------------
file://very/long/file/path/with/many/slashes/in/it/foo.xml
file://even/longer/file/path/with/more/slashes/in/it/baz.xml
file://something/completely/different/foo.xml
file://short/path/foobar.xml
There's ~50k records in this table and I want to know all distinct filenames, not the file paths:
foo.xml
baz.xml
foobar.xml
This looks easy, but I couldn't find a DB2 scalar function that allows me to search for the last occurrence of a character in a string. Am I overseeing something?
I could do this with a recursive query, but this appears to be overkill for such a simple task and (oh wonder) is extremely slow:
WITH PATHFRAGMENTS (POS, PATHFRAGMENT) AS (
SELECT
1,
FILEPATH
FROM FOO
UNION ALL
SELECT
POSITION('/', PATHFRAGMENT, OCTETS) AS POS,
SUBSTR(PATHFRAGMENT, POSITION('/', PATHFRAGMENT, OCTETS)+1) AS PATHFRAGMENT
FROM PATHFRAGMENTS
)
SELECT DISTINCT PATHFRAGMENT FROM PATHFRAGMENTS WHERE POS = 0
I think what you're looking for is the LOCATE_IN_STRING() scalar function. This is what Info Center has to say if you use a negative start value:
If the value of the integer is less than zero, the search begins at
LENGTH(source-string) + start + 1 and continues for each position to
the beginning of the string.
Combine that with the LENGTH() and RIGHT() scalar functions, and you can get what you want:
SELECT
RIGHT(
FILEPATH
,LENGTH(FILEPATH) - LOCATE_IN_STRING(FILEPATH,'/',-1)
)
FROM FOO
One way to do this is by taking advantage of the power of DB2s XQuery engine. The following worked for me (and fast):
SELECT DISTINCT XMLCAST(
XMLQuery('tokenize($P, ''/'')[last()]' PASSING FILEPATH AS "P")
AS VARCHAR(512) )
FROM FOO
Here I use tokenize to split the file path into a sequence of tokens and then select the last of these tokens. The rest is only conversion from SQL to XML types and back again.
I know that the problem from the OP was already solved but I decided to post the following anyway to hopefully help others like me that land here.
I came across this thread while searching for a solution to my similar problem which had the exact same requirement but was for a different kind of database that was also lacking the REVERSE function.
In my case this was for a OpenEdge (Progress) database, which has a slightly different syntax. This made the INSTR function available to me that most Oracle typed databases offer.
So I came up with the following code:
SELECT
SUBSTRING(
foo.filepath,
INSTR(foo.filepath, '/',1, LENGTH(foo.filepath) - LENGTH( REPLACE( foo.filepath, '/', '')))+1,
LENGTH(foo.filepath))
FROM foo
However, for my specific situation (being the OpenEdge (Progress) database) this did not result into the desired behaviour because replacing the character with an empty char gave the same length as the original string. This doesn't make much sense to me but I was able to bypass the problem with the code below:
SELECT
SUBSTRING(
foo.filepath,
INSTR(foo.filepath, '/',1, LENGTH( REPLACE( foo.filepath, '/', 'XX')) - LENGTH(foo.filepath))+1,
LENGTH(foo.filepath))
FROM foo
Now I understand that this code won't solve the problem for T-SQL because there is no alternative to the INSTR function that offers the Occurence property.
Just to be thorough I'll add the code needed to create this scalar function so it can be used the same way like I did in the above examples.
-- Drop the function if it already exists
IF OBJECT_ID('INSTR', 'FN') IS NOT NULL
DROP FUNCTION INSTR
GO
-- User-defined function to implement Oracle INSTR in SQL Server
CREATE FUNCTION INSTR (#str VARCHAR(8000), #substr VARCHAR(255), #start INT, #occurrence INT)
RETURNS INT
AS
BEGIN
DECLARE #found INT = #occurrence,
#pos INT = #start;
WHILE 1=1
BEGIN
-- Find the next occurrence
SET #pos = CHARINDEX(#substr, #str, #pos);
-- Nothing found
IF #pos IS NULL OR #pos = 0
RETURN #pos;
-- The required occurrence found
IF #found = 1
BREAK;
-- Prepare to find another one occurrence
SET #found = #found - 1;
SET #pos = #pos + 1;
END
RETURN #pos;
END
GO
To avoid the obvious, when the REVERSE function is available you do not need to create this scalar function and you can just get the required result like this:
SELECT
SUBSTRING(
foo.filepath,
LEN(foo.filepath) - CHARINDEX('\', REVERSE(foo.filepath))+2,
LEN(foo.filepath))
FROM foo
You could just do it in a single statement:
select distinct reverse(substring(reverse(FILEPATH), 1, charindex('/', reverse(FILEPATH))-1))
from filetable

How can I visualize the value of nvarchar(max), with max>65535, from SQL Server database? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
SQL Server Management Studio: Increase number of characters displayed in result set
Update: Note: that discussion contains INCORRECT answer marked as answer.
SSMS from SQL Server 2008 R2, permits to visualize maximum of 8192 characters in "Results to text" mode and 65535 in "Results to grid" mode. "Results to file" are also truncated.
How can I see the selected value of bigger size fast and cheap?
Update:
I saw previous discussion and the best answer is to create my own front-end app is not really an answer.
I am not planning to compete with DBMS client tools vendors.
I just need to see the value fast, dirty or cheap, be it tools or not tools.
I just cannot believe that in order to see a single value I should create client applications and there is no trick or way around.
Why don't you just return the dataset as XML and there are no size limitations? You can do this using the cast( COLUMN_NAME as XML ) to do this.
Quick and dirty, I like that. Of course it can be done from inside Management Studio, you just have to be little creative. The idea is simple - can't display the whole string? Chop it up and display more rows.
Here is a function that takes a varchar input and outputs table with chunks of specified size. You can then CROSS APLLY select from from original table with this function and get what you need.
Function:
create function Splitter( #string varchar(max), #pieceSize int )
returns #t table( S varchar(8000) )
as
begin
if ( #string is null or len(#string) = 0 )
return
set #pieceSize = isnull(#pieceSize, 1000)
if (#pieceSize < 0 or #pieceSize > 8000)
set #pieceSize = 8000
declare #i int = 0
declare #len int = len(#string)
while ( #i < #len / #pieceSize )
begin
insert into #t(S) values(substring( #string, #i * #pieceSize + 1, #pieceSize))
set #i = #i + 1
end
if (#len % #pieceSize) != 0
begin
if (#len / #pieceSize) = 0
set #i = 1
insert into #t(S) values(substring( #string, (#i - 1) * #pieceSize + 1, #len % #pieceSize ))
end
return
end
Usage example:
select t.ID, t.Col1, t.Col2, t.Col3, pieces.S
from dbo.MyTable as t
cross apply dbo.Splitter( t.MybigStringCol, 1000 ) as pieces
That is the problem, I attack in sqlise a PowerShell module of the SQLPSX codeplex project (sorry I'm only allowed to use on hyperlink please google for it).
PowerShell ISE is the Integrated Scripting Environment which is part of PowerShell V2.
SQLPSX is a collection of PowerShell modules targeting management and querying MS-SQLserver (and minimal support for ORACLE too).
The normal output-pane of ISE has some bad truncation/wrapping behaviour, but it is possible to send out-put to an editor pane.
When use a query that fetches a single row of a one column wide resultset and use either 'inline' or 'isetab' as output format, you get the complete varchar(max), text, CLOB (yes this works for ORACLE too) value.
If you query a single row with such columns, the result depends on embedded linefeeds, a width of 10000 chars / line is current set. But that is in a script language and you can modify it by yourself.
If you prefer a pure T-SQL solution, you can look a the source of my project Extreme T-SQL Script http://etsql.codeplex.com/. With the scripts print_string_in_lines.sql and sp_gentextupdate.sql you have the tools to generate update scripts to set fields to the current content. SQL-SERVER 2008 is required, as I internally use varchar(max).
BTW I don't have access to SQL Server 2008 R2. I though the limit is still about 4000 characters per text column.
I hope that helps
Bernd
Select
CASE WHEN T.TheSegment = 1 Then Cast(T.SomeID as varchar(50))
Else ''
End as The_ID
, T.ChoppedField
From (
Select SomeID
, 1 as TheSegment
, substring(SomeBigField, 1, 8000) as ChoppedField
from sometable
Union All
Select SomeID
, 2
, substring(SomeBigfield, 8001, 16000)
from sometable
) as t
order by t.SomeID, t.TheSegment;
Rinse and repeat if necessary on the unions or feel free to get recursive; not sure how much more than 16000 characters you feel like reading. About as cheap as it gets.
Many times these large fields contain formating characters, so the suggestions for creating your own form and using some type of richtext control are valid.
You can see it by viewing it in your front-end application. SSMS is not designed to be a general data viewer.