Remove phone numbers from text in SQL Server - sql

In a text column in SQL Server, there are personal phone numbers which I want to replace with # for each numbers. Please see examples below:
'07555815825'
'CALL ME ON 07585815826'
'TEXT 07545815826 TEST'
'TEXT 07545815826 TEST its 2020'
I have tried cross apply and string split but cannot get the desired results.
Below are the desired results:
'###########'
'CALL ME ON ###########'
'TEXT ########### TEST'
'TEXT ########### TEST its 2020'

Your sample data has exactly 11-digit phone numbers and only one phone number per text. With these constraints, you can use stuff():
select v.text,
stuff(text, patindex('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', text + '00000000000'), 11, 'XXXXXXXXXXX')
from (values ('07555815825'),
('CALL ME ON 07585815826'),
('TEXT 07545815826 TEST'),
('TEXT 07545815826 TEST its 2020')
) v(text);
Note: This also handles texts that have no phone numbers.

You can not do it as you want using native tql syntax because you don't have a builtin regex feature, but you can use other builtin functions to create something similar to regex.
here is a function,I found it in stackoverflow when I had the same need, that you can add to your solution that can help you with your need
CREATE FUNCTION dbo.fn_ReplaceWithPattern
(
#InputString VARCHAR(MAX),
#Pattern VARCHAR(MAX),
#ReplaceString VARCHAR(MAX),
#ReplaceLength INT = 1
)
RETURNS VARCHAR(MAX)
BEGIN
DECLARE #Index INT;
set #Index = patindex(#Pattern, #InputString);
while #Index > 0
begin
--replace matching character at index
set #InputString = stuff(#InputString, patindex(#Pattern, #InputString), #ReplaceLength, #ReplaceString);
set #Index = patindex(#Pattern, #InputString);
end;
return #InputString;
END;
To use it, you do so
declare #textString VARCHAR(250) = '07555815825
CALL ME ON 07585815826
TEXT 07545815826 TEST
TEXT 07545815826 TEST its 2020'
--select Replace(#textString, Substring(#textString, PatIndex('%[0-9]%', #textString), 1), '')
select dbo.fn_ReplaceWithPattern(#textString,'%[0-9]%','',1)
output
if your data is stored in a table, and you need to update its value, you do so
update TableA
set ColumnX = dbo.fn_ReplaceWithPattern(ColumnX,'%[0-9]%','',1)
If you need it just in your select statment, you use this syntaxe
select dbo.fn_ReplaceWithPattern(ColumnX,'%[0-9]%','',1) as SanitizedValue
from TableA

Related

SQL Server 2012 string functions

I have a field that can vary in length of the format CxxRyyy where x and y are numeric. I want to choose xx and yyy. For instance, if the field value is C1R12, then I want to get 1 and 12. if I use substring and charindex then I have to use a length, but I would like to use a position like
SUBSTRING(WPLocationNew, CHARINDEX('C',WPLocationNew,1)+1, CHARINDEX('R',WPLocationNew,1)-1)
or
SUBSTRING(WPLocationNew, CHARINDEX('C',WPLocationNew,1)+1, LEN(WPLocationNew) - CHARINDEX('R',WPLocationNew,1))
to get x, but I know that doesn't work. I feel like there is a fairly simple solution, but I am not coming up with it yet. Any suggestions
If these are cell references and will always be in the form C{1-5 digits}R{1-5 digits} you can do this:
DECLARE #t TABLE(Original varchar(32));
INSERT #t(Original) VALUES ('C14R4535'),('C1R12'),('C57R123');
;WITH src AS
(
SELECT Original, c = REPLACE(REPLACE(Original,'C',''),'R','.')
FROM #t
)
SELECT Original, C = PARSENAME(c,2), R = PARSENAME(c,1)
FROM src;
Output
Original
C
R
C14R4535
14
4535
C1R12
1
12
C57R123
57
123
Example db<>fiddle
If you need to protect against other formats, you can add
FROM #t WHERE Original LIKE 'C%[0-9]%R%[0-9]%'
AND PATINDEX('%[^C^R^0-9]%', Original) = 0
Updated db<>fiddle
It appears that you are attempting to parse an Excel cell reference. Those are predictably structured or I wouldn't suggest such an embarrassing hack as this.
Basically, take advantage of the fact that a try_cast in SQL ignores spaces when converting strings to numbers.
declare #val as varchar(20) = 'C1R12'
declare #newval as varchar(20)
declare #c as smallint
declare #r as smallint
--replace the C with 5 spaces
set #newval = replace(#val,'C',' ')
--replace the R with 5 spaces
set #newval = replace(#newval,'R',' ')
--take a look at the intermediate result, which is ' 1 14'
select #newval
set #c = try_cast(left(#newval,11) as smallint)
set #r = try_cast(right(#newval,6) as smallint)
--take a look at the results... two smallint, 1 and 14
select #c, #r
That can all be accomplished in one line for each element (a line for column and a line for row) but I wanted you to be able to understand what was happening so this example goes through the steps individually.
Here's yet another way:
declare #val as varchar(20) = 'C12R345'
declare #c as varchar(5)
declare #r as varchar(5)
set #c = SUBSTRING(#val, patindex('C%', #val)+1,(patindex('%R%', #val)-1)-patindex('C%', #val) )
set #r = SUBSTRING(#val, patindex('%R%', #val)+1, LEN(#val) -patindex('%R%', #val))
select cast(#c as int) as 'C', cast(#r as int) as 'R'
dbfiddle
There are lots of different ways to approach string parsing. Here's just one possible idea:
declare #s varchar(10) = 'C01R002';
select
rtrim( left(replace(stuff(#s, 1, 1, ''), 'R', ' '), 10)) as c,
ltrim(right(replace(substring(#s, 2, 10), 'R', ' '), 10)) as r
Strip out the 'C' and then replace the 'R' with enough spaces so that the left and right sides can be extracted using a fixed length and then easily trimmed back.
stuff() and substring() as used above are just different ways accomplish exactly the same thing. One advantage here is that it does use fairly portable string functions and it's conceivable that this is somewhat faster. This is also done inline and without multiple steps.

How to identify and redact all instances of a matching pattern in T-SQL

I have a requirement to run a function over certain fields to identify and redact any numbers which are 5 digits or longer, ensuring all but the last 4 digits are replaced with *
For example: "Some text with 12345 and 1234 and 12345678" would become "Some text with *2345 and 1234 and ****5678"
I've used PATINDEX to identify the the starting character of the pattern:
PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', TEST_TEXT)
I can recursively call that to get the starting character of all the occurrences, but I'm struggling with the actual redaction.
Does anyone have any pointers on how this can be done? I know to use REPLACE to insert the *s where they need to be, it's just the identification of what I should actually be replacing I'm struggling with.
Could do it on a program, but I need it to be T-SQL (can be a function if needed).
Any tips greatly appreciated!
You can do this using the built in functions of SQL Server. All of which used in this example are present in SQL Server 2008 and higher.
DECLARE #String VARCHAR(500) = 'Example Input: 1234567890, 1234, 12345, 123456, 1234567, 123asd456'
DECLARE #StartPos INT = 1, #EndPos INT = 1;
DECLARE #Input VARCHAR(500) = ISNULL(#String, '') + ' '; --Sets input field and adds a control character at the end to make the loop easier.
DECLARE #OutputString VARCHAR(500) = ''; --Initalize an empty string to avoid string null errors
WHILE (#StartPOS <> 0)
BEGIN
SET #StartPOS = PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', #Input);
IF #StartPOS <> 0
BEGIN
SET #OutputString += SUBSTRING(#Input, 1, #StartPOS - 1); --Seperate all contents before the first occurance of our filter
SET #Input = SUBSTRING(#Input, #StartPOS, 500); --Cut the entire string to the end. Last value must be greater than the original string length to simply cut it all.
SET #EndPos = (PATINDEX('%[0-9][0-9][0-9][0-9][^0-9]%', #Input)); --First occurance of 4 numbers with a not number behind it.
SET #Input = STUFF(#Input, 1, (#EndPos - 1), REPLICATE('*', (#EndPos - 1))); --#EndPos - 1 gives us the amount of chars we want to replace.
END
END
SET #OutputString += #Input; --Append the last element
SET #OutputString = LEFT(#OutputString, LEN(#OutputString))
SELECT #OutputString;
Which outputs the following:
Example Input: ******7890, 1234, *2345, **3456, ***4567, 123asd456
This entire code could also be made as a function since it only requires an input text.
A dirty solution with recursive CTE
DECLARE
#tags nvarchar(max) = N'Some text with 12345 and 1234 and 12345678',
#c nchar(1) = N' ';
;
WITH Process (s, i)
as
(
SELECT #tags, PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', #tags)
UNION ALL
SELECT value, PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', value)
FROM
(SELECT SUBSTRING(s,0,i)+'*'+SUBSTRING(s,i+4,len(s)) value
FROM Process
WHERE i >0) calc
-- we surround the value and the string with leading/trailing ,
-- so that cloth isn't a false positive for clothing
)
SELECT * FROM Process
WHERE i=0
I think a better solution it's to add clr function in Ms SQL Server to manage regexp.
sql-clr/RegEx
Here is an option using the DelimitedSplit8K_LEAD which can be found here. https://www.sqlservercentral.com/articles/reaping-the-benefits-of-the-window-functions-in-t-sql-2 This is an extension of Jeff Moden's splitter that is even a little bit faster than the original. The big advantage this splitter has over most of the others is that it returns the ordinal position of each element. One caveat to this is that I am using a space to split on based on your sample data. If you had numbers crammed in the middle of other characters this will ignore them. That may be good or bad depending on you specific requirements.
declare #Something varchar(100) = 'Some text with 12345 and 1234 and 12345678';
with MyCTE as
(
select x.ItemNumber
, Result = isnull(case when TRY_CONVERT(bigint, x.Item) is not null then isnull(replicate('*', len(convert(varchar(20), TRY_CONVERT(bigint, x.Item))) - 4), '') + right(convert(varchar(20), TRY_CONVERT(bigint, x.Item)), 4) end, x.Item)
from dbo.DelimitedSplit8K_LEAD(#Something, ' ') x
)
select Output = stuff((select ' ' + Result
from MyCTE
order by ItemNumber
FOR XML PATH('')), 1, 1, '')
This produces: Some text with *2345 and 1234 and ****5678

Replace every alpha character with itself + wildcard in string SQL Server

My goal is to create a query that will search for results related to a specific keyword.
Say in a database we had the word cat.
Regardless of if the user types C a t, C.A.T. or Cat I want to find a result related to the search as long as the alpha numeric characters are in the correct sequence that is all that matters
Say in the database we have these 4 records
cat
c/a/t
c.a.t
c. at
If the user types in C#$*(&A T I'd like to get all 4 results.
What I have written so far in my query is a function that strips any non-alphanumeric characters from the input string.
What can I do to replace each alphanumeric character with itself and add a wildcard at the end?
For every alpha character my input would look similar to this
C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%
Actually, that search string will return only one record from this table: the row with 'c.a.t '.
This is because the expression C%[^a-zA-Z0-9]%A does not mean there can't be any alpha-numeric chars between C and A.
What it actually means is there should be at least one non alpha-numeric value between C and A.
Moreover, it will return incorrect values as well - a value like 'c u a s e t ' will be returned.
You need to change your where clause to something like this:
WHERE column LIKE '%C%A%T%'
AND column NOT LIKE '%C%[a-zA-Z0-9]%A%[a-zA-Z0-9]%T%'
This way, if you have cat in the correct order, the first row will resolve to true, and if there are no other alpha-numeric chars between c, a, and t the second row will resolve to true.
Here is a test script, where you can see for yourself what I mean:
DECLARE #T AS TABLE
(
a varchar(20)
)
INSERT INTO #T VALUES
('cat'),
('c/a/t'),
('c.a.t '),
('c. at'),
('c u a s e t ')
-- Incorrect where clause
SELECT *
FROM #T
WHERE a LIKE 'C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%'
-- correct where clause
SELECT *
FROM #T
WHERE a LIKE '%C%A%T%'
AND a NOT LIKE '%C%[a-zA-Z0-9]%A%[a-zA-Z0-9]%T%'
You can also see it in action in this link.
And since I had some spare time, here is a script to create both the like and the not like patterns from the input string:
DECLARE #INPUT varchar(100) = '#*# c %^&# a ^&*$&* t (*&(%!##$'
DECLARE #Index int = 1,
#CurrentChar char(1),
#Like varchar(100),
#NotLike varchar(100) = '%'
WHILE #Index < LEN(#Input)
BEGIN
SET #CurrentChar = SUBSTRING(#INPUT, #Index, 1)
IF PATINDEX('%[^a-zA-Z0-9]%', #CurrentChar) = 0
BEGIN
SET #NotLike = #NotLike + #CurrentChar + '%[a-zA-Z0-9]%'
END
SET #Index = #Index + 1
END
SELECT #NotLike = LEFT(#NotLike, LEN(#NotLike) - 12),
#Like = REPLACE(#NotLike, '%[a-zA-Z0-9]%', '%')
SELECT *
FROM #T
WHERE a LIKE #Like
AND a NOT LIKE #NotLike
You can recursively go through your (cleaned) search string and to each letter add the expression you would like. In my example #builtString should be what you would like to use further on, if I understood correctly.
declare #cleanSearch as nvarchar(10) = 'CAT'
declare #builtString as nvarchar(100) = ''
WHILE LEN(#cleanSearch) > 0 -- loop until you deplete the search string
BEGIN
SET #builtString = #builtString + substring(#cleanSearch,1,1) + '%[^a-zA-Z0-9]%' -- append the letter plus regular expression
SET #cleanSearch = right(#cleanSearch, len(#cleanSearch) - 1) -- remove first letter of the search string
END
SELECT #builtString --will look like C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%
SELECT #cleanSearch --#cleanSearch is now empty

Trim Only the comma separated Numbers without trimming the character appended Numbers

I have a column '<InstructorID>' which may contain data like "79,instr1,inst2,13" and so on.
The following code gives me result like this "791213"
declare #InstructorID varchar(100)
set #InstructorID= (select InstructorID from CourseSession where CourseSessionNum=262)
WHILE PATINDEX('%[^0-9]%', #InstructorID) > 0
BEGIN
SET #InstructorID = STUFF(#InstructorID, PATINDEX('%[^0-9]%', #InstructorID), 1, '')
END
select #InstructorID
I need the output ti be like this "79,13"
i.e those numbers attached to characters shoud not appear in output.
P.S: I need to achieve this using sql only. Unfortunately i'm unable to use Regex which would have made this task much easier.
I agree with others that your problem would seem to be indicative of a mistake in your data design.
However, accepting that you cannot change the design, the following would allow you to achieve what you are looking for:
DECLARE #InstructorID VARCHAR(100)
DECLARE #Part VARCHAR(100)
DECLARE #Pos INT
DECLARE #Return VARCHAR(100)
SET #InstructorID = '79,instr1,inst2,13'
SET #Return = ''
-- Continue until InstructorID is empty
WHILE (LEN(#InstructorID) > 0)
BEGIN
-- Get the position of the next comma, and set to the end of InstructorID if there are no more
SET #Pos = CHARINDEX(',', #InstructorID)
IF (#Pos = 0)
SET #Pos = LEN(#InstructorID)
-- Get the next part of the text and shorted InstructorID
SET #Part = SUBSTRING(#InstructorID, 1, #Pos)
SET #InstructorID = RIGHT(#InstructorID, LEN(#InstructorID) - #Pos)
-- Check that the part is numeric
IF (ISNUMERIC(#Part) = 1)
SET #Return = #Return + #Part
END
-- Trim trailing comma (if any)
IF (RIGHT(#Return, 1) = ',')
SET #Return = LEFT(#Return, LEN(#Return) - 1)
PRINT #Return
Essentially, this loops through the #InstructorID, extracting parts of text between commas.
If the part is numeric then it adds it to the output text. I am PRINTing the text but you could SELECT it or use it however you wish.
Obviously, where I have SET #InstructorID = xyz, you should change this to your SELECT statement.
This code can be placed into a UDF if preferred, although as I say, your data format seems less than ideal.

How to count instances of character in SQL Column

I have an sql column that is a string of 100 'Y' or 'N' characters. For example:
YYNYNYYNNNYYNY...
What is the easiest way to get the count of all 'Y' symbols in each row.
This snippet works in the specific situation where you have a boolean: it answers "how many non-Ns are there?".
SELECT LEN(REPLACE(col, 'N', ''))
If, in a different situation, you were actually trying to count the occurrences of a certain character (for example 'Y') in any given string, use this:
SELECT LEN(col) - LEN(REPLACE(col, 'Y', ''))
In SQL Server:
SELECT LEN(REPLACE(myColumn, 'N', ''))
FROM ...
This gave me accurate results every time...
This is in my Stripes field...
Yellow, Yellow, Yellow, Yellow, Yellow, Yellow, Black, Yellow, Yellow, Red, Yellow, Yellow, Yellow, Black
11 Yellows
2 Black
1 Red
SELECT (LEN(Stripes) - LEN(REPLACE(Stripes, 'Red', ''))) / LEN('Red')
FROM t_Contacts
DECLARE #StringToFind VARCHAR(100) = "Text To Count"
SELECT (LEN([Field To Search]) - LEN(REPLACE([Field To Search],#StringToFind,'')))/COALESCE(NULLIF(LEN(#StringToFind), 0), 1) --protect division from zero
FROM [Table To Search]
This will return number of occurance of N
select ColumnName, LEN(ColumnName)- LEN(REPLACE(ColumnName, 'N', ''))
from Table
The easiest way is by using Oracle function:
SELECT REGEXP_COUNT(COLUMN_NAME,'CONDITION') FROM TABLE_NAME
Maybe something like this...
SELECT
LEN(REPLACE(ColumnName, 'N', '')) as NumberOfYs
FROM
SomeTable
Below solution help to find out no of character present from a string with a limitation:
1) using SELECT LEN(REPLACE(myColumn, 'N', '')), but limitation and
wrong output in below condition:
SELECT LEN(REPLACE('YYNYNYYNNNYYNY', 'N', ''));
--8 --Correct
SELECT LEN(REPLACE('123a123a12', 'a', ''));
--8 --Wrong
SELECT LEN(REPLACE('123a123a12', '1', ''));
--7 --Wrong
2) Try with below solution for correct output:
Create a function and also modify as per requirement.
And call function as per below
select dbo.vj_count_char_from_string('123a123a12','2');
--2 --Correct
select dbo.vj_count_char_from_string('123a123a12','a');
--2 --Correct
-- ================================================
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: VIKRAM JAIN
-- Create date: 20 MARCH 2019
-- Description: Count char from string
-- =============================================
create FUNCTION vj_count_char_from_string
(
#string nvarchar(500),
#find_char char(1)
)
RETURNS integer
AS
BEGIN
-- Declare the return variable here
DECLARE #total_char int; DECLARE #position INT;
SET #total_char=0; set #position = 1;
-- Add the T-SQL statements to compute the return value here
if LEN(#string)>0
BEGIN
WHILE #position <= LEN(#string) -1
BEGIN
if SUBSTRING(#string, #position, 1) = #find_char
BEGIN
SET #total_char+= 1;
END
SET #position+= 1;
END
END;
-- Return the result of the function
RETURN #total_char;
END
GO
try this
declare #v varchar(250) = 'test.a,1 ;hheuw-20;'
-- LF ;
select len(replace(#v,';','11'))-len(#v)
If you want to count the number of instances of strings with more than a single character, you can either use the previous solution with regex, or this solution uses STRING_SPLIT, which I believe was introduced in SQL Server 2016. Also you’ll need compatibility level 130 and higher.
ALTER DATABASE [database_name] SET COMPATIBILITY_LEVEL = 130
.
--some data
DECLARE #table TABLE (col varchar(500))
INSERT INTO #table SELECT 'whaCHAR(10)teverCHAR(10)whateverCHAR(10)'
INSERT INTO #table SELECT 'whaCHAR(10)teverwhateverCHAR(10)'
INSERT INTO #table SELECT 'whaCHAR(10)teverCHAR(10)whateverCHAR(10)~'
--string to find
DECLARE #string varchar(100) = 'CHAR(10)'
--select
SELECT
col
, (SELECT COUNT(*) - 1 FROM STRING_SPLIT (REPLACE(REPLACE(col, '~', ''), 'CHAR(10)', '~'), '~')) AS 'NumberOfBreaks'
FROM #table
The second answer provided by nickf is very clever. However, it only works for a character length of the target sub-string of 1 and ignores spaces. Specifically, there were two leading spaces in my data, which SQL helpfully removes (I didn't know this) when all the characters on the right-hand-side are removed. Which meant that
" John Smith"
generated 12 using Nickf's method, whereas:
" Joe Bloggs, John Smith"
generated 10, and
" Joe Bloggs, John Smith, John Smith"
Generated 20.
I've therefore modified the solution slightly to the following, which works for me:
Select (len(replace(Sales_Reps,' ',''))- len(replace((replace(Sales_Reps, ' ','')),'JohnSmith','')))/9 as Count_JS
I'm sure someone can think of a better way of doing it!
You can also Try This
-- DECLARE field because your table type may be text
DECLARE #mmRxClaim nvarchar(MAX)
-- Getting Value from table
SELECT top (1) #mmRxClaim = mRxClaim FROM RxClaim WHERE rxclaimid_PK =362
-- Main String Value
SELECT #mmRxClaim AS MainStringValue
-- Count Multiple Character for this number of space will be number of character
SELECT LEN(#mmRxClaim) - LEN(REPLACE(#mmRxClaim, 'GS', ' ')) AS CountMultipleCharacter
-- Count Single Character for this number of space will be one
SELECT LEN(#mmRxClaim) - LEN(REPLACE(#mmRxClaim, 'G', '')) AS CountSingleCharacter
Output:
If you need to count the char in a string with more then 2 kinds of chars, you can use instead of 'n' - some operator or regex of the chars accept the char you need.
SELECT LEN(REPLACE(col, 'N', ''))
Try this:
SELECT COUNT(DECODE(SUBSTR(UPPER(:main_string),rownum,LENGTH(:search_char)),UPPER(:search_char),1)) search_char_count
FROM DUAL
connect by rownum <= length(:main_string);
It determines the number of single character occurrences as well as the sub-string occurrences in main string.
Here's what I used in Oracle SQL to see if someone was passing a correctly formatted phone number:
WHERE REPLACE(TRANSLATE('555-555-1212','0123456789-','00000000000'),'0','') IS NULL AND
LENGTH(REPLACE(TRANSLATE('555-555-1212','0123456789','0000000000'),'0','')) = 2
The first part checks to see if the phone number has only numbers and the hyphen and the second part checks to see that the phone number has only two hyphens.
for example to calculate the count instances of character (a) in SQL Column ->name is column name
'' ( and in doblequote's is empty i am replace a with nocharecter #'')
select len(name)- len(replace(name,'a','')) from TESTING
select len('YYNYNYYNNNYYNY')- len(replace('YYNYNYYNNNYYNY','y',''))
DECLARE #char NVARCHAR(50);
DECLARE #counter INT = 0;
DECLARE #i INT = 1;
DECLARE #search NVARCHAR(10) = 'Y'
SET #char = N'YYNYNYYNNNYYNY';
WHILE #i <= LEN(#char)
BEGIN
IF SUBSTRING(#char, #i, 1) = #search
SET #counter += 1;
SET #i += 1;
END;
SELECT #counter;