Strip out all non alpanumeric characters and non set punctation - sql

Hi I've taken a function to strip out alphanumeric characters and some punctuation characters from a string.
ALTER FUNCTION [dbo].[Remove_Non_Alphanumeric]
(#String_Parameter VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #Alphanumeric_Characters VARCHAR(289) = '%[^A-Za-z0-9| |?|,|&|\|/|.|'+CHAR(13)+'|'+CHAR(10)+'|(|)|]|[|-]%';
WHILE PATINDEX(#Alphanumeric_Characters, #String_Parameter) > 0
BEGIN
SELECT #String_Parameter = STUFF(#String_Parameter, PATINDEX(#Alphanumeric_Characters, #String_Parameter), 1, '');
END
RETURN #String_Parameter;
END
This mostly seems to work. However there are odd passages which have some characters returned with a ? and odder still some bullet points are changed to ? and others are left. Despite both not being in a my list of acceptable characters.

Related

Regex in LIKE Clause that accepts only Alphanumeric and dashes

Below is my SQL function script that will help identify the alphanumeric value and dashes (-):
CREATE FUNCTION [dbo].[FN_VALIDATE_ALPHANUMERIC_AND_DASHES](#TX_INPUT VARCHAR(1000))RETURNS BIT AS
BEGIN
DECLARE #bitInputVal AS BIT = 0
DECLARE #InputText VARCHAR(1000)
SET #InputText = LTRIM(RTRIM(ISNULL(#TX_INPUT,'')))
IF #InputText <> ''
BEGIN
SET #bitInputVal = CASE
WHEN #InputText LIKE '%[A-Za-z0-9-]%' THEN 1
ELSE 0
END
END
RETURN #bitInputVal
END
I have problem which I try this query:
SELECT dbo.FN_VALIDATE_CLAIMANT_REF_NO('AbcdefgH-1234*') it gives me a result of 1 though the character * is not included in the regex and should return 0 instead.
What I want to achieve is to explicitly verify if the string consist of alphanumeric (alphabets and numbers) and dashes only.
Please note that there are no limitation in length of the characters, only check if the string consist of alphanumeric and dashes.
You are testing for "whether any alphabet, number or dash is present in the string"
You should instead test whether any character other than alphabet, number or dash is present.
SET #bitInputVal = CASE WHEN #InputText LIKE '%[^A-Za-z0-9-]%' THEN 0 ELSE 1 END

Update or replace a special character text in my database field

I am facing to an issue that I cannot update or replace some characters in my database.
Here is how this text look like in my column when I retrieve it:
As you can see, there is an unknown characters between 'master' and 'degree' which I cannot even paste it here.
I also tried to update and replace it with below code (I cannot paste that two vertical lines here since they are not supported in any browser and I am not sure what they are, Please see the picture above to see what is in my SQL statement).
begin transaction
update gm_desc set projdesc=replace(projdesc,'%â s%','') where projdesc like '%âs%' and proposalno = '15-149-01'
You can see the real SQL Statement here:
I tried to update, or replace it but I cannot do it. The update statement successfully works but I still see that weird special charters. I would be appreciate to help me.
Here's a scalar-valued function which removes all non-alphanumeric characters (preserves spaces) from a string.
Hopefully it helps!
dbfiddle
create function dbo.get_alphanumeric_str
(
#string varchar(max)
)
returns varchar(max)
as
begin
declare #ret varchar(max);
with nums as (
select 1 as n
union all select n+1 from nums
where n < 256
)
select #ret = replace(stuff(
(
select '' + substring(#string, nums.n, 1)
from nums
where patindex('%[^0-9A-Za-z ]%', substring(#string, nums.n,1)) = 0
for xml path('')
), 1, 0, ''
), ' ', ' ')
option (MAXRECURSION 256)
return #ret;
end
Usage
select dbo.get_alphanumeric_str('Helloᶄ âWorld 1234⅊⅐')
Returns: Hello World 1234
How it works
The nums CTE is just to get a list of numbers (you can set the maximum of 256 to a higher value if your strings are longer; n.b. option (MAXRECURSION n) is for this CTE but has to be placed at the query)
The stuff essentially iterates through the string, using the list of numbers above and extracts a substring of length 1; each of these chars are checked if they match the [^0-9A-Za-z ] regex group (0-9 all digits, A-Za-z all letters both lower and upper case, and a single space character)
If they match, patindex() should return 0; i.e. index zero.
Use replace(string, ' ', ' ') for the space character as the xml path returns a special encoding, see this question.
Use a binary collation for accented characters; see this answer

Identify Hidden Characters

In my SQL tables I have text which has hidden characters which is only visible when I copy and paste it in notepad++.
How to find those rows which has hidden characters using SQL Server queries?
I have tried comparing the lengths using datalength and len
it did not work.
DATALENGTH(name) AS BinaryLength != LEN(name)
I want the row which has hidden characters.
On the assumption that this is being caused by control characters. Some of which are invisible. But also include tabs, newlines and spaces. An example to illustrate and how to get them to appear.
--DROP TABLE #SillyTemp
DECLARE #InvisibleChar1 NCHAR(1) = NCHAR(28), #InvisibleChar2 NCHAR(1) = NCHAR(30), #NonControlChar NCHAR(1) = NCHAR(33);
DECLARE #InputString NVARCHAR(500) = N'Some |' + #InvisibleChar1 +'| random string |' + #InvisibleChar2 + '|' + '; Thank god Finally a normal character |' + #NonControlChar + '|';
SELECT #InputString AS OhNoInvisibleCharacters
DECLARE #ControlCharRange NVARCHAR(50) = N'%[' + NCHAR(1) + '-' + NCHAR(31) + ']%';
CREATE TABLE #SillyTemp
(
input nvarchar(500)
)
INSERT INTO #SillyTemp(input)
VALUES (#InputString),(N'A normal string')
SELECT #ControlCharRange;
SELECT input FROM #SillyTemp AS #SI WHERE input LIKE #ControlCharRange;
This produces 3 results. A string with invisiblechars within them like such:
Some || random string ||; Thank god Finally a normal character |!|
Note, the are actually invisible inside SQL. But stackoverflow shows them as such. The output in SQL Server is simply.
Some || random string ||; Thank god Finally a normal character |!|
But these characters still have a corresponding (N)CHAR(X) value. (N)CHAR(0) is a NULL character and is highly unlikely to be in a string, in my setup to detect them it also provides some problems in building a range. (N)CHAR(32) is the ' ' space character.
The way the [X-Y] string operator works is also based on the (N)CHAR numbers. Therefore we can make a range of [NCHAR(1)-NCHAR(31)]
The last select goes through the temporary table, one which has invisible characters. Since we're looking for any NCHARS between 1 and 31, only those with invisible characters (and often invalid characters or tabs/newlines) satisfy the where condition. Thus only they get returned. In this case only the 'faulty' string gets returned in my select statement.

Find and Replace credit card numbers

We have a large database with a lot of data in it. I found out recently our sales and shipping department have been using a part of the application to store clients credit card numbers in the open. We've put a stop to it, but now there are thousands of rows with the numbers.
We're trying to figure out how to scan certain columns for 16 digits in a row (or dash separation) and replace them with X's.
It's not a simple UPDATE statement because the card numbers are stored among large amounts of text. So far I've been unable to figure out if SQL Server is capable of regex (it would seem not).
All else fails i will do this through PHP since that is what i'm best at... but it'll be painful.
Sounds like you need to use PATINDEX with a WHERE LIKE clause.
Something like this. Create a stored proc with something similar, then call it with a bunch of different parameters (make #pattern & #patternlength the params) that you have identified, until you've replaced all of the instances.
declare #pattern varchar(100), #patternlength int
set #pattern = '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
set #patternlength = 19
update tableName
set fieldName =
LEFT(fieldName, patindex('%'+ #pattern + '%', fieldName)-1)
+ 'XXXX-XXXX-XXXX-XXXX'
+ SUBSTRING(fieldName, PATINDEX('%'+ #pattern + '%', fieldName)+#patternlength, LEN(fieldName))
from tableName
where fieldName like '%'+ #pattern + '%'
The trick is just finding the appropriate patterns, and setting the appropriate #patternlength value (not the length of #pattern as that won't work!)
I think you are better off doing this programatically, especially since you mentioned the data can be in a couple of different formats. Do keep in mind that not all credit card numbers are 16 digits long (Amex is 15, Visa is 13 or 16, etc).
The ability to check for various regexes and validate code will probably be best served at a cleanup job level, if possible.
Improvised Sean's answer.
The following will find all the occurrences of #maskPattern in #text and replace them with 'x'.
Example, If #maskPattern = XXXX-XXXX-XXXX-XXXX, it will find this pattern in #text and replace all occurrences with XXXX-XXXX-XXXX-XXXX. If it does not find any occurrence, it will leave the text as is.
This stored procedure can also be manipulated to only mask 3/4th of the beginning of the maskPattern. Cheers!
ALTER PROCEDURE [dbo].[SP_MaskCharacters] #text nvarchar(max),
#maskPattern nvarchar(500)
AS
BEGIN
DECLARE #numPattern nvarchar(max) = REPLACE(#maskPattern, 'x', '[0-9]')
DECLARE #patternLength int = LEN(#maskPattern)
WHILE (#text IS NOT NULL)
BEGIN
IF PATINDEX('%' + #numPattern + '%', #text) = 0 BREAK;
SET #text =
LEFT(#text, PATINDEX('%' + #numPattern + '%', #text)-1) --Get beginning chars of the input text until first occurance of pattern is found
+ #maskPattern --Append aasking pattern
+ SUBSTRING(#text, PATINDEX('%' + #numPattern + '%', #text) + #patternLength, LEN(#text)) -- Get & append rest of the text found after masking attern
END
SELECT #text
END
I faced this situation recently. Using Patindex and Stuff should help, but you would need to repeat for CC numbers with different number of digits separately.
-- For 16 digits CC numbers
UPDATE table
SET columnname = Stuff (columnname, Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname), 16, '################')
WHERE Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname) > 0
You can use patindex. It won't be pretty and there might be a more concise way to write it. But you can use sets ie [0-9]
patindex: http://msdn.microsoft.com/en-us/library/ms188395.aspx
similar question: SQL Server Regular expressions in T-SQL
For anyone finding this question who does want to use PHP, here's a function I use that takes a credit card number (all digits, with dashes, or with spaces) and replaces all but the first and last 4 digits with 'X'.
To accept credit card numbers with dashes as well, use this regex pattern instead:
$cc_regex_pattern = '/(\d{4})(-)?(\d{4})(-)?(\d{4})(-)?(\d{4})/'
and remove the preprocessing of the cc number that removes the dashes:
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
and so the replacement string becomes (because we've changed the index of patterns - note the $7):
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$7';
or if you want, simply replace the whole cc number, like in the original question:
$cc_regex_replacement = 'XXXX$2XXXX$4XXXX$6XXXX';
Here's the original function for credit card numbers with or without spaces or dashes, which obfuscates and removes any dashes:
/**
* #param integer|string $credit_card_number
* #return mixed
*/
static function obfuscate_credit_card($credit_card_number)
{
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
$cc_length = strlen($compressed_cc_number);
$cc_middle_length = $cc_length >= 9 ? $cc_length - 8 : 0;
//create middle pattern
$cc_middle_pattern = '';
for ($i = 0; $i < $cc_middle_length; $i++) {
$cc_middle_pattern .= 'X';
}
//replace cc middle digits with middle pattern
$cc_regex_pattern = '/(\d{4})(\d+)(\d{4})/';
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$3';
$obfuscated_cc = preg_replace($cc_regex_pattern, $cc_regex_replacement, $compressed_cc_number);
return $obfuscated_cc;
}

Other approach for handling this TSQL text manipulation

I have this following data:
0297144600-4799 0297485500-5599
The 0297485500-5599 based on observation always on position 31 char from the left which this is an easy approach.
But I would like to do is to anticipate just in case the data is like this below which means the position is no longer valid:
0297144600-4799 0297485500-5599 0297485600-5699
As you can see, I guess the first approach will the split by 1 blank space (" ") but due to number of space is unknown (varies) how do I take this approach then? Is there any method to find the space in between and shrink into 1 blank space (" ").
BTW ... it needs to be done in TSQL (Ms SQL 2005) unfortunately cause it's for SSIS :(
I am open with your idea/suggestion.
Thanks
I have updated my answer a bit, now that I know the number pattern will not always match. This code assumes the sequences will begin and end with a number and be separated by any number of spaces.
DECLARE #input nvarchar -- max in parens
DECLARE #pattern nvarchar -- max in parens
DECLARE #answer nvarchar -- max in parens
DECLARE #pos int
SET #input = ' 0297144623423400-4799 5615618131201561561 0297485600-5699 '
-- Make sure our search string has whitespace at the end for our pattern to match
SET #input = #input + ' '
-- Find anything that starts and ends with a number
WHILE PATINDEX('%[0-9]%[0-9] %', #input) > 0
BEGIN
-- Trim off the leading whitespace
SET #input = LTRIM(#input)
-- Find the end of the sequence by finding a space
SET #pos = PATINDEX('% %', #input)
-- Get the result out now that we know where it is
SET #answer = SUBSTRING(#input, 0, #pos)
SELECT [Result] = #answer
-- Remove the result off the front of the string so we can continue parsing
SET #input = SUBSTRING(#input, LEN(#answer) + 1, 8096)
END
Assuming you're processing one line at a time, you can also try this:
DECLARE #InputString nvarchar(max)
SET #InputString = '0297144600-4799 0297485500-5599 0297485600-5699'
BEGIN
WHILE CHARINDEX(' ',#InputString) > 0 -- Checking for double spaces
SET #InputString =
REPLACE(#InputString,' ',' ') -- Replace 2 spaces with 1 space
END
PRINT #InputString
(taken directly from SQLUSA, fnRemoveMultipleSpaces1)