Is there any way to look for and extract a string consisting of 2 characters followed by any 8 digit number in SQL Server? - sql

I am trying to create a query to extract a 10 digit string from a larger string. This string would consist of two characters followed by 8 digits e.g.'EL12345678'
I was previously using the below query, where the variable #prefix can consist of any two characters. However i have come across some cases where these characters are used elsewhere within the string causing it to extract the wrong code.
SELECT SUBSTRING(message_key, (SELECT CHARINDEX(#prefix, message_key)), 10) AS pcn,
Message_ID
FROM MQ
WHERE Message_Status != 'processed'
AND Message_Status != 'bad'
AND message_status != 'new'
AND Message_Time > DATEADD(DAY, -#days, dbo.dateonlyVB())
AND Message_MethodName = CASE WHEN #prefix = 'DN' THEN 'SaveJob' ELSE 'savedetails' END;
I tried to use patindex and some wildcards to see if i could specify that it was a number that followed the variable, however this didn't seem to work when i tried it.
I am expecting it to be able to extract a string something like 'EL12345678' from a larger string which can be anywhere in the region of 300+ characters long. However my query is currently occasionally extracting strings like 'elValvef":' instead.
Any help with this at all would be greatly appreciated!

Use can use patindex():
select substring(largerstring,
patindex('%__[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', largerstring
), 10
)

UPDATED BASED ON OP COMMENTS
As Gordon showed, you can do this:
SELECT item = SUBSTRING(#string,PATINDEX('%'+#prefix+'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', #string),8)
You can also use NGrams8k for this kind of thing:
DECLARE #string VARCHAR(1000) = 'ABCXXXAB12345678 blah blah', #prefix VARCHAR(2) = 'AB';
SELECT item = ng.token
FROM dbo.NGrams8k(#string,10) AS ng
WHERE PATINDEX('%'+#prefix+'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%',ng.token) = 1;
OLD:
This string would consist of two characters followed by 8 digits
e.g.'EL12345678'
If it's that simple you can just do this:
DECLARE #string VARCHAR(1000) = 'EL12345678';
SELECT SUBSTRING(#string,3,8)
--Returns: 12345678

Related

Allow leading zeros in a LIKE statement

I have a SQL query which includes the line:
WHERE
[TraceableItem].[IdentificationNo] LIKE N'015933%'
I would like this to match the following numbers:
015933
00015933
000000000015933
But not allow any non-zero characters. How could I do this?
--Some test data
DECLARE #sample TABLE
(
number_as_string VARCHAR(20)
)
INSERT INTO #sample
VALUES
('015933') -- okay
,('00015933') -- okay
,('000000000015933')-- okay
,(' 00015933') -- dont return as this doesnt start with a 0
,('25') -- dont return wrong number
,('string') -- dont return as its a string
,('st15933') -- dont return as it starts with a string.
,('001000015933') -- dont return as this is the number 1000015933
SELECT
*
FROM
#sample as s
WHERE
--only consider rows that are a number
--stops CONVERT exception being thrown on lines that do no convert
ISNUMERIC(s.number_as_string) = 1
AND
--Convert to INT wipes out the leading 0's, but also spaces
CONVERT(INT,s.number_as_string) LIKE '15933%'
AND
--must start with a number, i.e. check it doesn't start with a space.
--LEFT(s.number_as_string,1) NOT LIKE '[^0-9]'
--This version is easier to read as its not a double NOT logic like the version above
--Thanks to #Robert Kock
LEFT(s.number_as_string,1) BETWEEN '0' AND '9'
Gives the result
number_as_string
----------------
015933
00015933
000000000015933
You probably want to first convert to int and back to string as suggested by Neeraj Agarwal. But then take the left five characters and compare for exact equality to '15933'
where '15933' = left(convert(varchar(50),convert(int,
TraceableItem.IdentificationNo
)),5)
You can see it at work in the sample below, where it captures everything you desire and a little more, but doesn't capture the case presented by Harry Adams in the comments to Neeraj.
select *
from (values
('015933'),
('00015933'),
('000000000015933'),
('0001593399'),
('15933'),
('001000015933')
) vals (v)
where '15933' = left(convert(varchar(50),convert(int, v)),5)
I don't like converting to a number for this purpose. But one method is to "trim" the leading zeros away. For an exact match:
where replace(ltrim(replace([TraceableItem].[IdentificationNo], '0', ' ')), ' ', '0') = '15933'
For LIKE:
where replace(ltrim(replace([TraceableItem].[IdentificationNo], '0', ' ')), ' ', '0') LIKE '15933%'
You can also express this with LIKE/NOT LIKE:
where TraceableItem].[IdentificationNo] like '%15933%' and
TraceableItem].[IdentificationNo] not like '%[^0]%15933%'
You can use cast to convert to an int and back to a character string provided the string consists of digits only, e.g.:
select cast(cast("00015933" as int) as varchar(24))

Simple Explanation for PATINDEX

I have have been reading up on PATINDEX attempting to understand what and why. I understand the when using the wildcards it will return an INT as to where that character(s) appears/starts. So:
SELECT PATINDEX('%b%', '123b') -- returns 4
However I am looking to see if someone can explain the reason as to why you would use this in a simple(ish) way. I have read some other forums but it just is not sinking in to be honest.
Are you asking for realistic use-cases? I can think of two, real-life use-cases that I've had at work where PATINDEX() was my best option.
I had to import a text-file and parse it for INSERT INTO later on. But these files sometimes had numbers in this format: 00000-59. If you try CAST('00000-59' AS INT) you'll get an error. So I needed code that would parse 00000-59 to -59 but also 00000159 to 159 etc. The - could be anywhere, or it could simply not be there at all. This is what I did:
DECLARE #my_var VARCHAR(255) = '00000-59', #my_int INT
SET #my_var = STUFF(#my_var, 1, PATINDEX('%[^0]%', #my_var)-1, '')
SET #my_int = CAST(#my_var AS INT)
[^0] in this case means "any character that isn't a 0". So PATINDEX() tells me when the 0's end, regardless of whether that's because of a - or a number.
The second use-case I've had was checking whether an IBAN number was correct. In order to do that, any letters in the IBAN need to be changed to a corresponding number (A=10, B=11, etc...). I did something like this (incomplete but you get the idea):
SET #i = PATINDEX('%[^0-9]%', #IBAN)
WHILE #i <> 0 BEGIN
SET #num = UNICODE(SUBSTRING(#IBAN, #i, 1))-55
SET #IBAN = STUFF(#IBAN, #i, 1, CAST(#num AS VARCHAR(2))
SET #i = PATINDEX('%[^0-9]%', #IBAN)
END
So again, I'm not concerned with finding exactly the letter A or B etc. I'm just finding anything that isn't a number and converting it.
PATINDEX is roughly equivalent to CHARINDEX except that it returns the position of a pattern instead of single character. Examples:
Check if a string contains at least one digit:
SELECT PATINDEX('%[0-9]%', 'Hello') -- 0
SELECT PATINDEX('%[0-9]%', 'H3110') -- 2
Extract numeric portion from a string:
SELECT SUBSTRING('12345', PATINDEX('%[0-9]%', '12345'), 100) -- 12345
SELECT SUBSTRING('x2345', PATINDEX('%[0-9]%', 'x2345'), 100) -- 2345
SELECT SUBSTRING('xx345', PATINDEX('%[0-9]%', 'xx345'), 100) -- 345
Quoted from PATINDEX (Transact-SQL)
The following example uses % and _ wildcards to find the position at
which the pattern 'en', followed by any one character and 'ure' starts
in the specified string (index starts at 1):
SELECT PATINDEX('%en_ure%', 'please ensure the door is locked');
Here is the result set.
8
You'd use the PATINDEX function when you want to know at which character position a pattern begins in an expression of a valid text or character data type.

find if there is a 6 digit number within a string

How would you advise to find out in Sql Server 2010/2012 if a query contains a substring equal to a 6 digits number?
e.g. "agh123456 dfsdfdf" matches the requirements
"x123 ddd456" doesn't match the requirements because the 6 digits are not consecutive
"lm123" doesn't match the requirements because only 3 digits are found (out of the required 6)
The problem I encountered so far: is that SUBSTRING as a function requires parameters (position where the number presumably starts and this is random)
while PATINDEX returns the location of a pattern in a string, but we don't know the exact pattern (it can be any 6 digit number)
Any pointers or advice, much appreciated.
Thank you
You can use the LIKE operator:
SELECT *
FROM MyTable
WHERE Mycolumn LIKE '%[0-9][0-9][0-9][0-9][0-9][0-9]%'
Even this should work.. considering you don't have a string like this
abc123 abc123456
Try this
DECLARE #str varchar(max) = 'abcxyz123456'
SELECT ISNUMERIC(SUBSTRING(#str,(SELECT PATINDEX('%[0-9]%',#str)),6))
If you want to select all rows in the table and mask the first 6-digit substring in each row:
DECLARE #mask varchar(max) = '######'
DECLARE #pattern varchar(max) = '%'+REPLACE(#mask,'#','[0-9]')+'%'
SELECT
ISNULL(STUFF(col1,PATINDEX(#pattern,col1),LEN(#mask),#mask),col1)
FROM Table1

Find and Replace credit card numbers

We have a large database with a lot of data in it. I found out recently our sales and shipping department have been using a part of the application to store clients credit card numbers in the open. We've put a stop to it, but now there are thousands of rows with the numbers.
We're trying to figure out how to scan certain columns for 16 digits in a row (or dash separation) and replace them with X's.
It's not a simple UPDATE statement because the card numbers are stored among large amounts of text. So far I've been unable to figure out if SQL Server is capable of regex (it would seem not).
All else fails i will do this through PHP since that is what i'm best at... but it'll be painful.
Sounds like you need to use PATINDEX with a WHERE LIKE clause.
Something like this. Create a stored proc with something similar, then call it with a bunch of different parameters (make #pattern & #patternlength the params) that you have identified, until you've replaced all of the instances.
declare #pattern varchar(100), #patternlength int
set #pattern = '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
set #patternlength = 19
update tableName
set fieldName =
LEFT(fieldName, patindex('%'+ #pattern + '%', fieldName)-1)
+ 'XXXX-XXXX-XXXX-XXXX'
+ SUBSTRING(fieldName, PATINDEX('%'+ #pattern + '%', fieldName)+#patternlength, LEN(fieldName))
from tableName
where fieldName like '%'+ #pattern + '%'
The trick is just finding the appropriate patterns, and setting the appropriate #patternlength value (not the length of #pattern as that won't work!)
I think you are better off doing this programatically, especially since you mentioned the data can be in a couple of different formats. Do keep in mind that not all credit card numbers are 16 digits long (Amex is 15, Visa is 13 or 16, etc).
The ability to check for various regexes and validate code will probably be best served at a cleanup job level, if possible.
Improvised Sean's answer.
The following will find all the occurrences of #maskPattern in #text and replace them with 'x'.
Example, If #maskPattern = XXXX-XXXX-XXXX-XXXX, it will find this pattern in #text and replace all occurrences with XXXX-XXXX-XXXX-XXXX. If it does not find any occurrence, it will leave the text as is.
This stored procedure can also be manipulated to only mask 3/4th of the beginning of the maskPattern. Cheers!
ALTER PROCEDURE [dbo].[SP_MaskCharacters] #text nvarchar(max),
#maskPattern nvarchar(500)
AS
BEGIN
DECLARE #numPattern nvarchar(max) = REPLACE(#maskPattern, 'x', '[0-9]')
DECLARE #patternLength int = LEN(#maskPattern)
WHILE (#text IS NOT NULL)
BEGIN
IF PATINDEX('%' + #numPattern + '%', #text) = 0 BREAK;
SET #text =
LEFT(#text, PATINDEX('%' + #numPattern + '%', #text)-1) --Get beginning chars of the input text until first occurance of pattern is found
+ #maskPattern --Append aasking pattern
+ SUBSTRING(#text, PATINDEX('%' + #numPattern + '%', #text) + #patternLength, LEN(#text)) -- Get & append rest of the text found after masking attern
END
SELECT #text
END
I faced this situation recently. Using Patindex and Stuff should help, but you would need to repeat for CC numbers with different number of digits separately.
-- For 16 digits CC numbers
UPDATE table
SET columnname = Stuff (columnname, Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname), 16, '################')
WHERE Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname) > 0
You can use patindex. It won't be pretty and there might be a more concise way to write it. But you can use sets ie [0-9]
patindex: http://msdn.microsoft.com/en-us/library/ms188395.aspx
similar question: SQL Server Regular expressions in T-SQL
For anyone finding this question who does want to use PHP, here's a function I use that takes a credit card number (all digits, with dashes, or with spaces) and replaces all but the first and last 4 digits with 'X'.
To accept credit card numbers with dashes as well, use this regex pattern instead:
$cc_regex_pattern = '/(\d{4})(-)?(\d{4})(-)?(\d{4})(-)?(\d{4})/'
and remove the preprocessing of the cc number that removes the dashes:
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
and so the replacement string becomes (because we've changed the index of patterns - note the $7):
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$7';
or if you want, simply replace the whole cc number, like in the original question:
$cc_regex_replacement = 'XXXX$2XXXX$4XXXX$6XXXX';
Here's the original function for credit card numbers with or without spaces or dashes, which obfuscates and removes any dashes:
/**
* #param integer|string $credit_card_number
* #return mixed
*/
static function obfuscate_credit_card($credit_card_number)
{
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
$cc_length = strlen($compressed_cc_number);
$cc_middle_length = $cc_length >= 9 ? $cc_length - 8 : 0;
//create middle pattern
$cc_middle_pattern = '';
for ($i = 0; $i < $cc_middle_length; $i++) {
$cc_middle_pattern .= 'X';
}
//replace cc middle digits with middle pattern
$cc_regex_pattern = '/(\d{4})(\d+)(\d{4})/';
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$3';
$obfuscated_cc = preg_replace($cc_regex_pattern, $cc_regex_replacement, $compressed_cc_number);
return $obfuscated_cc;
}

A SQL Problem. Check condition with in a comma seperated value

I have a vairable
DECLARE #AssignOn nvarchar(20)='0,2,5'
I want to check a condition like this
DECLARE #index int
SET DATEFIRST 7
SELECT #index=DATEPART(DW, GETDATE())-1
IF(CONVERT(nvarchar(2),#index) IN #AssignOn)
IN cannot be used here . Any other methods to do this INLINE
You can use CharIndex to find if you have a match. It returns a non zero value if the first string appears in the second.
IF(CHARINDEX(CONVERT(nvarchar(2),#index), #AssignOn) > 0)
The easiest way to do this is to search for the substring ',needle,' in the csv list string. However, this doesn't work correctly for the first and last elements. This can be overcome by concatenating a comma onto each side of the csv list string.
An example in SQL might be:
SELECT
CHARINDEX(','+ NEEDLE +',', ','+ HAYSTACK +',')
FROM table;
Or using LIKE:
SELECT *
FROM table
WHERE ','+ HAYSTACK +',' LIKE '%,'+ NEEDLE +',';
IF CHARINDEX(','+CONVERT(nvarchar(2),#index)+',', ','+#AssignOn+',') <> 0
As you actually define the values in the code you could instead;
DECLARE #AssignOn TABLE (value int)
INSERT #AssignOn VALUES (0),(2),(5)
... #index IN (SELECT value FROM #AssignOn)