Simple Explanation for PATINDEX - sql

I have have been reading up on PATINDEX attempting to understand what and why. I understand the when using the wildcards it will return an INT as to where that character(s) appears/starts. So:
SELECT PATINDEX('%b%', '123b') -- returns 4
However I am looking to see if someone can explain the reason as to why you would use this in a simple(ish) way. I have read some other forums but it just is not sinking in to be honest.

Are you asking for realistic use-cases? I can think of two, real-life use-cases that I've had at work where PATINDEX() was my best option.
I had to import a text-file and parse it for INSERT INTO later on. But these files sometimes had numbers in this format: 00000-59. If you try CAST('00000-59' AS INT) you'll get an error. So I needed code that would parse 00000-59 to -59 but also 00000159 to 159 etc. The - could be anywhere, or it could simply not be there at all. This is what I did:
DECLARE #my_var VARCHAR(255) = '00000-59', #my_int INT
SET #my_var = STUFF(#my_var, 1, PATINDEX('%[^0]%', #my_var)-1, '')
SET #my_int = CAST(#my_var AS INT)
[^0] in this case means "any character that isn't a 0". So PATINDEX() tells me when the 0's end, regardless of whether that's because of a - or a number.
The second use-case I've had was checking whether an IBAN number was correct. In order to do that, any letters in the IBAN need to be changed to a corresponding number (A=10, B=11, etc...). I did something like this (incomplete but you get the idea):
SET #i = PATINDEX('%[^0-9]%', #IBAN)
WHILE #i <> 0 BEGIN
SET #num = UNICODE(SUBSTRING(#IBAN, #i, 1))-55
SET #IBAN = STUFF(#IBAN, #i, 1, CAST(#num AS VARCHAR(2))
SET #i = PATINDEX('%[^0-9]%', #IBAN)
END
So again, I'm not concerned with finding exactly the letter A or B etc. I'm just finding anything that isn't a number and converting it.

PATINDEX is roughly equivalent to CHARINDEX except that it returns the position of a pattern instead of single character. Examples:
Check if a string contains at least one digit:
SELECT PATINDEX('%[0-9]%', 'Hello') -- 0
SELECT PATINDEX('%[0-9]%', 'H3110') -- 2
Extract numeric portion from a string:
SELECT SUBSTRING('12345', PATINDEX('%[0-9]%', '12345'), 100) -- 12345
SELECT SUBSTRING('x2345', PATINDEX('%[0-9]%', 'x2345'), 100) -- 2345
SELECT SUBSTRING('xx345', PATINDEX('%[0-9]%', 'xx345'), 100) -- 345

Quoted from PATINDEX (Transact-SQL)
The following example uses % and _ wildcards to find the position at
which the pattern 'en', followed by any one character and 'ure' starts
in the specified string (index starts at 1):
SELECT PATINDEX('%en_ure%', 'please ensure the door is locked');
Here is the result set.
8
You'd use the PATINDEX function when you want to know at which character position a pattern begins in an expression of a valid text or character data type.

Related

Joining on numeric part of string

It's been a while...I'd like to get your advice on the most efficient way to join on only the number part of a field that may be prefixed and/or suffixed with up to 2 letters. Here's a simplified snippet of what I'm trying to do:
SELECT a, b, c
FROM table 1 t1
LEFT JOIN table 2 t2 ON t1.PolicyCode = t2.sPolicyID,
Where t2.sPolicyID could begin and/or end with up to 2 letters. Some examples: TG73100, S7286674, 2344506R, etc. We only want to join to just its numeric part in between the letters, i.e. 73100, 7286674 or 2344506 from the examples.
Could someone please advise on a simple way of doing this?
Here is one way:
LEFT JOIN table 2 t2 ON t1.PolicyCode =
LEFT(SUBSTRING(t2.sPolicyID, PATINDEX('%[0-9]%', t2.sPolicyID), 50),
PATINDEX('%[^0-9]%',
SUBSTRING(t2.sPolicyID, PATINDEX('%[0-9]%', t2.sPolicyID), 50)
+ 'a') -1)
To break this down, there are 4 main parts.
1: Find the position of the first number with PATINDEX:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT PATINDEX('%[0-9]%', #spolicyID)
--Returns 3
2: Use SUBSTRING() to cut off everything before the first letter:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50)
--Returns 123123xx
If we hardcoded the 3 that we know is returned from the first part, it would look like this:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT SUBSTRING(#spolicyID, 3), 50)
--50 is the number of characters to extract, set to something
--higher than the max string length to be safe
Of course, we don't want to hardcode it since it can change, but that makes seeing the different functions a bit easier.
3: Find the position of the next letter using PATINDEX again:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT PATINDEX('%[^0-9]%', SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50) + 'a')
--Returns 7 since it is looking at 123123xx
--The first x is in the 7th position
Note that we added an a onto the string. This is because if we had a string with no letters at the end, it would throw an error as the length 0 would be returned to SUBSTRING. You could add any letter or letters to the end and it would work, we are just making sure there is at least one. Try removing the + 'a' and using a string like xx123123 to see the error.
If we hardcoded the 123123xx from step 2 it would look like this (again just for easy example):
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT PATINDEX('%[^0-9]%', '123123xx' + 'a')
4: Use LEFT() to return everything before the trailing letters, leaving us with only the numbers in between:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
LEFT(SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50),PATINDEX('%[^0-9]%', SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50) + 'a') -1)
--Need to add `-1` because step 3 PATINDEX returns 7
--as the position of first trailing letter, and
--we want the 6 characters before that
And again hardcoded from step 2 and 3 for easy viewing:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
LEFT('123123xx', 7-1)

SQL Server : How to test if a string has only digit characters

I am working in SQL Server 2008. I am trying to test whether a string (varchar) has only digit characters (0-9). I know that the IS_NUMERIC function can give spurious results. (My data can possibly have $ signs, which should not pass the test.) So, I'm avoiding that function.
I already have a test to see if a string has any non-digit characters, i.e.,
some_column LIKE '%[^0123456789]%'
I would think that the only-digits test would be something similar, but I'm drawing a blank. Any ideas?
Use Not Like
where some_column NOT LIKE '%[^0-9]%'
Demo
declare #str varchar(50)='50'--'asdarew345'
select 1 where #str NOT LIKE '%[^0-9]%'
There is a system function called ISNUMERIC for SQL 2008 and up. An example:
SELECT myCol
FROM mTable
WHERE ISNUMERIC(myCol)<> 1;
I did a couple of quick tests and also looked further into the docs:
ISNUMERIC returns 1 when the input expression evaluates to a valid numeric data type; otherwise it returns 0.
Which means it is fairly predictable for example
-9879210433 would pass but 987921-0433 does not.
$9879210433 would pass but 9879210$433 does not.
So using this information you can weed out based on the list of valid currency symbols and + & - characters.
Solution:
where some_column NOT LIKE '%[^0-9]%'
Is correct.
Just one important note: Add validation for when the string column = '' (empty string). This scenario will return that '' is a valid number as well.
Method that will work. The way it is used above will not work.
declare #str varchar(50)='79136'
select
case
when #str LIKE replicate('[0-9]',LEN(#str)) then 1
else 0
end
declare #str2 varchar(50)='79D136'
select
case
when #str2 LIKE replicate('[0-9]',LEN(#str)) then 1
else 0
end
DECLARE #x int=1
declare #exit bit=1
WHILE #x<=len('123c') AND #exit=1
BEGIN
IF ascii(SUBSTRING('123c',#x,1)) BETWEEN 48 AND 57
BEGIN
set #x=#x+1
END
ELSE
BEGIN
SET #exit=0
PRINT 'string is not all numeric -:('
END
END
I was attempting to find strings with numbers ONLY, no punctuation or anything else. I finally found an answer that would work here.
Using PATINDEX('%[^0-9]%', some_column) = 0 allowed me to filter out everything but actual number strings.
The selected answer does not work.
declare #str varchar(50)='79D136'
select 1 where #str NOT LIKE '%[^0-9]%'
I don't have a solution but know of this potential pitfall. The same goes if you substitute the letter 'D' for 'E' which is scientific notation.

Looking for a scalar function to find the last occurrence of a character in a string

Table FOO has a column FILEPATH of type VARCHAR(512). Its entries are absolute paths:
FILEPATH
------------------------------------------------------------
file://very/long/file/path/with/many/slashes/in/it/foo.xml
file://even/longer/file/path/with/more/slashes/in/it/baz.xml
file://something/completely/different/foo.xml
file://short/path/foobar.xml
There's ~50k records in this table and I want to know all distinct filenames, not the file paths:
foo.xml
baz.xml
foobar.xml
This looks easy, but I couldn't find a DB2 scalar function that allows me to search for the last occurrence of a character in a string. Am I overseeing something?
I could do this with a recursive query, but this appears to be overkill for such a simple task and (oh wonder) is extremely slow:
WITH PATHFRAGMENTS (POS, PATHFRAGMENT) AS (
SELECT
1,
FILEPATH
FROM FOO
UNION ALL
SELECT
POSITION('/', PATHFRAGMENT, OCTETS) AS POS,
SUBSTR(PATHFRAGMENT, POSITION('/', PATHFRAGMENT, OCTETS)+1) AS PATHFRAGMENT
FROM PATHFRAGMENTS
)
SELECT DISTINCT PATHFRAGMENT FROM PATHFRAGMENTS WHERE POS = 0
I think what you're looking for is the LOCATE_IN_STRING() scalar function. This is what Info Center has to say if you use a negative start value:
If the value of the integer is less than zero, the search begins at
LENGTH(source-string) + start + 1 and continues for each position to
the beginning of the string.
Combine that with the LENGTH() and RIGHT() scalar functions, and you can get what you want:
SELECT
RIGHT(
FILEPATH
,LENGTH(FILEPATH) - LOCATE_IN_STRING(FILEPATH,'/',-1)
)
FROM FOO
One way to do this is by taking advantage of the power of DB2s XQuery engine. The following worked for me (and fast):
SELECT DISTINCT XMLCAST(
XMLQuery('tokenize($P, ''/'')[last()]' PASSING FILEPATH AS "P")
AS VARCHAR(512) )
FROM FOO
Here I use tokenize to split the file path into a sequence of tokens and then select the last of these tokens. The rest is only conversion from SQL to XML types and back again.
I know that the problem from the OP was already solved but I decided to post the following anyway to hopefully help others like me that land here.
I came across this thread while searching for a solution to my similar problem which had the exact same requirement but was for a different kind of database that was also lacking the REVERSE function.
In my case this was for a OpenEdge (Progress) database, which has a slightly different syntax. This made the INSTR function available to me that most Oracle typed databases offer.
So I came up with the following code:
SELECT
SUBSTRING(
foo.filepath,
INSTR(foo.filepath, '/',1, LENGTH(foo.filepath) - LENGTH( REPLACE( foo.filepath, '/', '')))+1,
LENGTH(foo.filepath))
FROM foo
However, for my specific situation (being the OpenEdge (Progress) database) this did not result into the desired behaviour because replacing the character with an empty char gave the same length as the original string. This doesn't make much sense to me but I was able to bypass the problem with the code below:
SELECT
SUBSTRING(
foo.filepath,
INSTR(foo.filepath, '/',1, LENGTH( REPLACE( foo.filepath, '/', 'XX')) - LENGTH(foo.filepath))+1,
LENGTH(foo.filepath))
FROM foo
Now I understand that this code won't solve the problem for T-SQL because there is no alternative to the INSTR function that offers the Occurence property.
Just to be thorough I'll add the code needed to create this scalar function so it can be used the same way like I did in the above examples.
-- Drop the function if it already exists
IF OBJECT_ID('INSTR', 'FN') IS NOT NULL
DROP FUNCTION INSTR
GO
-- User-defined function to implement Oracle INSTR in SQL Server
CREATE FUNCTION INSTR (#str VARCHAR(8000), #substr VARCHAR(255), #start INT, #occurrence INT)
RETURNS INT
AS
BEGIN
DECLARE #found INT = #occurrence,
#pos INT = #start;
WHILE 1=1
BEGIN
-- Find the next occurrence
SET #pos = CHARINDEX(#substr, #str, #pos);
-- Nothing found
IF #pos IS NULL OR #pos = 0
RETURN #pos;
-- The required occurrence found
IF #found = 1
BREAK;
-- Prepare to find another one occurrence
SET #found = #found - 1;
SET #pos = #pos + 1;
END
RETURN #pos;
END
GO
To avoid the obvious, when the REVERSE function is available you do not need to create this scalar function and you can just get the required result like this:
SELECT
SUBSTRING(
foo.filepath,
LEN(foo.filepath) - CHARINDEX('\', REVERSE(foo.filepath))+2,
LEN(foo.filepath))
FROM foo
You could just do it in a single statement:
select distinct reverse(substring(reverse(FILEPATH), 1, charindex('/', reverse(FILEPATH))-1))
from filetable

Find and Replace credit card numbers

We have a large database with a lot of data in it. I found out recently our sales and shipping department have been using a part of the application to store clients credit card numbers in the open. We've put a stop to it, but now there are thousands of rows with the numbers.
We're trying to figure out how to scan certain columns for 16 digits in a row (or dash separation) and replace them with X's.
It's not a simple UPDATE statement because the card numbers are stored among large amounts of text. So far I've been unable to figure out if SQL Server is capable of regex (it would seem not).
All else fails i will do this through PHP since that is what i'm best at... but it'll be painful.
Sounds like you need to use PATINDEX with a WHERE LIKE clause.
Something like this. Create a stored proc with something similar, then call it with a bunch of different parameters (make #pattern & #patternlength the params) that you have identified, until you've replaced all of the instances.
declare #pattern varchar(100), #patternlength int
set #pattern = '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
set #patternlength = 19
update tableName
set fieldName =
LEFT(fieldName, patindex('%'+ #pattern + '%', fieldName)-1)
+ 'XXXX-XXXX-XXXX-XXXX'
+ SUBSTRING(fieldName, PATINDEX('%'+ #pattern + '%', fieldName)+#patternlength, LEN(fieldName))
from tableName
where fieldName like '%'+ #pattern + '%'
The trick is just finding the appropriate patterns, and setting the appropriate #patternlength value (not the length of #pattern as that won't work!)
I think you are better off doing this programatically, especially since you mentioned the data can be in a couple of different formats. Do keep in mind that not all credit card numbers are 16 digits long (Amex is 15, Visa is 13 or 16, etc).
The ability to check for various regexes and validate code will probably be best served at a cleanup job level, if possible.
Improvised Sean's answer.
The following will find all the occurrences of #maskPattern in #text and replace them with 'x'.
Example, If #maskPattern = XXXX-XXXX-XXXX-XXXX, it will find this pattern in #text and replace all occurrences with XXXX-XXXX-XXXX-XXXX. If it does not find any occurrence, it will leave the text as is.
This stored procedure can also be manipulated to only mask 3/4th of the beginning of the maskPattern. Cheers!
ALTER PROCEDURE [dbo].[SP_MaskCharacters] #text nvarchar(max),
#maskPattern nvarchar(500)
AS
BEGIN
DECLARE #numPattern nvarchar(max) = REPLACE(#maskPattern, 'x', '[0-9]')
DECLARE #patternLength int = LEN(#maskPattern)
WHILE (#text IS NOT NULL)
BEGIN
IF PATINDEX('%' + #numPattern + '%', #text) = 0 BREAK;
SET #text =
LEFT(#text, PATINDEX('%' + #numPattern + '%', #text)-1) --Get beginning chars of the input text until first occurance of pattern is found
+ #maskPattern --Append aasking pattern
+ SUBSTRING(#text, PATINDEX('%' + #numPattern + '%', #text) + #patternLength, LEN(#text)) -- Get & append rest of the text found after masking attern
END
SELECT #text
END
I faced this situation recently. Using Patindex and Stuff should help, but you would need to repeat for CC numbers with different number of digits separately.
-- For 16 digits CC numbers
UPDATE table
SET columnname = Stuff (columnname, Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname), 16, '################')
WHERE Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname) > 0
You can use patindex. It won't be pretty and there might be a more concise way to write it. But you can use sets ie [0-9]
patindex: http://msdn.microsoft.com/en-us/library/ms188395.aspx
similar question: SQL Server Regular expressions in T-SQL
For anyone finding this question who does want to use PHP, here's a function I use that takes a credit card number (all digits, with dashes, or with spaces) and replaces all but the first and last 4 digits with 'X'.
To accept credit card numbers with dashes as well, use this regex pattern instead:
$cc_regex_pattern = '/(\d{4})(-)?(\d{4})(-)?(\d{4})(-)?(\d{4})/'
and remove the preprocessing of the cc number that removes the dashes:
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
and so the replacement string becomes (because we've changed the index of patterns - note the $7):
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$7';
or if you want, simply replace the whole cc number, like in the original question:
$cc_regex_replacement = 'XXXX$2XXXX$4XXXX$6XXXX';
Here's the original function for credit card numbers with or without spaces or dashes, which obfuscates and removes any dashes:
/**
* #param integer|string $credit_card_number
* #return mixed
*/
static function obfuscate_credit_card($credit_card_number)
{
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
$cc_length = strlen($compressed_cc_number);
$cc_middle_length = $cc_length >= 9 ? $cc_length - 8 : 0;
//create middle pattern
$cc_middle_pattern = '';
for ($i = 0; $i < $cc_middle_length; $i++) {
$cc_middle_pattern .= 'X';
}
//replace cc middle digits with middle pattern
$cc_regex_pattern = '/(\d{4})(\d+)(\d{4})/';
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$3';
$obfuscated_cc = preg_replace($cc_regex_pattern, $cc_regex_replacement, $compressed_cc_number);
return $obfuscated_cc;
}

SQL Server substring breaking on words, not characters

I'd like to show no more than n characters of a text field in search results to give the user an idea of the content. However, I can't find a way to easily break on words, so I wind up with a partial word at the break.
When I want to show: "This student has not submitted his last few assignments", the system might show: "This student has not submitted his last few assig"
I'd prefer that the system show up to the n character limit where words are preserved, so I'd like to see:
"This student has not submitted his last few"
Is there a nearest word function that I could write in T-SQL, or should I do that when I get the results back into ASP or .NET?
If you must do it in T-SQL:
DECLARE #t VARCHAR(100)
SET #t = 'This student has not submitted his last few assignments'
SELECT LEFT(LEFT(#t, 50), LEN(LEFT(#t, 50)) - CHARINDEX(' ', REVERSE(LEFT(#t, 50))))
It will not be catastrophically slow, but it will definitely be slower than doing it in the presentation layer.
Other than that — just cutting off the word and appending an ellipsis for longer strings is no bad option either. This way at least all truncated strings have the same length, which might come in handy if you are formatting for a fixed-width output.
I agree with doing this outside of the database that way other applications with different length restrictions can make their own decisions on what to show/hide. Perhaps that can be a parameter to the database call though.
Here's a quick stab at a solution:
DECLARE #OriginalData NVARCHAR(MAX)
,#ReversedData NVARCHAR(MAX)
,#MaxLength INT
,#DelimiterPosition INT ;
SELECT #OriginalData = 'This student has not submitted his last few assignments'
,#MaxLength = 45;
SET #ReversedData = REVERSE(
LEFT(#OriginalData, #MaxLength)
);
SET #DelimiterPosition = CHARINDEX(' ', #ReversedData);
PRINT LEFT(#OriginalData, #MaxLength - #DelimiterPosition);
/*
This student has not submitted his last few assignments
1234567890123456789012345678901234567890123456789012345
*/
I recommend doing that kind of logic outside database. With C# it could look similar to this:
static string Cut(string s, int length)
{
if (s.Length <= length)
{
return s;
}
while (s[length] != ' ')
{
length--;
}
return s.Substring(0, length).Trim();
}
Of cause you could do this with T-SQL, but that is bad idea (bad performance etc.). If you really need to put it inside DB I would use CLR-based stored procedure instead.
I'd like to add to the solutions already offered that word breaking logic is a lot more complicated than it seems on the surface. To do it well you are going to need to define a number of rules for what constitutes a word. Consider the following:
Spaces - No brainer.
Hyphens - Well that depends. In Over-exposed proably, in re-animated probably not. Then what about dates such as 01-02-1985?
Periods - No brainer. Oh wait, what about the one in myemail#myisp.com or $79.95?
Commas - In numbers such as 1,239 no, but in sentences yes.
Apostrophes - In O'Reily no, in SQL is an 'Enterprise' Database tool yes.
Do special characters alone constitute words?: In Item 1 : Buy TP is the colon counted as a word?
I found an answer on this site and modified it:
the cast (150) must be greater than the number of characters you're returning (100)
LEFT (Cast(myTextField As varchar(150)),
CHARINDEX(' ', CAST(flag_myTextField AS VARCHAR(150)), 100) ) AS myTextField_short
I'm not sure how fast this will run, but it will work....
DECLARE #Max int
SET #Max=??
SELECT
REVERSE(RIGHT(REVERSE(LEFT(YourColumnHere,#Max)),#Max- CHARINDEX(' ',REVERSE(LEFT(YourColumnHere,#Max)))))
FROM YourTable
WHERE X=Y
I wouldn't advice to do that either, but if you must, you can do something like this:
DECLARE #text nvarchar(max);
DECLARE #end_char int;
SELECT #text = 'This student has not submitted his last few assignments', #end_char = 50 ;
WHILE #end_char > 0 AND SUBSTRING( #text, #end_char+1, 1 ) <> ' '
SET #end_char = #end_char - 1
SELECT #text = SUBSTRING( #text, 1, #end_char ) ;
SELECT #text