Joining on numeric part of string - sql

It's been a while...I'd like to get your advice on the most efficient way to join on only the number part of a field that may be prefixed and/or suffixed with up to 2 letters. Here's a simplified snippet of what I'm trying to do:
SELECT a, b, c
FROM table 1 t1
LEFT JOIN table 2 t2 ON t1.PolicyCode = t2.sPolicyID,
Where t2.sPolicyID could begin and/or end with up to 2 letters. Some examples: TG73100, S7286674, 2344506R, etc. We only want to join to just its numeric part in between the letters, i.e. 73100, 7286674 or 2344506 from the examples.
Could someone please advise on a simple way of doing this?

Here is one way:
LEFT JOIN table 2 t2 ON t1.PolicyCode =
LEFT(SUBSTRING(t2.sPolicyID, PATINDEX('%[0-9]%', t2.sPolicyID), 50),
PATINDEX('%[^0-9]%',
SUBSTRING(t2.sPolicyID, PATINDEX('%[0-9]%', t2.sPolicyID), 50)
+ 'a') -1)
To break this down, there are 4 main parts.
1: Find the position of the first number with PATINDEX:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT PATINDEX('%[0-9]%', #spolicyID)
--Returns 3
2: Use SUBSTRING() to cut off everything before the first letter:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50)
--Returns 123123xx
If we hardcoded the 3 that we know is returned from the first part, it would look like this:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT SUBSTRING(#spolicyID, 3), 50)
--50 is the number of characters to extract, set to something
--higher than the max string length to be safe
Of course, we don't want to hardcode it since it can change, but that makes seeing the different functions a bit easier.
3: Find the position of the next letter using PATINDEX again:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT PATINDEX('%[^0-9]%', SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50) + 'a')
--Returns 7 since it is looking at 123123xx
--The first x is in the 7th position
Note that we added an a onto the string. This is because if we had a string with no letters at the end, it would throw an error as the length 0 would be returned to SUBSTRING. You could add any letter or letters to the end and it would work, we are just making sure there is at least one. Try removing the + 'a' and using a string like xx123123 to see the error.
If we hardcoded the 123123xx from step 2 it would look like this (again just for easy example):
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
SELECT PATINDEX('%[^0-9]%', '123123xx' + 'a')
4: Use LEFT() to return everything before the trailing letters, leaving us with only the numbers in between:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
LEFT(SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50),PATINDEX('%[^0-9]%', SUBSTRING(#spolicyID, PATINDEX('%[0-9]%', #spolicyID), 50) + 'a') -1)
--Need to add `-1` because step 3 PATINDEX returns 7
--as the position of first trailing letter, and
--we want the 6 characters before that
And again hardcoded from step 2 and 3 for easy viewing:
DECLARE #spolicyID VARCHAR(20) = 'xx123123xx'
LEFT('123123xx', 7-1)

Related

SQL Server 2012 string functions

I have a field that can vary in length of the format CxxRyyy where x and y are numeric. I want to choose xx and yyy. For instance, if the field value is C1R12, then I want to get 1 and 12. if I use substring and charindex then I have to use a length, but I would like to use a position like
SUBSTRING(WPLocationNew, CHARINDEX('C',WPLocationNew,1)+1, CHARINDEX('R',WPLocationNew,1)-1)
or
SUBSTRING(WPLocationNew, CHARINDEX('C',WPLocationNew,1)+1, LEN(WPLocationNew) - CHARINDEX('R',WPLocationNew,1))
to get x, but I know that doesn't work. I feel like there is a fairly simple solution, but I am not coming up with it yet. Any suggestions
If these are cell references and will always be in the form C{1-5 digits}R{1-5 digits} you can do this:
DECLARE #t TABLE(Original varchar(32));
INSERT #t(Original) VALUES ('C14R4535'),('C1R12'),('C57R123');
;WITH src AS
(
SELECT Original, c = REPLACE(REPLACE(Original,'C',''),'R','.')
FROM #t
)
SELECT Original, C = PARSENAME(c,2), R = PARSENAME(c,1)
FROM src;
Output
Original
C
R
C14R4535
14
4535
C1R12
1
12
C57R123
57
123
Example db<>fiddle
If you need to protect against other formats, you can add
FROM #t WHERE Original LIKE 'C%[0-9]%R%[0-9]%'
AND PATINDEX('%[^C^R^0-9]%', Original) = 0
Updated db<>fiddle
It appears that you are attempting to parse an Excel cell reference. Those are predictably structured or I wouldn't suggest such an embarrassing hack as this.
Basically, take advantage of the fact that a try_cast in SQL ignores spaces when converting strings to numbers.
declare #val as varchar(20) = 'C1R12'
declare #newval as varchar(20)
declare #c as smallint
declare #r as smallint
--replace the C with 5 spaces
set #newval = replace(#val,'C',' ')
--replace the R with 5 spaces
set #newval = replace(#newval,'R',' ')
--take a look at the intermediate result, which is ' 1 14'
select #newval
set #c = try_cast(left(#newval,11) as smallint)
set #r = try_cast(right(#newval,6) as smallint)
--take a look at the results... two smallint, 1 and 14
select #c, #r
That can all be accomplished in one line for each element (a line for column and a line for row) but I wanted you to be able to understand what was happening so this example goes through the steps individually.
Here's yet another way:
declare #val as varchar(20) = 'C12R345'
declare #c as varchar(5)
declare #r as varchar(5)
set #c = SUBSTRING(#val, patindex('C%', #val)+1,(patindex('%R%', #val)-1)-patindex('C%', #val) )
set #r = SUBSTRING(#val, patindex('%R%', #val)+1, LEN(#val) -patindex('%R%', #val))
select cast(#c as int) as 'C', cast(#r as int) as 'R'
dbfiddle
There are lots of different ways to approach string parsing. Here's just one possible idea:
declare #s varchar(10) = 'C01R002';
select
rtrim( left(replace(stuff(#s, 1, 1, ''), 'R', ' '), 10)) as c,
ltrim(right(replace(substring(#s, 2, 10), 'R', ' '), 10)) as r
Strip out the 'C' and then replace the 'R' with enough spaces so that the left and right sides can be extracted using a fixed length and then easily trimmed back.
stuff() and substring() as used above are just different ways accomplish exactly the same thing. One advantage here is that it does use fairly portable string functions and it's conceivable that this is somewhat faster. This is also done inline and without multiple steps.

Extract number between two substrings in sql

I had a previous question and it got me started but now I'm needing help completing this. Previous question = How to search a string and return only numeric value?
Basically I have a table with one of the columns containing a very long XML string. There's a number I want to extract near the end. A sample of the number would be this...
<SendDocument DocumentID="1234567">true</SendDocument>
So I want to use substrings to find the first part = true so that Im only left with the number.
What Ive tried so far is this:
SELECT SUBSTRING(xml_column, CHARINDEX('>true</SendDocument>', xml_column) - CHARINDEX('<SendDocument',xml_column) +10087,9)
The above gives me the results but its far from being correct. My concern is that, what if the number grows from 7 digits to 8 digits, or 9 or 10?
In the previous question I was helped with this:
SELECT SUBSTRING(cip_msg, CHARINDEX('<SendDocument',cip_msg)+26,7)
and thats how I got started but I wanted to alter so that I could subtract the last portion and just be left with the numbers.
So again, first part of the string that contains the digits, find the two substrings around the digits and remove them and retrieve just the digits no matter the length.
Thank you all
You should be able to setup your SUBSTRING() so that both the starting and ending positions are variable. That way the length of the number itself doesn't matter.
From the sound of it, the starting position you want is right After the "true"
The starting position would be:
CHARINDEX('<SendDocument DocumentID=', xml_column) + 25
((adding 25 because I think CHARINDEX gives you the position at the beginning of the string you are searching for))
Length would be:
CHARINDEX('>true</SendDocument>',xml_column) - CHARINDEX('<SendDocument DocumentID=', xml_column)+25
((Position of the ending text minus the position of the start text))
So, how about something along the lines of:
SELECT SUBSTRING(xml_column, CHARINDEX('<SendDocument DocumentID=', xml_column)+25,(CHARINDEX('>true</SendDocument>',xml_column) - CHARINDEX('<SendDocument DocumentID=', xml_column)+25))
Have you tried working directly with the xml type? Like below:
DECLARE #TempXmlTable TABLE
(XmlElement xml )
INSERT INTO #TempXmlTable
select Convert(xml,'<SendDocument DocumentID="1234567">true</SendDocument>')
SELECT
element.value('./#DocumentID', 'varchar(50)') as DocumentID
FROM
#TempXmlTable CROSS APPLY
XmlElement.nodes('//.') AS DocumentID(element)
WHERE element.value('./#DocumentID', 'varchar(50)') is not null
If you just want to work with this as a string you can do the following:
DECLARE #SearchString varchar(max) = '<SendDocument DocumentID="1234567">true</SendDocument>'
DECLARE #Start int = (select CHARINDEX('DocumentID="',#SearchString)) + 12 -- 12 Character search pattern
DECLARE #End int = (select CHARINDEX('">', #SearchString)) - #Start --Find End Characters and subtract start position
SELECT SUBSTRING(#SearchString,#Start,#End)
Below is the extended version of parsing an XML document string. In the example below, I create a copy of a PLSQL function called INSTR, the MS SQL database does not have this by default. The function will allow me to search strings at a designated starting position. In addition, I'm parsing a sample XML string into a variable temp table into lines and only looking at lines that match my search criteria. This is because there may be many elements with the words DocumentID and I'll want to find all of them. See below:
IF EXISTS (select * from sys.objects where name = 'INSTR' and type = 'FN')
DROP FUNCTION [dbo].[INSTR]
GO
CREATE FUNCTION [dbo].[INSTR] (#String VARCHAR(8000), #SearchStr VARCHAR(255), #Start INT, #Occurrence INT)
RETURNS INT
AS
BEGIN
DECLARE #Found INT = #Occurrence,
#Position INT = #Start;
WHILE 1=1
BEGIN
-- Find the next occurrence
SET #Position = CHARINDEX(#SearchStr, #String, #Position);
-- Nothing found
IF #Position IS NULL OR #Position = 0
RETURN #Position;
-- The required occurrence found
IF #Found = 1
BREAK;
-- Prepare to find another one occurrence
SET #Found = #Found - 1;
SET #Position = #Position + 1;
END
RETURN #Position;
END
GO
--Assuming well formated xml
DECLARE #XmlStringDocument varchar(max) = '<SomeTag Attrib1="5">
<SendDocument DocumentID="1234567">true</SendDocument>
<SendDocument DocumentID="1234568">true</SendDocument>
</SomeTag>'
--Split Lines on this element tag
DECLARE #SplitOn nvarchar(25) = '</SendDocument>'
--Let's hold all lines in Temp variable table
DECLARE #XmlStringLines TABLE
(
Value nvarchar(100)
)
While (Charindex(#SplitOn,#XmlStringDocument)>0)
Begin
Insert Into #XmlStringLines (value)
Select
Value = ltrim(rtrim(Substring(#XmlStringDocument,1,Charindex(#SplitOn,#XmlStringDocument)-1)))
Set #XmlStringDocument = Substring(#XmlStringDocument,Charindex(#SplitOn,#XmlStringDocument)+len(#SplitOn),len(#XmlStringDocument))
End
Insert Into #XmlStringLines (Value)
Select Value = ltrim(rtrim(#XmlStringDocument))
--Now we have a table with multple lines find all Document IDs
SELECT
StartPosition = CHARINDEX('DocumentID="',Value) + 12,
--Now lets use the INSTR function to find the first instance of '">' after our search string
EndPosition = dbo.INSTR(Value,'">',( CHARINDEX('DocumentID="',Value)) + 12,1),
--Now that we know the start and end lets use substring
Value = SUBSTRING(value,(
-- Start Position
CHARINDEX('DocumentID="',Value)) + 12,
--End Position Minus Start Position
dbo.INSTR(Value,'">',( CHARINDEX('DocumentID="',Value)) + 12,1) - (CHARINDEX('DocumentID="',Value) + 12))
FROM
#XmlStringLines
WHERE Value like '%DocumentID%' --Only care about lines with a document id

Replace every alpha character with itself + wildcard in string SQL Server

My goal is to create a query that will search for results related to a specific keyword.
Say in a database we had the word cat.
Regardless of if the user types C a t, C.A.T. or Cat I want to find a result related to the search as long as the alpha numeric characters are in the correct sequence that is all that matters
Say in the database we have these 4 records
cat
c/a/t
c.a.t
c. at
If the user types in C#$*(&A T I'd like to get all 4 results.
What I have written so far in my query is a function that strips any non-alphanumeric characters from the input string.
What can I do to replace each alphanumeric character with itself and add a wildcard at the end?
For every alpha character my input would look similar to this
C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%
Actually, that search string will return only one record from this table: the row with 'c.a.t '.
This is because the expression C%[^a-zA-Z0-9]%A does not mean there can't be any alpha-numeric chars between C and A.
What it actually means is there should be at least one non alpha-numeric value between C and A.
Moreover, it will return incorrect values as well - a value like 'c u a s e t ' will be returned.
You need to change your where clause to something like this:
WHERE column LIKE '%C%A%T%'
AND column NOT LIKE '%C%[a-zA-Z0-9]%A%[a-zA-Z0-9]%T%'
This way, if you have cat in the correct order, the first row will resolve to true, and if there are no other alpha-numeric chars between c, a, and t the second row will resolve to true.
Here is a test script, where you can see for yourself what I mean:
DECLARE #T AS TABLE
(
a varchar(20)
)
INSERT INTO #T VALUES
('cat'),
('c/a/t'),
('c.a.t '),
('c. at'),
('c u a s e t ')
-- Incorrect where clause
SELECT *
FROM #T
WHERE a LIKE 'C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%'
-- correct where clause
SELECT *
FROM #T
WHERE a LIKE '%C%A%T%'
AND a NOT LIKE '%C%[a-zA-Z0-9]%A%[a-zA-Z0-9]%T%'
You can also see it in action in this link.
And since I had some spare time, here is a script to create both the like and the not like patterns from the input string:
DECLARE #INPUT varchar(100) = '#*# c %^&# a ^&*$&* t (*&(%!##$'
DECLARE #Index int = 1,
#CurrentChar char(1),
#Like varchar(100),
#NotLike varchar(100) = '%'
WHILE #Index < LEN(#Input)
BEGIN
SET #CurrentChar = SUBSTRING(#INPUT, #Index, 1)
IF PATINDEX('%[^a-zA-Z0-9]%', #CurrentChar) = 0
BEGIN
SET #NotLike = #NotLike + #CurrentChar + '%[a-zA-Z0-9]%'
END
SET #Index = #Index + 1
END
SELECT #NotLike = LEFT(#NotLike, LEN(#NotLike) - 12),
#Like = REPLACE(#NotLike, '%[a-zA-Z0-9]%', '%')
SELECT *
FROM #T
WHERE a LIKE #Like
AND a NOT LIKE #NotLike
You can recursively go through your (cleaned) search string and to each letter add the expression you would like. In my example #builtString should be what you would like to use further on, if I understood correctly.
declare #cleanSearch as nvarchar(10) = 'CAT'
declare #builtString as nvarchar(100) = ''
WHILE LEN(#cleanSearch) > 0 -- loop until you deplete the search string
BEGIN
SET #builtString = #builtString + substring(#cleanSearch,1,1) + '%[^a-zA-Z0-9]%' -- append the letter plus regular expression
SET #cleanSearch = right(#cleanSearch, len(#cleanSearch) - 1) -- remove first letter of the search string
END
SELECT #builtString --will look like C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%
SELECT #cleanSearch --#cleanSearch is now empty

Simple Explanation for PATINDEX

I have have been reading up on PATINDEX attempting to understand what and why. I understand the when using the wildcards it will return an INT as to where that character(s) appears/starts. So:
SELECT PATINDEX('%b%', '123b') -- returns 4
However I am looking to see if someone can explain the reason as to why you would use this in a simple(ish) way. I have read some other forums but it just is not sinking in to be honest.
Are you asking for realistic use-cases? I can think of two, real-life use-cases that I've had at work where PATINDEX() was my best option.
I had to import a text-file and parse it for INSERT INTO later on. But these files sometimes had numbers in this format: 00000-59. If you try CAST('00000-59' AS INT) you'll get an error. So I needed code that would parse 00000-59 to -59 but also 00000159 to 159 etc. The - could be anywhere, or it could simply not be there at all. This is what I did:
DECLARE #my_var VARCHAR(255) = '00000-59', #my_int INT
SET #my_var = STUFF(#my_var, 1, PATINDEX('%[^0]%', #my_var)-1, '')
SET #my_int = CAST(#my_var AS INT)
[^0] in this case means "any character that isn't a 0". So PATINDEX() tells me when the 0's end, regardless of whether that's because of a - or a number.
The second use-case I've had was checking whether an IBAN number was correct. In order to do that, any letters in the IBAN need to be changed to a corresponding number (A=10, B=11, etc...). I did something like this (incomplete but you get the idea):
SET #i = PATINDEX('%[^0-9]%', #IBAN)
WHILE #i <> 0 BEGIN
SET #num = UNICODE(SUBSTRING(#IBAN, #i, 1))-55
SET #IBAN = STUFF(#IBAN, #i, 1, CAST(#num AS VARCHAR(2))
SET #i = PATINDEX('%[^0-9]%', #IBAN)
END
So again, I'm not concerned with finding exactly the letter A or B etc. I'm just finding anything that isn't a number and converting it.
PATINDEX is roughly equivalent to CHARINDEX except that it returns the position of a pattern instead of single character. Examples:
Check if a string contains at least one digit:
SELECT PATINDEX('%[0-9]%', 'Hello') -- 0
SELECT PATINDEX('%[0-9]%', 'H3110') -- 2
Extract numeric portion from a string:
SELECT SUBSTRING('12345', PATINDEX('%[0-9]%', '12345'), 100) -- 12345
SELECT SUBSTRING('x2345', PATINDEX('%[0-9]%', 'x2345'), 100) -- 2345
SELECT SUBSTRING('xx345', PATINDEX('%[0-9]%', 'xx345'), 100) -- 345
Quoted from PATINDEX (Transact-SQL)
The following example uses % and _ wildcards to find the position at
which the pattern 'en', followed by any one character and 'ure' starts
in the specified string (index starts at 1):
SELECT PATINDEX('%en_ure%', 'please ensure the door is locked');
Here is the result set.
8
You'd use the PATINDEX function when you want to know at which character position a pattern begins in an expression of a valid text or character data type.

Uppercase first two characters in a column in a db table

I've got a column in a database table (SQL Server 2005) that contains data like this:
TQ7394
SZ910284
T r1534
su8472
I would like to update this column so that the first two characters are uppercase. I would also like to remove any spaces between the first two characters. So T q1234 would become TQ1234.
The solution should be able to cope with multiple spaces between the first two characters.
Is this possible in T-SQL? How about in ANSI-92? I'm always interested in seeing how this is done in other db's too, so feel free to post answers for PostgreSQL, MySQL, et al.
Here is a solution:
EDIT: Updated to support replacement of multiple spaces between the first and the second non-space characters
/* TEST TABLE */
DECLARE #T AS TABLE(code Varchar(20))
INSERT INTO #T SELECT 'ab1234x1' UNION SELECT ' ab1234x2'
UNION SELECT ' ab1234x3' UNION SELECT 'a b1234x4'
UNION SELECT 'a b1234x5' UNION SELECT 'a b1234x6'
UNION SELECT 'ab 1234x7' UNION SELECT 'ab 1234x8'
SELECT * FROM #T
/* INPUT
code
--------------------
ab1234x3
ab1234x2
a b1234x6
a b1234x5
a b1234x4
ab 1234x8
ab 1234x7
ab1234x1
*/
/* START PROCESSING SECTION */
DECLARE #s Varchar(20)
DECLARE #firstChar INT
DECLARE #secondChar INT
UPDATE #T SET
#firstChar = PATINDEX('%[^ ]%',code)
,#secondChar = #firstChar + PATINDEX('%[^ ]%', STUFF(code,1, #firstChar,'' ) )
,#s = STUFF(
code,
1,
#secondChar,
REPLACE(LEFT(code,
#secondChar
),' ','')
)
,#s = STUFF(
#s,
1,
2,
UPPER(LEFT(#s,2))
)
,code = #s
/* END PROCESSING SECTION */
SELECT * FROM #T
/* OUTPUT
code
--------------------
AB1234x3
AB1234x2
AB1234x6
AB1234x5
AB1234x4
AB 1234x8
AB 1234x7
AB1234x1
*/
UPDATE YourTable
SET YourColumn = UPPER(
SUBSTRING(
REPLACE(YourColumn, ' ', ''), 1, 2
)
)
+
SUBSTRING(YourColumn, 3, LEN(YourColumn))
UPPER isn't going to hurt any numbers, so if the examples you gave are completely representative, there's not really any harm in doing:
UPDATE tbl
SET col = REPLACE(UPPER(col), ' ', '')
The sample data only has spaces and lowercase letters at the start. If this holds true for the real data then simply:
UPPER(REPLACE(YourColumn, ' ', ''))
For a more specific answer I'd politely ask you to expand on your spec, otherwise I'd have to code around all the other possibilities (e.g. values of less than three characters) without knowing if I was overengineering my solution to handle data that wouldn't actually arise in reality :)
As ever, once you've fixed the data, put in a database constraint to ensure the bad data does not reoccur e.g.
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_1_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 1, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
ALTER TABLE YourTable ADD
CONSTRAINT YourColumn__char_pos_2_uppercase_letter
CHECK (ASCII(SUBSTRING(YourColumn, 2, 1)) BETWEEN ASCII('A') AND ASCII('Z'));
#huo73: yours doesn't work for me on SQL Server 2008: I get 'TRr1534' instead of 'TR1534'.
update Table set Column = case when len(rtrim(substring (Column , 1 , 2))) < 2
then UPPER(substring (Column , 1 , 1) + substring (Column , 3 , 1)) + substring(Column , 4, len(Column)
else UPPER(substring (Column , 1 , 2)) + substring(Column , 3, len(Column) end
This works on the fact that if there is a space then the trim of that part of string would yield length less than 2 so we split the string in three and use upper on the 1st and 3rd char. In all other cases we can split the string in 2 parts and use upper to make the first two chars to upper case.
If you are doing an UPDATE, I would do it in 2 steps; first get rid of the space (RTRIM on a SUBSTRING), and second do the UPPER on the first 2 chars:
// uses a fixed column length - 20-odd in this case
UPDATE FOO
SET bar = RTRIM(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
UPDATE FOO
SET bar = UPPER(SUBSTRING(bar, 1, 2)) + SUBSTRING(bar, 3, 20)
If you need it in a SELECT (i.e. inline), then I'd be tempted to write a scalar UDF