SQL Server extract integers from string using regular expression

SQL Server extract integers from string using regular expression - sql

I have a string (unc file path) that I need to extract some integers that will be embedded in the string in a semi-predictable way.
Example strings:
\\servername\folder1\FTP\folder2\512/862450_FileBundle.zip
--OR-- : \\servername\folder1\FTP\folder2\512\862450_FileBundle.zip
--OR-- : servername/folder1/FTP/folder2/512/862450_FileBundle.zip
The following regular expression regular expression will match on any integer value that is bounded by a forward or backslash: (\/|\\)\d+(\/|\\)
So the REGEX above would match on "\512\", or "\512/", or "/512/" or even "/512\".
I have tried the following SQL and other variations without success:
DECLARE #testString varchar(50) = '\\servername\folder1\FTP\folder2\512/862450_FileBundle.zip'
SELECT PATINDEX('%(\/|\\)\d+(\/|\\)%', #testString)
I'm not terribly familiar with REGEX and SQL so I'm not even sure this is possible.

SQL Server doesn't have as good pattern matching abilities as regular expressions. You can search for the pattern:
[/\\][0-9]%[/\\]
That is, slash followed by a digit followed by any other string followed by a slash. This will match any characters after the first digit, but your examples have nothing of the form /1abc/.
If this is sufficient, then this does the trick:
select v.*,
left(v2.str2, patindex('%[/\\]%', v2.str2) - 1)
from (values ('\\servername\folder1\FTP\folder2\512/862450_FileBundle.zip')) v(str) cross apply
(values (stuff(v.str, 1, patindex('%[/\\][0-9]%[/\\]%', v.str), ''))) v2(str2)

Other than writing a UDF to loop through the characters, the only thing I can think of is brute force approach...
(The User Defined Function might be your least worst option.)
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=face1befe5e7c74f457846fc37eca649
SELECT
*,
SUBSTRING(test.unc_file_path, headMatch.pos+1, headMatch.chars)
FROM
test
OUTER APPLY
(
SELECT
MIN(pos), MIN(chars)
FROM
(
SELECT
PATINDEX('%' + head + body + tail + '%', test.unc_file_path) AS pos, chars
FROM
(
SELECT '\'
UNION ALL SELECT '/'
)
head(head)
CROSS JOIN
(
SELECT 1, '[0-9]'
UNION ALL SELECT 2, '[0-9][0-9]'
UNION ALL SELECT 3, '[0-9][0-9][0-9]'
UNION ALL SELECT 4, '[0-9][0-9][0-9][0-9]'
UNION ALL SELECT 5, '[0-9][0-9][0-9][0-9][0-9]'
)
body(chars, body)
CROSS JOIN
(
SELECT '\'
UNION ALL SELECT '/'
)
tail(tail)
)
match
WHERE
pos > 0
)
headMatch(pos, chars)

Related

Is it possible to find the first occurrence of a string that's NOT within a set of delimiters in SQL Server 2016+?

I have a column in a SQL Server table that has strings of varying lengths. I need to find the position of the first occurrence of the string , -- that's not enclosed in single quotes or square brackets.
For example, in the following two strings, I've bolded the portion I would like to get the position of. Notice in the first string, the first time , -- appears on its own (without being between single quote or square bracket delimiters) is at position 13 and in the second string, it's at position 16.
'a, --'[, --]**, --**[, --]
[a, --b]aaaaaaa_ **, --**', --'
Also I should mention that , -- itself could appear multiple times in the string.
Here's a simple query that shows the strings and my desired output.
SELECT
t.string, t.desired_pos
FROM
(VALUES (N'''a, --''[, --], --[, --]', 14),
(N'[a, —-b]aaaaaaa_ , --'', --''', 18)) t(string, desired_pos)
Is there any way to accomplish this using a SELECT query (or multiple) without using a function?
Thank you in advance!
I've tried variations of SUBSTRING, CHARINDEX, and even some CROSS APPLYs but I can't seem to get the result I'm looking for.

Before i write down my solution, i must warn you: DON'T USE IT. Use a function, or do this in some other language. This code is probably buggy.
It doesn't handle stuff like escaped quotes etcetc.
The idea is to first remove the stuff inside brackets [] and quotes '' and then just do a "simple" charindex.
To remove the brackets, i'm using a recursive CTE that loops ever part of matching quotes and replaces their content with placeholder strings.
One important point is that quotes might be embedded in each other, so you have to try both variants and chose the one that is earliest.
WITH CTE AS (
SELECT *
FROM
(VALUES (N'''a, --''[, --], --[, --]', 14),
(N'[a, —-b]aaaaaaa_ , --'', --''', 18)) t(string, desired_pos)
)
, cte2 AS (
select x.start
, x.finish
, case when x.start > 0 THEN STUFF(string, x.start, x.finish - x.start + 1, REPLICATE('a', x.finish - x.start + 1)) ELSE string END AS newString
, 1 as level
, string as orig
, desired_pos
from cte
CROSS APPLY (
SELECT *
, ROW_NUMBER() OVER(ORDER BY case when start > 0 THEN 0 ELSE 1 END, start) AS sortorder
FROM (
SELECT charindex('[', string) AS start
, charindex(']', string) AS finish
UNION ALL
SELECT charindex('''', string) AS startQ
, charindex('''', string, charindex('''', string) + 1) AS finishQ
) x
) x
WHERE x.sortorder = 1
UNION ALL
select x.start
, x.finish
, STUFF(newString, x.start, x.finish - x.start + 1, REPLICATE('a', x.finish - x.start + 1))
, 1 as level
, orig
, desired_pos
from cte2
CROSS APPLY (
SELECT *
, ROW_NUMBER() OVER(ORDER BY case when start > 0 THEN 0 ELSE 1 END, start) AS sortorder
FROM (
SELECT charindex('[', newString) AS start
, charindex(']', newString) AS finish
UNION ALL
SELECT charindex('''', newString) AS startQ
, charindex('''', newString, charindex('''', newString) + 1) AS finishQ
) x
) x
WHERE x.sortorder = 1
AND x.start > 0
AND cte2.start > 0 -- Must have been a match
)
SELECT PATINDEX('%, --%', newString), *
from (
select *, row_number() over(partition by orig order by level desc) AS sort
from cte2
) x
where x.sort = 1

Try this approach. I'm replacing the strings you don't need for another string of the same length. Then look for the position of the interested string.
SELECT string, desired_pos,
CHARINDEX(', --', REPLACE(REPLACE(string, ''', --''', '******'), '[, --]', '******')
) start_index
FROM (VALUES (N''', --''[, --], --[, --]', 13),
(N'[, --]aaaaaaa_ , --'', --''', 16)) t(string, desired_pos)

I don't know if it makes sense with a C# solution, but this class for CVS is a nice little parcer: TextFieldParser
Then you just define Delimeters etc. and assuming the input is escaped consistently then all is good.

Im late the game here but This kind of thing is simple in SQL Server when leveraging NGrams8k. Not only do you not need REGEX, a CLR, C# required. Furthermore, NGrams8k will be the fastest by far. In 8 years nobody has produced anything remotely as fast. Furthermore, this code will be faster and far less complex than a recursive CTE solution (which are almost always slow in SQL Server)
;--==== Sample Data
DECLARE #T Table (String VARCHAR(100))
INSERT #T
VALUES (N'''a, --''[, --], --[, --]'),
(N'[a, —-b]aaaaaaa_ , --'', --''');
;--==== Solution
SELECT
t.String, ng.Position
FROM #t AS t
CROSS APPLY (VALUES(REPLACE(t.String,'[',CHAR(1)))) AS f(S)
CROSS APPLY samd.NGrams8k(f.S,4) AS ng
CROSS APPLY (VALUES(SUBSTRING(f.S,ng.Position-2,7))) AS g(String)
WHERE ng.Token = ', --'
AND g.String NOT LIKE '%''%''%'
AND g.String NOT LIKE '%'+CHAR(1)+'%]%';
Results:
String Position
----------------------------- --------------------
'a, --'[, --], --[, --] 14
[a, —-b]aaaaaaa_ , --', --' 18

Remove Characters in a String in SQL

I have a column u_manualdoc which contains the values are like this CGY DR# 7405. I want to remove the CGY DR#.
Here's the code:
select u_manualdoc, cardcode, cardname from ODLN
I want only the 7405 number. Thanks!

Try this:
--sample data you provided in comments
declare #tbl table(codes varchar(20))
insert into #tbl values
('CGY PST - 58277') , ('CGY RMC PST # 58083'), ('CGY DR # 7443'), ('CSI # 1304'), ('PO# 0568 , 0570'), ('CGY DR# 7446')
--actual query that you can apply to your table
select SUBSTRING(codes, PATINDEX('%[0-9]%', codes), len(codes)) from #tbl
The key point here is to use patindex, which searches for a pattern and returns index where such pattern occur. I specified %[0-9]% which means that we search for any digit - it will return first occurrence of a digit. Now- since this would be our starting point to substring, we pass it to such function. Third parameter of substring is length. Since we want the rest of a string, len function makes sure that we get that :)
Applying to your naming:
select SUBSTRING(u_manualdoc, PATINDEX('%[0-9]%', u_manualdoc), len(u_manualdoc)),
cardcode,
cardname
from ODLN

You should use string functions charindex,len and substring to get it.
See the code below.
select SUBSTRING(u_manualdoc,CHARINDEX('#',u_manualdoc)+1,LEN(u_manualdoc)- CHARINDEX('#',u_manualdoc))

EDIT
In addition to the other answers, you can use this simple method:
select
substring(
u_manualdoc,
len(u_manualdoc) - patindex('%[^0-9]%', reverse(u_manualdoc)) + 2,
len(u_manualdoc)
),
cardcode, cardname
from ODLN
In this example, patindex finds the first non-digit (as specified by ^[0-9]) from the right side of the string, and then uses that as the starting point of the substring.
This will work on all of your sample strings (including 'PO# 0568 , 0570 CGY DR# 7446').
Or use SQL Server Regex, which lets you use more powerful regular expressions within your queries.

TRY THIS
DECLARE #table TABLE(DirtyCol VARCHAR(100));
INSERT INTO #table
VALUES('AB ABCDE # 123'), ('ABCDE# 123'), ('AB: ABC# 123 AB: ABC# 123'), ('AB#'), ('AB # 1 000 000'), ('AB # 1`234`567'), ('AB # (9)(876)(543)');
WITH tally
AS (
SELECT TOP (100) N = ROW_NUMBER() OVER(ORDER BY ##spid)
FROM sys.all_columns),
data
AS (
SELECT DirtyCol,
Col
FROM #table
CROSS APPLY
(
SELECT
(
SELECT C+''
FROM
(
SELECT N,
SUBSTRING(DirtyCol, N, 1) C
FROM tally
WHERE N <= DATALENGTH(DirtyCol)
) [1]
WHERE C BETWEEN '0' AND '9'
ORDER BY N FOR XML PATH('')
)
) p(Col)
WHERE p.Col IS NOT NULL)
SELECT DirtyCol,
CAST(Col AS INT) IntCol
FROM data;

I want to remove part of string from a string

Thank you in advance.
I want to remove string after . including ., but length is variable and string can be of any length.
1)Example:
Input:- SCC0204.X and FRK0005.X and RF0023.X and ADF1010.A and HGT9010.V
Output: SCC0204 and FRK0005 and RF0023 and ADF1010.A and HGT9010.V
I tried using the charindex but as the length keeps on changing i wasn't able to do it. I want to trim the values with ending with only X
Any help will be greatly appreciated.

Assuming there is only one dot
UPDATE TABLE
SET column_name = left(column_name, charindex('.', column_name) - 1)
For SELECT
select left(column_name, charindex('.', column_name) - 1) AS col
from your_table

Hope this helps. The code only trims the string when the value has a decimal "." in it and if that value is equal to .X
;WITH cte_TestData(Code) AS
(
SELECT 'SCC0204.X' UNION ALL
SELECT 'FRK0005.X' UNION ALL
SELECT 'RF0023.X' UNION ALL
SELECT 'ADF1010.A' UNION ALL
SELECT 'HGT9010.V' UNION ALL
SELECT 'SCC0204' UNION ALL
SELECT 'FRK0005'
)
SELECT CASE
WHEN CHARINDEX('.', Code) > 0 AND RIGHT(Code,2) = '.X'
THEN SUBSTRING(Code, 1, CHARINDEX('.', Code) - 1)
ELSE Code
END
FROM cte_TestData
If the criteria is only to replace remove .X then probably this should also work
;WITH cte_TestData(Code) AS
(
SELECT 'SCC0204.X' UNION ALL
SELECT 'FRK0005.X' UNION ALL
SELECT 'RF0023.X' UNION ALL
SELECT 'ADF1010.A' UNION ALL
SELECT 'HGT9010.V' UNION ALL
SELECT 'SCC0204' UNION ALL
SELECT 'FRK0005'
)
SELECT REPLACE (Code,'.X','')
FROM cte_TestData

Use LEFT String function :
DECLARE #String VARCHAR(100) = 'SCC0204.XXXXX'
SELECT LEFT(#String,CHARINDEX('.', #String) - 1)

I think your best bet here is to create a function that parses the string and uses regex. I hope this old post helps:
Perform regex (replace) in an SQL query
However, if the value you need to trim is constantly ".X", then you should use
select replace(string, '.x', '')

Please check the below code. I think this will help you.
DECLARE #String VARCHAR(100) = 'SCC0204.X'
IF (SELECT RIGHT(#String,2)) ='.X'
SELECT LEFT(#String,CHARINDEX('.', #String) - 1)
ELSE
SELECT #String

Update: I just missed one of the comments where the OP clarifies the requirement. What I put together below is how you would deal with a requirement to remove everything after the first dot on strings ending with X. I leave this here for reference.
;WITH cte_TestData(Code) AS
(
SELECT 'SCC0204.X' UNION ALL -- ends with '.X'
SELECT 'FRK.000.X' UNION ALL -- ends with '.X', contains multiple dots
SELECT 'RF0023.AX' UNION ALL -- ends with '.AX'
SELECT 'ADF1010.A' UNION ALL -- ends with '.A'
SELECT 'HGT9010.V' UNION ALL -- ends with '.V'
SELECT 'SCC0204.XF' UNION ALL -- ends with '.XF'
SELECT 'FRK0005' UNION ALL -- totally clean
SELECT 'ABCX' -- ends with 'X', not dots
)
SELECT
orig_string = code,
newstring =
SUBSTRING
(
code, 1,
CASE
WHEN code LIKE '%X'
THEN ISNULL(NULLIF(CHARINDEX('.',code)-1, -1), LEN(code))
ELSE LEN(code)
END
)
FROM cte_TestData;
FYI - SQL Server 2012+ you could simplify this code like this:
SELECT
orig_string = code,
newstring =
SUBSTRING(code, 1,IIF(code LIKE '%X', ISNULL(NULLIF(CHARINDEX('.',code)-1, -1), LEN(code)), LEN(code)))
FROM cte_TestData;

With SUBSTRING you can achieve your requirements by below code.
SELECT SUBSTRING(column_name, 0, CHARINDEX('.', column_name)) AS col
FROM your_table
If you want to remove fixed .X from string you can also use REPLACE function.
SELECT REPLACE(column_name, '.X', '') AS col

How to group through a string part?

I've a table which contains logs from a web portal, it contains the url visited, the request duration, the referer...
One of these columns is the path info and contains strings like following:
/admin/
/export/
/project2/
/project1/news
/project1/users
/user/id/1
/user/id/1/history
/user/id/2
/forum/topic/14/post/456
I would like to calculate with sql queries some stats based on this column, so I would like to know how can I create aggregate based on the first part of the path info?
It'd let me count number of url starting by /admin/, /export/, /project1/, /project2/, /user/, /forum/, ...
Making it with a programming language would be easy with regex, but I read that similar function does not exists on SQLServer.

I would use CHARINDEX() to find the first occurrence of the "/" starting AFTER the leading first character '/', so anything AFTER the second is stripped off.
select
LEFT( pathInfo, CHARINDEX( '/', pathInfo, 2 )) as RootLevelPath,
count(*) as Hits
from
temp
group by
LEFT( pathInfo, CHARINDEX( '/', pathInfo, 2 ))
Working result from SQLFiddle

DRapp's is perfect for grouping on the first fragment of the URL. If you need to group by other levels it might get unwieldy to manage the nested LEFT/CHARINDEX statements.
Here's one way to group by a parameterized level:
declare #t table (pathId int identity(1,1) primary key, somePath varchar(100));
insert into #t
select '/admin/' union all
select '/export/' union all
select '/project2/' union all
select '/project1/news' union all
select '/project1/users' union all
select '/user/id/1' union all
select '/user/id/1/history' union all
select '/user/id/2' union all
select '/forum/topic/14/post/456' union all
select '/forum/topic/14/post/789' union all
select '/forum/topic/14/post/789'
declare #level int =1;
;with fragments as
( select pathId,
[n] = x.query('.'),
[Fragment] = x.value('.', 'varchar(100)')
from ( select PathId,
cast('<r>' + replace(stuff(somePath, 1, 1, ''), '/', '</r><r>') + '</r>' as xml)
.query('r[position()<=sql:variable("#level")]')
from #t
) d (PathId, X)
)
select count(*), [path] = max(r.v)
from fragments f
cross
apply ( select '/' + p.n.value('.', 'varchar(100)')
from fragments
cross
apply n.nodes('r')p(n)
where PathId = f.PathId
for xml path('')
) r(v)
group
by fragment;

Can the Select list in a SQL Statement use Regular Expressions

I have a SQL statement,
select ColumnName from Table
And I get this result,
Error 192.168.1.67 UserName 0bce6c62-1efb-416d-bce5-71c3c8247b75 An existing ....
So anyway the field has a lot of stuff in it, I just want to get out the 'UserName'.
Can I use a regex for that?
I mean it would be kind of like this,
select SUBSTRING(ColumnName, 0, 5) from Table
Except the SUBSTRING would be replaced with a regex of some kind. I am comfortable with regex, but I am not sure how to apply it in this case, or even if you can.
If I could get this working it would be great because I plan to pull the data into a temporary table, and do some quite complicated things matching it with other tables etc. If I can get this all working it would save me writing a C# app to do it with.
Thanks.

No, out of the box, SQL Server doesn't support regexs.
You could retrofit those by means of a SQL-CLR assembly that you deploy into SQL Server.

I think going you should use SUBSTRING anyway. Using regular expression is more flexible but also lead to a large processing overhead. This becomes even worse if your have to process a large recordsets.
You have to justify if there's the need for flexibility in first place.
If so you should read about it here:
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
Using T-SQL only can look like that:
SELECT 'Error 192.168.1.67 XUserNameX 0bce6c62-1efb-416d-bce5-71c3c8247b75 An existing' expr
INTO log_table
GO
WITH
split1 (expr, cstart, cend)
AS (
SELECT
expr, 1, 0
FROM
log_table a
), split2 (expr, cstart, cend, div)
AS (
SELECT
a.expr, a.cend + 1, CHARINDEX(' ', a.expr, a.cend + 1), 1
FROM
split1 a
UNION ALL
SELECT
a.expr, a.cend + 1, CHARINDEX(' ', a.expr, a.cend + 1), div+1
FROM
split2 a
WHERE
a.cend > 1
), substrings(expr, div)
AS (
SELECT
SUBSTRING(expr, cstart, cend - cstart), div
FROM
split2
)
SELECT expr from
substrings a
where
a.div = 3

UPDATE
we cannot tell where the start of the
username is. Unless we can say 'find
me the start character after the
second space'
That is fairly straightforward:
Filter out strings that have fewer than
two spaces (alternatively, have three
or more words);
Find the position after the first
space (alternatively, the beginning
of the second word);
Find the position after the the first
space after the first space
(alternatively, the beginning of the
third word);
Determine the length of the third
word using the position of the next
space (or the end of the string is
there are only three words);
Use the above values with the
SUBSTRING() function to return the
third word.
Example:
WITH MyTable (ColumnName)
AS
(
SELECT NULL
UNION ALL
SELECT ''
UNION ALL
SELECT 'One.'
UNION ALL
SELECT 'Two words.'
UNION ALL
SELECT 'Three word sentence.'
UNION ALL
SELECT 'Sentence containing four words.'
UNION ALL
SELECT 'Five words in this sentence.'
UNION ALL
SELECT 'Sentence containing more than five words.'
),
AtLeastThreeWords (ColumnName, pos_word_2_start)
AS
(
SELECT M1.ColumnName, CHARINDEX(' ', M1.ColumnName) + LEN(' ') + 1
FROM MyTable AS M1
WHERE LEN(M1.ColumnName) - LEN(REPLACE(M1.ColumnName, ' ', '')) >= 2
),
MyTable2 (ColumnName, pos_word_3_start)
AS
(
SELECT M1.ColumnName,
CHARINDEX(' ', M1.ColumnName, pos_word_2_start) + LEN(' ') + 1
FROM AtLeastThreeWords AS M1
),
MyTable3 (ColumnName, pos_word_3_start, pos_word_3_end)
AS
(
SELECT M1.ColumnName, M1.pos_word_3_start,
CHARINDEX(' ', M1.ColumnName, pos_word_3_start) + LEN(' ')
FROM MyTable2 AS M1
),
MyTable4 (ColumnName, pos_word_3_start, word_3_length)
AS
(
SELECT M1.ColumnName, M1.pos_word_3_start,
CASE
WHEN pos_word_3_start < pos_word_3_end
THEN pos_word_3_end - pos_word_3_start
ELSE LEN(M1.ColumnName) - pos_word_3_start + 1
END
FROM MyTable3 AS M1
)
SELECT M1.ColumnName,
SUBSTRING(M1.ColumnName, pos_word_3_start, word_3_length)
AS word_3
FROM MyTable4 AS M1;
ORIGINAL ANSWER:
Is the problem that the position and/or length of the username value may not be constant in the data but always follows the string 'username '? If so, you can use CHARINDEX with SUBSTRING e.g.
WITH MyTable (ColumnName)
AS
(
SELECT 'Error 192.168.1.67 UserName 0bce6c62-1efb-416d-bce5-71c3c8247b75 An existing ....'
UNION ALL
SELECT 'Username onedaywhen is invalid'
),
MyTable1 (ColumnName, pos1)
AS
(
SELECT M1.ColumnName, CHARINDEX('UserName ', M1.ColumnName) + LEN('UserName ') + 1
FROM MyTable AS M1
),
MyTable2 (ColumnName, pos1, pos2)
AS
(
SELECT M1.ColumnName, M1.pos1,
CHARINDEX(' ', M1.ColumnName, pos1) - M1.pos1
FROM MyTable1 AS M1
)
SELECT SUBSTRING(M1.ColumnName, M1.pos1, M1.pos2)
FROM MyTable2 AS M1;
...though you'd need to make it more robust e.g. when there is no trailing space after the username value etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL Server extract integers from string using regular expression - sql

Related

Is it possible to find the first occurrence of a string that's NOT within a set of delimiters in SQL Server 2016+?

Remove Characters in a String in SQL

I want to remove part of string from a string

How to group through a string part?

Can the Select list in a SQL Statement use Regular Expressions

Categories

Resources