Splitting a String by character and parsing it into multiple columns in another table - sql

I am looking to take a string of directory path and parse information out of it into existing columns on another table. This is for the purpose of creating a staging table for reporting. It will be parsing many directory paths if the ProjectName is applicable to the change in structure.
Data Example:
Table1_Column1
ProjectName\123456_ProjectShortName\Release_1\Iteration\etc
Expected Output:
Table2_Column1, Table2_Column2
123456 ProjectShortName
I've figured out how to parse some strings by character but it seems a bit clunky and inefficient. Is there a better structure to go about this? To add some more to it, this is just one column I need to manipulate before shifting it over there are three other columns that are being directly shifted to the staging table based on the ProjectName.
Is it better to just create a UDF to split then call it within the job that will move the data or is there another way?

Here's a method without a UDF.
It uses charindex and substring to get the parts from that path string.
An example using a table variable:
declare #T table (Table1_Column1 varchar(100));
insert into #T values
('ProjectName\123456_ProjectShortName\Release_1\Iteration\etc'),
('OtherProjectName\789012_OtherProjectShortName\Release_2\Iteration\xxx');
select
case
when FirstBackslashPos > 0 and FirstUnderscorePos > 0
then substring(Col1,FirstBackslashPos+1,FirstUnderscorePos-FirstBackslashPos-1)
end as Table1_Column1,
case
when FirstUnderscorePos > 0 and SecondBackslashPos > 0
then substring(Col1,FirstUnderscorePos+1,SecondBackslashPos-FirstUnderscorePos-1)
end as Table1_Column2
from (
select
Table1_Column1 as Col1,
charindex('\',Table1_Column1) as FirstBackslashPos,
charindex('_',Table1_Column1) as FirstUnderscorePos,
charindex('\',Table1_Column1,charindex('\',Table1_Column1)+1) as SecondBackslashPos
from #T
) q;
If you want to calculate only one into a variable
declare #ProjectPath varchar(100);
set #ProjectPath = 'ProjectName\123456_ProjectShortName\Release_1\Iteration\etc';
declare #FirstBackslashPos int = charindex('\',#ProjectPath);
declare #FirstUnderscorePos int = charindex('_',#ProjectPath,#FirstBackslashPos);
declare #SecondBackslashPos int = charindex('\',#ProjectPath,#FirstBackslashPos+1);
declare #ProjectNumber varchar(30) = case when #FirstBackslashPos > 0 and #FirstUnderscorePos > 0 then substring(#ProjectPath,#FirstBackslashPos+1,#FirstUnderscorePos-#FirstBackslashPos-1)end;
declare #ProjectShortName varchar(30) = case when #FirstUnderscorePos > 0 and #SecondBackslashPos > 0 then substring(#ProjectPath,#FirstUnderscorePos+1,#SecondBackslashPos-#FirstUnderscorePos-1) end;
select #ProjectNumber as ProjectNumber, #ProjectShortName as ProjectShortName;
But i.m.h.o. it might be worth the effort to add some CLR that brings true regex matching to the SQL server. Since CHARINDEX and PATINDEX are not as flexible as regex.

The following is a SUPER fast Parser but it is limited to 8K bytes. Notice the Returned Sequence Number... Perhaps you can key off of that because I am still not clear on the logic for why column1 is 123456 and not ProjectName
Declare #String varchar(max) = 'ProjectName\123456_ProjectShortName\Release_1\Iteration\etc'
Select * from [dbo].[udf-Str-Parse-8K](#String,'\')
Returns
RetSeq RetVal
1 ProjectName
2 123456_ProjectShortName
3 Release_1
4 Iteration
5 etc
The UDF if needed
CREATE FUNCTION [dbo].[udf-Str-Parse-8K] (#String varchar(max),#Delimiter varchar(10))
Returns Table
As
Return (
with cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
cte2(N) As (Select Top (IsNull(DataLength(#String),0)) Row_Number() over (Order By (Select NULL)) From (Select N=1 From cte1 a,cte1 b,cte1 c,cte1 d) A ),
cte3(N) As (Select 1 Union All Select t.N+DataLength(#Delimiter) From cte2 t Where Substring(#String,t.N,DataLength(#Delimiter)) = #Delimiter),
cte4(N,L) As (Select S.N,IsNull(NullIf(CharIndex(#Delimiter,#String,s.N),0)-S.N,8000) From cte3 S)
Select RetSeq = Row_Number() over (Order By A.N)
,RetVal = Substring(#String, A.N, A.L)
From cte4 A
);
--Much faster than str-Parse, but limited to 8K
--Select * from [dbo].[udf-Str-Parse-8K]('Dog,Cat,House,Car',',')
--Select * from [dbo].[udf-Str-Parse-8K]('John||Cappelletti||was||here','||')

Related

T-SQL - Count unique characters in a variable

Goal: To count # of distinct characters in a variable the fastest way possible.
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
I've found some posts in regards to distinct characters in a column, grouped by characters, and etc, but not one for this scenario.
Using NGrams8K as a base, you can change the input parameter to a nvarchar(4000) and tweak the DATALENGTH, making NGramsN4K. Then you can use that to split the string into individual characters and count them:
SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(#String1,1) NG;
Altered NGrams8K:
IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams8k(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams8K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGrams8k is not case sensitive
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL)
4. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.NGrams8k(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE #maxN int = 100;
SELECT N = #maxN-position+1
FROM dbo.NGrams8k(REPLICATE(0,#maxN),1)
ORDER BY position;
5. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string varchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGrams8k(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if #N is greater than 0 and less
than the length of #string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the #string and #N
parameters to prevent a NULL string or NULL #N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES -- 90 NULL values used to create the CTE Tally table
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t(N)
),
iTally(N) AS -- my cte Tally table
(
SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,N''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 AND #N <= (DATALENGTH(#string)/2); -- Protection against bad parameter values
Here is another alternative using the power of the tally table. It has been called the "Swiss Army Knife of T-SQL". I keep a tally table as a view on my system which makes it insanely fast.
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
Now we can use that tally anytime we need it, like for this exercise.
declare #Something table
(
String1 nvarchar(4000)
)
insert #Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')
select count(distinct substring(s.String1, t.N, 1))
, s.String1
from #Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1
To be honest I don't know this would be any faster than Larnu's usage of NGrams but testing on a large table would be fun to see.
----- EDIT -----
Thanks to Shnugo for the idea. Using a cross apply to a correlated subquery here is actually quite an improvement.
select count(distinct substring(s.String1, A.N, 1))
, s.String1
from #Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1
The reason this is so much faster is that this is no longer using a triangular join which can really be painfully slow. I did also switch out the view with an indexed physical tally table. The improvement there was noticeable on larger datasets but not nearly as big as using the cross apply.
If you want to read more about triangular joins and why we should avoid them Jeff Moden has a great article on the topic. https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins
Grab a copy of NGrams8k and you can do this:
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(#String1),(#String2),(#String3),(#String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Returns:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
UPDATED
Just a quick update based on #Larnu's post and comments. I did not notice that the OP was dealing with Unicode e.g. NVARCHAR. I created an NVARCHAR(4000) version here - similar to what #Larnu posted above. I just updated the return token to use Latin1_General_BIN collation.
SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N)
This returns the correct answer:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(#String5,1) AS ng;
Without the collation in place you can use the what Larnu posted and get the right answer like this:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(#String5,1) AS ng;
Here's my updated NGramsN4K function:
ALTER FUNCTION dbo.NGramsN4K
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 -- Protection against bad parameter values:
AND #N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)));
You can do this natively in SQL Server using CTE and some string manipuation:
DECLARE #TestString NVARCHAR(4000);
SET #TestString = N'*A-zz';
WITH letters AS
(
SELECT 1 AS Pos,
#TestString AS Stri,
MAX(LEN(#TestString)) AS MaxPos,
SUBSTRING(#TestString, 1, 1) AS [Char]
UNION ALL
SELECT Pos + 1,
#TestString,
MaxPos,
SUBSTRING(#TestString, Pos + 1, 1) AS [Char]
FROM letters
WHERE Pos + 1 <= MaxPos
)
SELECT COUNT(*) AS LetterCount
FROM (
SELECT UPPER([Char]) AS [Char]
FROM letters
GROUP BY [Char]
) a
Example outputs:
SET #TestString = N'*A-zz';
{execute code}
LetterCount = 4
SET #TestString = N'1A^';
{execute code}
LetterCount = 3
SET #TestString = N'1';
{execute code}
LetterCount = 1
SET #TestString = N'*';
{execute code}
LetterCount = 1
CREATE TABLE #STRINGS(
STRING1 NVARCHAR(4000)
)
INSERT INTO #STRINGS (
STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')
;WITH CTE_T AS (
SELECT DISTINCT
S.STRING1
,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
FROM
#STRINGS S
INNER JOIN
[master]..spt_values V
ON V.number < LEN(S.STRING1)
WHERE
V.[type] = 'P'
)
SELECT
T.STRING1
,COUNT(1) AS CNT
FROM
CTE_T T
GROUP BY
T.STRING1

Sql Multiple Replace based on query

I have been trying to set up a SQL function to build descriptions with "tags". For example, I would want to start with a description:
"This is [length] ft. long and [height] ft. high"
And modify the description with data from a related table, to end up with:
"This is 75 ft. long and 20 ft. high"
I could do this easily with REPLACE functions if we had a set number of tags, but I want these tags to be user defined, and each description may or may not have specific tags in it. Would there be any better way to get this other than using a cursor to go through the string once for each available tag? Does SQL have any built in functionality to do a multiple replace? something like:
Replace(description,(select tag, replacement from tags))
I actually recommend doing this in application code. But, you can do it using a recursive CTE:
with t as (
select t.*, row_number() over (order by t.tag) as seqnum
from tags t
),
cte as (
select replace(#description, t.tag, t.replacement) as d, t.seqnum
from t
where seqnum = 1
union all
select replace(d, t.tag, t.replacement), t.seqnum
from cte join
t
on t.seqnum = cte.seqnum + 1
)
select top 1 cte.*
from cte
order by seqnum desc;
Try below query :
SELECT REPLACE(DESCRIPTION,'[length]',( SELECT replacement FROM tags WHERE tag
= '[length]') )
I agree with Gordon that this is best handled in your application code.
If for whatever reason that option is not available however, and if you don't want to use recursion as per Gordon's answer, you could use a tally table approach to swap out your values.
You will need to test the performance of the for xml being executed for each value though...
Assuming you have a table of Tag replacement values:
create table TagReplacementTable(Tag nvarchar(50), Replacement nvarchar(50));
insert into TagReplacementTable values('[test]',999)
,('[length]',75)
,('[height]',20)
,('[other length]',40)
,('[other height]',50);
You can create an inline table function that will work through your Descriptions and drop replace the necessary parts using TagReplacementTable as reference:
create function dbo.Tag_Replace(#str nvarchar(4000)
,#tagstart nvarchar(1)
,#tagend nvarchar(1)
)
returns table
as
return
(
with n(n) as (select n from (values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(n))
-- Select the same number of rows as characters in #str as incremental row numbers.
-- Cross joins increase exponentially to a max possible 10,000 rows to cover largest #str length.
,t(t) as (select top (select len(#str) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)
-- Return the position of every value that starts or ends a part of the description.
-- This will be the first character (t='f'), the start of any tag (t='s') and the end of any tag (t='e').
,s(s,t) as (select 1, 'f'
union all select t+1, 's' from t where substring(#str,t,1) = #tagstart
union all select t+1, 'e' from t where substring(#str,t,1) = #tagend
)
-- Return the start and length of every value, to use in the SUBSTRING function.
-- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
-- Using the t value we can determine which CHARINDEX to look for.
,l(t,s,l) as (select t,s,isnull(nullif(charindex(case t when 'f' then #tagstart when 's' then #tagend when 'e' then #tagstart end,#str,s),0)-s,4000) from s)
-- Each element of the string is returned in an ordered list along with its t value.
-- Where this t value is 's' this means the value is a tag, so append the start and end identifiers and join to the TagReplacementTable.
-- Where no replacement is found, simply return the part of the Description.
-- Finally, concatenate into one string value.
select (select isnull(r.Replacement,k.Item)
from(select row_number() over(order by s) as ItemNumber
,case when l.t = 's' then '[' else '' end
+ substring(#str,s,l)
+ case when l.t = 's' then ']' else '' end as Item
,t
from l
) k
left join TagReplacementTable r
on(k.Item = r.Tag)
order by k.ItemNumber
for xml path('')
) as NewString
);
And then outer apply to the results of the function to do replacements on all your Description values:
declare #t table (Descr nvarchar(100));
insert into #t values('This is [length] ft. long and [height] ft. high'),('[test] This is [other length] ft. long and [other height] ft. high');
select *
from #t t
outer apply dbo.Tag_Replace(t.Descr,'[',']') r;
Output:
+--------------------------------------------------------------------+-----------------------------------------+
| Descr | NewString |
+--------------------------------------------------------------------+-----------------------------------------+
| This is [length] ft. long and [height] ft. high | This is 75 ft. long and 20 ft. high |
| [test] This is [other length] ft. long and [other height] ft. high | 999 This is 40 ft. long and 50 ft. high |
+--------------------------------------------------------------------+-----------------------------------------+
I would not iterate through an individual string, but instead run the update on the entire column of strings. I'm not sure if that was your intent but this would be much quicker than one string at a time.
Test Data:
Create TABLE #strs ( mystr VARCHAR(MAX) )
Create TABLE #rpls (i INT IDENTITY(1,1) NOT NULL, src VARCHAR(MAX) , Trg VARCHAR(MAX) )
INSERT INTO #strs
( mystr )
SELECT 'hello ##color## world'
UNION ALL SELECT 'see jack ##verboftheday##! ##verboftheday## Jack, ##verboftheday##!'
UNION ALL SELECT 'on ##Date##, the ##color## StockMarket was ##MarketDirection##!'
INSERT INTO #rpls ( src ,Trg )
SELECT '##Color##', 'Blue'
UNION SELECT ALL '##verboftheday##' , 'run'
UNION SELECT ALL '##Date##' , CONVERT(VARCHAR(MAX), GETDATE(), 9)
UNION SELECT ALL '##MarketDirection##' , 'UP'
then a loop like this:
DECLARE #i INTEGER = 0
DECLARE #count INTEGER
SELECT #count = COUNT(*)
FROM #rpls R
WHILE #i < #count
BEGIN
SELECT #i += 1
UPDATE #strs
SET mystr = REPLACE(mystr, ( SELECT R.src
FROM #rpls R
WHERE i = #i ), ( SELECT R.Trg
FROM #rpls R
WHERE i = #i ))
END
SELECT *
FROM #strs S
Yielding the following
hello Blue world
see jack run! run Jack, run!
on May 19 2017 9:48:02:390AM, the Blue StockMarket was UP!
I found someone wanting to do something similar here with a set number of options:
SELECT #target = REPLACE(#target, invalidChar, '-')
FROM (VALUES ('~'),(''''),('!'),('#'),('#')) AS T(invalidChar)
I could modify it as such:
declare #target as varchar(max) = 'This is [length] ft. long and [height] ft. high'
select #target = REPLACE(#target,'[' + tag + ']',replacement)
from tags
It then runs the replace once for every record returned in the select statement.
(I originally had added this to my question, but it sounds like it is better protocol to add it as a answer.)

Generate a list with string prefix in SQL with fixed length

I just want to generate a list like this
XY0001
XY0002
XY0003
The prefix is same for all rows. Need fixed length (6 in this example)
Looking for an easy way to produce such list to put it into temp table.
MS SQL
for a very small number this would do:
DECLARE #TempList TABLE (Name VARCHAR(100));
insert into #TempList Values ('XY00001')
insert into #TempList Values ('XY00002')
insert into #TempList Values ('XY00003')
insert into #TempList Values ('XY00004')
select * from #TempList
You can use an ad-hoc tally table
If 2012+
DECLARE #TempList TABLE (Name VARCHAR(100));
Select Name = 'XY'+format(N,'0000')
From (Select Top 9999 N=Row_Number() Over (Order By (Select NULL)) From master..spt_values N1,master..spt_values N2) A
Order by N
Returns
Name
XY0001
XY0002
XY0003
XY0004
...
XY9997
XY9998
XY9999
If not
DECLARE #TempList TABLE (Name VARCHAR(100));
Select Name = 'XY'+right('00000'+cast(N as varchar(25)),4)
From (Select Top 9999 N=Row_Number() Over (Order By (Select NULL)) From master..spt_values N1,master..spt_values N2) A
Order by N
I like to use recursive CTE's for this.
declare #max_number int = 1000;
with num as (
select 1 as n
union
select n + 1
from num
where n < #max_number
)
select 'XY' + (cast n as char(4))
from num;
The recursive CTE gives you the numbers and the cast does the left-padding with 0's to ensure you get 0001 instead of 1.
This approach will support a variable number of outputs. Though as you alluded to in your question, this is overkill if you only want a few.
(You'll need to test this out for boundary cases. I haven't tested this exact code sample.)
There is likely a limit to how far this scales because it uses recursion.

How I can replace odd patterns inside a string?

I'm in the process of creating a temporary procedure in SQL because I have a value of a table which is written in markdown, so it appear as rendered HTML in the web browser (markdown to HTML conversion).
String of the column currently look like this:
Questions about **general computing hardware and software** are off-topic for Stack Overflow unless they directly involve tools used primarily for programming. You may be able to get help on [Super User](http://superuser.com/about)
I'm currently working with bold and italic text. This mean (in the case of bold text) I will need to replace odd N times the pattern**with<b>and even times with</b>.
I saw replace() but it perform the replacement on all the patterns of the string.
So How I can replace a sub-string only if it is odd or only it is even?
Update: Some peoples wonder what schemas I'm using so just take a look here.
One more extra if you want: The markdown style hyperlink to html hyperlink doesn't look so simple.
Using theSTUFFfunction and a simpleWHILEloop:
CREATE FUNCTION dbo.fn_OddEvenReplace(#text nvarchar(500),
#textToReplace nvarchar(10),
#oddText nvarchar(10),
#evenText nvarchar(500))
RETURNS varchar(max)
AS
BEGIN
DECLARE #counter tinyint
SET #counter = 1
DECLARE #switchText nvarchar(10)
WHILE CHARINDEX(#textToReplace, #text, 1) > 0
BEGIN
SELECT #text = STUFF(#text,
CHARINDEX(#textToReplace, #text, 1),
LEN(#textToReplace),
IIF(#counter%2=0,#evenText,#oddText)),
#counter = #counter + 1
END
RETURN #text
END
And you can use it like this:
SELECT dbo.fn_OddEvenReplace(column, '**', '<b>', '</b>')
FROM table
UPDATE:
This is re-written as an SP:
CREATE PROC dbo.##sp_OddEvenReplace #text nvarchar(500),
#textToReplace nvarchar(10),
#oddText nvarchar(10),
#evenText nvarchar(10),
#returnText nvarchar(500) output
AS
BEGIN
DECLARE #counter tinyint
SET #counter = 1
DECLARE #switchText nvarchar(10)
WHILE CHARINDEX(#textToReplace, #text, 1) > 0
BEGIN
SELECT #text = STUFF(#text,
CHARINDEX(#textToReplace, #text, 1),
LEN(#textToReplace),
IIF(#counter%2=0,#evenText,#oddText)),
#counter = #counter + 1
END
SET #returnText = #text
END
GO
And to execute:
DECLARE #returnText nvarchar(500)
EXEC dbo.##sp_OddEvenReplace '**a** **b** **c**', '**', '<b>', '</b>', #returnText output
SELECT #returnText
As per OP's request I have modified my earlier answer to perform as a temporary stored procedure. I have left my earlier answer as I believe the usage against a table of strings to be useful also.
If a Tally (or Numbers) table is known to already exist with at least 8000 values, then the marked section of the CTE can be omitted and the CTE reference tally replaced with the name of the existing Tally table.
create procedure #HtmlTagExpander(
#InString varchar(8000)
,#OutString varchar(8000) output
)as
begin
declare #Delimiter char(2) = '**';
create table #t(
StartLocation int not null
,EndLocation int not null
,constraint PK unique clustered (StartLocation desc)
);
with
-- vvv Only needed in absence of Tally table vvv
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1)
) E1(N)
), --10E+1 or 10 rows
E2(N) as (select 1 from E1 a cross join E1 b), --10E+2 or 100 rows
E4(N) As (select 1 from E2 a cross join E2 b), --10E+4 or 10,000 rows max
tally(N) as (select row_number() over (order by (select null)) from E4),
-- ^^^ Only needed in absence of Tally table ^^^
Delimiter as (
select len(#Delimiter) as Length,
len(#Delimiter)-1 as Offset
),
cteTally(N) AS (
select top (isnull(datalength(#InString),0))
row_number() over (order by (select null))
from tally
),
cteStart(N1) AS
select
t.N
from cteTally t cross join Delimiter
where substring(#InString, t.N, Delimiter.Length) = #Delimiter
),
cteValues as (
select
TagNumber = row_number() over(order by N1)
,Location = N1
from cteStart
),
HtmlTagSpotter as (
select
TagNumber
,Location
from cteValues
),
tags as (
select
Location = f.Location
,IsOpen = cast((TagNumber % 2) as bit)
,Occurrence = TagNumber
from HtmlTagSpotter f
)
insert #t(StartLocation,EndLocation)
select
prev.Location
,data.Location
from tags data
join tags prev
on prev.Occurrence = data.Occurrence - 1
and prev.IsOpen = 1;
set #outString = #Instring;
update this
set #outString = stuff(stuff(#outString,this.EndLocation, 2,'</b>')
,this.StartLocation,2,'<b>')
from #t this with (tablockx)
option (maxdop 1);
end
go
Invoked like this:
declare #InString varchar(8000)
,#OutString varchar(8000);
set #inString = 'Questions about **general computing hardware and software** are off-topic **for Stack Overflow.';
exec #HtmlTagExpander #InString,#OutString out; select #OutString;
set #inString = 'Questions **about** general computing hardware and software **are off-topic** for Stack Overflow.';
exec #HtmlTagExpander #InString,#OutString out; select #OutString;
go
drop procedure #HtmlTagExpander;
go
It yields as output:
Questions about <b>general computing hardware and software</b> are off-topic **for Stack Overflow.
Questions <b>about</b> general computing hardware and software <b>are off-topic</b> for Stack Overflow.
One option is to use a Regular Expression as it makes replacing such patterns very simple. RegEx functions are not built into SQL Server so you need to use SQL CLR, either compiled by you or from an existing library.
For this example I will use the SQL# (SQLsharp) library (which I am the author of) but the RegEx functions are available in the Free version.
SELECT SQL#.RegEx_Replace
(
N'Questions about **general computing hardware and software** are off-topic\
for Stack Overflow unless **they** directly involve tools used primarily for\
**programming. You may be able to get help on [Super User]\
(https://superuser.com/about)', -- #ExpressionToValidate
N'\*\*([^\*]*)\*\*', -- #RegularExpression
N'<b>$1</b>', -- #Replacement
-1, -- #Count (-1 = all)
1, - #StartAt
'IgnoreCase' -- #RegEx options
);
The above pattern \*\*([^\*]*)\*\* just looks for anything surrounded by double-asterisks. In this case you don't need to worry about odd / even. It also means that you won't get a poorly-formed <b>-only tag if for some reason there is an extra ** in the string. I added two additional test cases to the original string: a complete set of ** around the word they and an unmatched set of ** just before the word programming. The output is:
Questions about <b>general computing hardware and software</b> are off-topicfor Stack Overflow unless <b>they</b> directly involve tools used primarily for **programming. You may be able to get help on [Super User](https://superuser.com/about)
which renders as:
Questions about general computing hardware and software are off-topicfor Stack Overflow unless they directly involve tools used primarily for **programming. You may be able to get help on Super User
This solution makes use of techniques described by Jeff Moden in this article on the Running Sum problem in SQL. This solution is lengthy, but by making use of the Quirky Update in SQL Server over a clustered index, holds the promise of being much more efficient over large data sets than cursor-based solutions.
Update - amended below to operate off a table of strings
Assuming the existence of a tally table created like this (with at least 8000 rows):
create table dbo.tally (
N int not null
,unique clustered (N desc)
);
go
with
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1)
) E1(N)
), --10E+1 or 10 rows
E2(N) as (select 1 from E1 a cross join E1 b), --10E+2 or 100 rows
E4(N) As (select 1 from E2 a cross join E2 b) --10E+4 or 10,000 rows max
insert dbo.tally(N)
select row_number() over (order by (select null)) from E4;
go
and a HtmlTagSpotter function defined like this:
create function dbo.HtmlTagSPotter(
#pString varchar(8000)
,#pDelimiter char(2))
returns table with schemabinding as
return
WITH
Delimiter as (
select len(#pDelimiter) as Length,
len(#pDelimiter)-1 as Offset
),
cteTally(N) AS (
select top (isnull(datalength(#pstring),0))
row_number() over (order by (select null))
from dbo.tally
),
cteStart(N1) AS (--==== Returns starting position of each "delimiter" )
select
t.N
from cteTally t cross join Delimiter
where substring(#pString, t.N, Delimiter.Length) = #pDelimiter
),
cteValues as (
select
ItemNumber = row_number() over(order by N1)
,Location = N1
from cteStart
)
select
ItemNumber
,Location
from cteValues
go
then running the following SQL will perform the required substitution. Note that the inner join at the end prevents any trailing "odd" tags from being converted:
create table #t(
ItemNo int not null
,Item varchar(8000) null
,StartLocation int not null
,EndLocation int not null
,constraint PK unique clustered (ItemNo,StartLocation desc)
);
with data(i,s) as ( select i,s from (values
(1,'Questions about **general computing hardware and software** are off-topic **for Stack Overflow.')
,(2,'Questions **about **general computing hardware and software** are off-topic **for Stack Overflow.')
--....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8....,....9....,....0
)data(i,s)
),
tags as (
select
ItemNo = data.i
,Item = data.s
,Location = f.Location
,IsOpen = cast((TagNumber % 2) as bit)
,Occurrence = TagNumber
from data
cross apply dbo.HtmlTagSPotter(data.s,'**') f
)
insert #t(ItemNo,Item,StartLocation,EndLocation)
select
data.ItemNo
,data.Item
,prev.Location
,data.Location
from tags data
join tags prev
on prev.ItemNo = data.ItemNo
and prev.Occurrence = data.Occurrence - 1
and prev.IsOpen = 1
union all
select
i,s,8001,8002
from data
;
declare #ItemNo int
,#ThisStting varchar(8000);
declare #s varchar(8000);
update this
set #s = this.Item = case when this.StartLocation > 8000
then this.Item
else stuff(stuff(#s,this.EndLocation, 2,'</b>')
,this.StartLocation,2,'<b>')
end
from #t this with (tablockx)
option (maxdop 1);
select
Item
from (
select
Item
,ROW_NUMBER() over (partition by ItemNo order by StartLocation) as rn
from #t
) t
where rn = 1
go
yielding:
Item
------------------------------------------------------------------------------------------------------------
Questions about <b>general computing hardware and software</b> are off-topic **for Stack Overflow.
Questions <b>about </b>general computing hardware and software<b> are off-topic </b>for Stack Overflow.

Generating an n-gram table with an SQL query

I'm trying to implement a fuzzy search with JavaScript client side, to search a largish db (300 items roughly) of records contained in an SQL database. My constraint is that it is not possible to perform a live query on the database- I must generate "indexes" as flat files during a nightly batch job. And so, starting with a db that looks like this:
ID. NAME
1. The Rain Man
2. The Electric Slide
3. Transformers
I need to create within a single query something like this:
Trigram ID
the 1
the 2
he_ 1
he_ 2
e_r 1
_ra 1
rai 1
ain 1
in_ 1
n_m 1
_ma 1
man 1
e_e 2
_el 2
ele 2
lec 2
Etc etc, typos not withstanding. The rules here are that ''n' is the length of the strings in the first column, that only a-z and _ are valid characters, any other character being normalized to Lower case, or mapped to _, that a group by n-gram clause may be applied to the table. Thus, I would hope to gain a table that would allow me to quickly look up a particular n-gram and get a list of all the Ids of rows which contain that sequence. I'm not a clever enough SQL cookie to figure this problem out. Can you?
I created an T-SQL NGrams that works quite nicely; note the comments section for examples of how to use
CREATE FUNCTION dbo.nGrams8K
(
#string VARCHAR(8000),
#n TINYINT,
#pad BIT
)
/*
Created by: Alan Burstein
Created on: 3/10/2014
Updated on: 5/20/2014 changed the logic to use an "inline tally table"
9/10/2014 Added some more code examples in the comment section
9/30/2014 Added more code examples
10/27/2014 Small bug fix regarding padding
Use: Outputs a stream of tokens based on an input string.
Works just like mdq.nGrams; see http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
To better understand N-Grams see: http://en.wikipedia.org/wiki/N-gram
*/
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
iTally(n) AS
(
SELECT TOP (LEN(#string)+#n) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E2 a CROSS JOIN E2 b
),
NewString(NewString) AS
(
SELECT REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)+#string+
REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)
)
SELECT TOP ((#n)+LEN(#string))
n AS [sequence],
SUBSTRING(NewString,n,#n) AS token
FROM iTally
CROSS APPLY NewString
WHERE n < ((#n)+LEN(#string));
/*
------------------------------------------------------------
-- (1) Basic Use
-------------------------------------------------------------
;-- (A)basic "string to table":
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',1,1);
-- (b) create "bi-grams" (pad bit off)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',2,0);
-- (c) create "tri-grams" (pad bit on)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1);
-- (d) filter for only "tri-grams"
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
-- note the query plan for each. The power is coming from an index
-- also note how many rows are produced: len(#string+(#n-1))
-- lastly, you can trim as needed when padding=1
------------------------------------------------------------
-- (2) With a variable
------------------------------------------------------------
-- note, in this example I am getting only the stuff that has three letters
DECLARE #string varchar(20) = 'abcdefg',
#tokenLen tinyint = 3;
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
GO
------------------------------------------------------------
-- (3) An on-the-fly alphabet (this will come in handy in a moment)
------------------------------------------------------------
DECLARE #alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT [sequence], token
FROM dbo.nGrams8K(#alphabet,1,0);
GO
------------------------------------------------------------
-- (4) Character Count
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(#alphabet,1,0) a
LEFT JOIN dbo.nGrams8K(#string,1,0) b ON a.token=b.token
GROUP BY a.token
ORDER BY a.token;
GO
------------------------------------------------------------
-- (5) Locate the start position of a search pattern
------------------------------------------------------------
;-- (A) note these queries:
DECLARE #string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';
-- (i)
SELECT * FROM dbo.nGrams8K(#string,1,0) a;
-- (ii) note this query:
SELECT * FROM dbo.nGrams8K(#string,1,0) a WHERE [token]=' ';
-- (B) and now the word count (#string included for presentation)
SELECT #string AS string,
count(*)+1 AS words
FROM dbo.nGrams8K(#string,1,0) a
WHERE [token]=' '
GO
------------------------------------------------------------
-- (6) search for the number of occurances of a word
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',
#searchString VARCHAR(100)='The';
-- (5a) by location
SELECT sequence-(LEN(#searchstring)) AS location,
token AS searchString
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
-- (2b) get total
SELECT #string AS string,
#searchString AS searchString,
COUNT(*) AS ttl
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
------------------------------------------------------------
-- (7) Special SubstringBefore and SubstringAfter
------------------------------------------------------------
-- (7a) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringBeforeSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,1,substring_index-len(#substring))
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='10.0.1600.22',
#searchPattern varchar(100)='.',
#substring_index tinyint = 3;
SELECT * FROM dbo.SubstringBeforeSSI(#string,#searchPattern,#substring_index);
GO
-- (7b) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringAfterSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,substring_index+1,8000)
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='<notes id="1">blah, blah, blah</notes><notes id="2">More Notes</notes>',
#searchPattern varchar(100)='</notes>',
#substring_index tinyint = 1;
SELECT #string, *
FROM dbo.SubstringAfterSSI(#string,#searchPattern,#substring_index);
------------------------------------------------------------
-- (8) Strip non-numeric characters from a string
------------------------------------------------------------
-- (8a) create the function
ALTER FUNCTION StripNonNumeric_itvf(#OriginalText VARCHAR(8000))
RETURNS TABLE
--WITH SCHEMABINDING
AS
return
WITH ngrams AS
(
SELECT n = [sequence], c = token
FROM dbo.nGrams8K(#OriginalText,1,1)
),
clean_txt(CleanedText) AS
(
SELECT c+''
FROM ngrams
WHERE ascii(substring(#OriginalText,n,1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
SELECT CleanedText
FROM clean_txt;
GO
-- (8b) use against a value or variable
SELECT CleanedText
FROM dbo.StripNonNumeric_itvf('value123');
-- (8c) use against a table
-- test harness:
IF OBJECT_ID('tempdb..#strings') IS NOT NULL DROP TABLE #strings;
WITH strings AS
(
SELECT TOP (100000) string = newid()
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT *
INTO #strings
FROM strings;
GO
-- query (returns 100K rows every 3 seconds on my pc):
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf(string);
------------------------------------------------------------
-- (9) A couple complex String Algorithms
------------------------------------------------------------
-- (9a) hamming distance between two strings:
DECLARE #string1 varchar(8000) = 'xxxxyyyzzz',
#string2 varchar(8000) = 'xxxxyyzzzz';
SELECT string1 = #string1,
string2 = #string2,
hamming_distance = count(*)
FROM dbo.nGrams8K(#string1,1,0) s1
CROSS APPLY dbo.nGrams8K(#string2,1,0) s2
WHERE s1.sequence = s2.sequence
AND s1.token <> s2.token
GO
-- (9b) inner join between 2 strings
--(can be used to speed up other string metrics such as the longest common subsequence)
DECLARE #string1 varchar(100)='xxxx123yyyy456zzzz',
#string2 varchar(100)='xx789yy000zz';
WITH
s1(string1) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string1,1,0)
WHERE charindex([token],#string2)<>0
ORDER BY [sequence]
FOR XML PATH('')
),
s2(string2) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string2,1,0)
WHERE charindex([token],#string1)<>0
ORDER BY [sequence]
FOR XML PATH('')
)
SELECT string1, string2
FROM s1
CROSS APPLY s2;
------------------------------------------------------------
-- (10) Advanced Substring Metrics
------------------------------------------------------------
-- (10a) Identify common substrings and their location
DECLARE #string1 varchar(100) = 'xxx yyy zzz',
#string2 varchar(100) = 'xx yyy zz';
-- (i) review the two strings
SELECT str1 = #string1,
str2 = #string2;
-- (ii) the results
WITH
iTally AS
(
SELECT n
FROM dbo.tally t
WHERE n<= len(#string1)
),
distinct_tokens AS
(
SELECT ng1 = ng1.token, ng2 = ng2.token --= ltrim(ng1.token), ng2 = ltrim(ng2.token)
FROM itally
CROSS APPLY dbo.nGrams8K(#string1,n,1) ng1
CROSS APPLY dbo.nGrams8K(#string2,n,1) ng2
WHERE ng1.token=ng2.token
)
SELECT ss_txt = ng1,
ss_len = len(ng1),
str1_loc = charindex(ng1,#string1),
str2_loc = charindex(ng2,#string2)
FROM distinct_tokens
WHERE ng1<>'' AND charindex(ng1,#string1)+charindex(ng2,#string2)<>0
GROUP BY ng1, ng2
ORDER BY charindex(ng1,#string1), charindex(ng2,#string2), len(ng1);
-- (10b) Longest common substring function
-- (i) function
IF EXISTS
( SELECT * FROM INFORMATION_SCHEMA.ROUTINES
WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss')
DROP FUNCTION dbo.lcss;
GO
CREATE FUNCTION dbo.lcss(#string1 varchar(100), #string2 varchar(100))
RETURNS TABLE
AS
RETURN
SELECT TOP (1) with ties token
FROM dbo.tally
CROSS APPLY dbo.nGrams8K(#string1,n,1)
WHERE n <= len(#string1)
AND charindex(token, #string2) > 0
ORDER BY len(token) DESC;
GO
-- (ii) example of use
DECLARE #string1 varchar(100) = '000xxxyyyzzz',
#string2 varchar(100) = '999xxyyyzaa';
SELECT string1 = #string1,
string2 = #string2,
token
FROM dbo.lcss(#string1, #string2);
*/
GO
You'd have to repeat this statement:
insert into trigram_table ( Trigram, ID )
select substr( translate( lower( Name ), ' ', '_' ), :X, :N ),
ID
from db_table
for all :X from 1 to Len(Name) + 1 - :N
You will also have to extend the translate function for all the other special characters you'd want to convert to an underscore. Right now it's just translating a blank into an underscore.
For performance you could do the translate and lower functions on the Trigram column in a last pass on the trigram_table so you're not doing those functions for each :X.