T-SQL - Count unique characters in a variable - sql

Goal: To count # of distinct characters in a variable the fastest way possible.
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
I've found some posts in regards to distinct characters in a column, grouped by characters, and etc, but not one for this scenario.

Using NGrams8K as a base, you can change the input parameter to a nvarchar(4000) and tweak the DATALENGTH, making NGramsN4K. Then you can use that to split the string into individual characters and count them:
SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(#String1,1) NG;
Altered NGrams8K:
IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams8k(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams8K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGrams8k is not case sensitive
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL)
4. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.NGrams8k(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE #maxN int = 100;
SELECT N = #maxN-position+1
FROM dbo.NGrams8k(REPLICATE(0,#maxN),1)
ORDER BY position;
5. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string varchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGrams8k(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if #N is greater than 0 and less
than the length of #string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the #string and #N
parameters to prevent a NULL string or NULL #N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES -- 90 NULL values used to create the CTE Tally table
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t(N)
),
iTally(N) AS -- my cte Tally table
(
SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,N''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 AND #N <= (DATALENGTH(#string)/2); -- Protection against bad parameter values

Here is another alternative using the power of the tally table. It has been called the "Swiss Army Knife of T-SQL". I keep a tally table as a view on my system which makes it insanely fast.
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
Now we can use that tally anytime we need it, like for this exercise.
declare #Something table
(
String1 nvarchar(4000)
)
insert #Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')
select count(distinct substring(s.String1, t.N, 1))
, s.String1
from #Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1
To be honest I don't know this would be any faster than Larnu's usage of NGrams but testing on a large table would be fun to see.
----- EDIT -----
Thanks to Shnugo for the idea. Using a cross apply to a correlated subquery here is actually quite an improvement.
select count(distinct substring(s.String1, A.N, 1))
, s.String1
from #Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1
The reason this is so much faster is that this is no longer using a triangular join which can really be painfully slow. I did also switch out the view with an indexed physical tally table. The improvement there was noticeable on larger datasets but not nearly as big as using the cross apply.
If you want to read more about triangular joins and why we should avoid them Jeff Moden has a great article on the topic. https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins

Grab a copy of NGrams8k and you can do this:
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(#String1),(#String2),(#String3),(#String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Returns:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
UPDATED
Just a quick update based on #Larnu's post and comments. I did not notice that the OP was dealing with Unicode e.g. NVARCHAR. I created an NVARCHAR(4000) version here - similar to what #Larnu posted above. I just updated the return token to use Latin1_General_BIN collation.
SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N)
This returns the correct answer:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(#String5,1) AS ng;
Without the collation in place you can use the what Larnu posted and get the right answer like this:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(#String5,1) AS ng;
Here's my updated NGramsN4K function:
ALTER FUNCTION dbo.NGramsN4K
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 -- Protection against bad parameter values:
AND #N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)));

You can do this natively in SQL Server using CTE and some string manipuation:
DECLARE #TestString NVARCHAR(4000);
SET #TestString = N'*A-zz';
WITH letters AS
(
SELECT 1 AS Pos,
#TestString AS Stri,
MAX(LEN(#TestString)) AS MaxPos,
SUBSTRING(#TestString, 1, 1) AS [Char]
UNION ALL
SELECT Pos + 1,
#TestString,
MaxPos,
SUBSTRING(#TestString, Pos + 1, 1) AS [Char]
FROM letters
WHERE Pos + 1 <= MaxPos
)
SELECT COUNT(*) AS LetterCount
FROM (
SELECT UPPER([Char]) AS [Char]
FROM letters
GROUP BY [Char]
) a
Example outputs:
SET #TestString = N'*A-zz';
{execute code}
LetterCount = 4
SET #TestString = N'1A^';
{execute code}
LetterCount = 3
SET #TestString = N'1';
{execute code}
LetterCount = 1
SET #TestString = N'*';
{execute code}
LetterCount = 1

CREATE TABLE #STRINGS(
STRING1 NVARCHAR(4000)
)
INSERT INTO #STRINGS (
STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')
;WITH CTE_T AS (
SELECT DISTINCT
S.STRING1
,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
FROM
#STRINGS S
INNER JOIN
[master]..spt_values V
ON V.number < LEN(S.STRING1)
WHERE
V.[type] = 'P'
)
SELECT
T.STRING1
,COUNT(1) AS CNT
FROM
CTE_T T
GROUP BY
T.STRING1

Related

Replace consecutive duplicate letters with a single letter

How do I create a function in SQL Server 2017 that identifies when a string contains duplicate consecutive letters (a-z) and replaces those duplicate letters with a single instance of that letter?
Here are some examples of what should happen:
CompanyAAABCD -> CompanyABCD
CommpanyABYTTT -> CompanyABYT
Company11111 -> Company11111
alter function fn_RemoveDuplicateChar(#name varchar(200))
RETURNS VARCHAR(200)
as
begin
declare #strPosition int=1;
declare #strlen int=0;
declare #finalstr varchar(200)='';
declare #str varchar(200)='';
declare #fstr varchar(200)='';
select #strlen = (select len(#name))
while #strPosition<=#strlen
begin
select #fstr = SUBSTRING(#name, #strPosition, 1)
select #str = SUBSTRING(#finalstr, len(#finalstr), 1)
If #fstr <> #str or ( ISNUMERIC(#fstr)=1 and ISNUMERIC(#str)=1)
set #finalstr = #finalstr + #fstr
set #strPosition =#strPosition+1
end
return (select #finalstr)
end
go
select dbo.fn_RemoveDuplicateChar('CompanyAAABCD')
select dbo.fn_RemoveDuplicateChar('CommpanyABYTTT')
select dbo.fn_RemoveDuplicateChar('Company11111')
If you just wanted a single round of replacement (i.e. aaabbbb becomes aabb) then you could use this:
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (#value varchar(200))
RETURNS VARCHAR(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #result varchar(200) = #value;
DECLARE #i int = 65;
-- a-z is ASCII 65-90
WHILE #i < 90
BEGIN
SET #result = REPLACE(#result, CHAR(#i) + CHAR(#i), CHAR(#i));
SET #i += 1
END;
RETURN #result;
END;
GO
But it seems you need a recursive replacement, so that every character that has the same before it is removed.
So we can use this version, which is similar to the other answer.
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (#value varchar(200))
RETURNS varchar(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #c char(1);
DECLARE #cLast char(1) = LEFT(#value, 1);
DECLARE #result varchar(200) = #cLast;
DECLARE #strlen int = LEN(#value);
DECLARE #i int = 2;
WHILE (#i < #strlen)
BEGIN
SET #c = SUBSTRING(#value, #i, 1);
IF (#c <> #cLast)
SET #result += #c;
SET #i += 1
END;
RETURN #result;
END;
GO
I rewrote this as an inline Table-Valued Function, and found it significantly faster. Here are two versions of that, depending whether you can use STRING_AGG
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesXML (#value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(#value)) rn FROM Nums)
SELECT (
SELECT SUBSTRING(#value, rn, 1)
FROM Chars
WHERE rn = 1 OR SUBSTRING(#value, rn - 1, 1) <> SUBSTRING(#value, rn, 1)
ORDER BY rn
FOR XML PATH(''), TYPE
).value('text()[1]','nvarchar(max)') Result
);
GO
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesAGG (#value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(#value)) rn FROM Nums)
SELECT STRING_AGG(SUBSTRING(#value, rn, 1), '') WITHIN GROUP (ORDER BY rn) Result
FROM Chars
WHERE rn = 1 OR SUBSTRING(#value, rn - 1, 1) <> SUBSTRING(#value, rn, 1)
);
GO
This utilizes Itzik Ben-Gan's famous inline tally-table method to break out the string into single characters. You will need another CROSS JOIN or more (1) if you have more than 256 characters.
You have two methods to use this, the performance should be identical
Either as a scalar subquery
SELECT (SELECT * FROM RemoveDuplicatesAGG(t.MyString) Result
FROM myTable t
Or as an APPLY
SELECT d.Result
FROM myTable t
CROSS APPLY RemoveDuplicatesAGG(t.MyString) d
I know I'm a little late here but if performance is important then you can use the fastest "de-duplicator" in the game (the function, removeDupesExcept8K, is at the end of this post.) It takes an input string and a pattern representing what you want deduplicated; in the example below I'm saying "deduplicate anything that's not between A to Z.
DECLARE #string VARCHAR(8000) = 'AAABBBCCC999';
SELECT rd.NewString FROM samd.removeDupesExcept8K(#string, '[^A-Z]') AS rd;
Returns: ABC999
Let's compare fn_RemoveDuplicateChar from B.Muthamizhselvi above to the one at the end of the post.
Performance test:
--==== Test Data
SELECT TOP(10000)
ID = IDENTITY(INT,1,1),
String = REPLACE(REPLACE(REPLACE(NEWID(),'A',0),'B',0),'-','AAA')
INTO #strings
FROM sys.all_columns, sys.all_columns b;
GO
--==== Performance Test
PRINT CHAR(13)+'dbo.fn_RemoveDuplicateChar'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE #st DATETIME = GETDATE(), #x VARCHAR(100);
SELECT #x = dbo.fn_RemoveDuplicateChar(s.String)
FROM #strings AS s
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Serial'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE #st DATETIME = GETDATE(), #x VARCHAR(100);
SELECT #x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Parallel'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE #st DATETIME = GETDATE(), #x VARCHAR(100);
SELECT #x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
As you'll see below, removeDupesExcept8K is twice as fast with a serial execution plan (one CPU) and more than 10X faster with a parallel plan. No need to test fn_RemoveDuplicateChar with a parallel plan, scalar UDFs can't go parallel unless inlined.
Test Results:
dbo.fn_RemoveDuplicateChar
------------------------------------------------------------------------------------------
Beginning execution loop
1110
1106
1093
Batch execution completed 3 times.
samd.removeDupChar8K - Serial
------------------------------------------------------------------------------------------
Beginning execution loop
563
560
593
Batch execution completed 3 times.
samd.removeDupChar8K - Parallel
------------------------------------------------------------------------------------------
Beginning execution loop
91
91
93
Batch execution completed 3 times.
The Function
IF OBJECT_ID('samd.removeDupesExcept8K') IS NOT NULL DROP FUNCTION samd.removeDupesExcept8K;
GO
CREATE FUNCTION samd.removeDupesExcept8K(#string varchar(8000), #preserved varchar(50))
/*****************************************************************************************
[Purpose]:
A purely set-based inline table valued function (iTVF) that accepts and input strings
(#string) and a pattern (#preserved) and removes all duplicate characters in #string that
do not match the #preserved pattern.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous use
SELECT rd.newString
FROM samd.removeDupesExcept8K(#string, #preserved) AS rd;
--===== Use against a table
SELECT st.SomeColumn1, rd.newString
FROM SomeTable AS st
CROSS
APPLY samd.removeDupesExcept8K(st.SomeColumn1, #preserved) AS rd;
Parameters:
#string = varchar(8000); Input string to be "cleaned"
#preserved = varchar(50); the pattern to preserve. For example, when #preserved='[0-9]'
only non-numeric characters will be removed
[Return Types]:
Inline Table Valued Function returns:
newString = varchar(8000); the string with duplicate characters removed
[Developer Notes]:
1. Requires NGrams8K. The code for NGrams8K can be found here:
http://www.sqlservercentral.com/articles/Tally+Table/142316/
2. This function is what is referred to as an "inline" scalar UDF." Technically it's an
inline table valued function (iTVF) but performs the same task as a scalar valued user
defined function (UDF); the difference is that it requires the APPLY table operator
to accept column values as a parameter. For more about "inline" scalar UDFs see this
article by SQL MVP Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. removeDupesExcept8K is deterministic; for more about deterministic and nondeterministic
functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Examples...
DECLARE #string varchar(8000) = '!!!aa###bb!!!';
BEGIN
--===== 1.1. Remove all duplicate characters
SELECT f.newString
FROM samd.removeDupesExcept8K(#string,'') f; -- Returns: !a#b!
--===== 1.2. Remove all non-alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(#string,'[a-z]') f; -- Returns: !aa#bb!
--===== 1.3. Remove only alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(#string,'[^a-z]') f; -- Returns: !!!a###b!!!
END
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20160720 - Initial Creation - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT ng.token+''
FROM samd.NGrams8K(#string,1) AS ng
WHERE ng.token <> SUBSTRING(#string, ng.position+1,1) -- exclude chars = the next char
OR ng.token LIKE #preserved -- preserve characters that match the #preserved pattern
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('(text())[1]','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic

SQL Query to replace characters by getting record list from other table

I wrote the below query to take an id as input, get DocumentID from Attachment table and then use that id to get Document name from Document table. Once i get document name i am removing anything but the character a-z and numbers. The below Query is working fine if only one Document id is being returned based on Entity id, how can i make it work if one entity id returns more than one Document ID. I also need to return all those new names as well.
ALTER PROCEDURE [dbo].[NormalizeDocumentFileName1]
-- Add the parameters for the stored procedure here
#id nvarchar(16),
#temp varchar(50) OUTPUT
AS
BEGIN
  Select #temp=Document.TheName from Document where id = (Select DocumentId from Attachment where EntityId = #id)
  Declare #KeepValues as varchar(50)
  Set #KeepValues = '%[^a-z0-9-_.]%'
  While PatIndex(#KeepValues, #temp) > 0
  Set #temp = Stuff(#temp, PatIndex(#KeepValues, #temp), 1, '')
   
END
Personally, I would go with a very different approach with this. I'm going to make use of Alan Burstein's NGrams8K.
You want to avoid the WHILE loop, it'll perform awfully, and go with a dataset approach. I'm going to use a Function instead:
CREATE FUNCTION NormalizeDocumentFileName (#FileName varchar(50) )
RETURNS TABLE
AS RETURN
WITH Tokens AS (
SELECT *
FROM dbo.NGrams8k (#FileName,1) --If you didn't create the function on the dbo schema, you'll need to change it.
WHERE token NOT LIKE '%[^a-z0-9-_.]%')
SELECT CONVERT(varchar(50),(SELECT Token + ''
FROM Tokens
ORDER BY Position
FOR XML PATH(''))) AS NormalFileName;
GO
Then you can do something as simple as:
SELECT D.YourColumn, NDFN.NormalFileName
FROM Document D
CROSS APPLY NormalizeDocumentFileName(D.TheName) NDFN;
Another set-based function for this type of thing is PatExclude8K the function does the same thing as what Larnu put together and is reusable. You would have to use the link to get the T-SQL code to create the function. The function works like this:
DECLARE #string varchar(50) = '$$$123___!!!555.ABC???';
SELECT * FROM dbo.patexclude8k(#string, '[^A-Za-z0-9-_.]');
Returns:
NewString
------------
123___555.ABC
Note that what LARNU put together will return entity references for XML characters such as "&", ">", etc. But it will perform better than Patexclude. If you don't expect to deal special XML characters you can use a slightly modified version that will perform relatively the same - here it is:
CREATE FUNCTION dbo.PatExclude8K_NXP
(
#String VARCHAR(8000),
#Pattern VARCHAR(50)
)
/*******************************************************************************
Purpose:
Given a string (#String) and a pattern (#Pattern) of characters to remove,
remove the patterned characters from the string.
Usage:
--===== Basic Syntax Example
SELECT CleanedString
FROM dbo.PatExclude8K_NXP(#String,#Pattern);
--===== Remove all but Alpha characters
SELECT CleanedString
FROM dbo.SomeTable st
CROSS APPLY dbo.PatExclude8K(st.SomeString,'%[^A-Za-z]%');
--===== Remove all but Numeric digits
SELECT CleanedString
FROM dbo.SomeTable st
CROSS APPLY dbo.PatExclude8K(st.SomeString,'%[^0-9]%');
Programmer Notes:
1. #Pattern is case sensitive (the function can be easily modified to make it so)
2. There is no need to include the "%" before and/or after your pattern since since we
are evaluating each character individually
Revision History:
Rev 00 - 20180508 Initial Development - Alan Burstein
*******************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
WITH
E1(N) AS (SELECT N FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N)),
itally(N) AS
(
SELECT TOP(CONVERT(INT,LEN(#String),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E1 T1 CROSS JOIN E1 T2 CROSS JOIN E1 T3 CROSS JOIN E1 T4
)
SELECT NewString =
(
SELECT SUBSTRING(#String,N,1)
FROM iTally
WHERE 0 = PATINDEX(#Pattern,SUBSTRING(#String COLLATE Latin1_General_BIN,N,1))
FOR XML PATH('')
);
Lastly, both NGrams8K and PatExclude perform quite a bit better when the optimizer chooses a parallel execution plan. To force a parallel plan you can use Make_parallel by Adam Machanic. Using Larnu's solution as an example you would force a parallel plan like so:
SELECT D.YourColumn, NDFN.NormalFileName
FROM Document D
CROSS APPLY NormalizeDocumentFileName(D.TheName) NDFN;
CROSS APPLY dbo.make_parallel();
Hmmm. You can dispense with the while loop and use a recursive CTE:
with cte as (
Select d.TheName, 0 as lev, d.TheName as orig_TheName
from Document d
where d.id = (Select DocumentId from Attachment where EntityId = #id)
union all
select Stuff(cte.thename, PatIndex(#KeepValues, cte.thename), 1, '') as DocumentId lev + 1, cte.orig_TheName
from cte
where PatIndex(#KeepValues, cte.thename) > 0
)
select theName
from (select theName, max(lev) over (partition by orig_thename) as max_lev
from cte
) x
where lev = max_lev

How to get all the two characters long substrings separated by dot (.) from an email address in SQL server. I want Scalar function

I have one email column that is having values like this 'claudio.passerini#uni.re.dit.mn.us'. I want to take two characters strings between dot (to check for the countries and states codes).
i want result like this
col1=re,mn,us
Solution
To do exactly what you've asked; i.e. pull back just the 2 char codes from within the email address's domain, you could use a function such as this:
create function dbo.fn_Get2AlphaCharCodesFromEmail
(
#email nvarchar(254) --max length of an email is 254: http://stackoverflow.com/questions/386294/what-is-the-maximum-length-of-a-valid-email-address
) returns nvarchar(254)
as
begin
declare #result nvarchar(254) = null
, #maxLen int = 254
;with cte(i, remainder,result) as
(
select cast(0 as int)
, cast('.' + substring(#email,charindex('#',#email)+1,#maxLen) + '.' as nvarchar(254))
, cast(null as nvarchar(254))
union all
select cast(i+1 as int)
, cast(substring(remainder,patindex('%.[A-Z][A-Z].%',remainder)+3,#maxLen)as nvarchar(254))
, cast(coalesce(result + ',','') + substring(remainder,patindex('%.[A-Z][A-Z].%',remainder)+1,2) as nvarchar(254))
from cte
where patindex('%.[A-Z][A-Z].%',remainder) > 0
)
select top 1 #result = result from cte order by i desc;
Return #result;
end
go
--demo
select dbo.fn_Get2AlphaCharCodesFromEmail ('claudio.passerini#uni.re.dit.mn.us')
--returns: re,mn,us
select dbo.fn_Get2AlphaCharCodesFromEmail ('claudio.passerini#uni.123.dit.mnx.usx')
--returns: NULL
Explanation
Create a function called fn_Get2AlphaCharCodesFromEmail in the schema dbo which takes a single parameter, #email which is a string of up to 254 characters, and returns a string of up to 254 characters.
create function dbo.fn_Get2AlphaCharCodesFromEmail
(
#email nvarchar(254)
) returns nvarchar(254)
as
begin
--... code that does the work goes here
end
declare the variables we'll be using later on.
#result holds the value we'll be returning from the function
#maxLen records the maximum length of an email; this makes it slightly easier should this length ever need to change; though not entirely simple since we have to specify the 254 length in our column & variable definitions later on anyway.
declare #result nvarchar(254) = null
, #maxLen int = 254
Now comes the interesting bit. We create a common table expression with 3 columns:
i is used to record which iteration each record was produced in; the highest value of i is the last record to be created.
remainder is used to hold the yet-to-be processed characters from the email.
result is used to record the 2 char codes; each new row adds another value to this column's comma separated values.
;with cte(i, remainder,result) as
(
--code to iterate through the email string, breaking it down, goes here
)
this gives us our first row in the cte "table".
The cast statements throughout this part are to ensure we have a consistent data type, as data types in a CTE are implicit, and not always correct
we initialise i (i.e. the first column) with value 0 to say that this is our first row (we could choose pretty much any value here; it doesn't matter
we initialise remainder (i.e. 2nd column) as the part of the email address which follows the # character; i.e. the email's domain.
we initialise result (i.e. 3rd column) as null; as we've not yet found a result (i.e. a 2 char string within the email's domain)
there is no from component as we're just getting data from the #email variable; no tables/views/etc are required.
select cast(0 as int)
, cast('.' + substring(#email,charindex('#',#email)+1,#maxLen) + '.' as nvarchar(254))
, cast(null as nvarchar(254))
union all is used to combing the first result(s) with the results of the next (recurring) statement. NB: The CTE code before this statement is run once to give initial values; the code after is run once for each new set of rows generated.
union all
The recurring code in the CTE is applied to new rows in the CTE until no new rows are generated.
i takes the value of the previous iteration's row's i incremented by 1.
select cast(i+1 as int)
remainder takes the previous iteration's remainder, and removes everything before (and including) the next 2 character code (result).
patindex('%.[A-Z][A-Z].%',remainder) returns a number giving the location of the a string containing a dot followed by 2 letters followed by a dot, occurring anywhere in the input string
, cast(substring(remainder,patindex('%.[A-Z][A-Z].%',remainder)+3,#maxLen)as nvarchar(254))
result uses the same logic as remainder, only it takes the 2 characters found, rather than everything after them. These characters are added on to the end of the previous iteartion's row's result value, separated by a comma.
, cast(coalesce(result + ',','') + substring(remainder,patindex('%.[A-Z][A-Z].%',remainder)+1,2) as nvarchar(254))
the from cte part just says that we're referencing the same "table" we're creating; i.e. this is how the recursion occurs
from cte
the where statement is used to prevent infinite recursion; i.e. once there are no more 2 char codes left in the remainder, stop looking.
where patindex('%.[A-Z][A-Z].%',remainder) > 0
Once we've found all the 2 char codes in the string, we know that the last row's result will contain the complete set; as such we assign this single row's value to the #result variable.
select top 1 #result = result
the from statement shows we're referencing the data we created in our with cte statement
from cte
the order by is used to determine which record comes first (i.e. which record is the top 1 record). We want it to be the last row generated by the CTE. Since we've been incrementing i by 1 each time, this last record will have the highest value of i, so by sorting by i desc (descending) that last generated row will be the row we get.
order by i desc;
Finally, we return the result generated above.
Return #result;
Alternative Approach
However, if you're trying to extract information from your emails, I'd recommend an alternate approach... have a list of values that you're looking for, and compare your email with that, without having to break apart the email address (beyond splitting on the # to ensure you're only checking the email's domain).
declare #countryCodes table (code nchar(2), name nvarchar(64)) --you'd use a real table for this; I'm just using a table variable so this demo's throwaway code
insert into #countryCodes (code, name)
values
('es','Spain')
,('fr','France')
,('uk','United Kingdom')
,('us','USA')
--etc.
--check a single mail
declare #mail nvarchar(256) = 'claudio.passerini#uni.re.dit.mn.us'
if exists (select top 1 1 from #countryCodes where '.' + substring(#mail,charindex('#',#mail)+1,256) + '.' like '%.' + code + '.%')
begin
select name from #countryCodes where '.' + substring(#mail,charindex('#',#mail)+1,256) + '.' like '%.' + code + '.%'
end
else
begin
select 'no results found'
end
--check a bunch of mails
declare #emailsToCheck table (email nvarchar(256))
insert into #emailsToCheck (email)
values
('claudio.passerini#uni.re.dit.mn.us')
,('someone#someplace.co.uk')
,('cant.see.me#never.never.land')
,('some.fr.address.hidden#france.not.in.this.bit')
select e.email, c.name
from #emailsToCheck e
left outer join #countryCodes c
on '.' + substring(email,charindex('#',email)+1,256) + '.' like '%.' + code + '.%'
order by e.email, c.name
If yo want individual columns you will need to pivot your data after splitting out your strings with a table valued function as per Marc's answer. If you are happy having them in rows, you can just use the select statement inside the brackets.
Query to get the data
declare #t table (Email nvarchar(50));
insert into #t values('claudio.passerini#uni.re.dit.mn.us'),('claudio.passerini#uni.ry.dit.mn.urg'),('claudio.passerini#uni.rn.dit.mn.uk');
select Email
,[1]
,[2]
,[3]
,[4]
,[5]
,[6]
from(
select t.Email
,s.Item
,row_number() over (partition by t.Email order by s.Item) as rn
from #t t
cross apply dbo.DelimitedSplit8K(t.Email,'.') s
where len(s.Item) = 2
) a
pivot
(
max(Item) for rn in([1],[2],[3],[4],[5],[6])
) pvt
Table valued function to split out the strings, courtesy of Jeff Moden
http://www.sqlservercentral.com/articles/Tally+Table/72993/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
You can create your own function to split strings.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[fnSplitString]
(
#string NVARCHAR(MAX),
#delimiter CHAR(1)
)
RETURNS #output TABLE(splitdata NVARCHAR(MAX)
)
BEGIN
set #delimiter = coalesce(#delimiter, dbo.cSeparador());
DECLARE #start INT, #end INT
SELECT #start = 1, #end = CHARINDEX(#delimiter, #string)
WHILE #start < LEN(#string) + 1 BEGIN
IF #end = 0
SET #end = LEN(#string) + 1
INSERT INTO #output (splitdata)
VALUES(SUBSTRING(#string, #start, #end - #start))
SET #start = #end + 1
SET #end = CHARINDEX(#delimiter, #string, #start)
END
RETURN
END
Using this function you can get all your country&state codes :
select splitdata from dbo.fnSplitString('claudio.passerini#uni.re.dit.mn.us', '.')
where len(splitdata) = 2
You can modify that query to concatenate the result on a single string :
SELECT
STUFF((SELECT ',' + splitdata
FROM dbo.fnSplitString('claudio.passerini#uni.re.dit.mn.us', '.')
WHERE len(splitdata) = 2
FOR XML PATH('')), 1, 1, '')
Here is how you put it into an scalar function :
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[fnCountryCodes](#email nvarchar(max)) returns nvarchar(max)
AS
BEGIN
RETURN (SELECT
STUFF((SELECT ',' + splitdata
FROM dbo.fnSplitString(#email, '.')
WHERE len(splitdata) = 2
FOR XML PATH('')), 1, 1, ''));
END
You call it like this :
select dbo.fnCountryCodes('claudio.passerini#uni.re.dit.mn.us')
Alternatively you can create a table-valued function that returns all the 2 characters long substrings from the domain of a mail address :
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[fnCountryCodes] (#email NVARCHAR(MAX))
RETURNS #output TABLE(subdomain1 nvarchar(2), subdomain2 nvarchar(2), subdomain3 nvarchar(2), subdomain4 nvarchar(2), subdomain5 nvarchar(2))
as
BEGIN
DECLARE #subdomain1 nvarchar(2);
DECLARE #subdomain2 nvarchar(2);
DECLARE #subdomain3 nvarchar(2);
DECLARE #subdomain4 nvarchar(2);
DECLARE #subdomain5 nvarchar(2);
DECLARE CURSOR_SUBDOMAINS CURSOR FOR select splitdata from dbo.fnSplitString(#email, '.') where len(splitdata) = 2;
OPEN CURSOR_SUBDOMAINS;
FETCH NEXT FROM CURSOR_SUBDOMAINS INTO #subdomain1;
FETCH NEXT FROM CURSOR_SUBDOMAINS INTO #subdomain2;
FETCH NEXT FROM CURSOR_SUBDOMAINS INTO #subdomain3;
FETCH NEXT FROM CURSOR_SUBDOMAINS INTO #subdomain4;
FETCH NEXT FROM CURSOR_SUBDOMAINS INTO #subdomain5;
CLOSE CURSOR_SUBDOMAINS;
DEALLOCATE CURSOR_SUBDOMAINS;
INSERT INTO #output (subdomain1, subdomain2, subdomain3, subdomain4, subdomain5)
values (#subdomain1, #subdomain2, #subdomain3, #subdomain4, #subdomain5)
RETURN
END
You use it like that :
select * from dbo.fnCountryCodes('claudio.passerini#uni.re.dit.mn.us')

How I can replace odd patterns inside a string?

I'm in the process of creating a temporary procedure in SQL because I have a value of a table which is written in markdown, so it appear as rendered HTML in the web browser (markdown to HTML conversion).
String of the column currently look like this:
Questions about **general computing hardware and software** are off-topic for Stack Overflow unless they directly involve tools used primarily for programming. You may be able to get help on [Super User](http://superuser.com/about)
I'm currently working with bold and italic text. This mean (in the case of bold text) I will need to replace odd N times the pattern**with<b>and even times with</b>.
I saw replace() but it perform the replacement on all the patterns of the string.
So How I can replace a sub-string only if it is odd or only it is even?
Update: Some peoples wonder what schemas I'm using so just take a look here.
One more extra if you want: The markdown style hyperlink to html hyperlink doesn't look so simple.
Using theSTUFFfunction and a simpleWHILEloop:
CREATE FUNCTION dbo.fn_OddEvenReplace(#text nvarchar(500),
#textToReplace nvarchar(10),
#oddText nvarchar(10),
#evenText nvarchar(500))
RETURNS varchar(max)
AS
BEGIN
DECLARE #counter tinyint
SET #counter = 1
DECLARE #switchText nvarchar(10)
WHILE CHARINDEX(#textToReplace, #text, 1) > 0
BEGIN
SELECT #text = STUFF(#text,
CHARINDEX(#textToReplace, #text, 1),
LEN(#textToReplace),
IIF(#counter%2=0,#evenText,#oddText)),
#counter = #counter + 1
END
RETURN #text
END
And you can use it like this:
SELECT dbo.fn_OddEvenReplace(column, '**', '<b>', '</b>')
FROM table
UPDATE:
This is re-written as an SP:
CREATE PROC dbo.##sp_OddEvenReplace #text nvarchar(500),
#textToReplace nvarchar(10),
#oddText nvarchar(10),
#evenText nvarchar(10),
#returnText nvarchar(500) output
AS
BEGIN
DECLARE #counter tinyint
SET #counter = 1
DECLARE #switchText nvarchar(10)
WHILE CHARINDEX(#textToReplace, #text, 1) > 0
BEGIN
SELECT #text = STUFF(#text,
CHARINDEX(#textToReplace, #text, 1),
LEN(#textToReplace),
IIF(#counter%2=0,#evenText,#oddText)),
#counter = #counter + 1
END
SET #returnText = #text
END
GO
And to execute:
DECLARE #returnText nvarchar(500)
EXEC dbo.##sp_OddEvenReplace '**a** **b** **c**', '**', '<b>', '</b>', #returnText output
SELECT #returnText
As per OP's request I have modified my earlier answer to perform as a temporary stored procedure. I have left my earlier answer as I believe the usage against a table of strings to be useful also.
If a Tally (or Numbers) table is known to already exist with at least 8000 values, then the marked section of the CTE can be omitted and the CTE reference tally replaced with the name of the existing Tally table.
create procedure #HtmlTagExpander(
#InString varchar(8000)
,#OutString varchar(8000) output
)as
begin
declare #Delimiter char(2) = '**';
create table #t(
StartLocation int not null
,EndLocation int not null
,constraint PK unique clustered (StartLocation desc)
);
with
-- vvv Only needed in absence of Tally table vvv
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1)
) E1(N)
), --10E+1 or 10 rows
E2(N) as (select 1 from E1 a cross join E1 b), --10E+2 or 100 rows
E4(N) As (select 1 from E2 a cross join E2 b), --10E+4 or 10,000 rows max
tally(N) as (select row_number() over (order by (select null)) from E4),
-- ^^^ Only needed in absence of Tally table ^^^
Delimiter as (
select len(#Delimiter) as Length,
len(#Delimiter)-1 as Offset
),
cteTally(N) AS (
select top (isnull(datalength(#InString),0))
row_number() over (order by (select null))
from tally
),
cteStart(N1) AS
select
t.N
from cteTally t cross join Delimiter
where substring(#InString, t.N, Delimiter.Length) = #Delimiter
),
cteValues as (
select
TagNumber = row_number() over(order by N1)
,Location = N1
from cteStart
),
HtmlTagSpotter as (
select
TagNumber
,Location
from cteValues
),
tags as (
select
Location = f.Location
,IsOpen = cast((TagNumber % 2) as bit)
,Occurrence = TagNumber
from HtmlTagSpotter f
)
insert #t(StartLocation,EndLocation)
select
prev.Location
,data.Location
from tags data
join tags prev
on prev.Occurrence = data.Occurrence - 1
and prev.IsOpen = 1;
set #outString = #Instring;
update this
set #outString = stuff(stuff(#outString,this.EndLocation, 2,'</b>')
,this.StartLocation,2,'<b>')
from #t this with (tablockx)
option (maxdop 1);
end
go
Invoked like this:
declare #InString varchar(8000)
,#OutString varchar(8000);
set #inString = 'Questions about **general computing hardware and software** are off-topic **for Stack Overflow.';
exec #HtmlTagExpander #InString,#OutString out; select #OutString;
set #inString = 'Questions **about** general computing hardware and software **are off-topic** for Stack Overflow.';
exec #HtmlTagExpander #InString,#OutString out; select #OutString;
go
drop procedure #HtmlTagExpander;
go
It yields as output:
Questions about <b>general computing hardware and software</b> are off-topic **for Stack Overflow.
Questions <b>about</b> general computing hardware and software <b>are off-topic</b> for Stack Overflow.
One option is to use a Regular Expression as it makes replacing such patterns very simple. RegEx functions are not built into SQL Server so you need to use SQL CLR, either compiled by you or from an existing library.
For this example I will use the SQL# (SQLsharp) library (which I am the author of) but the RegEx functions are available in the Free version.
SELECT SQL#.RegEx_Replace
(
N'Questions about **general computing hardware and software** are off-topic\
for Stack Overflow unless **they** directly involve tools used primarily for\
**programming. You may be able to get help on [Super User]\
(https://superuser.com/about)', -- #ExpressionToValidate
N'\*\*([^\*]*)\*\*', -- #RegularExpression
N'<b>$1</b>', -- #Replacement
-1, -- #Count (-1 = all)
1, - #StartAt
'IgnoreCase' -- #RegEx options
);
The above pattern \*\*([^\*]*)\*\* just looks for anything surrounded by double-asterisks. In this case you don't need to worry about odd / even. It also means that you won't get a poorly-formed <b>-only tag if for some reason there is an extra ** in the string. I added two additional test cases to the original string: a complete set of ** around the word they and an unmatched set of ** just before the word programming. The output is:
Questions about <b>general computing hardware and software</b> are off-topicfor Stack Overflow unless <b>they</b> directly involve tools used primarily for **programming. You may be able to get help on [Super User](https://superuser.com/about)
which renders as:
Questions about general computing hardware and software are off-topicfor Stack Overflow unless they directly involve tools used primarily for **programming. You may be able to get help on Super User
This solution makes use of techniques described by Jeff Moden in this article on the Running Sum problem in SQL. This solution is lengthy, but by making use of the Quirky Update in SQL Server over a clustered index, holds the promise of being much more efficient over large data sets than cursor-based solutions.
Update - amended below to operate off a table of strings
Assuming the existence of a tally table created like this (with at least 8000 rows):
create table dbo.tally (
N int not null
,unique clustered (N desc)
);
go
with
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1)
) E1(N)
), --10E+1 or 10 rows
E2(N) as (select 1 from E1 a cross join E1 b), --10E+2 or 100 rows
E4(N) As (select 1 from E2 a cross join E2 b) --10E+4 or 10,000 rows max
insert dbo.tally(N)
select row_number() over (order by (select null)) from E4;
go
and a HtmlTagSpotter function defined like this:
create function dbo.HtmlTagSPotter(
#pString varchar(8000)
,#pDelimiter char(2))
returns table with schemabinding as
return
WITH
Delimiter as (
select len(#pDelimiter) as Length,
len(#pDelimiter)-1 as Offset
),
cteTally(N) AS (
select top (isnull(datalength(#pstring),0))
row_number() over (order by (select null))
from dbo.tally
),
cteStart(N1) AS (--==== Returns starting position of each "delimiter" )
select
t.N
from cteTally t cross join Delimiter
where substring(#pString, t.N, Delimiter.Length) = #pDelimiter
),
cteValues as (
select
ItemNumber = row_number() over(order by N1)
,Location = N1
from cteStart
)
select
ItemNumber
,Location
from cteValues
go
then running the following SQL will perform the required substitution. Note that the inner join at the end prevents any trailing "odd" tags from being converted:
create table #t(
ItemNo int not null
,Item varchar(8000) null
,StartLocation int not null
,EndLocation int not null
,constraint PK unique clustered (ItemNo,StartLocation desc)
);
with data(i,s) as ( select i,s from (values
(1,'Questions about **general computing hardware and software** are off-topic **for Stack Overflow.')
,(2,'Questions **about **general computing hardware and software** are off-topic **for Stack Overflow.')
--....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8....,....9....,....0
)data(i,s)
),
tags as (
select
ItemNo = data.i
,Item = data.s
,Location = f.Location
,IsOpen = cast((TagNumber % 2) as bit)
,Occurrence = TagNumber
from data
cross apply dbo.HtmlTagSPotter(data.s,'**') f
)
insert #t(ItemNo,Item,StartLocation,EndLocation)
select
data.ItemNo
,data.Item
,prev.Location
,data.Location
from tags data
join tags prev
on prev.ItemNo = data.ItemNo
and prev.Occurrence = data.Occurrence - 1
and prev.IsOpen = 1
union all
select
i,s,8001,8002
from data
;
declare #ItemNo int
,#ThisStting varchar(8000);
declare #s varchar(8000);
update this
set #s = this.Item = case when this.StartLocation > 8000
then this.Item
else stuff(stuff(#s,this.EndLocation, 2,'</b>')
,this.StartLocation,2,'<b>')
end
from #t this with (tablockx)
option (maxdop 1);
select
Item
from (
select
Item
,ROW_NUMBER() over (partition by ItemNo order by StartLocation) as rn
from #t
) t
where rn = 1
go
yielding:
Item
------------------------------------------------------------------------------------------------------------
Questions about <b>general computing hardware and software</b> are off-topic **for Stack Overflow.
Questions <b>about </b>general computing hardware and software<b> are off-topic </b>for Stack Overflow.

Generating an n-gram table with an SQL query

I'm trying to implement a fuzzy search with JavaScript client side, to search a largish db (300 items roughly) of records contained in an SQL database. My constraint is that it is not possible to perform a live query on the database- I must generate "indexes" as flat files during a nightly batch job. And so, starting with a db that looks like this:
ID. NAME
1. The Rain Man
2. The Electric Slide
3. Transformers
I need to create within a single query something like this:
Trigram ID
the 1
the 2
he_ 1
he_ 2
e_r 1
_ra 1
rai 1
ain 1
in_ 1
n_m 1
_ma 1
man 1
e_e 2
_el 2
ele 2
lec 2
Etc etc, typos not withstanding. The rules here are that ''n' is the length of the strings in the first column, that only a-z and _ are valid characters, any other character being normalized to Lower case, or mapped to _, that a group by n-gram clause may be applied to the table. Thus, I would hope to gain a table that would allow me to quickly look up a particular n-gram and get a list of all the Ids of rows which contain that sequence. I'm not a clever enough SQL cookie to figure this problem out. Can you?
I created an T-SQL NGrams that works quite nicely; note the comments section for examples of how to use
CREATE FUNCTION dbo.nGrams8K
(
#string VARCHAR(8000),
#n TINYINT,
#pad BIT
)
/*
Created by: Alan Burstein
Created on: 3/10/2014
Updated on: 5/20/2014 changed the logic to use an "inline tally table"
9/10/2014 Added some more code examples in the comment section
9/30/2014 Added more code examples
10/27/2014 Small bug fix regarding padding
Use: Outputs a stream of tokens based on an input string.
Works just like mdq.nGrams; see http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
To better understand N-Grams see: http://en.wikipedia.org/wiki/N-gram
*/
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
iTally(n) AS
(
SELECT TOP (LEN(#string)+#n) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E2 a CROSS JOIN E2 b
),
NewString(NewString) AS
(
SELECT REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)+#string+
REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)
)
SELECT TOP ((#n)+LEN(#string))
n AS [sequence],
SUBSTRING(NewString,n,#n) AS token
FROM iTally
CROSS APPLY NewString
WHERE n < ((#n)+LEN(#string));
/*
------------------------------------------------------------
-- (1) Basic Use
-------------------------------------------------------------
;-- (A)basic "string to table":
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',1,1);
-- (b) create "bi-grams" (pad bit off)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',2,0);
-- (c) create "tri-grams" (pad bit on)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1);
-- (d) filter for only "tri-grams"
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
-- note the query plan for each. The power is coming from an index
-- also note how many rows are produced: len(#string+(#n-1))
-- lastly, you can trim as needed when padding=1
------------------------------------------------------------
-- (2) With a variable
------------------------------------------------------------
-- note, in this example I am getting only the stuff that has three letters
DECLARE #string varchar(20) = 'abcdefg',
#tokenLen tinyint = 3;
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
GO
------------------------------------------------------------
-- (3) An on-the-fly alphabet (this will come in handy in a moment)
------------------------------------------------------------
DECLARE #alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT [sequence], token
FROM dbo.nGrams8K(#alphabet,1,0);
GO
------------------------------------------------------------
-- (4) Character Count
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(#alphabet,1,0) a
LEFT JOIN dbo.nGrams8K(#string,1,0) b ON a.token=b.token
GROUP BY a.token
ORDER BY a.token;
GO
------------------------------------------------------------
-- (5) Locate the start position of a search pattern
------------------------------------------------------------
;-- (A) note these queries:
DECLARE #string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';
-- (i)
SELECT * FROM dbo.nGrams8K(#string,1,0) a;
-- (ii) note this query:
SELECT * FROM dbo.nGrams8K(#string,1,0) a WHERE [token]=' ';
-- (B) and now the word count (#string included for presentation)
SELECT #string AS string,
count(*)+1 AS words
FROM dbo.nGrams8K(#string,1,0) a
WHERE [token]=' '
GO
------------------------------------------------------------
-- (6) search for the number of occurances of a word
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',
#searchString VARCHAR(100)='The';
-- (5a) by location
SELECT sequence-(LEN(#searchstring)) AS location,
token AS searchString
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
-- (2b) get total
SELECT #string AS string,
#searchString AS searchString,
COUNT(*) AS ttl
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
------------------------------------------------------------
-- (7) Special SubstringBefore and SubstringAfter
------------------------------------------------------------
-- (7a) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringBeforeSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,1,substring_index-len(#substring))
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='10.0.1600.22',
#searchPattern varchar(100)='.',
#substring_index tinyint = 3;
SELECT * FROM dbo.SubstringBeforeSSI(#string,#searchPattern,#substring_index);
GO
-- (7b) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringAfterSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,substring_index+1,8000)
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='<notes id="1">blah, blah, blah</notes><notes id="2">More Notes</notes>',
#searchPattern varchar(100)='</notes>',
#substring_index tinyint = 1;
SELECT #string, *
FROM dbo.SubstringAfterSSI(#string,#searchPattern,#substring_index);
------------------------------------------------------------
-- (8) Strip non-numeric characters from a string
------------------------------------------------------------
-- (8a) create the function
ALTER FUNCTION StripNonNumeric_itvf(#OriginalText VARCHAR(8000))
RETURNS TABLE
--WITH SCHEMABINDING
AS
return
WITH ngrams AS
(
SELECT n = [sequence], c = token
FROM dbo.nGrams8K(#OriginalText,1,1)
),
clean_txt(CleanedText) AS
(
SELECT c+''
FROM ngrams
WHERE ascii(substring(#OriginalText,n,1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
SELECT CleanedText
FROM clean_txt;
GO
-- (8b) use against a value or variable
SELECT CleanedText
FROM dbo.StripNonNumeric_itvf('value123');
-- (8c) use against a table
-- test harness:
IF OBJECT_ID('tempdb..#strings') IS NOT NULL DROP TABLE #strings;
WITH strings AS
(
SELECT TOP (100000) string = newid()
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT *
INTO #strings
FROM strings;
GO
-- query (returns 100K rows every 3 seconds on my pc):
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf(string);
------------------------------------------------------------
-- (9) A couple complex String Algorithms
------------------------------------------------------------
-- (9a) hamming distance between two strings:
DECLARE #string1 varchar(8000) = 'xxxxyyyzzz',
#string2 varchar(8000) = 'xxxxyyzzzz';
SELECT string1 = #string1,
string2 = #string2,
hamming_distance = count(*)
FROM dbo.nGrams8K(#string1,1,0) s1
CROSS APPLY dbo.nGrams8K(#string2,1,0) s2
WHERE s1.sequence = s2.sequence
AND s1.token <> s2.token
GO
-- (9b) inner join between 2 strings
--(can be used to speed up other string metrics such as the longest common subsequence)
DECLARE #string1 varchar(100)='xxxx123yyyy456zzzz',
#string2 varchar(100)='xx789yy000zz';
WITH
s1(string1) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string1,1,0)
WHERE charindex([token],#string2)<>0
ORDER BY [sequence]
FOR XML PATH('')
),
s2(string2) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string2,1,0)
WHERE charindex([token],#string1)<>0
ORDER BY [sequence]
FOR XML PATH('')
)
SELECT string1, string2
FROM s1
CROSS APPLY s2;
------------------------------------------------------------
-- (10) Advanced Substring Metrics
------------------------------------------------------------
-- (10a) Identify common substrings and their location
DECLARE #string1 varchar(100) = 'xxx yyy zzz',
#string2 varchar(100) = 'xx yyy zz';
-- (i) review the two strings
SELECT str1 = #string1,
str2 = #string2;
-- (ii) the results
WITH
iTally AS
(
SELECT n
FROM dbo.tally t
WHERE n<= len(#string1)
),
distinct_tokens AS
(
SELECT ng1 = ng1.token, ng2 = ng2.token --= ltrim(ng1.token), ng2 = ltrim(ng2.token)
FROM itally
CROSS APPLY dbo.nGrams8K(#string1,n,1) ng1
CROSS APPLY dbo.nGrams8K(#string2,n,1) ng2
WHERE ng1.token=ng2.token
)
SELECT ss_txt = ng1,
ss_len = len(ng1),
str1_loc = charindex(ng1,#string1),
str2_loc = charindex(ng2,#string2)
FROM distinct_tokens
WHERE ng1<>'' AND charindex(ng1,#string1)+charindex(ng2,#string2)<>0
GROUP BY ng1, ng2
ORDER BY charindex(ng1,#string1), charindex(ng2,#string2), len(ng1);
-- (10b) Longest common substring function
-- (i) function
IF EXISTS
( SELECT * FROM INFORMATION_SCHEMA.ROUTINES
WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss')
DROP FUNCTION dbo.lcss;
GO
CREATE FUNCTION dbo.lcss(#string1 varchar(100), #string2 varchar(100))
RETURNS TABLE
AS
RETURN
SELECT TOP (1) with ties token
FROM dbo.tally
CROSS APPLY dbo.nGrams8K(#string1,n,1)
WHERE n <= len(#string1)
AND charindex(token, #string2) > 0
ORDER BY len(token) DESC;
GO
-- (ii) example of use
DECLARE #string1 varchar(100) = '000xxxyyyzzz',
#string2 varchar(100) = '999xxyyyzaa';
SELECT string1 = #string1,
string2 = #string2,
token
FROM dbo.lcss(#string1, #string2);
*/
GO
You'd have to repeat this statement:
insert into trigram_table ( Trigram, ID )
select substr( translate( lower( Name ), ' ', '_' ), :X, :N ),
ID
from db_table
for all :X from 1 to Len(Name) + 1 - :N
You will also have to extend the translate function for all the other special characters you'd want to convert to an underscore. Right now it's just translating a blank into an underscore.
For performance you could do the translate and lower functions on the Trigram column in a last pass on the trigram_table so you're not doing those functions for each :X.