How I can replace odd patterns inside a string? - sql

I'm in the process of creating a temporary procedure in SQL because I have a value of a table which is written in markdown, so it appear as rendered HTML in the web browser (markdown to HTML conversion).
String of the column currently look like this:
Questions about **general computing hardware and software** are off-topic for Stack Overflow unless they directly involve tools used primarily for programming. You may be able to get help on [Super User](http://superuser.com/about)
I'm currently working with bold and italic text. This mean (in the case of bold text) I will need to replace odd N times the pattern**with<b>and even times with</b>.
I saw replace() but it perform the replacement on all the patterns of the string.
So How I can replace a sub-string only if it is odd or only it is even?
Update: Some peoples wonder what schemas I'm using so just take a look here.
One more extra if you want: The markdown style hyperlink to html hyperlink doesn't look so simple.

Using theSTUFFfunction and a simpleWHILEloop:
CREATE FUNCTION dbo.fn_OddEvenReplace(#text nvarchar(500),
#textToReplace nvarchar(10),
#oddText nvarchar(10),
#evenText nvarchar(500))
RETURNS varchar(max)
AS
BEGIN
DECLARE #counter tinyint
SET #counter = 1
DECLARE #switchText nvarchar(10)
WHILE CHARINDEX(#textToReplace, #text, 1) > 0
BEGIN
SELECT #text = STUFF(#text,
CHARINDEX(#textToReplace, #text, 1),
LEN(#textToReplace),
IIF(#counter%2=0,#evenText,#oddText)),
#counter = #counter + 1
END
RETURN #text
END
And you can use it like this:
SELECT dbo.fn_OddEvenReplace(column, '**', '<b>', '</b>')
FROM table
UPDATE:
This is re-written as an SP:
CREATE PROC dbo.##sp_OddEvenReplace #text nvarchar(500),
#textToReplace nvarchar(10),
#oddText nvarchar(10),
#evenText nvarchar(10),
#returnText nvarchar(500) output
AS
BEGIN
DECLARE #counter tinyint
SET #counter = 1
DECLARE #switchText nvarchar(10)
WHILE CHARINDEX(#textToReplace, #text, 1) > 0
BEGIN
SELECT #text = STUFF(#text,
CHARINDEX(#textToReplace, #text, 1),
LEN(#textToReplace),
IIF(#counter%2=0,#evenText,#oddText)),
#counter = #counter + 1
END
SET #returnText = #text
END
GO
And to execute:
DECLARE #returnText nvarchar(500)
EXEC dbo.##sp_OddEvenReplace '**a** **b** **c**', '**', '<b>', '</b>', #returnText output
SELECT #returnText

As per OP's request I have modified my earlier answer to perform as a temporary stored procedure. I have left my earlier answer as I believe the usage against a table of strings to be useful also.
If a Tally (or Numbers) table is known to already exist with at least 8000 values, then the marked section of the CTE can be omitted and the CTE reference tally replaced with the name of the existing Tally table.
create procedure #HtmlTagExpander(
#InString varchar(8000)
,#OutString varchar(8000) output
)as
begin
declare #Delimiter char(2) = '**';
create table #t(
StartLocation int not null
,EndLocation int not null
,constraint PK unique clustered (StartLocation desc)
);
with
-- vvv Only needed in absence of Tally table vvv
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1)
) E1(N)
), --10E+1 or 10 rows
E2(N) as (select 1 from E1 a cross join E1 b), --10E+2 or 100 rows
E4(N) As (select 1 from E2 a cross join E2 b), --10E+4 or 10,000 rows max
tally(N) as (select row_number() over (order by (select null)) from E4),
-- ^^^ Only needed in absence of Tally table ^^^
Delimiter as (
select len(#Delimiter) as Length,
len(#Delimiter)-1 as Offset
),
cteTally(N) AS (
select top (isnull(datalength(#InString),0))
row_number() over (order by (select null))
from tally
),
cteStart(N1) AS
select
t.N
from cteTally t cross join Delimiter
where substring(#InString, t.N, Delimiter.Length) = #Delimiter
),
cteValues as (
select
TagNumber = row_number() over(order by N1)
,Location = N1
from cteStart
),
HtmlTagSpotter as (
select
TagNumber
,Location
from cteValues
),
tags as (
select
Location = f.Location
,IsOpen = cast((TagNumber % 2) as bit)
,Occurrence = TagNumber
from HtmlTagSpotter f
)
insert #t(StartLocation,EndLocation)
select
prev.Location
,data.Location
from tags data
join tags prev
on prev.Occurrence = data.Occurrence - 1
and prev.IsOpen = 1;
set #outString = #Instring;
update this
set #outString = stuff(stuff(#outString,this.EndLocation, 2,'</b>')
,this.StartLocation,2,'<b>')
from #t this with (tablockx)
option (maxdop 1);
end
go
Invoked like this:
declare #InString varchar(8000)
,#OutString varchar(8000);
set #inString = 'Questions about **general computing hardware and software** are off-topic **for Stack Overflow.';
exec #HtmlTagExpander #InString,#OutString out; select #OutString;
set #inString = 'Questions **about** general computing hardware and software **are off-topic** for Stack Overflow.';
exec #HtmlTagExpander #InString,#OutString out; select #OutString;
go
drop procedure #HtmlTagExpander;
go
It yields as output:
Questions about <b>general computing hardware and software</b> are off-topic **for Stack Overflow.
Questions <b>about</b> general computing hardware and software <b>are off-topic</b> for Stack Overflow.

One option is to use a Regular Expression as it makes replacing such patterns very simple. RegEx functions are not built into SQL Server so you need to use SQL CLR, either compiled by you or from an existing library.
For this example I will use the SQL# (SQLsharp) library (which I am the author of) but the RegEx functions are available in the Free version.
SELECT SQL#.RegEx_Replace
(
N'Questions about **general computing hardware and software** are off-topic\
for Stack Overflow unless **they** directly involve tools used primarily for\
**programming. You may be able to get help on [Super User]\
(https://superuser.com/about)', -- #ExpressionToValidate
N'\*\*([^\*]*)\*\*', -- #RegularExpression
N'<b>$1</b>', -- #Replacement
-1, -- #Count (-1 = all)
1, - #StartAt
'IgnoreCase' -- #RegEx options
);
The above pattern \*\*([^\*]*)\*\* just looks for anything surrounded by double-asterisks. In this case you don't need to worry about odd / even. It also means that you won't get a poorly-formed <b>-only tag if for some reason there is an extra ** in the string. I added two additional test cases to the original string: a complete set of ** around the word they and an unmatched set of ** just before the word programming. The output is:
Questions about <b>general computing hardware and software</b> are off-topicfor Stack Overflow unless <b>they</b> directly involve tools used primarily for **programming. You may be able to get help on [Super User](https://superuser.com/about)
which renders as:
Questions about general computing hardware and software are off-topicfor Stack Overflow unless they directly involve tools used primarily for **programming. You may be able to get help on Super User

This solution makes use of techniques described by Jeff Moden in this article on the Running Sum problem in SQL. This solution is lengthy, but by making use of the Quirky Update in SQL Server over a clustered index, holds the promise of being much more efficient over large data sets than cursor-based solutions.
Update - amended below to operate off a table of strings
Assuming the existence of a tally table created like this (with at least 8000 rows):
create table dbo.tally (
N int not null
,unique clustered (N desc)
);
go
with
E1(N) as (
select 1 from (values
(1),(1),(1),(1),(1),
(1),(1),(1),(1),(1)
) E1(N)
), --10E+1 or 10 rows
E2(N) as (select 1 from E1 a cross join E1 b), --10E+2 or 100 rows
E4(N) As (select 1 from E2 a cross join E2 b) --10E+4 or 10,000 rows max
insert dbo.tally(N)
select row_number() over (order by (select null)) from E4;
go
and a HtmlTagSpotter function defined like this:
create function dbo.HtmlTagSPotter(
#pString varchar(8000)
,#pDelimiter char(2))
returns table with schemabinding as
return
WITH
Delimiter as (
select len(#pDelimiter) as Length,
len(#pDelimiter)-1 as Offset
),
cteTally(N) AS (
select top (isnull(datalength(#pstring),0))
row_number() over (order by (select null))
from dbo.tally
),
cteStart(N1) AS (--==== Returns starting position of each "delimiter" )
select
t.N
from cteTally t cross join Delimiter
where substring(#pString, t.N, Delimiter.Length) = #pDelimiter
),
cteValues as (
select
ItemNumber = row_number() over(order by N1)
,Location = N1
from cteStart
)
select
ItemNumber
,Location
from cteValues
go
then running the following SQL will perform the required substitution. Note that the inner join at the end prevents any trailing "odd" tags from being converted:
create table #t(
ItemNo int not null
,Item varchar(8000) null
,StartLocation int not null
,EndLocation int not null
,constraint PK unique clustered (ItemNo,StartLocation desc)
);
with data(i,s) as ( select i,s from (values
(1,'Questions about **general computing hardware and software** are off-topic **for Stack Overflow.')
,(2,'Questions **about **general computing hardware and software** are off-topic **for Stack Overflow.')
--....,....1....,....2....,....3....,....4....,....5....,....6....,....7....,....8....,....9....,....0
)data(i,s)
),
tags as (
select
ItemNo = data.i
,Item = data.s
,Location = f.Location
,IsOpen = cast((TagNumber % 2) as bit)
,Occurrence = TagNumber
from data
cross apply dbo.HtmlTagSPotter(data.s,'**') f
)
insert #t(ItemNo,Item,StartLocation,EndLocation)
select
data.ItemNo
,data.Item
,prev.Location
,data.Location
from tags data
join tags prev
on prev.ItemNo = data.ItemNo
and prev.Occurrence = data.Occurrence - 1
and prev.IsOpen = 1
union all
select
i,s,8001,8002
from data
;
declare #ItemNo int
,#ThisStting varchar(8000);
declare #s varchar(8000);
update this
set #s = this.Item = case when this.StartLocation > 8000
then this.Item
else stuff(stuff(#s,this.EndLocation, 2,'</b>')
,this.StartLocation,2,'<b>')
end
from #t this with (tablockx)
option (maxdop 1);
select
Item
from (
select
Item
,ROW_NUMBER() over (partition by ItemNo order by StartLocation) as rn
from #t
) t
where rn = 1
go
yielding:
Item
------------------------------------------------------------------------------------------------------------
Questions about <b>general computing hardware and software</b> are off-topic **for Stack Overflow.
Questions <b>about </b>general computing hardware and software<b> are off-topic </b>for Stack Overflow.

Related

Replace consecutive duplicate letters with a single letter

How do I create a function in SQL Server 2017 that identifies when a string contains duplicate consecutive letters (a-z) and replaces those duplicate letters with a single instance of that letter?
Here are some examples of what should happen:
CompanyAAABCD -> CompanyABCD
CommpanyABYTTT -> CompanyABYT
Company11111 -> Company11111
alter function fn_RemoveDuplicateChar(#name varchar(200))
RETURNS VARCHAR(200)
as
begin
declare #strPosition int=1;
declare #strlen int=0;
declare #finalstr varchar(200)='';
declare #str varchar(200)='';
declare #fstr varchar(200)='';
select #strlen = (select len(#name))
while #strPosition<=#strlen
begin
select #fstr = SUBSTRING(#name, #strPosition, 1)
select #str = SUBSTRING(#finalstr, len(#finalstr), 1)
If #fstr <> #str or ( ISNUMERIC(#fstr)=1 and ISNUMERIC(#str)=1)
set #finalstr = #finalstr + #fstr
set #strPosition =#strPosition+1
end
return (select #finalstr)
end
go
select dbo.fn_RemoveDuplicateChar('CompanyAAABCD')
select dbo.fn_RemoveDuplicateChar('CommpanyABYTTT')
select dbo.fn_RemoveDuplicateChar('Company11111')
If you just wanted a single round of replacement (i.e. aaabbbb becomes aabb) then you could use this:
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (#value varchar(200))
RETURNS VARCHAR(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #result varchar(200) = #value;
DECLARE #i int = 65;
-- a-z is ASCII 65-90
WHILE #i < 90
BEGIN
SET #result = REPLACE(#result, CHAR(#i) + CHAR(#i), CHAR(#i));
SET #i += 1
END;
RETURN #result;
END;
GO
But it seems you need a recursive replacement, so that every character that has the same before it is removed.
So we can use this version, which is similar to the other answer.
CREATE OR ALTER FUNCTION dbo.RemoveDuplicates (#value varchar(200))
RETURNS varchar(200)
WITH SCHEMABINDING
AS
BEGIN
DECLARE #c char(1);
DECLARE #cLast char(1) = LEFT(#value, 1);
DECLARE #result varchar(200) = #cLast;
DECLARE #strlen int = LEN(#value);
DECLARE #i int = 2;
WHILE (#i < #strlen)
BEGIN
SET #c = SUBSTRING(#value, #i, 1);
IF (#c <> #cLast)
SET #result += #c;
SET #i += 1
END;
RETURN #result;
END;
GO
I rewrote this as an inline Table-Valued Function, and found it significantly faster. Here are two versions of that, depending whether you can use STRING_AGG
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesXML (#value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(#value)) rn FROM Nums)
SELECT (
SELECT SUBSTRING(#value, rn, 1)
FROM Chars
WHERE rn = 1 OR SUBSTRING(#value, rn - 1, 1) <> SUBSTRING(#value, rn, 1)
ORDER BY rn
FOR XML PATH(''), TYPE
).value('text()[1]','nvarchar(max)') Result
);
GO
CREATE OR ALTER FUNCTION dbo.RemoveDuplicatesAGG (#value varchar(200))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
(
WITH L1 AS (SELECT n FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) v(n)),
L2 AS (SELECT 1 n FROM L1 A CROSS JOIN L1 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) rn FROM L2),
Chars AS (SELECT TOP(LEN(#value)) rn FROM Nums)
SELECT STRING_AGG(SUBSTRING(#value, rn, 1), '') WITHIN GROUP (ORDER BY rn) Result
FROM Chars
WHERE rn = 1 OR SUBSTRING(#value, rn - 1, 1) <> SUBSTRING(#value, rn, 1)
);
GO
This utilizes Itzik Ben-Gan's famous inline tally-table method to break out the string into single characters. You will need another CROSS JOIN or more (1) if you have more than 256 characters.
You have two methods to use this, the performance should be identical
Either as a scalar subquery
SELECT (SELECT * FROM RemoveDuplicatesAGG(t.MyString) Result
FROM myTable t
Or as an APPLY
SELECT d.Result
FROM myTable t
CROSS APPLY RemoveDuplicatesAGG(t.MyString) d
I know I'm a little late here but if performance is important then you can use the fastest "de-duplicator" in the game (the function, removeDupesExcept8K, is at the end of this post.) It takes an input string and a pattern representing what you want deduplicated; in the example below I'm saying "deduplicate anything that's not between A to Z.
DECLARE #string VARCHAR(8000) = 'AAABBBCCC999';
SELECT rd.NewString FROM samd.removeDupesExcept8K(#string, '[^A-Z]') AS rd;
Returns: ABC999
Let's compare fn_RemoveDuplicateChar from B.Muthamizhselvi above to the one at the end of the post.
Performance test:
--==== Test Data
SELECT TOP(10000)
ID = IDENTITY(INT,1,1),
String = REPLACE(REPLACE(REPLACE(NEWID(),'A',0),'B',0),'-','AAA')
INTO #strings
FROM sys.all_columns, sys.all_columns b;
GO
--==== Performance Test
PRINT CHAR(13)+'dbo.fn_RemoveDuplicateChar'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE #st DATETIME = GETDATE(), #x VARCHAR(100);
SELECT #x = dbo.fn_RemoveDuplicateChar(s.String)
FROM #strings AS s
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Serial'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE #st DATETIME = GETDATE(), #x VARCHAR(100);
SELECT #x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (MAXDOP 1);
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
PRINT CHAR(13)+'samd.removeDupChar8K - Parallel'+CHAR(13)+REPLICATE('-',90);
GO
DECLARE #st DATETIME = GETDATE(), #x VARCHAR(100);
SELECT #x = rd.NewString
FROM #strings AS s
CROSS APPLY samd.removeDupesExcept8K(s.String,'[^A-Z]') AS rd
OPTION (QUERYTRACEON 8649);
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
As you'll see below, removeDupesExcept8K is twice as fast with a serial execution plan (one CPU) and more than 10X faster with a parallel plan. No need to test fn_RemoveDuplicateChar with a parallel plan, scalar UDFs can't go parallel unless inlined.
Test Results:
dbo.fn_RemoveDuplicateChar
------------------------------------------------------------------------------------------
Beginning execution loop
1110
1106
1093
Batch execution completed 3 times.
samd.removeDupChar8K - Serial
------------------------------------------------------------------------------------------
Beginning execution loop
563
560
593
Batch execution completed 3 times.
samd.removeDupChar8K - Parallel
------------------------------------------------------------------------------------------
Beginning execution loop
91
91
93
Batch execution completed 3 times.
The Function
IF OBJECT_ID('samd.removeDupesExcept8K') IS NOT NULL DROP FUNCTION samd.removeDupesExcept8K;
GO
CREATE FUNCTION samd.removeDupesExcept8K(#string varchar(8000), #preserved varchar(50))
/*****************************************************************************************
[Purpose]:
A purely set-based inline table valued function (iTVF) that accepts and input strings
(#string) and a pattern (#preserved) and removes all duplicate characters in #string that
do not match the #preserved pattern.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous use
SELECT rd.newString
FROM samd.removeDupesExcept8K(#string, #preserved) AS rd;
--===== Use against a table
SELECT st.SomeColumn1, rd.newString
FROM SomeTable AS st
CROSS
APPLY samd.removeDupesExcept8K(st.SomeColumn1, #preserved) AS rd;
Parameters:
#string = varchar(8000); Input string to be "cleaned"
#preserved = varchar(50); the pattern to preserve. For example, when #preserved='[0-9]'
only non-numeric characters will be removed
[Return Types]:
Inline Table Valued Function returns:
newString = varchar(8000); the string with duplicate characters removed
[Developer Notes]:
1. Requires NGrams8K. The code for NGrams8K can be found here:
http://www.sqlservercentral.com/articles/Tally+Table/142316/
2. This function is what is referred to as an "inline" scalar UDF." Technically it's an
inline table valued function (iTVF) but performs the same task as a scalar valued user
defined function (UDF); the difference is that it requires the APPLY table operator
to accept column values as a parameter. For more about "inline" scalar UDFs see this
article by SQL MVP Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. removeDupesExcept8K is deterministic; for more about deterministic and nondeterministic
functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Examples...
DECLARE #string varchar(8000) = '!!!aa###bb!!!';
BEGIN
--===== 1.1. Remove all duplicate characters
SELECT f.newString
FROM samd.removeDupesExcept8K(#string,'') f; -- Returns: !a#b!
--===== 1.2. Remove all non-alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(#string,'[a-z]') f; -- Returns: !aa#bb!
--===== 1.3. Remove only alphabetical duplicates
SELECT f.newString
FROM samd.removeDupesExcept8K(#string,'[^a-z]') f; -- Returns: !!!a###b!!!
END
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20160720 - Initial Creation - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT newString =
(
SELECT ng.token+''
FROM samd.NGrams8K(#string,1) AS ng
WHERE ng.token <> SUBSTRING(#string, ng.position+1,1) -- exclude chars = the next char
OR ng.token LIKE #preserved -- preserve characters that match the #preserved pattern
ORDER BY ng.position
FOR XML PATH(''),TYPE
).value('(text())[1]','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic

T-SQL - Count unique characters in a variable

Goal: To count # of distinct characters in a variable the fastest way possible.
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
I've found some posts in regards to distinct characters in a column, grouped by characters, and etc, but not one for this scenario.
Using NGrams8K as a base, you can change the input parameter to a nvarchar(4000) and tweak the DATALENGTH, making NGramsN4K. Then you can use that to split the string into individual characters and count them:
SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(#String1,1) NG;
Altered NGrams8K:
IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 8000 varchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGrams8k(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGrams8K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = varchar(8000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGrams8k is not case sensitive
2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL)
4. NGrams8k can also be used as a tally table with the position column being your "N"
row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
it into unigrams then only return the position column. NGrams8k will get you up to
8000 numbers. There will be no performance penalty for sorting by position in
ascending order but there is for sorting in descending order. To get the numbers in
descending order without forcing a sort in the query plan use the following formula:
N = <highest number>-position+1.
Pseudo Tally Table Examples:
--===== (1) Get the numbers 1 to 100 in ascending order:
SELECT N = position
FROM dbo.NGrams8k(REPLICATE(0,100),1);
--===== (2) Get the numbers 1 to 100 in descending order:
DECLARE #maxN int = 100;
SELECT N = #maxN-position+1
FROM dbo.NGrams8k(REPLICATE(0,#maxN),1)
ORDER BY position;
5. NGrams8k is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string varchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGrams8k(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
----------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if #N is greater than 0 and less
than the length of #string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the #string and #N
parameters to prevent a NULL string or NULL #N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES -- 90 NULL values used to create the CTE Tally table
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
) t(N)
),
iTally(N) AS -- my cte Tally table
(
SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,N''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 AND #N <= (DATALENGTH(#string)/2); -- Protection against bad parameter values
Here is another alternative using the power of the tally table. It has been called the "Swiss Army Knife of T-SQL". I keep a tally table as a view on my system which makes it insanely fast.
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
Now we can use that tally anytime we need it, like for this exercise.
declare #Something table
(
String1 nvarchar(4000)
)
insert #Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')
select count(distinct substring(s.String1, t.N, 1))
, s.String1
from #Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1
To be honest I don't know this would be any faster than Larnu's usage of NGrams but testing on a large table would be fun to see.
----- EDIT -----
Thanks to Shnugo for the idea. Using a cross apply to a correlated subquery here is actually quite an improvement.
select count(distinct substring(s.String1, A.N, 1))
, s.String1
from #Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1
The reason this is so much faster is that this is no longer using a triangular join which can really be painfully slow. I did also switch out the view with an indexed physical tally table. The improvement there was noticeable on larger datasets but not nearly as big as using the cross apply.
If you want to read more about triangular joins and why we should avoid them Jeff Moden has a great article on the topic. https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins
Grab a copy of NGrams8k and you can do this:
DECLARE #String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE #String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE #String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE #String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4
SELECT s.String, Total = COUNT(DISTINCT ng.token)
FROM (VALUES(#String1),(#String2),(#String3),(#String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1) AS ng
GROUP BY s.String;
Returns:
String Total
-------- -----------
* 1
*A-zz 4
11 1
1A^ 3
UPDATED
Just a quick update based on #Larnu's post and comments. I did not notice that the OP was dealing with Unicode e.g. NVARCHAR. I created an NVARCHAR(4000) version here - similar to what #Larnu posted above. I just updated the return token to use Latin1_General_BIN collation.
SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N)
This returns the correct answer:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT ng.token)
FROM dbo.NGramsN4k(#String5,1) AS ng;
Without the collation in place you can use the what Larnu posted and get the right answer like this:
DECLARE #String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2
SELECT COUNT(DISTINCT UNICODE(ng.token))
FROM dbo.NGramsN4k(#String5,1) AS ng;
Here's my updated NGramsN4K function:
ALTER FUNCTION dbo.NGramsN4K
(
#string nvarchar(4000), -- Input string
#N int -- requested token size
)
/****************************************************************************************
Purpose:
A character-level N-Grams function that outputs a contiguous stream of #N-sized tokens
based on an input string (#string). Accepts strings up to 4000 nvarchar characters long.
For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.
Compatibility:
SQL Server 2008+, Azure SQL Database
Syntax:
--===== Autonomous
SELECT position, token FROM dbo.NGramsN4K(#string,#N);
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable s
CROSS APPLY dbo.NGramsN4K(s.SomeValue,#N) ng;
Parameters:
#string = The input string to split into tokens.
#N = The size of each token returned.
Returns:
Position = bigint; the position of the token in the input string
token = nvarchar(4000); a #N-sized character-level N-Gram token
Developer Notes:
1. NGramsN4K is not case sensitive
2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
creates a parallel execution plan. One way to get a parallel query plan (if the
optimizer does not chose one) is to use make_parallel by Adam Machanic which can be
found here:
sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx
3. When #N is less than 1 or greater than the datalength of the input string then no
tokens (rows) are returned. If either #string or #N are NULL no rows are returned.
This is a debatable topic but the thinking behind this decision is that: because you
can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you
can't turn anything into NULL-grams, no rows should be returned.
For people who would prefer that a NULL input forces the function to return a single
NULL output you could add this code to the end of the function:
UNION ALL
SELECT 1, NULL
WHERE NOT(#N > 0 AND #N <= DATALENGTH(#string)) OR (#N IS NULL OR #string IS NULL);
4. NGramsN4K is deterministic. For more about deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (#N=1)
SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams (#N=2)
SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (#N=3)
--===== How many times the substring "AB" appears in each record
DECLARE #table TABLE(stringID int identity primary key, string nvarchar(100));
INSERT #table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');
SELECT string, occurances = COUNT(*)
FROM #table t
CROSS APPLY dbo.NGramsN4K(t.string,2) ng
WHERE ng.token = 'AB'
GROUP BY string;
------------------------------------------------------------------------------------------
Revision History:
Rev 00 - 20170324 - Initial Development - Alan Burstein
Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS
(
SELECT
TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
FROM L1 a CROSS JOIN L1 b -- cartesian product for 4096 rows (16^2)
)
SELECT
position = N, -- position of the token in the string(s)
token = SUBSTRING(#string COLLATE Latin1_General_BIN,CAST(N AS int),#N) -- the #N-Sized token
FROM iTally
WHERE #N > 0 -- Protection against bad parameter values:
AND #N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(#string,''))/2)-(ISNULL(#N,1)-1)),0)));
You can do this natively in SQL Server using CTE and some string manipuation:
DECLARE #TestString NVARCHAR(4000);
SET #TestString = N'*A-zz';
WITH letters AS
(
SELECT 1 AS Pos,
#TestString AS Stri,
MAX(LEN(#TestString)) AS MaxPos,
SUBSTRING(#TestString, 1, 1) AS [Char]
UNION ALL
SELECT Pos + 1,
#TestString,
MaxPos,
SUBSTRING(#TestString, Pos + 1, 1) AS [Char]
FROM letters
WHERE Pos + 1 <= MaxPos
)
SELECT COUNT(*) AS LetterCount
FROM (
SELECT UPPER([Char]) AS [Char]
FROM letters
GROUP BY [Char]
) a
Example outputs:
SET #TestString = N'*A-zz';
{execute code}
LetterCount = 4
SET #TestString = N'1A^';
{execute code}
LetterCount = 3
SET #TestString = N'1';
{execute code}
LetterCount = 1
SET #TestString = N'*';
{execute code}
LetterCount = 1
CREATE TABLE #STRINGS(
STRING1 NVARCHAR(4000)
)
INSERT INTO #STRINGS (
STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')
;WITH CTE_T AS (
SELECT DISTINCT
S.STRING1
,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
FROM
#STRINGS S
INNER JOIN
[master]..spt_values V
ON V.number < LEN(S.STRING1)
WHERE
V.[type] = 'P'
)
SELECT
T.STRING1
,COUNT(1) AS CNT
FROM
CTE_T T
GROUP BY
T.STRING1

Splitting a String by character and parsing it into multiple columns in another table

I am looking to take a string of directory path and parse information out of it into existing columns on another table. This is for the purpose of creating a staging table for reporting. It will be parsing many directory paths if the ProjectName is applicable to the change in structure.
Data Example:
Table1_Column1
ProjectName\123456_ProjectShortName\Release_1\Iteration\etc
Expected Output:
Table2_Column1, Table2_Column2
123456 ProjectShortName
I've figured out how to parse some strings by character but it seems a bit clunky and inefficient. Is there a better structure to go about this? To add some more to it, this is just one column I need to manipulate before shifting it over there are three other columns that are being directly shifted to the staging table based on the ProjectName.
Is it better to just create a UDF to split then call it within the job that will move the data or is there another way?
Here's a method without a UDF.
It uses charindex and substring to get the parts from that path string.
An example using a table variable:
declare #T table (Table1_Column1 varchar(100));
insert into #T values
('ProjectName\123456_ProjectShortName\Release_1\Iteration\etc'),
('OtherProjectName\789012_OtherProjectShortName\Release_2\Iteration\xxx');
select
case
when FirstBackslashPos > 0 and FirstUnderscorePos > 0
then substring(Col1,FirstBackslashPos+1,FirstUnderscorePos-FirstBackslashPos-1)
end as Table1_Column1,
case
when FirstUnderscorePos > 0 and SecondBackslashPos > 0
then substring(Col1,FirstUnderscorePos+1,SecondBackslashPos-FirstUnderscorePos-1)
end as Table1_Column2
from (
select
Table1_Column1 as Col1,
charindex('\',Table1_Column1) as FirstBackslashPos,
charindex('_',Table1_Column1) as FirstUnderscorePos,
charindex('\',Table1_Column1,charindex('\',Table1_Column1)+1) as SecondBackslashPos
from #T
) q;
If you want to calculate only one into a variable
declare #ProjectPath varchar(100);
set #ProjectPath = 'ProjectName\123456_ProjectShortName\Release_1\Iteration\etc';
declare #FirstBackslashPos int = charindex('\',#ProjectPath);
declare #FirstUnderscorePos int = charindex('_',#ProjectPath,#FirstBackslashPos);
declare #SecondBackslashPos int = charindex('\',#ProjectPath,#FirstBackslashPos+1);
declare #ProjectNumber varchar(30) = case when #FirstBackslashPos > 0 and #FirstUnderscorePos > 0 then substring(#ProjectPath,#FirstBackslashPos+1,#FirstUnderscorePos-#FirstBackslashPos-1)end;
declare #ProjectShortName varchar(30) = case when #FirstUnderscorePos > 0 and #SecondBackslashPos > 0 then substring(#ProjectPath,#FirstUnderscorePos+1,#SecondBackslashPos-#FirstUnderscorePos-1) end;
select #ProjectNumber as ProjectNumber, #ProjectShortName as ProjectShortName;
But i.m.h.o. it might be worth the effort to add some CLR that brings true regex matching to the SQL server. Since CHARINDEX and PATINDEX are not as flexible as regex.
The following is a SUPER fast Parser but it is limited to 8K bytes. Notice the Returned Sequence Number... Perhaps you can key off of that because I am still not clear on the logic for why column1 is 123456 and not ProjectName
Declare #String varchar(max) = 'ProjectName\123456_ProjectShortName\Release_1\Iteration\etc'
Select * from [dbo].[udf-Str-Parse-8K](#String,'\')
Returns
RetSeq RetVal
1 ProjectName
2 123456_ProjectShortName
3 Release_1
4 Iteration
5 etc
The UDF if needed
CREATE FUNCTION [dbo].[udf-Str-Parse-8K] (#String varchar(max),#Delimiter varchar(10))
Returns Table
As
Return (
with cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
cte2(N) As (Select Top (IsNull(DataLength(#String),0)) Row_Number() over (Order By (Select NULL)) From (Select N=1 From cte1 a,cte1 b,cte1 c,cte1 d) A ),
cte3(N) As (Select 1 Union All Select t.N+DataLength(#Delimiter) From cte2 t Where Substring(#String,t.N,DataLength(#Delimiter)) = #Delimiter),
cte4(N,L) As (Select S.N,IsNull(NullIf(CharIndex(#Delimiter,#String,s.N),0)-S.N,8000) From cte3 S)
Select RetSeq = Row_Number() over (Order By A.N)
,RetVal = Substring(#String, A.N, A.L)
From cte4 A
);
--Much faster than str-Parse, but limited to 8K
--Select * from [dbo].[udf-Str-Parse-8K]('Dog,Cat,House,Car',',')
--Select * from [dbo].[udf-Str-Parse-8K]('John||Cappelletti||was||here','||')

Generating an n-gram table with an SQL query

I'm trying to implement a fuzzy search with JavaScript client side, to search a largish db (300 items roughly) of records contained in an SQL database. My constraint is that it is not possible to perform a live query on the database- I must generate "indexes" as flat files during a nightly batch job. And so, starting with a db that looks like this:
ID. NAME
1. The Rain Man
2. The Electric Slide
3. Transformers
I need to create within a single query something like this:
Trigram ID
the 1
the 2
he_ 1
he_ 2
e_r 1
_ra 1
rai 1
ain 1
in_ 1
n_m 1
_ma 1
man 1
e_e 2
_el 2
ele 2
lec 2
Etc etc, typos not withstanding. The rules here are that ''n' is the length of the strings in the first column, that only a-z and _ are valid characters, any other character being normalized to Lower case, or mapped to _, that a group by n-gram clause may be applied to the table. Thus, I would hope to gain a table that would allow me to quickly look up a particular n-gram and get a list of all the Ids of rows which contain that sequence. I'm not a clever enough SQL cookie to figure this problem out. Can you?
I created an T-SQL NGrams that works quite nicely; note the comments section for examples of how to use
CREATE FUNCTION dbo.nGrams8K
(
#string VARCHAR(8000),
#n TINYINT,
#pad BIT
)
/*
Created by: Alan Burstein
Created on: 3/10/2014
Updated on: 5/20/2014 changed the logic to use an "inline tally table"
9/10/2014 Added some more code examples in the comment section
9/30/2014 Added more code examples
10/27/2014 Small bug fix regarding padding
Use: Outputs a stream of tokens based on an input string.
Works just like mdq.nGrams; see http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.
n-gram defined:
In the fields of computational linguistics and probability,
an n-gram is a contiguous sequence of n items from a given
sequence of text or speech. The items can be phonemes, syllables,
letters, words or base pairs according to the application.
To better understand N-Grams see: http://en.wikipedia.org/wiki/N-gram
*/
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH
E1(n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(n)),
E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
iTally(n) AS
(
SELECT TOP (LEN(#string)+#n) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E2 a CROSS JOIN E2 b
),
NewString(NewString) AS
(
SELECT REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)+#string+
REPLICATE(CASE #pad WHEN 0 THEN '' ELSE ' ' END,#n-1)
)
SELECT TOP ((#n)+LEN(#string))
n AS [sequence],
SUBSTRING(NewString,n,#n) AS token
FROM iTally
CROSS APPLY NewString
WHERE n < ((#n)+LEN(#string));
/*
------------------------------------------------------------
-- (1) Basic Use
-------------------------------------------------------------
;-- (A)basic "string to table":
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',1,1);
-- (b) create "bi-grams" (pad bit off)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',2,0);
-- (c) create "tri-grams" (pad bit on)
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1);
-- (d) filter for only "tri-grams"
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
-- note the query plan for each. The power is coming from an index
-- also note how many rows are produced: len(#string+(#n-1))
-- lastly, you can trim as needed when padding=1
------------------------------------------------------------
-- (2) With a variable
------------------------------------------------------------
-- note, in this example I am getting only the stuff that has three letters
DECLARE #string varchar(20) = 'abcdefg',
#tokenLen tinyint = 3;
SELECT [sequence], token
FROM dbo.nGrams8K('abcdefg',3,1)
WHERE len(ltrim(token)) = 3;
GO
------------------------------------------------------------
-- (3) An on-the-fly alphabet (this will come in handy in a moment)
------------------------------------------------------------
DECLARE #alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT [sequence], token
FROM dbo.nGrams8K(#alphabet,1,0);
GO
------------------------------------------------------------
-- (4) Character Count
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ';
SELECT a.token, COUNT(b.token) ttl
FROM dbo.nGrams8K(#alphabet,1,0) a
LEFT JOIN dbo.nGrams8K(#string,1,0) b ON a.token=b.token
GROUP BY a.token
ORDER BY a.token;
GO
------------------------------------------------------------
-- (5) Locate the start position of a search pattern
------------------------------------------------------------
;-- (A) note these queries:
DECLARE #string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';
-- (i)
SELECT * FROM dbo.nGrams8K(#string,1,0) a;
-- (ii) note this query:
SELECT * FROM dbo.nGrams8K(#string,1,0) a WHERE [token]=' ';
-- (B) and now the word count (#string included for presentation)
SELECT #string AS string,
count(*)+1 AS words
FROM dbo.nGrams8K(#string,1,0) a
WHERE [token]=' '
GO
------------------------------------------------------------
-- (6) search for the number of occurances of a word
------------------------------------------------------------
DECLARE #string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dog just laid there.',
#alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',
#searchString VARCHAR(100)='The';
-- (5a) by location
SELECT sequence-(LEN(#searchstring)) AS location,
token AS searchString
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
-- (2b) get total
SELECT #string AS string,
#searchString AS searchString,
COUNT(*) AS ttl
FROM dbo.nGrams8K(#string,LEN(#searchstring+' ')+1,0) b
WHERE token=#searchString;
------------------------------------------------------------
-- (7) Special SubstringBefore and SubstringAfter
------------------------------------------------------------
-- (7a) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringBeforeSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,1,substring_index-len(#substring))
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='10.0.1600.22',
#searchPattern varchar(100)='.',
#substring_index tinyint = 3;
SELECT * FROM dbo.SubstringBeforeSSI(#string,#searchPattern,#substring_index);
GO
-- (7b) SubstringBeforeSSI (note: SSI = substringIndex)
ALTER FUNCTION dbo.SubstringAfterSSI
(
#string varchar(1000),
#substring varchar(100),
#substring_index tinyint
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
WITH get_pos AS
(
SELECT rn = row_number() over (order by sequence), substring_index = sequence
FROM dbo.nGrams8K(#string,len(#substring),1)
WHERE token=#substring
)
SELECT newstring = substring(#string,substring_index+1,8000)
FROM get_pos
WHERE rn=#substring_index;
GO
DECLARE #string varchar(1000)='<notes id="1">blah, blah, blah</notes><notes id="2">More Notes</notes>',
#searchPattern varchar(100)='</notes>',
#substring_index tinyint = 1;
SELECT #string, *
FROM dbo.SubstringAfterSSI(#string,#searchPattern,#substring_index);
------------------------------------------------------------
-- (8) Strip non-numeric characters from a string
------------------------------------------------------------
-- (8a) create the function
ALTER FUNCTION StripNonNumeric_itvf(#OriginalText VARCHAR(8000))
RETURNS TABLE
--WITH SCHEMABINDING
AS
return
WITH ngrams AS
(
SELECT n = [sequence], c = token
FROM dbo.nGrams8K(#OriginalText,1,1)
),
clean_txt(CleanedText) AS
(
SELECT c+''
FROM ngrams
WHERE ascii(substring(#OriginalText,n,1)) BETWEEN 48 AND 57
FOR XML PATH('')
)
SELECT CleanedText
FROM clean_txt;
GO
-- (8b) use against a value or variable
SELECT CleanedText
FROM dbo.StripNonNumeric_itvf('value123');
-- (8c) use against a table
-- test harness:
IF OBJECT_ID('tempdb..#strings') IS NOT NULL DROP TABLE #strings;
WITH strings AS
(
SELECT TOP (100000) string = newid()
FROM sys.all_columns a CROSS JOIN sys.all_columns b
)
SELECT *
INTO #strings
FROM strings;
GO
-- query (returns 100K rows every 3 seconds on my pc):
SELECT CleanedText
FROM #strings
CROSS APPLY dbo.StripNonNumeric_itvf(string);
------------------------------------------------------------
-- (9) A couple complex String Algorithms
------------------------------------------------------------
-- (9a) hamming distance between two strings:
DECLARE #string1 varchar(8000) = 'xxxxyyyzzz',
#string2 varchar(8000) = 'xxxxyyzzzz';
SELECT string1 = #string1,
string2 = #string2,
hamming_distance = count(*)
FROM dbo.nGrams8K(#string1,1,0) s1
CROSS APPLY dbo.nGrams8K(#string2,1,0) s2
WHERE s1.sequence = s2.sequence
AND s1.token <> s2.token
GO
-- (9b) inner join between 2 strings
--(can be used to speed up other string metrics such as the longest common subsequence)
DECLARE #string1 varchar(100)='xxxx123yyyy456zzzz',
#string2 varchar(100)='xx789yy000zz';
WITH
s1(string1) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string1,1,0)
WHERE charindex([token],#string2)<>0
ORDER BY [sequence]
FOR XML PATH('')
),
s2(string2) AS
(
SELECT [token]+''
FROM dbo.nGrams8K(#string2,1,0)
WHERE charindex([token],#string1)<>0
ORDER BY [sequence]
FOR XML PATH('')
)
SELECT string1, string2
FROM s1
CROSS APPLY s2;
------------------------------------------------------------
-- (10) Advanced Substring Metrics
------------------------------------------------------------
-- (10a) Identify common substrings and their location
DECLARE #string1 varchar(100) = 'xxx yyy zzz',
#string2 varchar(100) = 'xx yyy zz';
-- (i) review the two strings
SELECT str1 = #string1,
str2 = #string2;
-- (ii) the results
WITH
iTally AS
(
SELECT n
FROM dbo.tally t
WHERE n<= len(#string1)
),
distinct_tokens AS
(
SELECT ng1 = ng1.token, ng2 = ng2.token --= ltrim(ng1.token), ng2 = ltrim(ng2.token)
FROM itally
CROSS APPLY dbo.nGrams8K(#string1,n,1) ng1
CROSS APPLY dbo.nGrams8K(#string2,n,1) ng2
WHERE ng1.token=ng2.token
)
SELECT ss_txt = ng1,
ss_len = len(ng1),
str1_loc = charindex(ng1,#string1),
str2_loc = charindex(ng2,#string2)
FROM distinct_tokens
WHERE ng1<>'' AND charindex(ng1,#string1)+charindex(ng2,#string2)<>0
GROUP BY ng1, ng2
ORDER BY charindex(ng1,#string1), charindex(ng2,#string2), len(ng1);
-- (10b) Longest common substring function
-- (i) function
IF EXISTS
( SELECT * FROM INFORMATION_SCHEMA.ROUTINES
WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss')
DROP FUNCTION dbo.lcss;
GO
CREATE FUNCTION dbo.lcss(#string1 varchar(100), #string2 varchar(100))
RETURNS TABLE
AS
RETURN
SELECT TOP (1) with ties token
FROM dbo.tally
CROSS APPLY dbo.nGrams8K(#string1,n,1)
WHERE n <= len(#string1)
AND charindex(token, #string2) > 0
ORDER BY len(token) DESC;
GO
-- (ii) example of use
DECLARE #string1 varchar(100) = '000xxxyyyzzz',
#string2 varchar(100) = '999xxyyyzaa';
SELECT string1 = #string1,
string2 = #string2,
token
FROM dbo.lcss(#string1, #string2);
*/
GO
You'd have to repeat this statement:
insert into trigram_table ( Trigram, ID )
select substr( translate( lower( Name ), ' ', '_' ), :X, :N ),
ID
from db_table
for all :X from 1 to Len(Name) + 1 - :N
You will also have to extend the translate function for all the other special characters you'd want to convert to an underscore. Right now it's just translating a blank into an underscore.
For performance you could do the translate and lower functions on the Trigram column in a last pass on the trigram_table so you're not doing those functions for each :X.

SQL query to match keywords?

I have a table with a column as nvarchar(max) with text extracted from word documents in it. How can I create a select query that I'll pass another a list of keywords as parameter and return the rows ordered by the number of matches?
Maybe it is possible with full text search?
Yes, possible with full text search, and likely the best answer. For a straight T-SQL solution, you could use a split function and join, e.g. assuming a table of numbers called dbo.Numbers (you may need to decide on a different upper limit):
SET NOCOUNT ON;
DECLARE #UpperLimit INT;
SET #UpperLimit = 200000;
WITH n AS
(
SELECT
rn = ROW_NUMBER() OVER
(ORDER BY s1.[object_id])
FROM sys.objects AS s1
CROSS JOIN sys.objects AS s2
CROSS JOIN sys.objects AS s3
)
SELECT [Number] = rn - 1
INTO dbo.Numbers
FROM n
WHERE rn <= #UpperLimit + 1;
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
And a splitting function that uses that table of numbers:
CREATE FUNCTION dbo.SplitStrings
(
#List NVARCHAR(MAX)
)
RETURNS TABLE
AS
RETURN
(
SELECT DISTINCT
[Value] = LTRIM(RTRIM(
SUBSTRING(#List, [Number],
CHARINDEX(N',', #List + N',', [Number]) - [Number])))
FROM
dbo.Numbers
WHERE
Number <= LEN(#List)
AND SUBSTRING(N',' + #List, [Number], 1) = N','
);
GO
Then you can simply say:
SELECT key, NvarcharColumn /*, other cols */
FROM dbo.table AS outerT
WHERE EXISTS
(
SELECT 1
FROM dbo.table AS t
INNER JOIN dbo.SplitStrings(N'list,of,words') AS s
ON t.NvarcharColumn LIKE '%' + s.Item + '%'
WHERE t.key = outerT.key
);
As a procedure:
CREATE PROCEDURE dbo.Search
#List NVARCHAR(MAX)
AS
BEGIN
SET NOCOUNT ON;
SELECT key, NvarcharColumn /*, other cols */
FROM dbo.table AS outerT
WHERE EXISTS
(
SELECT 1
FROM dbo.table AS t
INNER JOIN dbo.SplitStrings(#List) AS s
ON t.NvarcharColumn LIKE '%' + s.Item + '%'
WHERE t.key = outerT.key
);
END
GO
Then you can just pass in #List (e.g. EXEC dbo.Search #List = N'foo,bar,splunge') from C#.
This won't be super fast, but I'm sure it will be quicker than pulling all the data out into C# and double-nested loop it manually.
how to ... return the rows ordered by the number of [full-text] matches
I have not used it myself but believe SQL Server 2008 supports weighting the CONTAINSTABLE matches which might be of help to you:
http://msdn.microsoft.com/en-us/library/ms189760.aspx
If you don't have an engine that returns results weighted by the number of hits ...
You could write a UDF that takes two inputs and returns an integer: the big textvalue is the first input and the words you're looking for as a comma-delimited string is the second. The function returns an integer representing either the number of distinct looked-for words that were actually found at least once in the text, or the total number of times the looked-for words were found. Implementation --how to weight-- is up to you. Maybe, for example, you'd want to arrange the looked-for words in most-important to least-important order, and give an important word hit more weight than a less important word hit.
You could then use your full text search engine to find all records that contain at least one of the words (you'd OR them), and you'd run this result set through your UDF scalar function:
pseudo code
select title, weightfunction(summary, 'word1,word2,word3....wordN')
from docs
where summary contains ( word1 or word2 or word3 ... or wordN)
order by weightfunction(summary, 'word1,word2,word3....wordN') desc