How can I pivot these key+values rows into a table of complete entries? - sql

Maybe I demand too much from SQL but I feel like this should be possible. I start with a list of key-value pairs, like this:
'0:First, 1:Second, 2:Third, 3:Fourth'
etc. I can split this up pretty easily with a two-step parse that gets me a table like:
EntryNumber PairNumber Item
0 0 0
1 0 First
2 1 1
3 1 Second
etc.
Now, in the simple case of splitting the pairs into a pair of columns, it's fairly easy. I'm interested in the more advanced case where I might have multiple values per entry, like:
'0:First:Fishing, 1:Second:Camping, 2:Third:Hiking'
and such.
In that generic case, I'd like to find a way to take my 3-column result table and somehow pivot it to have one row per entry and one column per value-part.
So I want to turn this:
EntryNumber PairNumber Item
0 0 0
1 0 First
2 0 Fishing
3 1 1
4 1 Second
5 1 Camping
Into this:
Entry [1] [2] [3]
0 0 First Fishing
1 1 Second Camping
Is that just too much for SQL to handle, or is there a way? Pivots (even tricky dynamic pivots) seem like an answer, but I can't figure how to get that to work.

No, in SQL you can't infer columns dynamically based on the data found during the same query.
Even using the PIVOT feature in Microsoft SQL Server, you must know the columns when you write the query, and you have to hard-code them.
You have to do a lot of work to avoid storing the data in a relational normal form.

Alright, I found a way to accomplish what I was after. Strap in, this is going to get bumpy.
So the basic problem is to take a string with two kinds of delimiters: entries and values. Each entry represents a set of values, and I wanted to turn the string into a table with one column for each value per entry. I tried to make this a UDF, but the necessity for a temporary table and dynamic SQL meant it had to be a stored procedure.
CREATE PROCEDURE [dbo].[ParseValueList]
(
#parseString varchar(8000),
#itemDelimiter CHAR(1),
#valueDelimiter CHAR(1)
)
AS
BEGIN
SET NOCOUNT ON;
IF object_id('tempdb..#ParsedValues') IS NOT NULL
BEGIN
DROP TABLE #ParsedValues
END
CREATE TABLE #ParsedValues
(
EntryID int,
[Rank] int,
Pair varchar(200)
)
So that's just basic set up, establishing the temp table to hold my intermediate results.
;WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1),--Brute forces 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --Uses a cross join to generate 100 rows (10 * 10)
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --Uses a cross join to generate 10,000 rows (100 * 100)
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E4)
That beautiful piece of SQL comes from SQL Server Central's Forums and is credited to "a guru." It's a great little 10,000 line tally table perfect for string splitting.
INSERT INTO #ParsedValues
SELECT ItemNumber AS EntryID, ROW_NUMBER() OVER (PARTITION BY ItemNumber ORDER BY ItemNumber) AS [Rank],
SUBSTRING(Items.Item, T1.N, CHARINDEX(#valueDelimiter, Items.Item + #valueDelimiter, T1.N) - T1.N) AS [Value]
FROM(
SELECT ROW_NUMBER() OVER (ORDER BY T2.N) AS ItemNumber,
SUBSTRING(#parseString, T2.N, CHARINDEX(#itemDelimiter, #parseString + #itemDelimiter, T2.N) - T2.N) AS Item
FROM cteTally T2
WHERE T2.N < LEN(#parseString) + 2 --Ensures we cut out once the entire string is done
AND SUBSTRING(#itemDelimiter + #parseString, T2.N, 1) = #itemDelimiter
) AS Items, cteTally T1
WHERE T1.N < LEN(#parseString) + 2 --Ensures we cut out once the entire string is done
AND SUBSTRING(#valueDelimiter + Items.Item, T1.N, 1) = #valueDelimiter
Ok, this is the first really dense meaty part. The inner select is breaking up my string along the item delimiter (the comma), using the guru's string splitting method. Then that table is passed up to the outer select which does the same thing, but this time using the value delimiter (the colon) to each row. The inner RowNumber (EntryID) and the outer RowNumber over Partition (Rank) are key to the pivot. EntryID show which Item the values belong to, and Rank shows the ordinal of the values.
DECLARE #columns varchar(200)
DECLARE #columnNames varchar(2000)
DECLARE #query varchar(8000)
SELECT #columns = COALESCE(#columns + ',[' + CAST([Rank] AS varchar) + ']', '[' + CAST([Rank] AS varchar)+ ']'),
#columnNames = COALESCE(#columnNames + ',[' + CAST([Rank] AS varchar) + '] AS Value' + CAST([Rank] AS varchar)
, '[' + CAST([Rank] AS varchar)+ '] AS Value' + CAST([Rank] AS varchar))
FROM (SELECT DISTINCT [Rank] FROM #ParsedValues) AS Ranks
SET #query = '
SELECT '+ #columnNames +'
FROM #ParsedValues
PIVOT
(
MAX([Value]) FOR [Rank]
IN (' + #columns + ')
) AS pvt'
EXECUTE(#query)
DROP TABLE #ParsedValues
END
And at last, the dynamic sql that makes it possible. By getting a list of Distinct Ranks, we set up our column list. This is then written into the dynamic pivot which tilts the values over and slots each value into the proper column, each with a generic "Value#" heading.
Thus by calling EXEC ParseValueList with a properly formatted string of values, we can break it up into a table to feed into our purposes! It works (but is probably overkill) for simple key:value pairs, and scales up to a fair number of columns (About 50 at most, I think, but that'd be really silly.)
Anyway, hope that helps anyone having a similar issue.
(Yeah, it probably could have been done in something like SQLCLR as well, but I find a great joy in solving problems with pure SQL.)

Though probably not optimal, here's a more condensed solution.
DECLARE #DATA varchar(max);
SET #DATA = '0:First:Fishing, 1:Second:Camping, 2:Third:Hiking';
SELECT
DENSE_RANK() OVER (ORDER BY [Data].[row]) AS [Entry]
, [Data].[row].value('(./B/text())[1]', 'int') as "[1]"
, [Data].[row].value('(./B/text())[2]', 'varchar(64)') as "[2]"
, [Data].[row].value('(./B/text())[3]', 'varchar(64)') as "[3]"
FROM
(
SELECT
CONVERT(XML, '<A><B>' + REPLACE(REPLACE(#DATA , ',', '</B></A><A><B>'), ':', '</B><B>') + '</B></A>').query('.')
) AS [T]([c])
CROSS APPLY [T].[c].nodes('/A') AS [Data]([row]);

Hope is not too late.
You can use the function RANK to know the position of each Item per PairNumber. And then use Pivot
SELECT PairNumber, [1] ,[2] ,[3]
FROM
(
SELECT PairNumber, Item, RANK() OVER (PARTITION BY PairNumber order by EntryNumber) as RANKing
from tabla) T
PIVOT
(MAX(Item)
FOR RANKing in ([1],[2],[3])
)as PVT

Related

Order Concatenated field

I have a field which is a concatenation of single letters. I am trying to order these strings within a view. These values can't be hard coded as there are too many. Is someone able to provide some guidance on the function to use to achieve the desired output below? I am using MSSQL.
Current output
CustID | Code
123 | BCA
Desired output
CustID | Code
123 | ABC
I have tried using a UDF
CREATE FUNCTION [dbo].[Alphaorder] (#str VARCHAR(50))
returns VARCHAR(50)
BEGIN
DECLARE #len INT,
#cnt INT =1,
#str1 VARCHAR(50)='',
#output VARCHAR(50)=''
SELECT #len = Len(#str)
WHILE #cnt <= #len
BEGIN
SELECT #str1 += Substring(#str, #cnt, 1) + ','
SET #cnt+=1
END
SELECT #str1 = LEFT(#str1, Len(#str1) - 1)
SELECT #output += Sp_data
FROM (SELECT Split.a.value('.', 'VARCHAR(100)') Sp_data
FROM (SELECT Cast ('<M>' + Replace(#str1, ',', '</M><M>') + '</M>' AS XML) AS Data) AS A
CROSS APPLY Data.nodes ('/M') AS Split(a)) A
ORDER BY Sp_data
RETURN #output
END
This works when calling one field
ie.
Select CustID, dbo.alphaorder(Code)
from dbo.source
where custid = 123
however when i try to apply this to top(10) i receive the error
"Invalid length parameter passed to the LEFT or SUBSTRING function."
Keeping in mind my source has ~4million records, is this still the best solution?
Unfortunately i am not able to normalize the data into a separate table with records for each Code.
This doesn't rely on a id column to join with itself, performance is almost as fast
as the answer by #Shnugo:
SELECT
CustID,
(
SELECT
chr
FROM
(SELECT TOP(LEN(Code))
SUBSTRING(Code,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),1)
FROM sys.messages) A(Chr)
ORDER by chr
FOR XML PATH(''), type).value('.', 'varchar(max)'
) As CODE
FROM
source t
First of all: Avoid loops...
You can try this:
DECLARE #tbl TABLE(ID INT IDENTITY, YourString VARCHAR(100));
INSERT INTO #tbl VALUES ('ABC')
,('JSKEzXO')
,('QKEvYUJMKRC');
--the cte will create a list of all your strings separated in single characters.
--You can check the output with a simple SELECT * FROM SeparatedCharacters instead of the actual SELECT
WITH SeparatedCharacters AS
(
SELECT *
FROM #tbl
CROSS APPLY
(SELECT TOP(LEN(YourString)) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) A(Nmbr)
CROSS APPLY
(SELECT SUBSTRING(YourString,Nmbr,1))B(Chr)
)
SELECT ID,YourString
,(
SELECT Chr As [*]
FROM SeparatedCharacters sc1
WHERE sc1.ID=t.ID
ORDER BY sc1.Chr
FOR XML PATH(''),TYPE
).value('.','nvarchar(max)') AS Sorted
FROM #tbl t;
The result
ID YourString Sorted
1 ABC ABC
2 JSKEzXO EJKOSXz
3 QKEvYUJMKRC CEJKKMQRUvY
The idea in short
The trick is the first CROSS APPLY. This will create a tally on-the-fly. You will get a resultset with numbers from 1 to n where n is the length of the current string.
The second apply uses this number to get each character one-by-one using SUBSTRING().
The outer SELECT calls from the orginal table, which means one-row-per-ID and use a correalted sub-query to fetch all related characters. They will be sorted and re-concatenated using FOR XML. You might add DISTINCT in order to avoid repeating characters.
That's it :-)
Hint: SQL-Server 2017+
With version v2017 there's the new function STRING_AGG(). This would make the re-concatenation very easy:
WITH SeparatedCharacters AS
(
SELECT *
FROM #tbl
CROSS APPLY
(SELECT TOP(LEN(YourString)) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) A(Nmbr)
CROSS APPLY
(SELECT SUBSTRING(YourString,Nmbr,1))B(Chr)
)
SELECT ID,YourString
,STRING_AGG(sc.Chr,'') WITHIN GROUP(ORDER BY sc.Chr) AS Sorted
FROM SeparatedCharacters sc
GROUP BY ID,YourString;
Considering your table having good amount of rows (~4 Million), I would suggest you to create a persisted calculated field in the table, to store these values. As calculating these values at run time in a view, will lead to performance problems.
If you are not able to normalize, add this as a denormalized column to the existing table.
I think the error you are getting could be due to empty codes.
If LEN(#str) = 0
BEGIN
SET #output = ''
END
ELSE
BEGIN
... EXISTING CODE BLOCK ...
END
I can suggest to split string into its characters using referred SQL function.
Then you can concatenate string back, this time ordered alphabetically.
Are you using SQL Server 2017? Because with SQL Server 2017, you can use SQL String_Agg string aggregation function to concatenate characters splitted in an ordered way as follows
select
t.CustId, string_agg(strval, '') within GROUP (order by strval)
from CharacterTable t
cross apply dbo.SPLIT(t.code) s
where strval is not null
group by CustId
order by CustId
If you are not working on SQL2017, then you can follow below structure using SQL XML PATH for concatenation in SQL
select
CustId,
STUFF(
(
SELECT
'' + strval
from CharacterTable ct
cross apply dbo.SPLIT(t.code) s
where strval is not null
and t.CustId = ct.CustId
order by strval
FOR XML PATH('')
), 1, 0, ''
) As concatenated_string
from CharacterTable t
order by CustId

Efficient way to merge alternating values from two columns into one column in SQL Server

I have two columns in a table. I want to merge them into a single column, but the merge should be done taking alternate characters from each columns.
For example:
Column A --> value (1,2,3)
Column B --> value (A,B,C)
Required result - (1,A,2,B,3,C)
It should be done without loops.
You need to make use of the UNION and get a little creative with how you choose to alternate. My solution ended up looking like this.
SELECT ColumnA
FROM Table
WHERE ColumnA%2=1
UNION
SELECT ColumnB
FROM TABLE
WHERE ColumnA%2=0
If you have an ID/PK column that could just as easily be used, I just didn't want to assume anything about your table.
EDIT:
If your table contains duplicates that you wish to keep, use UNION ALL instead of UNION
Try This;
SELECT [value]
FROM [Table]
UNPIVOT
(
[value] FOR [Column] IN ([Column_A], [Column_B])
) UNPVT
If you have SQL 2016 or higher you can use:
SELECT QUOTENAME(STRING_AGG (cast(a as varchar(1)) + ',' + b, ','), '()')
FROM test;
In older versions, depending on how much data you have in your tables you can also try:
SELECT QUOTENAME(STUFF(
(SELECT ',' + cast(a as varchar(1)) + ',' + b
FROM test
FOR XML PATH('')), 1, 1,''), '()')
Here you can try a sample
http://sqlfiddle.com/#!18/6c9af/5
with data as (
select *, row_number() over order by colA) as rn
from t
)
select rn,
case rn % 2 when 1 then colA else colB end as alternating
from data;
The following SQL uses undocumented aggregate concatenation technique. This is described in Inside Microsoft SQL Server 2008 T-SQL Programming on page 33.
declare #x varchar(max) = '';
declare #t table (a varchar(10), b varchar(10));
insert into #t values (1,'A'), (2,'B'),(3,'C');
select #x = #x + a + ',' + b + ','
from #t;
select '(' + LEFT(#x, LEN(#x) - 1) + ')';

how to modify t-sql to process multiple records not just one.

I am working on a function to remove/ replace special characters from a string from a column named "Title". Currently I am testing the code for one record at a time. I would like to test the code against all the records in the table, but I do not know how to modify the current t-sql to process all the records rather than just one at a time. I would appreciate if someone could show me how, or what type of modifications I need to do to be able to process all records.
This is the code as I have it right now:
DECLARE #str VARCHAR(400);
DECLARE #expres VARCHAR(50) = '%[~,#,#,$,%,&,*,(,),.,!,´,:]%'
SET #str = (SELECT REPLACE(REPLACE(LOWER([a].[Title]), CHAR(9), ''), ' ', '_') FROM [dbo].[a] WHERE [a].[ID] = '43948')
WHILE PATINDEX(#expres, #str) > 0
SET #str = REPLACE(REPLACE(#str, SUBSTRING(#str, PATINDEX(#expres, #str), 1), ''), '-', ' ')
SELECT #str COLLATE SQL_Latin1_General_CP1251_CS_AS
For a Title containing the value: Schöne Wiege Meiner Leiden, the output after the code is applied would be: schone_wiege_meiner_leiden
I would like to make the code work to process multiple records rather that one like is done currently by specifying the ID. I want to process a bulks of records.
I hope I can get some help, thank you in advance for your help.
Code example taken from: remove special characters from string in sql server
There is no need for a loop here. You can instead use a tally table and this can become a set based inline table valued function quite easily. Performance wise it will blow the doors off a loop based scalar function.
I keep a tally table as a view in my system. Here is the code for the tally table.
create View [dbo].[cteTally] as
WITH
E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
select N from cteTally
Now comes the fun part, using this to parse strings and all kinds of various things. It has been dubbed the swiss army knife of t-sql. Anytime you start thinking loop, try to think about using a tally table instead. Here is how this function might look.
create function RemoveValuesFromString
(
#SearchVal nvarchar(max)
, #CharsToRemove nvarchar(max)
) returns table as
RETURN
with MyValues as
(
select substring(#SearchVal, N, 1) as MyChar
, t.N
from cteTally t
where N <= len(#SearchVal)
and charindex(substring(#SearchVal, N, 1), #CharsToRemove) = 0
)
select distinct MyResult = STUFF((select MyChar + ''
from MyValues mv2
order by mv2.N
FOR XML PATH(''),TYPE).value('.','NVARCHAR(MAX)'), 1, 0, '')
from MyValues mv
;
Here is an example of how you might be able to use this. I am using a table variable here but this could be any table or whatever.
declare #SomeTable table
(
SomeTableID int identity primary key clustered
, SomeString varchar(max)
)
insert #SomeTable
select 'This coffee cost $32.!!! This is a# tot$#a%l r)*i-p~!`of^%f' union all
select 'This &that'
select *
from #SomeTable st
cross apply dbo.RemoveValuesFromString(st.SomeString, '%[~,##$%&*()!´:]%`^-') x

SQL Server 2012 T-SQL count number of words between elements of two sets

I have two sets of elements, let's say they are these words:
set 1: "nuclear", "fission", "dirty" and
set 2: "device", "explosive"
In my database, I have a text column (Description) which contains a sentence or two. I would like to find any records where Description contains both an element from set 1 followed by an element from set 2, where the two elements are separated by four words or less. For simplicity, counting (spaces-1) will count words between the two elements.
I'd prefer it if a solution didn't require the installation of anything like CLR functions for regular expression. Rather, if this could be done with a user-defined table function, it would make deployment simpler.
Does this sound possible?
It is possible, but i do not think it will preform well with millions of rows.
I have a solution here that handles about 10 000 rows in 2 sec and 100 000 rows in about 20 sec on our server. It also requires the famous DelimitedSplit8K sql table function from SQLServerCentral:
DECLARE #set1 VARCHAR(MAX) = 'nuclear, fission, dirty';
DECLARE #set2 VARCHAR(MAX) = 'device, explosive';
WITH GetDistances AS
(
SELECT
DubID = ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID)
, Distance = dbo.[cf_ValueSetDistance](s.Description, #set1, #set2)
, s.ID
,s.Description
FROM #sentences s
JOIN dbo.cf_DelimitedSplit8K(#set1, ',') s1 ON s.Description LIKE '%' + RTRIM(LTRIM(s1.Item)) + '%'
JOIN dbo.cf_DelimitedSplit8K(#set2, ',') s2 ON s.Description LIKE '%' + RTRIM(LTRIM(s2.Item)) + '%'
) SELECT Distance, ID, Description FROM GetDistances WHERE DubID = 1 AND Distance BETWEEN 1 AND 4;
--10 000 rows: 2sec
--100 000 rows: 20sec
Test data generator
--DROP TABLE #sentences
CREATE TABLE #sentences
(
ID INT IDENTITY(1,1) PRIMARY KEY
, Description VARCHAR(100)
);
GO
--CREATE 10000 random sentences that are 100 chars long
SET NOCOUNT ON;
WHILE((SELECT COUNT(*) FROM #sentences) < 10000)
BEGIN
DECLARE #randomWord VARCHAR(100) = '';
SELECT TOP 100 #randomWord = #randomWord + ' ' + Item FROM dbo.cf_DelimitedSplit8K('nuclear fission dirty device explosive On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty through weakness of will, which is the same as saying through shrinking from toil and pain. These cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is untrammelled and when nothing prevents our being able to do what we like best, every pleasure is to be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or the obligations of business it will frequently occur that pleasures have to be repudiated and annoyances accepted. The wise man therefore always holds in these matters to this principle of selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to avoid worse pains', ' ') ORDER BY NEWID();
INSERT INTO #sentences
SELECT #randomWord
END
SET NOCOUNT OFF;
Function 1 - cf_ValueSetDistance
CREATE FUNCTION [dbo].[cf_ValueSetDistance]
(
#value VARCHAR(MAX)
, #compareSet1 VARCHAR(MAX)
, #compareSet2 VARCHAR(MAX)
)
RETURNS INT
AS
BEGIN
SET #value = REPLACE(REPLACE(REPLACE(#value, '.', ''), ',', ''), '?', '');
DECLARE #distance INT;
DECLARE #sentence TABLE( WordIndex INT, Word VARCHAR(MAX) );
DECLARE #set1 TABLE(Word VARCHAR(MAX) );
DECLARE #set2 TABLE(Word VARCHAR(MAX) );
INSERT INTO #sentence
SELECT ItemNumber, RTRIM(LTRIM(Item)) FROM dbo.cf_DelimitedSplit8K(#value, ' ')
INSERT INTO #set1
SELECT RTRIM(LTRIM(Item)) FROM dbo.cf_DelimitedSplit8K(#compareSet1, ',')
IF(EXISTS(SELECT 1 FROM #sentence s JOIN #set1 s1 ON s.Word = s1.Word))
BEGIN
INSERT INTO #set2
SELECT RTRIM(LTRIM(Item)) FROM dbo.cf_DelimitedSplit8K(#compareSet2, ',');
IF(EXISTS(SELECT 1 FROM #sentence s JOIN #set2 s2 ON s.Word = s2.Word))
BEGIN
WITH Set1 AS (
SELECT s.WordIndex, s.Word FROM #sentence s
JOIN #set1 s1 ON s1.Word = s.Word
), Set2 AS
(
SELECT s.WordIndex, s.Word FROM #sentence s
JOIN #set2 s2 ON s2.Word = s.Word
)
SELECT #distance = MIN(ABS(s2.WordIndex - s1.WordIndex)) FROM Set1 s1, Set2 s2
END
END
RETURN #distance;
END
Function 2 - DelimitedSplit8K
(No need to even try to understand this code, this is an extremely fast function for splitting a string to a table, written by several talented people):
CREATE FUNCTION [dbo].[cf_DelimitedSplit8K]
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l;
I dont know anything about performance, but this could be done with cross apply and two temporary tables.
--initialize word set data
DECLARE #set1 TABLE (wordFromSet varchar(n))
DECLARE #set2 TABLE (wordFromSet varchar(n))
INSERT INTO #set1 SELECT 'nuclear' UNION SELECT 'fission' UNION SELECT 'dirty'
INSERT INTO #set2 SELECT 'device' UNION SELECT 'explosive'
SELECT *
FROM MyTable m
CROSS APPLY
(
SELECT wordFromSet
,LEN(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description))) - LEN(REPLACE(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description)),' ', '')) AS WordPosition
FROM #set1
WHERE m.Description LIKE '%' + wordFromSet + '%'
) w1
CROSS APPLY
(
SELECT wordFromSet
,LEN(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description))) - LEN(REPLACE(SUBSTRING(m.Description, 1, CHARINDEX(wordFromSet, m.Description)),' ', '')) AS WordPosition
FROM #set2
WHERE m.Description LIKE '%' + wordFromSet + '%'
) w2
WHERE w2.WordPosition - w1.WordPosition <= treshold
Essentially it will only return rows from MyTable that have at least a word from both sets, and for these rows it will calculate which word position it holds by calculating the difference in length between the substring that ends at the words position and the same substring with spaces removed.
I am adding a new answer, even if my old one has been accepted and I can see you went for the "FULL TEXT INDEX".
I have looked at the answer #Louis gave, and I think it was clever to use "CROSS APPLY". His answer beats the performance of mine. The only problem is that his code will only compare from the first instance of a word. This made me want to try to combine his answer with the split function I used (DelimitedSplit8K from SQLServerCentral).
This results in a remarkable performance boost, I have tested this on 1 million rows, and the result was almost instant:
My old answer: 5min
#Louis answer: 2min
New answer: 3sec
This do not beet the "FULLTEXT INDEX" performance wise, but it at least supports the word search combination specification you provided in a relatively effective way.
DECLARE #set1 TABLE (Word VARCHAR(50))
DECLARE #set2 TABLE (Word VARCHAR(50))
INSERT INTO #set1 SELECT 'nuclear' UNION SELECT 'fission' UNION SELECT 'dirty'
INSERT INTO #set2 SELECT 'device'UNION SELECT 'explosive'
SELECT * FROM #sentences s
CROSS APPLY
(
SELECT * FROM #set1 s1
JOIN dbo.cf_DelimitedSplit8K(s.Description, ' ') split ON split.Item = s1.Word
) s1
CROSS APPLY
(
SELECT * FROM #set2 s2
JOIN dbo.cf_DelimitedSplit8K(s.Description, ' ') split ON split.Item = s2.Word
) s2
WHERE ABS(s1.ItemNumber - s2.ItemNumber) <= 4;
Look at my old answer for the code for the dbo.cf_COM_DelimitedSplit8K function.

replace value in varchar(max) field with join

I have a table that contains text field with placeholders. Something like this:
Row Notes
1. This is some notes ##placeholder130## this ##myPlaceholder##, #oneMore#. End.
2. Second row...just a ##test#.
(This table contains about 1-5k rows on average. Average number of placeholders in one row is 5-15).
Now, I have a lookup table that looks like this:
Name Value
placeholder130 Dog
myPlaceholder Cat
oneMore Cow
test Horse
(Lookup table will contain anywhere from 10k to 100k records)
I need to find the fastest way to join those placeholders from strings to a lookup table and replace with value. So, my result should look like this (1st row):
This is some notes Dog this Cat, Cow. End.
What I came up with was to split each row into multiple for each placeholder and then join it to lookup table and then concat records back to original row with new values, but it takes around 10-30 seconds on average.
You could try to split the string using a numbers table and rebuild it with for xml path.
select (
select coalesce(L.Value, T.Value)
from Numbers as N
cross apply (select substring(Notes.notes, N.Number, charindex('##', Notes.notes + '##', N.Number) - N.Number)) as T(Value)
left outer join Lookup as L
on L.Name = T.Value
where N.Number <= len(notes) and
substring('##' + notes, Number, 2) = '##'
order by N.Number
for xml path(''), type
).value('text()[1]', 'varchar(max)')
from Notes
SQL Fiddle
I borrowed the string splitting from this blog post by Aaron Bertrand
SQL Server is not very fast with string manipulation, so this is probably best done client-side. Have the client load the entire lookup table, and replace the notes as they arrived.
Having said that, it can of course be done in SQL. Here's a solution with a recursive CTE. It performs one lookup per recursion step:
; with Repl as
(
select row_number() over (order by l.name) rn
, Name
, Value
from Lookup l
)
, Recurse as
(
select Notes
, 0 as rn
from Notes
union all
select replace(Notes, '##' + l.name + '##', l.value)
, r.rn + 1
from Recurse r
join Repl l
on l.rn = r.rn + 1
)
select *
from Recurse
where rn =
(
select count(*)
from Lookup
)
option (maxrecursion 0)
Example at SQL Fiddle.
Another option is a while loop to keep replacing lookups until no more are found:
declare #notes table (notes varchar(max))
insert #notes
select Notes
from Notes
while 1=1
begin
update n
set Notes = replace(n.Notes, '##' + l.name + '##', l.value)
from #notes n
outer apply
(
select top 1 Name
, Value
from Lookup l
where n.Notes like '%##' + l.name + '##%'
) l
where l.name is not null
if ##rowcount = 0
break
end
select *
from #notes
Example at SQL Fiddle.
I second the comment that tsql is just not suited for this operation, but if you must do it in the db here is an example using a function to manage the multiple replace statements.
Since you have a relatively small number of tokens in each note (5-15) and a very large number of tokens (10k-100k) my function first extracts tokens from the input as potential tokens and uses that set to join to your lookup (dbo.Token below). It was far too much work to look for an occurrence of any of your tokens in each note.
I did a bit of perf testing using 50k tokens and 5k notes and this function runs really well, completing in <2 seconds (on my laptop). Please report back how this strategy performs for you.
note: In your example data the token format was not consistent (##_#, ##_##, #_#), I am guessing this was simply a typo and assume all tokens take the form of ##TokenName##.
--setup
if object_id('dbo.[Lookup]') is not null
drop table dbo.[Lookup];
go
if object_id('dbo.fn_ReplaceLookups') is not null
drop function dbo.fn_ReplaceLookups;
go
create table dbo.[Lookup] (LookupName varchar(100) primary key, LookupValue varchar(100));
insert into dbo.[Lookup]
select '##placeholder130##','Dog' union all
select '##myPlaceholder##','Cat' union all
select '##oneMore##','Cow' union all
select '##test##','Horse';
go
create function [dbo].[fn_ReplaceLookups](#input varchar(max))
returns varchar(max)
as
begin
declare #xml xml;
select #xml = cast(('<r><i>'+replace(#input,'##' ,'</i><i>')+'</i></r>') as xml);
--extract the potential tokens
declare #LookupsInString table (LookupName varchar(100) primary key);
insert into #LookupsInString
select distinct '##'+v+'##'
from ( select [v] = r.n.value('(./text())[1]', 'varchar(100)'),
[r] = row_number() over (order by n)
from #xml.nodes('r/i') r(n)
)d(v,r)
where r%2=0;
--tokenize the input
select #input = replace(#input, l.LookupName, l.LookupValue)
from dbo.[Lookup] l
join #LookupsInString lis on
l.LookupName = lis.LookupName;
return #input;
end
go
return
--usage
declare #Notes table ([Id] int primary key, notes varchar(100));
insert into #Notes
select 1, 'This is some notes ##placeholder130## this ##myPlaceholder##, ##oneMore##. End.' union all
select 2, 'Second row...just a ##test##.';
select *,
dbo.fn_ReplaceLookups(notes)
from #Notes;
Returns:
Tokenized
--------------------------------------------------------
This is some notes Dog this Cat, Cow. End.
Second row...just a Horse.
Try this
;WITH CTE (org, calc, [Notes], [level]) AS
(
SELECT [Notes], [Notes], CONVERT(varchar(MAX),[Notes]), 0 FROM PlaceholderTable
UNION ALL
SELECT CTE.org, CTE.[Notes],
CONVERT(varchar(MAX), REPLACE(CTE.[Notes],'##' + T.[Name] + '##', T.[Value])), CTE.[level] + 1
FROM CTE
INNER JOIN LookupTable T ON CTE.[Notes] LIKE '%##' + T.[Name] + '##%'
)
SELECT DISTINCT org, [Notes], level FROM CTE
WHERE [level] = (SELECT MAX(level) FROM CTE c WHERE CTE.org = c.org)
SQL FIDDLE DEMO
Check the below devioblog post for reference
devioblog post
To get speed, you can preprocess the note templates into a more efficient form. This will be a sequence of fragments, with each ending in a substitution. The substitution might be NULL for the last fragment.
Notes
Id FragSeq Text SubsId
1 1 'This is some notes ' 1
1 2 ' this ' 2
1 3 ', ' 3
1 4 '. End.' null
2 1 'Second row...just a ' 4
2 2 '.' null
Subs
Id Name Value
1 'placeholder130' 'Dog'
2 'myPlaceholder' 'Cat'
3 'oneMore' 'Cow'
4 'test' 'Horse'
Now we can do the substitutions with a simple join.
SELECT Notes.Text + COALESCE(Subs.Value, '')
FROM Notes LEFT JOIN Subs
ON SubsId = Subs.Id WHERE Notes.Id = ?
ORDER BY FragSeq
This produces a list of fragments with substitutions complete. I am not an MSQL user, but in most dialects of SQL you can concatenate these fragments in a variable quite easily:
DECLARE #Note VARCHAR(8000)
SELECT #Note = COALESCE(#Note, '') + Notes.Text + COALSCE(Subs.Value, '')
FROM Notes LEFT JOIN Subs
ON SubsId = Subs.Id WHERE Notes.Id = ?
ORDER BY FragSeq
Pre-processing a note template into fragments will be straightforward using the string splitting techniques of other posts.
Unfortunately I'm not at a location where I can test this, but it ought to work fine.
I really don't know how it will perform with 10k+ of lookups.
how does the old dynamic SQL performs?
DECLARE #sqlCommand NVARCHAR(MAX)
SELECT #sqlCommand = N'PlaceholderTable.[Notes]'
SELECT #sqlCommand = 'REPLACE( ' + #sqlCommand +
', ''##' + LookupTable.[Name] + '##'', ''' +
LookupTable.[Value] + ''')'
FROM LookupTable
SELECT #sqlCommand = 'SELECT *, ' + #sqlCommand + ' FROM PlaceholderTable'
EXECUTE sp_executesql #sqlCommand
Fiddle demo
And now for some recursive CTE.
If your indexes are correctly set up, this one should be very fast or very slow. SQL Server always surprises me with performance extremes when it comes to the r-CTE...
;WITH T AS (
SELECT
Row,
StartIdx = 1, -- 1 as first starting index
EndIdx = CAST(patindex('%##%', Notes) as int), -- first ending index
Result = substring(Notes, 1, patindex('%##%', Notes) - 1)
-- (first) temp result bounded by indexes
FROM PlaceholderTable -- **this is your source table**
UNION ALL
SELECT
pt.Row,
StartIdx = newstartidx, -- starting index (calculated in calc1)
EndIdx = EndIdx + CAST(newendidx as int) + 1, -- ending index (calculated in calc4 + total offset)
Result = Result + CAST(ISNULL(newtokensub, newtoken) as nvarchar(max))
-- temp result taken from subquery or original
FROM
T
JOIN PlaceholderTable pt -- **this is your source table**
ON pt.Row = T.Row
CROSS APPLY(
SELECT newstartidx = EndIdx + 2 -- new starting index moved by 2 from last end ('##')
) calc1
CROSS APPLY(
SELECT newtxt = substring(pt.Notes, newstartidx, len(pt.Notes))
-- current piece of txt we work on
) calc2
CROSS APPLY(
SELECT patidx = patindex('%##%', newtxt) -- current index of '##'
) calc3
CROSS APPLY(
SELECT newendidx = CASE
WHEN patidx = 0 THEN len(newtxt) + 1
ELSE patidx END -- if last piece of txt, end with its length
) calc4
CROSS APPLY(
SELECT newtoken = substring(pt.Notes, newstartidx, newendidx - 1)
-- get the new token
) calc5
OUTER APPLY(
SELECT newtokensub = Value
FROM LookupTable
WHERE Name = newtoken -- substitute the token if you can find it in **your lookup table**
) calc6
WHERE newstartidx + len(newtxt) - 1 <= len(pt.Notes)
-- do this while {new starting index} + {length of txt we work on} exceeds total length
)
,lastProcessed AS (
SELECT
Row,
Result,
rn = row_number() over(partition by Row order by StartIdx desc)
FROM T
) -- enumerate all (including intermediate) results
SELECT *
FROM lastProcessed
WHERE rn = 1 -- filter out intermediate results (display only last ones)