Separate words by group wise for each row in SQL - sql

I have a string something like
No People,Day,side view,looking at camera,snow,mountain,tranquil scene,tranquility,Night,walking,water,Two Person,looking Down
And I have a table Group_words
Group Category
---------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------
No People,One Person,Two Person,Three Person,Four Person,five person,medium group of people,large group of people,unrecognizable person,real people People
Day,dusk,night,dawn,sunset,sunrise Weather
looking at camera,looking way,looking sideways,looking down,looking up View Angle
I want to check every comma separated word with table Group_words and find the wrong combination.
For the above string result should be : "No People,Day,side view,looking at camera,snow,mountain,tranquil scene,tranquility,walking,water"
Night is removed because Day is available in the string.
Two Person is removed because No People is available in the string.
looking Down is removed because looking at camera is available in the string.
I know its to complicated but simply I want to remove the not matching words from sting which is available into table Group_words.

Wow, you should be re-designing your tables. Anyway, here is my attempt using Jeff Moden's DelimitedSplit8k.
I believe you now have this function since I answered one of your previous questions that also uses this function.
First, you want to split your #string input into separate rows. You should also split the Group_Words table.
After that you do a LEFT JOIN to get the matching categories. Then you eliminate the invalid words.
See it in action here: SQL Fiddle
DECLARE #string VARCHAR(8000)
SET #string = 'No People,Day,side view,looking at camera,snow,mountain,tranquil scene,tranquility,Night,walking,water,Two Person,looking Down'
-- Split #string variable
DECLARE #tbl_string AS TABLE(ItemNumber INT, Item VARCHAR(8000))
INSERT INTO #tbl_string
SELECT
ItemNumber, LTRIM(RTRIM(Item))
FROM dbo.DelimitedSplit8K(#string, ',')
-- Normalize Group_Words
DECLARE #tbl_grouping AS TABLE(Category VARCHAR(20), ItemNumber INT, Item VARCHAR(8000))
INSERT INTO #tbl_grouping
SELECT
w.Category, s.ItemNumber, LTRIM(RTRIM(s.Item))
FROM Group_Words w
CROSS APPLY dbo.DelimitedSplit8K(w.[Group], ',')s
;WITH Cte AS(
SELECT
s.ItemNumber,
s.Item,
g.category,
RN = ROW_NUMBER() OVER(PARTITION BY g.Category ORDER BY s.ItemNumber)
FROM #tbl_string s
LEFT JOIN #tbl_grouping g
ON g.Item = s.Item
)
SELECT STUFF((
SELECT ',' + Item
FROM Cte
WHERE
RN = 1
OR Category IS NULL
ORDER BY ItemNumber
FOR XML PATH(''), TYPE).value('.', 'VARCHAR(MAX)'),
1, 1, '')
OUTPUT:
| |
|--------------------------------------------------------------------------------------------------|
| No People,Day,side view,looking at camera,snow,mountain,tranquil scene,tranquility,walking,water |
If your #string input has more than 8000 characters, the DelimitedSplit8K will slow down. You can use other splitters instead. Here is one taken for Sir Aaron Bertrands's article.
CREATE FUNCTION dbo.SplitStrings_XML
(
#List NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(#List, #Delimiter, '</i><i>')
+ '</i>').query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO

Related

Loop and insert more than one comma separated List in SQL

I wish to loop through two comma-separated values and perform an insert
As an example lets consider two variables
Declare #Qid= 1,4,6,7,8 #Answers = 4,4,3,2,3
set #pos = 0
set #len = 0
WHILE CHARINDEX(',', #Answers, #pos+1)>0
BEGIN
set #len = CHARINDEX(',', #Answers, #pos+1) - #pos
set #value = SUBSTRING(#Answers, #pos, #len)
insert into table values(#fdid,#Qid,#fusid, #value) -- i need Qid also
set #pos = CHARINDEX(',', #Answers, #pos+#len) +1
END
Using this loop I am able to extract #Answers and can perform insert. But I wish to extract #Qid and insert inside the loop.
edit
for more clarity it is a feedback module. my result table have Qid and Answer field. Answers are ratings (1 to 5). The values we get in variables #Qid and #Answers are sequential. which means 1st answer will be for 1st question and so on.
edit
as per Shnugo's Answer
Declare #Qid varchar(100)= '1,4,6,7,8', #Answers varchar(100)= '4,4,3,2,3'
DECLARE #tbl TABLE(ID INT IDENTITY, Questions VARCHAR(100),Answers VARCHAR(100));
INSERT INTO #tbl VALUES(#Qid,#Answers)
INSERT INTO table(FeedbackId,QuestionId,FeedbackUserId,Answer)
SELECT 1,
A.qXml.value('(/x[sql:column("B.QuestionCount")])[1]','int') AS QuestionNumber,3
,A.aXml.value('(/x[sql:column("B.QuestionCount")])[1]','int') AS AnwerNumber
FROM #tbl t
CROSS APPLY(SELECT CAST('<x>' + REPLACE(#Qid,',','</x><x>') + '</x>' AS XML)
,CAST('<x>' + REPLACE(#Answers,',','</x><x>') + '</x>' AS XML)) A(qXml,aXml)
CROSS APPLY(SELECT TOP(A.qXml.value('count(/x)','int')) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) B(QuestionCount)
I'd prefer Zhorov's JSON answer (needs v2016+).
If you use a SQL-Server below 2016 you might use this position-safe XML-based solution:
A mockup table to simulate your issue with two different rows.
DECLARE #tbl TABLE(ID INT IDENTITY, Questions VARCHAR(100),Answers VARCHAR(100));
INSERT INTO #tbl VALUES('1,4,6,7,8','4,4,3,2,3')
,('1,2,3','4,5,6');
--The query
SELECT t.*
,A.qXml.value('(/x[sql:column("B.QuestionCount")])[1]','int') AS QuestionNumber
,A.aXml.value('(/x[sql:column("B.QuestionCount")])[1]','int') AS AnwerNumber
FROM #tbl t
CROSS APPLY(SELECT CAST('<x>' + REPLACE(t.Questions,',','</x><x>') + '</x>' AS XML)
,CAST('<x>' + REPLACE(t.Answers,',','</x><x>') + '</x>' AS XML)) A(qXml,aXml)
CROSS APPLY(SELECT TOP(A.qXml.value('count(/x)','int')) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) B(QuestionCount);
The idea in short:
We need a CROSS APPLY and some string methods to transform something like 1,2,3 to an xml like <x>1</x><x>2</x><x>3</x>.
Now we can use value() with XQuery count() to find the actual count of questions.
We need one more CROSS APPLY with a computed TOP() clause to get a set of running number from 1 to n with n=countOfQuestions. I do this against master..spt_values. This is just a well-filled standard table... We do not need the values, just any set to create the counter...
Finally we can use .value() in connection with sql:column() in order to fetch the question and the corresponding answer by their positions.
UPDATE: Non-tabular data
If you do not get these CSV parameters as a table you can use this:
Declare #Qid varchar(100)= '1,4,6,7,8', #Answers varchar(100)= '4,4,3,2,3'
--INSERT INTO table(FeedbackId,QuestionId,FeedbackUserId,Answer)
SELECT 1
,A.qXml.value('(/x[sql:column("B.QuestionCount")])[1]','int') AS QuestionNumber
,3
,A.aXml.value('(/x[sql:column("B.QuestionCount")])[1]','int') AS AnwerNumber
FROM (SELECT CAST('<x>' + REPLACE(#Qid,',','</x><x>') + '</x>' AS XML)
,CAST('<x>' + REPLACE(#Answers,',','</x><x>') + '</x>' AS XML)) A(qXml,aXml)
CROSS APPLY(SELECT TOP(A.qXml.value('count(/x)','int')) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) B(QuestionCount);
If you use SQL Server 2016 or higher, you may try to use the next JSON-based approach to map questions and answers by their positions in the input strings. You need to transform the input strings into valid JSON arrays and then use OPENJSON() with default schema to parse the arrays. The result is a table, with columns key, value and type and the key column holds the index of the element in the specified array.
Note, that STRING_SPLIT() function does not guarantee the order of the rows and the output rows might be in any order.
Statement:
DECLARE #Qid nvarchar(max) = N'1,4,6,7,8'
DECLARE #Answers nvarchar(max) = N'4,4,3,2,3'
-- Build your INSERT statement as you expect
-- INSERT INTO Table ...
SELECT j1.[value] AS Qid, j2.[value] AS Answers
FROM OPENJSON(CONCAT(N'[', #Qid, N']')) j1
JOIN OPENJSON(CONCAT(N'[', #Answers, N']')) j2 ON j1.[key] = j2.[key]
Result from the SELECT statement:
Qid Answers
1 4
4 4
6 3
7 2
8 3
You have not described relationship of question and its answers. I feel its one to one relationship and for that, I have given the answer.
declare #Qid varchar(200)= '1,4,6,7,8' , #Answers varchar(200) = '4,4,3,2,3'
;with cte
as(
select id, data qid from dbo.Split (#qid, ',')
),
cte1 as
(
select id, data ansid from dbo.Split (#answers, ',')
)
--insert into tablename
select
qid, ansid from cte join cte1 on cte.id = cte1.id
Result will be:
qid ansid
1 4
4 4
6 3
7 2
8 3
See other option for later version of sqlserver : Split function equivalent in T-SQL?

Order Concatenated field

I have a field which is a concatenation of single letters. I am trying to order these strings within a view. These values can't be hard coded as there are too many. Is someone able to provide some guidance on the function to use to achieve the desired output below? I am using MSSQL.
Current output
CustID | Code
123 | BCA
Desired output
CustID | Code
123 | ABC
I have tried using a UDF
CREATE FUNCTION [dbo].[Alphaorder] (#str VARCHAR(50))
returns VARCHAR(50)
BEGIN
DECLARE #len INT,
#cnt INT =1,
#str1 VARCHAR(50)='',
#output VARCHAR(50)=''
SELECT #len = Len(#str)
WHILE #cnt <= #len
BEGIN
SELECT #str1 += Substring(#str, #cnt, 1) + ','
SET #cnt+=1
END
SELECT #str1 = LEFT(#str1, Len(#str1) - 1)
SELECT #output += Sp_data
FROM (SELECT Split.a.value('.', 'VARCHAR(100)') Sp_data
FROM (SELECT Cast ('<M>' + Replace(#str1, ',', '</M><M>') + '</M>' AS XML) AS Data) AS A
CROSS APPLY Data.nodes ('/M') AS Split(a)) A
ORDER BY Sp_data
RETURN #output
END
This works when calling one field
ie.
Select CustID, dbo.alphaorder(Code)
from dbo.source
where custid = 123
however when i try to apply this to top(10) i receive the error
"Invalid length parameter passed to the LEFT or SUBSTRING function."
Keeping in mind my source has ~4million records, is this still the best solution?
Unfortunately i am not able to normalize the data into a separate table with records for each Code.
This doesn't rely on a id column to join with itself, performance is almost as fast
as the answer by #Shnugo:
SELECT
CustID,
(
SELECT
chr
FROM
(SELECT TOP(LEN(Code))
SUBSTRING(Code,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),1)
FROM sys.messages) A(Chr)
ORDER by chr
FOR XML PATH(''), type).value('.', 'varchar(max)'
) As CODE
FROM
source t
First of all: Avoid loops...
You can try this:
DECLARE #tbl TABLE(ID INT IDENTITY, YourString VARCHAR(100));
INSERT INTO #tbl VALUES ('ABC')
,('JSKEzXO')
,('QKEvYUJMKRC');
--the cte will create a list of all your strings separated in single characters.
--You can check the output with a simple SELECT * FROM SeparatedCharacters instead of the actual SELECT
WITH SeparatedCharacters AS
(
SELECT *
FROM #tbl
CROSS APPLY
(SELECT TOP(LEN(YourString)) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) A(Nmbr)
CROSS APPLY
(SELECT SUBSTRING(YourString,Nmbr,1))B(Chr)
)
SELECT ID,YourString
,(
SELECT Chr As [*]
FROM SeparatedCharacters sc1
WHERE sc1.ID=t.ID
ORDER BY sc1.Chr
FOR XML PATH(''),TYPE
).value('.','nvarchar(max)') AS Sorted
FROM #tbl t;
The result
ID YourString Sorted
1 ABC ABC
2 JSKEzXO EJKOSXz
3 QKEvYUJMKRC CEJKKMQRUvY
The idea in short
The trick is the first CROSS APPLY. This will create a tally on-the-fly. You will get a resultset with numbers from 1 to n where n is the length of the current string.
The second apply uses this number to get each character one-by-one using SUBSTRING().
The outer SELECT calls from the orginal table, which means one-row-per-ID and use a correalted sub-query to fetch all related characters. They will be sorted and re-concatenated using FOR XML. You might add DISTINCT in order to avoid repeating characters.
That's it :-)
Hint: SQL-Server 2017+
With version v2017 there's the new function STRING_AGG(). This would make the re-concatenation very easy:
WITH SeparatedCharacters AS
(
SELECT *
FROM #tbl
CROSS APPLY
(SELECT TOP(LEN(YourString)) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values) A(Nmbr)
CROSS APPLY
(SELECT SUBSTRING(YourString,Nmbr,1))B(Chr)
)
SELECT ID,YourString
,STRING_AGG(sc.Chr,'') WITHIN GROUP(ORDER BY sc.Chr) AS Sorted
FROM SeparatedCharacters sc
GROUP BY ID,YourString;
Considering your table having good amount of rows (~4 Million), I would suggest you to create a persisted calculated field in the table, to store these values. As calculating these values at run time in a view, will lead to performance problems.
If you are not able to normalize, add this as a denormalized column to the existing table.
I think the error you are getting could be due to empty codes.
If LEN(#str) = 0
BEGIN
SET #output = ''
END
ELSE
BEGIN
... EXISTING CODE BLOCK ...
END
I can suggest to split string into its characters using referred SQL function.
Then you can concatenate string back, this time ordered alphabetically.
Are you using SQL Server 2017? Because with SQL Server 2017, you can use SQL String_Agg string aggregation function to concatenate characters splitted in an ordered way as follows
select
t.CustId, string_agg(strval, '') within GROUP (order by strval)
from CharacterTable t
cross apply dbo.SPLIT(t.code) s
where strval is not null
group by CustId
order by CustId
If you are not working on SQL2017, then you can follow below structure using SQL XML PATH for concatenation in SQL
select
CustId,
STUFF(
(
SELECT
'' + strval
from CharacterTable ct
cross apply dbo.SPLIT(t.code) s
where strval is not null
and t.CustId = ct.CustId
order by strval
FOR XML PATH('')
), 1, 0, ''
) As concatenated_string
from CharacterTable t
order by CustId

How to SORT in order as entered in SQL Server?

I'm using SQL Server and I'm trying to find results but I would like to get the results in the same order as I had input the conditions.
My code:
SELECT
AccountNumber, EndDate
FROM
Accounts
WHERE
AccountNumber IN (212345, 312345, 145687, 658975, 256987, 365874, 568974, 124578, 125689) -- I would like the results to be in the same order as these numbers.
Here is an in-line approach
Example
Declare #List varchar(max)='212345, 312345, 145687, 658975, 256987, 365874, 568974, 124578, 125689'
Select A.AccountNumber
,A.EndDate
From Accounts A
Join (
Select RetSeq = Row_Number() over (Order By (Select null))
,RetVal = v.value('(./text())[1]', 'int')
From (values (convert(xml,'<x>' + replace(#List,',','</x><x>')+'</x>'))) x(n)
Cross Apply n.nodes('x') node(v)
) B on A.AccountNumber = B.RetVal
Order By B.RetSeq
EDIT - the subquery Returns
RetSeq RetVal
1 212345
2 312345
3 145687
4 658975
5 256987
6 365874
7 568974
8 124578
9 125689
You can replace IN with a JOIN, and set a field for ordering, like this:
SELECT AccountNumber , EndDate
FROM Accounts a
JOIN (
SELECT 212345 AS Number, 1 AS SeqOrder
UNION ALL
SELECT 312345 AS Number, 2 AS SeqOrder
UNION ALL
SELECT 145687 AS Number, 3 AS SeqOrder
UNION ALL
... -- and so on
) AS inlist ON inlist.Number = a.AccountNumber
ORDER BY inlist.SeqOrder
I will offer one more approach I just found out, but this needs v2016. Regrettfully the developers forgot to include the index into the resultset of STRING_SPLIT(), but this would work and is documented:
A solution via FROM OPENJSON():
DECLARE #str VARCHAR(100) = 'val1,val2,val3';
SELECT *
FROM OPENJSON('["' + REPLACE(#str,',','","') + '"]');
The result
key value type
0 val1 1
1 val2 1
2 val3 1
The documentation tells clearly:
When OPENJSON parses a JSON array, the function returns the indexes of the elements in the JSON text as keys.
This is not an answer, just some test-code to check John Cappelletti's approach.
DECLARE #tbl TABLE(ID INT IDENTITY,SomeGuid UNIQUEIDENTIFIER);
--Create more than 6 mio rows with an running number and a changing Guid
WITH tally AS (SELECT ROW_NUMBER()OVER(ORDER BY (SELECT NULL)) AS Nmbr
FROM master..spt_values v1
CROSS JOIN master..spt_values v2)
INSERT INTO #tbl
SELECT NEWID() from tally;
SELECT COUNT(*) FROM #tbl; --6.325.225 on my machine
--Create an XML with nothing more than a list of GUIDs in the order of the table's ID
DECLARE #xml XML=
(SELECT SomeGuid FRom #tbl ORDER BY ID FOR XML PATH(''),ROOT('root'),TYPE);
--Create one invalid entry
UPDATE #tbl SET SomeGuid = NEWID() WHERE ID=10000;
--Read all GUIDs out of the XML and number them
DECLARE #tbl2 TABLE(Position INT,TheGuid UNIQUEIDENTIFIER);
INSERT INTO #tbl2(Position,TheGuid)
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
,g.value(N'text()[1]',N'uniqueidentifier')
FROM #xml.nodes(N'/root/SomeGuid') AS A(g);
--then JOIN them via "Position" and check,
--if there are rows, where not the same values get into the same row.
SELECT *
FROM #tbl t
INNER JOIN #tbl2 t2 ON t2.Position=t.ID
WHERE t.SomeGuid<>t2.TheGuid;
At least in this simple case I always get exactly only the one record back which was invalidated...
Okay, after some re-thinking I'll offer the ultimative XML based type-safe and sort-safe splitter:
Declare #List varchar(max)='212345, 312345, 145687, 658975, 256987, 365874, 568974, 124578, 125689';
DECLARE #delimiter VARCHAR(10)=', ';
WITH Casted AS
(
SELECT (LEN(#List)-LEN(REPLACE(#List,#delimiter,'')))/LEN(REPLACE(#delimiter,' ','.')) + 1 AS ElementCount
,CAST('<x>' + REPLACE((SELECT #List AS [*] FOR XML PATH('')),#delimiter,'</x><x>')+'</x>' AS XML) AS ListXml
)
,Tally(Nmbr) As
(
SELECT TOP((SELECT ElementCount FROM Casted)) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values v1 CROSS JOIN master..spt_values v2
)
SELECT Tally.Nmbr AS Position
,(SELECT ListXml.value('(/x[sql:column("Tally.Nmbr")])[1]','int') FROM Casted) AS Item
FROM Tally;
The trick is to create a list of running numbers with the fitting number of element (a number's table was even better) and to pick the elements according to their position.
Hint: This is rather slow...
UPDATE: even better:
WITH Casted AS
(
SELECT (LEN(#List)-LEN(REPLACE(#List,#delimiter,'')))/LEN(REPLACE(#delimiter,' ','.')) + 1 AS ElementCount
,CAST('<x>' + REPLACE((SELECT #List AS [*] FOR XML PATH('')),#delimiter,'</x><x>')+'</x>' AS XML)
.query('
for $x in /x
return <x p="{count(/x[. << $x])}">{$x/text()[1]}</x>
') AS ListXml
)
SELECT x.value('#p','int') AS Position
,x.value('text()[1]','int') AS Item
FROM Casted
CROSS APPLY Casted.ListXml.nodes('/x') AS A(x);
Elements are create as
<x p="99">TheValue</x>
Regrettfully the XQuery function position() is not available to retrieve the value. But you can use the trick to count all elements before a given node. this is scaling badly, as this count must be performed over and over. The more elements the worse it goes...
UPDATE2: With a known count of elements one might use this (much better performance)
Use XQuery to iterate a literally given list:
WITH Casted AS
(
SELECT (LEN(#List)-LEN(REPLACE(#List,#delimiter,'')))/LEN(REPLACE(#delimiter,' ','.')) + 1 AS ElementCount
,CAST('<x>' + REPLACE((SELECT #List AS [*] FOR XML PATH('')),#delimiter,'</x><x>')+'</x>' AS XML)
.query('
for $i in (1,2,3,4,5,6,7,8,9)
return <x p="{$i}">{/x[$i]/text()[1]}</x>
') AS ListXml
)
SELECT x.value('#p','int') AS Position
,x.value('text()[1]','int') AS Item
FROM Casted
CROSS APPLY Casted.ListXml.nodes('/x') AS A(x);
In Azure SQL, there is now extended version of STRING_SPLIT which also can return the order of items if the third optional argument enable_ordinal is set to 1.
Then this simple task is finally easy:
DECLARE #string AS varchar(200) = 'a/b/c/d/e'
DECLARE #position AS int = 3
SELECT value FROM STRING_SPLIT(#string, '/', 1) WHERE ordinal = #position
Unfortunately not available in SQL Server 2019, only in Azure for now, lets hope it will be in SQL Server 2022.

Dynamic String replacement in a list for SQL 2005

I have a column that can have results like the following:
ER
ER,ER,ER
ER,ER,OR,OR
OR
OR,OR
OR,OR,OR,ER,ER
I am looking for a way to replace any a string such as "ER,ER,ER,OR,OR" to just "ER,OR". No matter how many times ER or OR show up I just want each displayed only once. Thank you
Create a split function (or search for hundreds of variations on this site):
CREATE FUNCTION [dbo].[SplitStrings_XML]
(
#List NVARCHAR(MAX),
#Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = y.i.value(N'./text()[1]', N'nvarchar(4000)')
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(#List, #Delimiter, '</i><i>') + '</i>').query('.')
) AS a
CROSS APPLY x.nodes('i') AS y(i)
);
GO
Then you can do this:
DECLARE #x TABLE(foo VARCHAR(32));
INSERT #x SELECT 'ER'
UNION ALL SELECT 'ER,ER,ER'
UNION ALL SELECT 'ER,ER,OR,OR'
UNION ALL SELECT 'OR'
UNION ALL SELECT 'OR,OR'
UNION ALL SELECT 'OR,OR,OR,ER,ER';
;WITH g AS
(
SELECT x.foo, s.Item
FROM #x AS x
CROSS APPLY dbo.SplitStrings_XML(x.foo, ',') AS s
GROUP BY x.foo, s.Item
)
SELECT original = x.foo, new = STUFF((SELECT ',' + Item FROM g
WHERE foo = x.foo GROUP BY Item
FOR XML PATH(''),
TYPE).value(N'./text()[1]', N'nvarchar(max)'), 1, 1, '')
FROM #x AS x;
Results are almost exactly the way you want, except order from the initial string is not preserved:
original new
----------------- ----------
ER ER
ER,ER,ER ER
ER,ER,OR,OR ER,OR
OR OR
OR,OR OR
OR,OR,OR,ER,ER ER,OR
A quick and dirty way would be like this... though it's limited to the example you provide in the question only. Conceptually there are better ways as provided in the comment to your question by KM.
update mytable
set mycolumn =
case when (mycolumn like '%er%' and not mycolumn like '%or%') then 'ER'
when (mycolumn like '%or%' and not mycolumn like '%er%') then 'OR'
when (mycolumn like '%er%' and mycolumn like '%or%') then 'ER,OR' end

TSQL - Querying a table column to pull out popular words for a tag cloud

Just an exploratory question to see if anyone has done this or if, in fact it is at all possible.
We all know what a tag cloud is, and usually, a tag cloud is created by someone assigning tags. Is it possible, within the current features of SQL Server to create this automatically, maybe via trigger when a table has a record added or updated, by looking at the data within a certain column and getting popular words?
It is similar to this question: How can I get the most popular words in a table via mysql?. But, that is MySQL not MSSQL.
Thanks in advance.
James
Here is a good bit on parsing delimited string into rows:
http://anyrest.wordpress.com/2010/08/13/converting-parsing-delimited-string-column-in-sql-to-rows/
http://www.sqlteam.com/article/parsing-csv-values-into-multiple-rows
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50648
T-SQL: Opposite to string concatenation - how to split string into multiple records
If you want to parse all words, you can use the space ' ' as your delimiter, Then you get a row for each word.
Next you would simply select the result set GROUPing by the word and aggregating the COUNT
Order your results and you're there.
IMO, the design approach is what makes this difficult. Just because you allow users to assign tags does not mean the tags must be stored as a single delimited list of words. You can normalize the structure into something like:
Create Table Posts ( Id ... not null primary key )
Create Table Tags( Id ... not null primary key, Name ... not null Unique )
Create Table PostTags
( PostId ... not null References Posts( Id )
, TagId ... not null References Tags( Id ) )
Now your question becomes trivial:
Select T.Id, T.Name, Count(*) As TagCount
From PostTags As PT
Join Tags As T
On T.Id = PT.TagId
Group By T.Id, T.Name
Order By Count(*) Desc
If you insist on storing tags as delimited values, then only solution is to split the values on their delimiter by writing a custom Split function and then do your count. At the bottom is an example of a Split function. With it your query would look something like (using a comma delimiter):
Select Tag.Value, Count(*) As TagCount
From Posts As P
Cross Apply dbo.Split( P.Tags, ',' ) As Tag
Group By Tag.Value
Order By Count(*) Desc
Split Function:
Create Function [dbo].[Split]
(
#DelimitedList nvarchar(max)
, #Delimiter nvarchar(2) = ','
)
RETURNS TABLE
AS
RETURN
(
With CorrectedList As
(
Select Case When Left(#DelimitedList, DataLength(#Delimiter)/2) <> #Delimiter Then #Delimiter Else '' End
+ #DelimitedList
+ Case When Right(#DelimitedList, DataLength(#Delimiter)/2) <> #Delimiter Then #Delimiter Else '' End
As List
, DataLength(#Delimiter)/2 As DelimiterLen
)
, Numbers As
(
Select TOP (Coalesce(Len(#DelimitedList),1)) Row_Number() Over ( Order By c1.object_id ) As Value
From sys.objects As c1
Cross Join sys.columns As c2
)
Select CharIndex(#Delimiter, CL.list, N.Value) + CL.DelimiterLen As Position
, Substring (
CL.List
, CharIndex(#Delimiter, CL.list, N.Value) + CL.DelimiterLen
, Case
When CharIndex(#Delimiter, CL.list, N.Value + 1)
- CharIndex(#Delimiter, CL.list, N.Value)
- CL.DelimiterLen < 0 Then Len(CL.List)
Else CharIndex(#Delimiter, CL.list, N.Value + 1)
- CharIndex(#Delimiter, CL.list, N.Value)
- CL.DelimiterLen
End
) As Value
From CorrectedList As CL
Cross Join Numbers As N
Where N.Value < Len(CL.List)
And Substring(CL.List, N.Value, CL.DelimiterLen) = #Delimiter
)
Word or Tag clouds need two fields: a string and a value of how many times that word or string appeared in your collection. You can then pass the results into a tag cloud tool that will display the data as you require.
Not to take away from the previous answers, as they do answer the original challenge. However, I have a simpler solution using two functions (similar to #Thomas answer), one of which uses regex to "clean" the words.
The two functions are:
dbo.fnStripChars(a, b) --use regex 'b' to cleanse a string 'a'
dbo.fnMakeTableFromList(a, b) --convert a single field 'a' into a tabled list, delimited by 'b'
I then apply them into a single SQL statement, using the TOP n feature to give me the top 10 words I want to pass onto PowerBI or some other graphical tool, for actually displaying a word or tag cloud.
SELECT TOP 10 b.[words], b.[total]
FROM
(SELECT a.[words], count(*) AS [total]
FROM
(SELECT upper(l.item) AS [words]
FROM dbo.MyTableWithWords AS c
CROSS APPLY POTS.fnMakeTableFromList([POTS].fnStripChars(c.myColumnThatHasTheWords,'[^a-zA-Z ]'),' ') AS l) AS a
GROUP BY a.[words]) AS b
ORDER BY 2 DESC
As you can see, the regex is [^a-zA-Z ], which is to give me only alphabetical characters and spaces. The space is then used as a delimiter to the make table function to separate each word individually. I apply a count(*), to give me the number of times that word appears, hence then I have everything I need to give me the TOP 10 results.
Note that CROSS APPLY is important here so I get only data with actual "words" in each record found. Otherwise it will go through every record with or without words to extract from the column I want.
fnStripChars()
FUNCTION [dbo].[fnStripChars]
(
#String NVARCHAR(4000),
#MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
SET #MatchExpression = '%' + #MatchExpression + '%'
WHILE PatIndex(#MatchExpression, #String) > 0
SET #String = Stuff(#String, PatIndex(#MatchExpression, #String), 1, '')
RETURN #String
END
fnMakeTableFromList()
FUNCTION [dbo].[fnMakeTableFromList](
#List VARCHAR(MAX),
#Delimiter CHAR(1))
RETURNS TABLE
AS
RETURN (SELECT Item = CONVERT(VARCHAR, Item)
FROM (SELECT Item = x.i.value('(./text())[1]','varchar(max)')
FROM (SELECT [XML] = CONVERT(XML,'<i>' + REPLACE(#List,#Delimiter,'</i><i>') + '</i>').query('.')) AS a
CROSS APPLY [XML].nodes('i') AS x(i)) AS y
WHERE Item IS NOT NULL);
I've tested this with over 400K records and it's able to come back with my results in under 60 seconds. I think that's reasonable.