TSQL - Querying a table column to pull out popular words for a tag cloud - sql

Just an exploratory question to see if anyone has done this or if, in fact it is at all possible.
We all know what a tag cloud is, and usually, a tag cloud is created by someone assigning tags. Is it possible, within the current features of SQL Server to create this automatically, maybe via trigger when a table has a record added or updated, by looking at the data within a certain column and getting popular words?
It is similar to this question: How can I get the most popular words in a table via mysql?. But, that is MySQL not MSSQL.
Thanks in advance.
James

Here is a good bit on parsing delimited string into rows:
http://anyrest.wordpress.com/2010/08/13/converting-parsing-delimited-string-column-in-sql-to-rows/
http://www.sqlteam.com/article/parsing-csv-values-into-multiple-rows
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=50648
T-SQL: Opposite to string concatenation - how to split string into multiple records
If you want to parse all words, you can use the space ' ' as your delimiter, Then you get a row for each word.
Next you would simply select the result set GROUPing by the word and aggregating the COUNT
Order your results and you're there.

IMO, the design approach is what makes this difficult. Just because you allow users to assign tags does not mean the tags must be stored as a single delimited list of words. You can normalize the structure into something like:
Create Table Posts ( Id ... not null primary key )
Create Table Tags( Id ... not null primary key, Name ... not null Unique )
Create Table PostTags
( PostId ... not null References Posts( Id )
, TagId ... not null References Tags( Id ) )
Now your question becomes trivial:
Select T.Id, T.Name, Count(*) As TagCount
From PostTags As PT
Join Tags As T
On T.Id = PT.TagId
Group By T.Id, T.Name
Order By Count(*) Desc
If you insist on storing tags as delimited values, then only solution is to split the values on their delimiter by writing a custom Split function and then do your count. At the bottom is an example of a Split function. With it your query would look something like (using a comma delimiter):
Select Tag.Value, Count(*) As TagCount
From Posts As P
Cross Apply dbo.Split( P.Tags, ',' ) As Tag
Group By Tag.Value
Order By Count(*) Desc
Split Function:
Create Function [dbo].[Split]
(
#DelimitedList nvarchar(max)
, #Delimiter nvarchar(2) = ','
)
RETURNS TABLE
AS
RETURN
(
With CorrectedList As
(
Select Case When Left(#DelimitedList, DataLength(#Delimiter)/2) <> #Delimiter Then #Delimiter Else '' End
+ #DelimitedList
+ Case When Right(#DelimitedList, DataLength(#Delimiter)/2) <> #Delimiter Then #Delimiter Else '' End
As List
, DataLength(#Delimiter)/2 As DelimiterLen
)
, Numbers As
(
Select TOP (Coalesce(Len(#DelimitedList),1)) Row_Number() Over ( Order By c1.object_id ) As Value
From sys.objects As c1
Cross Join sys.columns As c2
)
Select CharIndex(#Delimiter, CL.list, N.Value) + CL.DelimiterLen As Position
, Substring (
CL.List
, CharIndex(#Delimiter, CL.list, N.Value) + CL.DelimiterLen
, Case
When CharIndex(#Delimiter, CL.list, N.Value + 1)
- CharIndex(#Delimiter, CL.list, N.Value)
- CL.DelimiterLen < 0 Then Len(CL.List)
Else CharIndex(#Delimiter, CL.list, N.Value + 1)
- CharIndex(#Delimiter, CL.list, N.Value)
- CL.DelimiterLen
End
) As Value
From CorrectedList As CL
Cross Join Numbers As N
Where N.Value < Len(CL.List)
And Substring(CL.List, N.Value, CL.DelimiterLen) = #Delimiter
)

Word or Tag clouds need two fields: a string and a value of how many times that word or string appeared in your collection. You can then pass the results into a tag cloud tool that will display the data as you require.
Not to take away from the previous answers, as they do answer the original challenge. However, I have a simpler solution using two functions (similar to #Thomas answer), one of which uses regex to "clean" the words.
The two functions are:
dbo.fnStripChars(a, b) --use regex 'b' to cleanse a string 'a'
dbo.fnMakeTableFromList(a, b) --convert a single field 'a' into a tabled list, delimited by 'b'
I then apply them into a single SQL statement, using the TOP n feature to give me the top 10 words I want to pass onto PowerBI or some other graphical tool, for actually displaying a word or tag cloud.
SELECT TOP 10 b.[words], b.[total]
FROM
(SELECT a.[words], count(*) AS [total]
FROM
(SELECT upper(l.item) AS [words]
FROM dbo.MyTableWithWords AS c
CROSS APPLY POTS.fnMakeTableFromList([POTS].fnStripChars(c.myColumnThatHasTheWords,'[^a-zA-Z ]'),' ') AS l) AS a
GROUP BY a.[words]) AS b
ORDER BY 2 DESC
As you can see, the regex is [^a-zA-Z ], which is to give me only alphabetical characters and spaces. The space is then used as a delimiter to the make table function to separate each word individually. I apply a count(*), to give me the number of times that word appears, hence then I have everything I need to give me the TOP 10 results.
Note that CROSS APPLY is important here so I get only data with actual "words" in each record found. Otherwise it will go through every record with or without words to extract from the column I want.
fnStripChars()
FUNCTION [dbo].[fnStripChars]
(
#String NVARCHAR(4000),
#MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
SET #MatchExpression = '%' + #MatchExpression + '%'
WHILE PatIndex(#MatchExpression, #String) > 0
SET #String = Stuff(#String, PatIndex(#MatchExpression, #String), 1, '')
RETURN #String
END
fnMakeTableFromList()
FUNCTION [dbo].[fnMakeTableFromList](
#List VARCHAR(MAX),
#Delimiter CHAR(1))
RETURNS TABLE
AS
RETURN (SELECT Item = CONVERT(VARCHAR, Item)
FROM (SELECT Item = x.i.value('(./text())[1]','varchar(max)')
FROM (SELECT [XML] = CONVERT(XML,'<i>' + REPLACE(#List,#Delimiter,'</i><i>') + '</i>').query('.')) AS a
CROSS APPLY [XML].nodes('i') AS x(i)) AS y
WHERE Item IS NOT NULL);
I've tested this with over 400K records and it's able to come back with my results in under 60 seconds. I think that's reasonable.

Related

SQL: Filter on Any Data in Delimited Column

I want to include records that only have any value in a specific location within a delimited string. For example, in the strings below, I want to only include the one in which there is data in column5. So, only Example B qualifies:
Example A: PV1|column1data||column3data|column4data||||column8data
Example B: PV1|column1data||column3data||column5data|||column8data
If your data always has the same format of 9 items and 8 separators:
select * from tab where col not like '%|%|%|%|%||%|%|%'
That said, there may be better options available in your RDBMS if you'd let us know which you're using. Also, storing multiple items in one column is one of the worst anti-patterns in a relational database.
Update for jagged data
For an unknown number of elements in your data, this could work in SQL server (depending upon whitespace and ANSI settings):
select *
from tab
where col not like replicate('%|', 5) + replicate('|%', len(col) - len(replace(col, '|', '')) - 5)
Here we're calculating the number of elements for each value and dynamically creating the like pattern to work as it does above. Other databases should have similar functions allowing much the same logic. I'm not sure how much better the answers can get without having more info.
The LIKE operator is slow, and will be something which will be visibly slow once you hit a few hundred thousand records. This answer isn't super quick, but will do the job and doesn't require SQL Server CLR. If your chosen system is Oracle, you will need 11g to use the same syntax.
Presuming you have a table...
create table udata (
ID int primary key identity(1,1)
, string varchar(2000) not null
);
You'll want a variable for this for any future changes, but also for script flexibility.
declare #delimiter varchar(1) = '|'
Then set everything up in a CTE.
;with parser as (
select
ID
, 0 as colNum
, substring(d.string, endPos + (2 * delimLen), len(d.string)) as string
, startPos
, endPos
from udata d
cross apply (
select
len(#delimiter) as startPos
, case charindex(#delimiter,d.string) when 0 then len(d.string) + len(#delimiter) else charindex(#delimiter,d.string) end - len(#delimiter) as endPos
, len(#delimiter) as delimLen
) p
where id between 2000 and 10000
union all
select
ID
, colNum + 1 as colNum
, substring(d.string, p.endPos + (2 * delimLen), len(d.string)) as string
, d.endPos + (2 * delimLen) as startPos
, d.endPos + (delimLen) + p.endPos as endPos
from parser d
cross apply (
select
len(#delimiter) as startPos
, case charindex(#delimiter,d.string) when 0 then len(d.string) + len(#delimiter) else charindex(#delimiter,d.string) end - len(#delimiter) as endPos
, len(#delimiter) as delimLen
) p
where string != ''
), selector as (
select u.id, p.colNum, substring(u.string, p.startPos, p.endPos - p.startPos + len(#delimiter)) as colVal--,u.string, p.startPos, p.endPos
from udata u
inner join parser p
on p.ID = u.ID
)
What this does is first mark the locations for the begin and end of each column value, then the selector slices it out of the source string. Note the where clause in the first part of the recursive query: where id between 2000 and 10000 You will want to limit your records here. This might be where you limit it to the types of records you are looking for.
Finally, select out your columns in a pivot for ease of reading:
select *
from selector
pivot (
max(colVal) for colNum in ([1],[2],[3],[4],[5],[6],[7],[8])
) pv
The original rows can instead be returned using your original criteria like this:
select *
from udata u
where exists (
select top 1 1
from selector s
where s.colNum = 5
and s.colVal =''
and s.ID = u.ID
)
My test data contains ~160K rows, and the exists query, without the ID limits in the CTE, took 18 seconds to run on laptop hardware. Still, this should do the trick.

Select statement that concatenates the first character after every '/' character in a column

So I am trying to write a query which, among other things, brings back the first character in a Varchar field, then returns the first character which appears after each / character throughout the rest of the field.
The field I am refrering too will contain a group of last names, separated by a '/'. For example: Fischer-Costello/Korbell/Morrison/Pearson
For the above example, I would want my select statement to return: FKMP.
So far, I have only been able to get my code to return the first character + the first character after the FIRST (and only the first) '/' character.
So for the above example input, my select statement would return: FK
Here is the code that I have written so far:
select rp.CONTACT_ID, ra.TRADE_REP, c.FIRST_NAME, c.LAST_NAME,
UPPER(LEFT(FIRST_NAME, 1)) + SUBSTRING(c.first_name,CHARINDEX('/',c.first_name)+1,1) as al_1,
UPPER(LEFT(LAST_NAME, 1)) + SUBSTRING(c.LAST_name,CHARINDEX('/',c.LAST_name)+1,1) as al_2
from dbo.REP_ALIAS ra
inner join dbo.REP_PROFILE rp on rp.CONTACT_ID = ra.CONTACT_ID
inner join dbo.CONTACT c on rp.CONTACT_ID = c.CONTACT_ID
where
rp.CRD_NUMBER is null and
ra.TRADE_REP like '%DNK%' and
(c.LAST_NAME like '%/%' or c.FIRST_NAME like '%/%') and
ra.TRADE_FIRM in
(
'xxxxxxx',
'xxxxxxx'
)
If you read the code, it's obvious that I am attempting to perform the same concatenation on the first_name column as well. However, I realize that a solution which will work for the Last_name column (used in my example), will also work for the first_name column.
Thank you.
Some default values
DECLARE #List VARCHAR(50) = 'Fischer-Costello/Korbell/Morrison/Pearson'
DECLARE #SplitOn CHAR(1) = '/'
This area just splits the string into a list
DECLARE #RtnValue table
(
Id int identity(1,1),
Value nvarchar(4000)
)
While (Charindex(#SplitOn, #List)>0)
Begin
Insert Into #RtnValue (value)
Select
Value = ltrim(rtrim(Substring(#List,1,Charindex(#SplitOn,#List)-1)))
Set #List = Substring(#List,Charindex(#SplitOn,#List)+len(#SplitOn+',')-1,len(#List))
End
Insert Into #RtnValue (Value)
Select Value = ltrim(rtrim(#List))
Now lets grab the first character of each name and stuff it back into a single variable
SELECT STUFF((SELECT SUBSTRING(VALUE,1,1) FROM #RtnValue FOR XML PATH('')),1,0,'') AS Value
Outputs:
Value
FKMP
Here is another way to do this would be a lot faster than looping. What you need is a set based splitter. Jeff Moden at sql server central has one that is awesome. Here is a link to the article. http://www.sqlservercentral.com/articles/Tally+Table/72993/
Now I know you have to signup for an account to view this but it is free and the logic in that article will change the way you look at data. You might also be able to find his code posted if you search for DelimitedSplit8K.
At any rate, here is how you could implement this type of splitter.
declare #Table table(ID int identity, SomeValue varchar(50))
insert #Table
select 'Fischer-Costello/Korbell/Morrison/Pearson'
select ID, STUFF((select '' + left(x.Item, 1)
from #Table t2
cross apply dbo.DelimitedSplit8K(SomeValue, '/') x
where t2.ID = t1.ID
for xml path('')), 1, 0 , '') as MyResult
from #Table t1
group by t1.ID

Extracting pipe delimted field into rows

I have a tbl with a field with values that are pipe delimited, and I need them extracted as rows.
Sample data
select distinct [PROV_KEY],
[NTWK_CDS]
FROM [SPOCK].[US\AC39169].[WellPointExtract_ERR]
where [PROV_KEY] = '447358B0A8E1C0F1B7AEB1ED07EC2F25'
--results
PROV_KEY NTWK_CDS
447358B0A8E1C0F1B7AEB1ED07EC2F25 |GA_HMO|GA_OPN|GA_PPO|GA_BD|GA_MCPPO|GA_HDPPO|
And I would like:
PROV_KEY NTWK_CDS
447358B0A8E1C0F1B7AEB1ED07EC2F25 GA_HMO
447358B0A8E1C0F1B7AEB1ED07EC2F25 GA_OPN
447358B0A8E1C0F1B7AEB1ED07EC2F25 GA_PPO
I tried the following but I'm only getting the first set of values:
select distinct [PROV_KEY],
substring([NTWK_CDS], 1,
CHARINDEX('|',[NTWK_CDS], CHARINDEX('|',[NTWK_CDS])+1))
FROM [SPOCK].[US\AC39169].[WellPointExtract_ERR]
where [PROV_KEY] = '447358B0A8E1C0F1B7AEB1ED07EC2F25'
This is a standard string splitting problem and there are many solutions out there. However most still feel like a workaround, as SQL Server does not have a split function build in.
You can start your research here: http://www.sommarskog.se/arrays-in-sql.html
The crucial operation you need to perform is a split. There are lots of solutions to this problem (see here for some), and people favor different ones depending on both situation and personal preference. Once you've done the split, though, you can JOIN or APPLY against the results to get the desired output.
I personally prefer using a SQLCLR function for this purpose since the performance is generally much better; but the number of options out there is staggering.
You can use splitting function
CREATE FUNCTION dbo.SplitStrings_CTE(#List nvarchar (1000), #Delimiter nvarchar(1 ))
RETURNS #returns TABLE(val nvarchar(100), [level] int, PRIMARY KEY CLUSTERED([level]))
AS
BEGIN
;WITH cte AS
(
SELECT SUBSTRING(#List, 0, CHARINDEX(#Delimiter , #List)) AS val ,
CAST(STUFF(#List + #Delimiter, 1, CHARINDEX(#Delimiter, #List),'') AS nvarchar (1000)) AS stval,
1 AS [level]
UNION ALL
SELECT SUBSTRING(stval, 0, CHARINDEX(#Delimiter, stval)),
CAST(STUFF(stval, 1 , CHARINDEX(#Delimiter ,stval), '') AS nvarchar(1000)),
[level] + 1
FROM cte
WHERE stval != ''
)
INSERT #returns
SELECT REPLACE(val ,' ' ,'') AS val, [level]
FROM cte
RETURN
END
Hence, your SELECT statement will be
SELECT *
FROM dbo.test82 t CROSS APPLY dbo.SplitStrings_CTE(t.NTWK_CDS, '|') o
WHERE o.val != ''
Demo on SQLFiddle

SQL: Find rows where Column contains all of the given words

I have some column EntityName, and I want to have users to be able to search names by entering words separated by space. The space is implicitly considered as an 'AND' operator, meaning that the returned rows must have all of the words specified, and not necessarily in the given order.
For example, if we have rows like these:
abba nina pretty balerina
acdc you shook me all night long
sth you are me
dream theater it's all about you
when the user enters: me you, or you me (the results must be equivalent), the result has rows 2 and 3.
I know I can go like:
WHERE Col1 LIKE '%' + word1 + '%'
AND Col1 LIKE '%' + word2 + '%'
but I wanted to know if there's some more optimal solution.
The CONTAINS would require a full text index, which (for various reasons) is not an option.
Maybe Sql2008 has some built-in, semi-hidden solution for these cases?
The only thing I can think of is to write a CLR function that does the LIKE comparisons. This should be many times faster.
Update: Now that I think about it, it makes sense CLR would not help. Two other ideas:
1 - Try indexing Col1 and do this:
WHERE (Col1 LIKE word1 + '%' or Col1 LIKE '%' + word1 + '%')
AND (Col1 LIKE word2 + '%' or Col1 LIKE '%' + word2 + '%')
Depending on the most common searches (starts with vs. substring), this may offer an improvement.
2 - Add your own full text indexing table where each word is a row in the table. Then you can index properly.
Function
CREATE FUNCTION [dbo].[fnSplit] ( #sep CHAR(1), #str VARCHAR(512) )
RETURNS TABLE AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(#sep, #str)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(#sep, #str, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT
pn AS Id,
SUBSTRING(#str, start, CASE WHEN stop > 0 THEN stop - start ELSE 512 END) AS Data
FROM
Pieces
)
Query
DECLARE #FilterTable TABLE (Data VARCHAR(512))
INSERT INTO #FilterTable (Data)
SELECT DISTINCT S.Data
FROM fnSplit(' ', 'word1 word2 word3') S -- Contains words
SELECT DISTINCT
T.*
FROM
MyTable T
INNER JOIN #FilterTable F1 ON T.Col1 LIKE '%' + F1.Data + '%'
LEFT JOIN #FilterTable F2 ON T.Col1 NOT LIKE '%' + F2.Data + '%'
WHERE
F2.Data IS NULL
Source: SQL SELECT WHERE field contains words
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
You're going to end up with a full table scan anyway.
The collation can make a big difference apparently. Kalen Delaney in the book "Microsoft SQL Server 2008 Internals" says:
Collation can make a huge difference
when SQL Server has to look at almost
all characters in the strings. For
instance, look at the following:
SELECT COUNT(*) FROM tbl WHERE longcol LIKE '%abc%'
This may execute 10 times faster or more with a binary collation than a nonbinary Windows collation. And with varchar data, this executes up to seven or eight times faster with a SQL collation than with a Windows collation.
WITH Tokens AS(SELECT 'you' AS Token UNION ALL SELECT 'me')
SELECT ...
FROM YourTable AS t
WHERE (SELECT COUNT(*) FROM Tokens WHERE y.Col1 LIKE '%'+Tokens.Token+'%')
=
(SELECT COUNT(*) FROM Tokens) ;
This should ideally be done with the help of Full text search as mentioned above.
BUT,
If you don't have full text configured for your DB, here is a performance intensive solution for doing a prioritized string search.
-- table to search in
drop table if exists dbo.myTable;
go
CREATE TABLE dbo.myTable
(
myTableId int NOT NULL IDENTITY (1, 1),
code varchar(200) NOT NULL,
description varchar(200) NOT NULL -- this column contains the values we are going to search in
) ON [PRIMARY]
GO
-- function to split space separated search string into individual words
drop function if exists [dbo].[fnSplit];
go
CREATE FUNCTION [dbo].[fnSplit] (#StringInput nvarchar(max),
#Delimiter nvarchar(1))
RETURNS #OutputTable TABLE (
id nvarchar(1000)
)
AS
BEGIN
DECLARE #String nvarchar(100);
WHILE LEN(#StringInput) > 0
BEGIN
SET #String = LEFT(#StringInput, ISNULL(NULLIF(CHARINDEX(#Delimiter, #StringInput) - 1, -1),
LEN(#StringInput)));
SET #StringInput = SUBSTRING(#StringInput, ISNULL(NULLIF(CHARINDEX
(
#Delimiter, #StringInput
),
0
), LEN
(
#StringInput)
)
+ 1, LEN(#StringInput));
INSERT INTO #OutputTable (id)
VALUES (#String);
END;
RETURN;
END;
GO
-- this is the search script which can be optionally converted to a stored procedure /function
declare #search varchar(max) = 'infection upper acute genito'; -- enter your search string here
-- the searched string above should give rows containing the following
-- infection in upper side with acute genitointestinal tract
-- acute infection in upper teeth
-- acute genitointestinal pain
if (len(trim(#search)) = 0) -- if search string is empty, just return records ordered alphabetically
begin
select 1 as Priority ,myTableid, code, Description from myTable order by Description
return;
end
declare #splitTable Table(
wordRank int Identity(1,1), -- individual words are assinged priority order (in order of occurence/position)
word varchar(200)
)
declare #nonWordTable Table( -- table to trim out auxiliary verbs, prepositions etc. from the search
id varchar(200)
)
insert into #nonWordTable values
('of'),
('with'),
('at'),
('in'),
('for'),
('on'),
('by'),
('like'),
('up'),
('off'),
('near'),
('is'),
('are'),
(','),
(':'),
(';')
insert into #splitTable
select id from dbo.fnSplit(#search,' '); -- this function gives you a table with rows containing all the space separated words of the search like in this e.g., the output will be -
-- id
-------------
-- infection
-- upper
-- acute
-- genito
delete s from #splitTable s join #nonWordTable n on s.word = n.id; -- trimming out non-words here
declare #countOfSearchStrings int = (select count(word) from #splitTable); -- count of space separated words for search
declare #highestPriority int = POWER(#countOfSearchStrings,3);
with plainMatches as
(
select myTableid, #highestPriority as Priority from myTable where Description like #search -- exact matches have highest priority
union
select myTableid, #highestPriority-1 as Priority from myTable where Description like #search + '%' -- then with something at the end
union
select myTableid, #highestPriority-2 as Priority from myTable where Description like '%' + #search -- then with something at the beginning
union
select myTableid, #highestPriority-3 as Priority from myTable where Description like '%' + #search + '%' -- then if the word falls somewhere in between
),
splitWordMatches as( -- give each searched word a rank based on its position in the searched string
-- and calculate its char index in the field to search
select myTable.myTableid, (#countOfSearchStrings - s.wordRank) as Priority, s.word,
wordIndex = CHARINDEX(s.word, myTable.Description) from myTable join #splitTable s on myTable.Description like '%'+ s.word + '%'
-- and not exists(select myTableid from plainMatches p where p.myTableId = myTable.myTableId) -- need not look into myTables that have already been found in plainmatches as they are highest ranked
-- this one takes a long time though, so commenting it, will have no impact on the result
),
matchingRowsWithAllWords as (
select myTableid, count(myTableid) as myTableCount from splitWordMatches group by(myTableid) having count(myTableid) = #countOfSearchStrings
)
, -- trim off the CTE here if you don't care about the ordering of words to be considered for priority
wordIndexRatings as( -- reverse the char indexes retrived above so that words occuring earlier have higher weightage
-- and then normalize them to sequential values
select s.myTableid, Priority, word, ROW_NUMBER() over (partition by s.myTableid order by wordindex desc) as comparativeWordIndex
from splitWordMatches s join matchingRowsWithAllWords m on s.myTableId = m.myTableId
)
,
wordIndexSequenceRatings as ( -- need to do this to ensure that if the same set of words from search string is found in two rows,
-- their sequence in the field value is taken into account for higher priority
select w.myTableid, w.word, (w.Priority + w.comparativeWordIndex + coalesce(sequncedPriority ,0)) as Priority
from wordIndexRatings w left join
(
select w1.myTableid, w1.priority, w1.word, w1.comparativeWordIndex, count(w1.myTableid) as sequncedPriority
from wordIndexRatings w1 join wordIndexRatings w2 on w1.myTableId = w2.myTableId and w1.Priority > w2.Priority and w1.comparativeWordIndex>w2.comparativeWordIndex
group by w1.myTableid, w1.priority,w1.word, w1.comparativeWordIndex
)
sequencedPriority on w.myTableId = sequencedPriority.myTableId and w.Priority = sequencedPriority.Priority
),
prioritizedSplitWordMatches as ( -- this calculates the cumulative priority for a field value
select w1.myTableId, sum(w1.Priority) as OverallPriority from wordIndexSequenceRatings w1 join wordIndexSequenceRatings w2 on w1.myTableId = w2.myTableId
where w1.word <> w2.word group by w1.myTableid
),
completeSet as (
select myTableid, priority from plainMatches -- get plain matches which should be highest ranked
union
select myTableid, OverallPriority as priority from prioritizedSplitWordMatches -- get ranked split word matches (which are ordered based on word rank in search string and sequence)
),
maximizedCompleteSet as( -- set the priority of a field value = maximum priority for that field value
select myTableid, max(priority) as Priority from completeSet group by myTableId
)
select priority, myTable.myTableid , code, Description from maximizedCompleteSet m join myTable on m.myTableId = myTable.myTableId
order by Priority desc, Description -- order by priority desc to get highest rated items on top
--offset 0 rows fetch next 50 rows only -- optional paging

Replace with wildcard, in SQL

I know MS T-SQL does not support regular expression, but I need similar functionality. Here's what I'm trying to do:
I have a varchar table field which stores a breadcrumb, like this:
/ID1:Category1/ID2:Category2/ID3:Category3/
Each Category name is preceded by its Category ID, separated by a colon. I'd like to select and display these breadcrumbs but I want to remove the Category IDs and colons, like this:
/Category1/Category2/Category3/
Everything between the leading slash (/) up to and including the colon (:) should be stripped out.
I don't have the option of extracting the data, manipulating it externally, and re-inserting back into the table; so I'm trying to accomplish this in a SELECT statement.
I also can't resort to using a cursor to loop through each row and clean each field with a nested loop, due to the number of rows returned in the SELECT.
Can this be done?
Thanks all - Jay
I think your best bet is going to be to use a recursive user-defined function (UDF). I've included some code here that you can use to pass in a string to achieve the results you're looking for.
CREATE FUNCTION ufn_StripIDsFromBreadcrumb (#cIndex int, #breadcrumb varchar(max), #theString varchar(max))
RETURNS varchar(max)
AS
BEGIN
DECLARE #nextColon int
DECLARE #nextSlash int
SET #nextColon = CHARINDEX(':', #theString, #cIndex)
SET #nextSlash = CHARINDEX('/', #theString, #nextColon)
SET #breadcrumb = #breadcrumb + SUBSTRING(#theString, #nextColon + 1, #nextSlash - #nextColon)
IF #nextSlash != LEN(#theString)
BEGIN
exec #breadcrumb = ufn_StripIDsFromBreadcrumb #cIndex = #nextSlash, #breadcrumb = #breadcrumb, #theString = #theString
END
RETURN #breadcrumb
END
You could then execute it with:
DECLARE #myString varchar(max)
EXEC #myString = ufn_StripIDsFromBreadcrumb 1, '/', '/ID1:Category1/ID2:Category2/ID3:Category3/'
PRINT #myString
This works for SQL Server 2005 and up.
create table strings (
string varchar(1000)
)
insert into strings values( '/ID1:Category1/ID2:Category2/ID3:Category3/' )
insert into strings values( '/ID4:Category4/ID5:Category5/ID8:Category6/' )
insert into strings values( '/ID7:Category7/ID8:Category8/ID9:Category9/' )
go
with
replace_with_wildcard ( restrung ) as
(
select replace( string, '', '' )
from strings
union all
select
replace( restrung, substring( restrung, patindex( '%ID%', restrung ), 4 ), '' )
from replace_with_wildcard
where patindex( '%ID%', restrung ) > 0
)
select restrung
from replace_with_wildcard
where charindex( ':', restrung ) = 0
order by restrung
drop table strings
You might be able to do this using a Split function. The following split function relies on the existence of a Numbers table which literally contains a sequential list of numbers like so:
Create Table dbo.Numbers( Value int not null primary key clustered )
GO
With Nums As
(
Select ROW_NUMBER() OVER( Order By o.object_id ) As Num
From sys.objects as o
cross join sys.objects as o2
)
Insert dbo.Numbers( Value )
Select Num
From Nums
Where Num Between 1 And 10000
GO
Create Function [dbo].[udf_Split] (#DelimitedList nvarchar(max), #Delimiter nvarchar(2) = ',')
Returns #SplitResults TABLE (Position int NOT NULL PRIMARY KEY, Value nvarchar(max))
AS
/*
PURPOSE: to split the #DelimitedList based on the #Delimter
DESIGN NOTES:
1. In general the contents of the next item is: NextDelimiterPosition - CurrentStartPosition
2. CurrentStartPosition =
CharIndex(#Delimiter, A.list, N.Value) = Current Delimiter position
+ Len(#Delimiter) + The number of delimiter characters
+ 1 + 1 since the text of the item starts after the delimiter
3. We need to calculate the delimiter length because the LEN function excludes trailing spaces. Thus
if a delimiter of ", " (a comma followed by a space) is used, the LEN function will return 1.
4. The DataLength function returns the number of bytes in the string. However, since we're using
an nvarchar for the delimiter, the number of bytes will double the number of characters.
*/
Begin
Declare #DelimiterLength int
Set #DelimiterLength = DataLength(#Delimiter) / 2
If Left(#DelimitedList, #DelimiterLength) <> #Delimiter
Set #DelimitedList = #Delimiter + #DelimitedList
If Right(#DelimitedList, #DelimiterLength) <> #Delimiter
Set #DelimitedList = #DelimitedList + #Delimiter
Insert #SplitResults(Position, Value)
Select CharIndex(#Delimiter, A.list, N.Value) + #DelimiterLength
, Substring (
A.List
, CharIndex(#Delimiter, A.list, N.Value) + #DelimiterLength
, CharIndex(#Delimiter, A.list, N.Value + 1)
- ( CharIndex(#Delimiter, A.list, N.Value) + #DelimiterLength )
)
From dbo.Numbers As N
Cross Join (Select #DelimitedList As list) As A
Where N.Value > 0
And N.Value < LEN(A.list)
And Substring(A.list, N.Value, #DelimiterLength) = #Delimiter
Order By N.Value
Return
End
You then might be able to run a query like so where you strip out the prefixes:
Select Table, Substring(S.Value, CharIndex(':', S.Value) + 1, Len(S.Value))
From Table
Cross Apply dbo.udf_Split(Table.ListColumn, '/') As S
This would give you values like:
Category1
Category2
Category3
You could then use FOR XML PATH to combine them again:
Select Table.PK
, Stuff( (
Select '/' + Substring(S.Value, CharIndex(':', S.Value) + 1, Len(S.Value))
From Table As Table1
Cross Apply dbo.udf_Split(Table.ListColumn, '/') As S1
Where Table1.PK = Table.PK
Order By S1.Position
For Xml Path('')
), 1, 1, '') As BreadCrumb
From Table
For SQL Server 2005+, you can get regex support by:
Enabling CLR (doesn't require instance restart)
Uploading your CLR functionality (in this case, regex replace)
Using native TSQL, you'll need to define REPLACE statements for everything you want to remove:
SELECT REPLACE(
REPLACE(
REPLACE(''/ID1:Category1/ID2:Category2/ID3:Category3/'', 'ID1:', ''),
'ID2:', ''),
'ID3:', '')
Regex or otherwise, you need to be sure these patterns don't appear in the actual data.
You can use SQL CLR. Here's an MSDN article:
declare #test1 nvarchar(max)
set #test1='/ID1:Category1/ID2:Category2/ID3:Category3/'
while(CHARINDEX('ID',#test1)<>0)
Begin
select #test1=REPLACE(#test1,SUBSTRING(#test1,CHARINDEX('ID',#test1),CHARINDEX(':',#test1)-
CHARINDEX('ID',#test1)+1),'')
End
select #test1