Loop through words in a string - sql

I've seen a lot of javascript stack questions around this, but nothing specifically related to SQL Server.
What I need to do is accept a string value and do a loop over the string multiple times to get back all the individual words so I can do calculations on them.
A rough idea is shown below. Anyone know how this can be achieved?
Declare #String nvarchar(50) = 'Mary had a little lamb'
Declare #word nvarchar(50)
Start Loop 1 to 5
Set #word = 'Mary' (1)
Set #word = 'had' (2)
Set #word = 'a' (3)
Set #word = 'little' (4)
Set #word = 'lamb' (5)
End Loop

This looks to do the job.
DECLARE #tags NVARCHAR(400) = 'mary had a little lamb'
SELECT value
FROM STRING_SPLIT(#tags, ' ')

You can split the parameter and write it in a temp table, and then use the values from the temp table. Using loops in SQL is not the best idea as they really affects your query performance.
Declare #String nvarchar(50) = 'Mary had a little lamb'
INSERT INTO #tempTable(colName)
SELECT colName
FROM dbo.Split(#String , ' ')

Related

SQL Server update statement performance

I have problem in optimizing an SQL query to do some data cleansing.
In fact, I have a table which is a sort of referential of a multiple special characters and word. Let's call it ABNORMAL(ID,PATTERN)
I have also another table INDIVIDUALS containing a column (NAME) which I want to clean by removing from it all characters that exist in the table ABNORMAL.
Currently, I have tried to use update statements, but I'm not sure if there is a better way to do this.
Approach one
Use a while loop to build a replace containing all characters from ABNORMALS by a blank '' and do one update using the built-in REPLACE
DECLARE #REPLACE_EXPRESSION nvarchar(max) ='REPLACE(NAME,'''','''')'
DECLARE #i int = 1
DECLARE #nbr int = (SELECT COUNT(*) FROM ABNORMAL)
-- CURRENT_CHARAC
DECLARE #CURRENT_CHARAC nvarchar(max)
-- NEW REPLACE EXPRESSION TO IMBRICATE INTO THE REPLACE EXPRESSION VARIABLE
DECLARE #CURR_REP NVARCHAR(max)
-- STRING TO BUILD AN SQL QUERY CONTAINING THE REPLACE EXPRESSION
DECLARE #UPDATE_QUERY nvarchar(max)
WHILE #i < #nbr
BEGIN
SELECT #CURRENT_CHARAC=PATTERN FROM CLEANSING_STG_PRISM_FRA_REF_UNSIGNIFICANT_VALUES WHERE ID_PATTERN=#i ;
SET #REPLACE_EXPRESSION = REPLACE(#REPLACE_EXPRESSION ,'NAME','REPLACE(NAME,'+''''+#CURRENT_CHARAC+''''+','''')')
set #i=#i+1
END
SET #UPDATE_QUERY = 'UPDATE INDIVIDUAL SET NAME ='+ #REPLACE_EXPRESSION
EXEC sp_executesql #UPDATE_QUERY
Approach two
Use a while loop to select every character in abnormal and do an update using replace containing the characters to remove:
DECLARE #i int = 1
DECLARE #nbr int = (SELECT COUNT(*) FROM ABNORMAL)
-- CURRENT_CHARAC
DECLARE #CURRENT_CHARAC nvarchar(max)
-- STRING TO BUILD AN SQL QUERY CONTAINING THE REPLACE EXPRESSION
DECLARE #UPDATE_QUERY nvarchar(max)
WHILE #i < #nbr
BEGIN
SELECT #CURRENT_CHARAC=PATTERN FROM CLEANSING_STG_PRISM_FRA_REF_UNSIGNIFICANT_VALUES WHERE ID_PATTERN=#i ;
UPDATE INDIVIDUAL
SET NAME = REPLACE(NAME,#CURRENT_CHARAC,'')
SET #i=#i+1
END
I already tested both approaches for 2 millions records, and I found that the first approach is faster than the second. I would know if you have already done something similar and new (better) ideas to try.
If you are using SQL Server 2017 you could use TRANSLATE and avoid dynamic SQL:
SELECT i.*
, REPLACE(TRANSLATE(i.NAME, f, REPLICATE('!', s.l)), '!', '') AS cleansed
FROM INDIVIDUALS i
OUTER APPLY (SELECT STRING_AGG(PATTERN, '') AS f
,LEN(STRING_AGG(PATTERN,'')) AS l
FROM ABNORMAL) AS s
DBFiddle Demo
Anyway 1st approach is better becasue you do one UPDATE, with second approach you remove characters one character at time (so you will have multiple UPDATE).
I would also track transaction log growth with both approaches.
If there's not too many characters that to be cleaned, then this trick might work.
Basically, you build 1 big update statement with a replace for each value in the table with the characters to be removed.
Example code:
Test data (using temp tables)
create table #ABNORMAL_CHARACTERS (id int identity(1,1), chr varchar(30));
insert into #ABNORMAL_CHARACTERS (chr) values ('!'),('&'),('#');
create table #INDIVIDUAL (id int identity(1,1), name varchar(30));
insert into #INDIVIDUAL (name) values ('test 1 &'),('test !&#2'),('test 3');
Code:
declare #FieldName varchar(30) = 'name';
declare #Replaces varchar(max) = #FieldName;
declare #UpdateSQL varchar(max);
select #Replaces = concat('replace('+#Replaces+', ', ''''+chr+''','''')') from #ABNORMAL_CHARACTERS order by id;
set #UpdateSQL = 'update #INDIVIDUAL
set name = '+#Replaces + '
where exists (select 1 from #ABNORMAL_CHARACTERS where charindex(chr,name)>0)';
exec (#UpdateSQL);
select * from #INDIVIDUAL;
A test here on rextester
And if you would have a UDF that can do a regex replace.
For example here
Then the #Replaces variable could be simplified with only 1 RegexReplace function and a pattern.

Creating multiple UDFs in one batch - SQL Server

I'm asking this question for SQL Server 2008 R2
I'd like to know if there is a way to create multiple functions in a single batch statement.
I've made the following code as an example; suppose I want to take a character string and rearrange its letters in alphabetical order. So, 'Hello' would become 'eHllo'
CREATE FUNCTION char_split (#string varchar(max))
RETURNS #characters TABLE
(
chars varchar(2)
)
AS
BEGIN
DECLARE #length int,
#K int
SET #length = len(#string)
SET #K = 1
WHILE #K < #length+1
BEGIN
INSERT INTO #characters
SELECT SUBSTRING(#string,#K,1)
SET #K = #K+1
END
RETURN
END
CREATE FUNCTION rearrange (#string varchar(max))
RETURNS varchar(max)
AS
BEGIN
DECLARE #SplitData TABLE (
chars varchar(2)
)
INSERT INTO #SplitData SELECT * FROM char_split(#string)
DECLARE #Output varchar(max)
SELECT #Output = coalesce(#Output,' ') + cast(chars as varchar(10))
from #SplitData
order by chars asc
RETURN #Output
END
declare #string varchar(max)
set #string = 'Hello'
select dbo.rearrange(#string)
When I try running this code, I get this error:
'CREATE FUNCTION' must be the first statement in a query batch.
I tried enclosing each function in a BEGIN END block, but no luck. Any advice?
Just use a GO statement between the definition of the UDFs
Not doable. SImple like that.
YOu can make it is one statement using a GO between them.
But as the GO is a batch delimiter.... this means you send multiple batches, which is explicitly NOT Wanted in your question.
So, no - it is not possible to do that in one batch as the error clearly indicates.

4000 character limit in LIKE statement

I have been getting an error in a previously working stored procedure called by an SSRS report and I have traced it down to a LIKE statement in a scalar function that is called by the stored procedure, in combination with a 7000+ NVARCHAR(MAX) string. It is something similar to:
Msg 8152, Level 16, State 10, Line 14
String or binary data would be truncated.
I can reproduce it with the following code:
DECLARE #name1 NVARCHAR(MAX) = ''
DECLARE #name2 NVARCHAR(MAX) = ''
DECLARE #count INT = 4001
WHILE #count > 0
BEGIN
SET #name1 = #name1 + 'a'
SET #name2 = #name2 + 'a'
SET #count = #count - 1
END
SELECT LEN(#name1)
IF #name1 LIKE #name2
PRINT 'OK'
What's the deal? Is there anyway around this limitation, or is it there for good reason? Thanks.
You can also reproduce it without the terrible loop:
DECLARE #name1 NVARCHAR(MAX), #name2 NVARCHAR(MAX);
SET #name1 = REPLICATE(CONVERT(NVARCHAR(MAX), N'a'), 4000);
SET #name2 = #name1;
IF #name1 LIKE #name2
PRINT 'OK';
SELECT #name1 += N'a', #name2 += N'a';
IF #name1 LIKE #name2
PRINT 'OK';
Result:
OK
Msg 8152, Level 16, State 10, Line 30
String or binary data would be truncated.
In any case, the reason is clearly stated in the documentation for LIKE (emphasis mine):
match_expression [ NOT ] LIKE pattern [ ESCAPE escape_character ]
...
pattern
Is the specific string of characters to search for in match_expression, and can include the following valid wildcard characters. pattern can be a maximum of 8,000 bytes.
And 8,000 bytes is used up by 4,000 Unicode characters.
I would suggest that comparing the first 4,000 characters is probably sufficient:
WHERE column LIKE LEFT(#param, 4000) + '%';
I can't envision any scenario where you want to compare the whole thing; how many strings contain the same first 4000 characters but then character 4001 is different? If that really is a requirement, I guess you could go to the great lengths identified in the Connect item David pointed out.
A simpler (though probably much more computationally expensive) workaround might be:
IF CONVERT(VARBINARY(MAX), #name1) = CONVERT(VARBINARY(MAX), #name2)
PRINT 'OK';
I suggest that it would be far better to fix the design and stop identifying rows by comparing large strings. Is there really no other way to identify the row you're after? This is like finding your car in the parking lot by testing the DNA of all the Dunkin Donuts cups in all the cup holders, rather than just checking the license plate.
I have the same problem right now, and I do believe my situation -where you want to compare two strings with more than 4000 characters- is a possible situation :-).
In my situation, I'm collecting a lot of data from different tables in a NVARCHAR(MAX) field in a specific table, to be able to search on that data using FullText. Keeping that table in sync, is done using the MERGE statement, converting everything to NVARCHAR(MAX).
So my MERGE statement would look like this:
MERGE MyFullTextTable AS target
USING (
SELECT --Various stuff from various tables, casting it as NVARCHAR(MAX)
...
) AS source (IndexColumn, FullTextColumn)
ON (target.IndexColumn = source.IndexColumn)
WHEN MATCHED AND source.FullTextColumn NOT LIKE target.FullTextColumn THEN
UPDATE SET FullTextColumn = source.FullTextColumn
WHEN NOT MATCHED THEN
INSERT (IndexColumn, FullTextColumn)
VALUES (source.IndexColumn, source.FullTextColumn)
OUTPUT -- Some stuff
This would produce errors because of the LIKE comparison when the FullText-data is bigger than 4000 characters.
So I created a function that does the comparison. Allthough it's not bullet proof, it works for me. You could also split data in blocks of 4000 characters, and compare each block, but for me (for now) comparing the first 4000 characters in combination with the length, is enough ...
So the Merge-statement would look like:
MERGE MyFullTextTable AS target
USING (
SELECT --Various stuff from various tables, casting it as NVARCHAR(MAX)
...
) AS source (IndexColumn, FullTextColumn)
ON (target.IndexColumn = source.IndexColumn)
WHEN MATCHED AND udfCompareTwoTexts(source.FullTextColumn, target.FullTextColumn) = 1 THEN
UPDATE SET FullTextColumn = source.FullTextColumn
WHEN NOT MATCHED THEN
INSERT (IndexColumn, FullTextColumn)
VALUES (source.IndexColumn, source.FullTextColumn)
OUTPUT -- Some stuff
And the function looks like:
ALTER FUNCTION udfCompareTwoTexts
(
#Value1 AS NVARCHAR(MAX),
#Value2 AS NVARCHAR(MAX)
)
RETURNS BIT
AS
BEGIN
DECLARE #ReturnValue AS BIT = 0
IF LEN(#Value1) > 4000 OR LEN(#Value2) > 4000
BEGIN
IF LEN(#Value1) = LEN(#Value2) AND LEFT(#Value1, 4000) LIKE LEFT(#Value2, 4000)
SET #ReturnValue = 1
ELSE
SET #ReturnValue = 0
END
ELSE
BEGIN
IF #Value1 LIKE #Value2
SET #ReturnValue = 1
ELSE
SET #ReturnValue = 0
END
RETURN #ReturnValue;
END
GO

Compare comma delimited strings in SQL

I have search requests that come in a CDL ("1,2,3,4"),("1,5"). I need to compare that to another CDL and return back all records that have a match. The kicker is the position of each number isn't always the same.
I've got something almost working except for instances where I'm trying to match ("2,5") to ("2,4,5"). Obviously the strings aren't equal but I need to return that match, because it has all the values in the first CDL.
My SQL Fiddle should make it fairly clear...
Any help would be much appreciated.
Oh and I saw this one is similar, but that seems a little drastic and over my head but I'll see if I can try to understand it.
Edit
So I just did a replace to change ("2,5") to ("%2%5%") and changed the were to use LIKE. From what I can initially tell it seems to be working.. SQL Fiddle Any reason I shouldn't do this or maybe I'm crazy and it doesn't work at all?
Just one step further, get closer to your answer.
SQL FIDDLE DEMO
SELECT P.*
FROM Product P
CROSS APPLY(
SELECT *
FROM ShelfSearch SS
WHERE Patindex(char(37)+replace(ss.shelflist, ',',char(37))+char(37),p.shelflist) > 0
)Shelfs
You can convert the lists to a table with the following function:
CREATE FUNCTION DelimitedStringToTable (#cdl varchar(8000), #delimiter varchar(1))
RETURNS #list table (
Token varchar(1000)
)
AS
BEGIN
DECLARE #token varchar(1000)
DECLARE #position int
SET #cdl = #cdl + #delimiter -- tack on delimiter to end
SET #position = CHARINDEX(#delimiter, #cdl, 1)
WHILE #position > 0
BEGIN
SET #token = LEFT(#cdl, #position - 1)
INSERT INTO #list (Token) VALUES (#token)
SET #cdl = RIGHT(#cdl, DATALENGTH(#cdl) - #position)
SET #position = CHARINDEX(#delimiter, #cdl, 1)
END
RETURN
END
Then you can use something like this to find all the matches:
SELECT list1.*
FROM DelimitedStringToTable('1,2,3', ',') list1
INNER JOIN DelimitedStringToTable('2', ',') list2 ON list1.Token = list2.Token

SQL Server: How do you remove punctuation from a field?

Any one know a good way to remove punctuation from a field in SQL Server?
I'm thinking
UPDATE tblMyTable SET FieldName = REPLACE(REPLACE(REPLACE(FieldName,',',''),'.',''),'''' ,'')
but it seems a bit tedious when I intend on removing a large number of different characters for example: !##$%^&*()<>:"
Thanks in advance
Ideally, you would do this in an application language such as C# + LINQ as mentioned above.
If you wanted to do it purely in T-SQL though, one way make things neater would be to firstly create a table that held all the punctuation you wanted to removed.
CREATE TABLE Punctuation
(
Symbol VARCHAR(1) NOT NULL
)
INSERT INTO Punctuation (Symbol) VALUES('''')
INSERT INTO Punctuation (Symbol) VALUES('-')
INSERT INTO Punctuation (Symbol) VALUES('.')
Next, you could create a function in SQL to remove all the punctuation symbols from an input string.
CREATE FUNCTION dbo.fn_RemovePunctuation
(
#InputString VARCHAR(500)
)
RETURNS VARCHAR(500)
AS
BEGIN
SELECT
#InputString = REPLACE(#InputString, P.Symbol, '')
FROM
Punctuation P
RETURN #InputString
END
GO
Then you can just call the function in your UPDATE statement
UPDATE tblMyTable SET FieldName = dbo.fn_RemovePunctuation(FieldName)
I wanted to avoid creating a table and wanted to remove everything except letters and digits.
DECLARE #p int
DECLARE #Result Varchar(250)
DECLARE #BadChars Varchar(12)
SELECT #BadChars = '%[^a-z0-9]%'
-- to leave spaces - SELECT #BadChars = '%[^a-z0-9] %'
SET #Result = #InStr
SET #P =PatIndex(#BadChars,#Result)
WHILE #p > 0 BEGIN
SELECT #Result = Left(#Result,#p-1) + Substring(#Result,#p+1,250)
SET #P =PatIndex(#BadChars,#Result)
END
I am proposing 2 solutions
Solution 1: Make a noise table and replace the noises with blank spaces
e.g.
DECLARE #String VARCHAR(MAX)
DECLARE #Noise TABLE(Noise VARCHAR(100),ReplaceChars VARCHAR(10))
SET #String = 'hello! how * > are % u (: . I am ok :). Oh nice!'
INSERT INTO #Noise(Noise,ReplaceChars)
SELECT '!',SPACE(1) UNION ALL SELECT '#',SPACE(1) UNION ALL
SELECT '#',SPACE(1) UNION ALL SELECT '$',SPACE(1) UNION ALL
SELECT '%',SPACE(1) UNION ALL SELECT '^',SPACE(1) UNION ALL
SELECT '&',SPACE(1) UNION ALL SELECT '*',SPACE(1) UNION ALL
SELECT '(',SPACE(1) UNION ALL SELECT ')',SPACE(1) UNION ALL
SELECT '{',SPACE(1) UNION ALL SELECT '}',SPACE(1) UNION ALL
SELECT '<',SPACE(1) UNION ALL SELECT '>',SPACE(1) UNION ALL
SELECT ':',SPACE(1)
SELECT #String = REPLACE(#String, Noise, ReplaceChars) FROM #Noise
SELECT #String Data
Solution 2: With a number table
DECLARE #String VARCHAR(MAX)
SET #String = 'hello! & how * > are % u (: . I am ok :). Oh nice!'
;with numbercte as
(
select 1 as rn
union all
select rn+1 from numbercte where rn<LEN(#String)
)
select REPLACE(FilteredData,' ',SPACE(1)) Data from
(select SUBSTRING(#String,rn,1)
from numbercte
where SUBSTRING(#String,rn,1) not in('!','*','>','<','%','(',')',':','!','&','#','#','$')
for xml path(''))X(FilteredData)
Output(Both the cases)
Data
hello how are u . I am ok . Oh nice
Note- I have just put some of the noises. You may need to put the noises that u need.
Hope this helps
You can use regular expressions in SQL Server - here is an article based on SQL 2005:
http://msdn.microsoft.com/en-us/magazine/cc163473.aspx
I'd wrap it in a simple scalar UDF so all string cleaning is in one place if it's needed again.
Then you can use it on INSERT too...
I took Ken MC's solution and made it into an function which can replace all punctuation with a given string:
----------------------------------------------------------------------------------------------------------------
-- This function replaces all punctuation in the given string with the "replaceWith" string
----------------------------------------------------------------------------------------------------------------
IF object_id('[dbo].[fnReplacePunctuation]') IS NOT NULL
BEGIN
DROP FUNCTION [dbo].[fnReplacePunctuation];
END;
GO
CREATE FUNCTION [dbo].[fnReplacePunctuation] (#string NVARCHAR(MAX), #replaceWith NVARCHAR(max))
RETURNS NVARCHAR(MAX)
BEGIN
DECLARE #Result Varchar(max) = #string;
DECLARE #BadChars Varchar(12) = '%[^a-z0-9]%'; -- to leave spaces - SELECT #BadChars = '%[^a-z0-9] %'
DECLARE #p int = PatIndex(#BadChars,#Result);
DECLARE #searchFrom INT;
DECLARE #indexOfPunct INT = #p;
WHILE #indexOfPunct > 0 BEGIN
SET #searchFrom = LEN(#Result) - #p;
SET #Result = Left(#Result, #p-1) + #replaceWith + Substring(#Result, #p+1,LEN(#Result));
SET #IndexOfPunct = PatIndex(#BadChars, substring(#Result, (LEN(#Result) - #SearchFrom)+1, LEN(#Result)));
SET #p = (LEN(#Result) - #searchFrom) + #indexOfPunct;
END
RETURN #Result;
END;
GO
-- example:
SELECT dbo.fnReplacePunctuation('This is, only, a tést-really..', '');
Output:
Thisisonlyatéstreally
If it's a one-off thing, I would use a C# + LINQ snippet in LINQPad to do the job with regular expressions.
It is quick and easy and you don't have to go through the process of setting up a CLR stored procedure and then cleaning up after yourself.
Can't you use PATINDEX to only include NUMBERS and LETTERS instead of trying to guess what punctuation might be in the field? (Not trying to be snarky, if I had the code ready, I'd share it...but this is what I'm looking for).
Seems like you need to create a custom function in order to avoid a giant list of replace functions in your queries - here's a good example:
http://www.codeproject.com/KB/database/SQLPhoneNumbersPart_2.aspx?display=Print