How to replace all special characters in string - sql

I have a table with the following columns:
dbo.SomeInfo
- Id
- Name
- InfoCode
Now I need to update the above table's InfoCode as
Update dbo.SomeInfo
Set InfoCode= REPLACE(Replace(RTRIM(LOWER(Name)),' ','-'),':','')
This replaces all spaces with - & lowercase the name
When I do check the InfoCode, I see there are Names with some special characters like
Cathe Friedrich''s Low Impact
coffeyfit-cardio-box-&-burn
Jillian Michaels: Cardio
Then I am manually writing the update sql against this as
Update dbo.SomeInfo
SET InfoCode= 'cathe-friedrichs-low-impact'
where Name ='Cathe Friedrich''s Low Impact '
Now, this solution is not realistic for me. I checked the following links related to Regex & others around it.
UPDATE and REPLACE part of a string
https://www.codeproject.com/Questions/456246/replace-special-characters-in-sql
But none of them is hitting the requirement.
What I need is if there is any character other [a-z0-9] replace it - & also there should not be continuous -- in InfoCode
The above Update sql has set some values of InfoCode as the-dancer's-workout®----starter-package
Some Names have value as
Sleek Technique™
The Dancer's-workout®
How can I write Update sql that could handle all such special characters?

Using NGrams8K you could split the string into characters and then rather than replacing every non-acceptable character, retain only certain ones:
SELECT (SELECT '' + CASE WHEN N.token COLLATE Latin1_General_BIN LIKE '[A-z0-9]'THEN token ELSE '-' END
FROM dbo.NGrams8k(V.S,1) N
ORDER BY position
FOR XML PATH(''))
FROM (VALUES('Sleek Technique™'),('The Dancer''s-workout®'))V(S);
I use COLLATE here as on my default collation in my instance the '™' is ignored, therefore I use a binary collation. You may want to use COLLATE to switch the string back to its original collation outside of the subquery.

This approach is fully inlinable:
First we need a mock-up table with some test data:
DECLARe #SomeInfo TABLE (Id INT IDENTITY, InfoCode VARCHAR(100));
INSERT INTO #SomeInfo (InfoCode) VALUES
('Cathe Friedrich''s Low Impact')
,('coffeyfit-cardio-box-&-burn')
,('Jillian Michaels: Cardio')
,('Sleek Technique™')
,('The Dancer''s-workout®');
--This is the query
WITH cte AS
(
SELECT 1 AS position
,si.Id
,LOWER(si.InfoCode) AS SourceText
,SUBSTRING(LOWER(si.InfoCode),1,1) AS OneChar
FROM #SomeInfo si
UNION ALL
SELECT cte.position +1
,cte.Id
,cte.SourceText
,SUBSTRING(LOWER(cte.SourceText),cte.position+1,1) AS OneChar
FROM cte
WHERE position < DATALENGTH(SourceText)
)
,Cleaned AS
(
SELECT cte.Id
,(
SELECT CASE WHEN ASCII(cte2.OneChar) BETWEEN 65 AND 90 --A-Z
OR ASCII(cte2.OneChar) BETWEEN 97 AND 122--a-z
OR ASCII(cte2.OneChar) BETWEEN 48 AND 57 --0-9
--You can easily add more ranges
THEN cte2.OneChar ELSE '-'
--You can easily nest another CASE to deal with special characters like the single quote in your examples...
END
FROM cte AS cte2
WHERE cte2.Id=cte.Id
ORDER BY cte2.position
FOR XML PATH('')
) AS normalised
FROM cte
GROUP BY cte.Id
)
,NoDoubleHyphens AS
(
SELECT REPLACE(REPLACE(REPLACE(normalised,'-','<>'),'><',''),'<>','-') AS normalised2
FROM Cleaned
)
SELECT CASE WHEN RIGHT(normalised2,1)='-' THEN SUBSTRING(normalised2,1,LEN(normalised2)-1) ELSE normalised2 END AS FinalResult
FROM NoDoubleHyphens;
The first CTE will recursively (well, rather iteratively) travers down the string, character by character and a return a very slim set with one row per character.
The second CTE will then GROUP the Ids. This allows for a correlated sub-query, where the actual check is performed using ASCII-ranges. FOR XML PATH('') is used to re-concatenate the string. With SQL-Server 2017+ I'd suggest to use STRING_AGG() instead.
The third CTE will use a well known trick to get rid of multiple occurances of a character. Take any two characters which will never occur in your string, I use < and >. A string like a--b---c will come back as a<><>b<><><>c. After replacing >< with nothing we get a<>b<>c. Well, that's it...
The final SELECT will cut away a trailing hyphen. If needed you can add similar logic to get rid of a leading hyphen. With v2017+ There was TRIM('-') to make this easier...
The result
cathe-friedrich-s-low-impact
coffeyfit-cardio-box-burn
jillian-michaels-cardio
sleek-technique
the-dancer-s-workout

You can create a User-Defined-Function for something like that.
Then use the UDF in the update.
CREATE FUNCTION [dbo].LowerDashString (#str varchar(255))
RETURNS varchar(255)
AS
BEGIN
DECLARE #result varchar(255);
DECLARE #chr varchar(1);
DECLARE #pos int;
SET #result = '';
SET #pos = 1;
-- lowercase the input and remove the single-quotes
SET #str = REPLACE(LOWER(#str),'''','');
-- loop through the characters
-- while replacing anything that's not a letter to a dash
WHILE #pos <= LEN(#str)
BEGIN
SET #chr = SUBSTRING(#str, #pos, 1)
IF #chr LIKE '[a-z]' SET #result += #chr;
ELSE SET #result += '-';
SET #pos += 1;
END;
-- SET #result = TRIM('-' FROM #result); -- SqlServer 2017 and beyond
-- multiple dashes to one dash
WHILE #result LIKE '%--%' SET #result = REPLACE(#result,'--','-');
RETURN #result;
END;
GO
Example snippet using the function:
-- using a table variable for demonstration purposes
declare #SomeInfo table (Id int primary key identity(1,1) not null, InfoCode varchar(100) not null);
-- sample data
insert into #SomeInfo (InfoCode) values
('Cathe Friedrich''s Low Impact'),
('coffeyfit-cardio-box-&-burn'),
('Jillian Michaels: Cardio'),
('Sleek Technique™'),
('The Dancer''s-workout®');
update #SomeInfo
set InfoCode = dbo.LowerDashString(InfoCode)
where (InfoCode LIKE '%[^A-Z-]%' OR InfoCode != LOWER(InfoCode));
select *
from #SomeInfo;
Result:
Id InfoCode
-- -----------------------------
1 cathe-friedrichs-low-impact
2 coffeyfit-cardio-box-burn
3 jillian-michaels-cardio
4 sleek-technique-
5 the-dancers-workout-

Related

How to select specific row in SQL from a bad designed schema?

I have a string in a column of a db schema I did not design, like this:
numbers column
--------------------
First: 1,2,33,34,43,5
Second: 1,2,3,4,5
Despite I know this is not the best practice scenario, I would still want to select the row which contains only '3' value, not '33' or '34' or '43'.
How could I select only second row?
SELECT *
FROM tblNumbers
WHERE numbers like '%,3,%' OR numbers like '3,%' OR numbers like '%,3'
This query selected both 2 columns. How can I do this, to get just the second row?
Here is my problem:
Thanks.
You should be storing the values in a separate table, with one row per column and per number.
Sometimes, though, we are stuck with other peoples bad data structures. If so, you can do what you want in this rather cumbersome way:
where replace(replace(numbers, '{', ','), '}', ',') like '%,3,%'
That is, put the delimiters around all the numbers in numbers.
Let me repeat, though: the proper way to store this data is using a separate table. If you need to store multiple values in a column like this, then do some research on XML and JSON formats (which are supported only in the most recent version of SQL Server).
EDIT:
Exactly the same idea applies, the code is just simpler:
where ',' + numbers + ',' like '%,3,%'
Did you try it like this?
SELECT *
FROM tblNumbers
WHERE number = '3' OR ReportedGMY = '3'
if you are storing numbers as integers
SELECT *
FROM tblNumbers
WHERE number = '3'
if you are storing as string
SELECT *
FROM tblNumbers
WHERE number like "3"
Its is bad practice to save command separated value in a column. This should be avoid as much as possible. If you really need to do it, then can be done using user defined function.
CREATE FUNCTION dbo.HasDigit (#String VARCHAR(MAX), #DigitToCheck INT, #Delimiter VARCHAR(10))
RETURNS BIT
AS
BEGIN
DECLARE #DelimiterPosition INT
DECLARE #Digit INT
DECLARE #ContainsDigit BIT = 0
WHILE CHARINDEX(#Delimiter, #String) > 0
BEGIN
SELECT #DelimiterPosition = CHARINDEX(#Delimiter, #String)
SELECT #Digit = CAST(SUBSTRING(#String, 1, #DelimiterPosition - 1) AS INT)
IF(#Digit = #DigitToCheck)
BEGIN
SET #ContainsDigit = 1
END
SELECT #String = SUBSTRING(#String, #DelimiterPosition + 1, LEN(#String) - #DelimiterPosition)
END
RETURN #ContainsDigit
END;
GO
CREATE TABLE TEST (
Numbers VARCHAR(MAX),
COLUMNNAME VARCHAR(MAX)
)
GO
INSERT INTO TEST VALUES('First:', '1,2,33,34,43,5')
INSERT INTO TEST VALUES('Second:', ' 1,2,3,4,5')
GO
SELECT * FROM TEST WHERE dbo.HasDigit(COLUMNNAME, 3, ',') = 1
Output:
--Numbers COLUMNNAME
--------- ----------------
--Second: 1,2,3,4,5

SQL Server 2012: Remove text from end of string

I'm new to SQL so please forgive me if I use incorrect terminology and my question sounds confused.
I've been tasked with writing a stored procedure which will be sent 3 variables as strings (varchar I think). I need to take two of the variables and remove text from the end of the variable and only from the end.
The strings/text I need to remove from the end of the variables are
co
corp
corporation
company
lp
llc
ltd
limited
For example this string
Global Widgets LLC
would become
Global Widgets
However it should only apply once so
Global Widgets Corporation LLC
Should become
Global Widgets Corporation
I then need to use the altered variables to do a SQL query.
This is to be used as a backup for an integration piece we have which makes a callout to another system. The other system takes the same variables and uses Regex to remove the strings from the end of variables.
I've tried different combinations of PATINDEX, SUBSTRING, REPLACE, STUFF but cannot seem to come up with something that will do the job.
===============================================================
Edit: I want to thank everyone for the answers provided so far, but I left out some information that I didn't think was important but judging by the answers seems like it would affect the processing.
My proc will start something like
ALTER PROC [dbo].[USP_MyDatabaseTable] #variableToBeAltered nvarchar(50)
AS
I will then need to remove all , and . characters. I've already figured out how to do this. I will then need to do the processing on #variableToBeAltered (technically there will be two variables) to remove the strings I listed previously. I must then remove all spaces from #variableToBeAltered. (Again I figured that part out). Then finally I will use #variableToBeAltered in my SQL query something like
SELECT [field1] AS myField
,[field2] AS myOtherField
FROM [MyData].[dbo].[MyDatabaseTable]
WHERE [field1] = (#variableToBeAltered);
I hope this information is more useful.
I'd keep all of your suffixes in a table to make this a little easier. You can then perform code like this either within a query or against a variable.
DECLARE #company_name VARCHAR(50) = 'Global Widgets Corporation LLC'
DECLARE #Suffixes TABLE (suffix VARCHAR(20))
INSERT INTO #Suffixes (suffix) VALUES ('LLC'), ('CO'), ('CORP'), ('CORPORATION'), ('COMPANY'), ('LP'), ('LTD'), ('LIMITED')
SELECT #company_name = SUBSTRING(#company_name, 1, LEN(#company_name) - LEN(suffix))
FROM #Suffixes
WHERE #company_name LIKE '%' + suffix
SELECT #company_name
The keys here are that you are only matching with strings that end in the suffix and it uses SUBSTRING rather than REPLACE to avoid accidentally removing copies of any of the suffixes from the middle of the string.
The #Suffixes table is a table variable here, but it makes more sense for you to just create it and fill it as a permanent table.
The query will just find the one row (if any) that matches its suffix with the end of your string. If a match is found then the variable will be set to a substring with the length of the suffix removed from the end. There will usually be a trailing space, but for a VARCHAR that will just get dropped off.
There are still a couple of potential issues to be aware of though...
First, if you have a company name like "Watco" then the "co" would be a false positive here. I'm not sure what can be done about that other than maybe making your suffixes include a leading space.
Second, if one suffix ends with one of your other suffixes then the ordering that they get applied could be a problem. You could get around this by only applying the row with the greatest length for suffix, but it gets a little more complicated, so I've left that out for now.
Building on the answer given by Tom H, but applying across the entire table:
set nocount on;
declare #suffixes table(tag nvarchar(20));
insert into #suffixes values('co');
insert into #suffixes values('corp');
insert into #suffixes values('corporation');
insert into #suffixes values('company');
insert into #suffixes values('lp');
insert into #suffixes values('llc');
insert into #suffixes values('ltd');
insert into #suffixes values('limited');
declare #companynames table(entry nvarchar(100),processed bit default 0);
insert into #companynames values('somecompany llc',0);
insert into #companynames values('business2 co',0);
insert into #companynames values('business3',0);
insert into #companynames values('business4 lpx',0);
while exists(select * from #companynames where processed = 0)
begin
declare #currentcompanyname nvarchar(100) = (select top 1 entry from #companynames where processed = 0);
update #companynames set processed = 1 where entry = #currentcompanyname;
update #companynames
set entry = SUBSTRING(entry, 1, LEN(entry) - LEN(tag))
from #suffixes
where entry like '%' + tag
end
select * from #companynames
You can use a query like below:
-- Assuming that you can maintain all patterns in a table or a temp table
CREATE TABLE tbl(pattern varchar(100))
INSERT INTO tbl values
('co'),('llc'),('beta')
--#a stores the string you need to manipulate, #lw & #b are variables to aid
DECLARE #a nvarchar(100), #b nvarchar(100), #lw varchar(100)
SET #a='alpha beta gamma'
SET #b=''
-- #t is a flag
DECLARE #t int
SET #t=0
-- Below is a loop
WHILE(#t=0 OR LEN(#a)=0 )
BEGIN
-- Store the current last word in the #lw variable
SET #lw=reverse(substring(reverse(#a),1, charindex(' ', reverse(#a)) -1))
-- check if the word is in pattern dictionary. If yes, then Voila!
SELECT #t=1 FROM tbl WHERE #lw like pattern
-- remove the last word from #a
SET #a=LEFT(#a,LEN(#a)-LEN(#lw))
IF (#t<>1)
BEGIN
-- all words which were not pattern are joined back onto this stack
SET #b=CONCAT(#lw,#b)
END
END
-- get back the remaining word
SET #a=CONCAT(#a,#b)
SELECT #a
drop table tbl
Do note that this method overcomes Tom's problem of
if you have a company name like "Watco" then the "co" would be a false positive here. I'm not sure what can be done about that other than maybe making your suffixes include a leading space.
use the replace function in SQL 2012,
declare #var1 nvarchar(20) = 'ACME LLC'
declare #var2 nvarchar(20) = 'LLC'
SELECT CASE
WHEN ((PATINDEX('%'+#var2+'%',#var1) <= (LEN(#var1)-LEN(#var2)))
Or (SUBSTRING(#var1,PATINDEX('%'+#var2+'%',#var1)-1,1) <> SPACE(1)))
THEN #var1
ELSE
REPLACE(#var1,#var2,'')
END
Here is another way to overcome the 'Runco Co' situation.
declare #var1 nvarchar(20) = REVERSE('Runco Co')
declare #var2 nvarchar(20) = REVERSE('Co')
Select REVERSE(
CASE WHEN(CHARINDEX(' ',#var1) > LEN(#var2)) THEN
SUBSTRING(#var1,PATINDEX('%'+#var2+'%',#var1)+LEN(#var2),LEN(#var1)-LEN(#var2))
ELSE
#var1
END
)

LIKE operator for sequence of Numbers

I am trying to use wildcard expression to fetch data related to a sequence of numbers. Can I know how to use a series of numbers inside wildcard expression LIKE [0-10].
here is my query:
select grade from table where grade LIKE [1-12]?
output: is 1 and 2
I referred to t-SQL book and they talk about LIKE N[1-12]. What's the difference between LIKE [1-12] and N[1-12]?
I can use between 1 and 12 to fetch my data. But I am just curious how to use a wildcard for series of numbers with LIKE operator?
In SQL Server, like has three wildcards. Underscore '_' represents any single character. % represents zero or more characters. And square brackets.
The expression between the square brackets represents one single character. So,
x like '[abc]'
matches "a", "b", or "c" -- and nothing else. The following matches any digit:
x like '[0123456789]'
This, however, starts to get cumbersome to type out. So, SQL Server offers the shorthand:
x like '[0-9]'
This just means any character from the range starting with 0 and ending at 9.
You could match any hex character with:
x like '[0-9ABCDEF]'
So, additional characters are allowed in the range.
When you write
x like '[1-12]'
You are saying x like the range of characters from 1 to 1, plus the character 2. This is more easily written as:
x like '[12]'
In any case, you shouldn't store numeric values as strings, and you shouldn't use like on numbers. It is much better to write:
grade between 1 and 12
Or something like that.
But if you already have a column with a sequence of numbers and don't know the size, what I've done was this function:
CREATE FUNCTION Keep_Only_Int (#X VARCHAR(MAX)) RETURNS BIGINT AS BEGIN
IF #X IS NULL RETURN NULL
DECLARE #T AS INT = LEN(#X), #I AS INT = 0, #J AS CHAR(1), #RET AS VARCHAR(50) = ''
WHILE #I < #T BEGIN
SET #I += 1
SET #J = SUBSTRING(#X, #I, 1)
IF ASCII(#J) BETWEEN 48 AND 57 --Numbers, is needed because ¹, ² and ³ are going to return true in the link
SET #RET += #J
END
IF LEN(#RET) > 19 RETURN NULL --Bigger then bigint
RETURN NULLIF(#RET, '')
END
An example of usage:
create table #a (content varchar(100))
insert #a values ('My number is 123, whatever')
insert #a values ('My number is 1234, whatever')
insert #a values ('My number is ¹²³4, whatever') --> Special numbers
insert #a values ('My number is one, whatever') --> No number
insert #a values ('My number is 1234567890123456789, whatever')
insert #a values ('My number is 12345678901234567890, whatever')--> This is too big!
select *
, dbo.Keep_Only_Int(content)
from #a
The function already convert the field to BIGINT, so you can use an between statement
select *
from #a
where dbo.Keep_Only_Int(content) between 1 and 2000
It is not focused on a great performance, if you are using a table too big I'd recomend creating a specific code for that

Replacing characters in a string based on rows in a table sql

I need to replace a list of characters in a string with some mapped characters.
I have a table 'dbo.CharacterMappings' with 2 columns: 'CharacterToFilter' and 'ReplacementCharacter'.
Say that there are 3 records in this table:
Filter Replacement
$ s
# a
0 o
How would I replace all of the filter characters in a string based on these mappings?
i.e. 'Hell0 c#t$' needs to become 'Hello cats'.
I cant really think of any way of doing this without resorting to a table variable and then looping through it. I.e. have a table variable with a 'count' column then use a loop to select 1 row at a time based on this column. Then I can use the REPLACE function to update the characters one at a time.
Edit: I should note that I always want to strip out these characters (I don't need to worry about $5 -> s5 for example).
declare #s varchar(50)= 'Hell0 c#t$'
select #s = REPLACE(#s, CharacterToFilter, ReplacementCharacter)
from CharacterMappings
select #s
You could create a function:
CREATE FUNCTION [dbo].[ReplaceAll]
(
#text varchar(8000)
)
RETURNS VARCHAR(8000)
AS
BEGIN
SELECT #text =
REPLACE(#text,cm.Filter, cm.Replacement)
FROM CharacterMappings cm;
RETURN #text
END
Then this
select dbo.[ReplaceAll]('Hell0 c#t$');
returns Hello cats

Table variable row limitation?

I have in my application a user defined function which takes a comma separated list as an argument. It splits the items and plugs them in to a table variable and returns the result.
This function works well, except that when the items in the comma separated list exceed 1000, it ignores the remainder. That is to say, if I plug in 1239, the first 1000 rows will be returned and the remaining 239 are entirely ignored. There are no errors when this occurs.
I can't help but feel that this is due to some sort of limitation that I should know about, but I can't seem to find any information about it. Is it a limitation on the amount of rows that can be stored in a table variable? Or am I missing something in the actual code itself? Can anyone assist? Going squirrely-eyed over here.
ALTER FUNCTION [dbo].[ufnConvertArrayToIntTable] (#IntArray VARCHAR(8000))
RETURNS #retIntTable TABLE
(
ID int
)
AS
BEGIN
DECLARE #Delimiter char(1)
SET #Delimiter = ','
DECLARE #Item varchar(8)
IF CHARINDEX(#Delimiter,#IntArray,0) <> 0
BEGIN
WHILE CHARINDEX(#Delimiter,#IntArray,0) <> 0
BEGIN
SELECT
#Item = RTRIM(LTRIM(SUBSTRING(#IntArray,1,CHARINDEX(#Delimiter,#IntArray,0)-1))),
#IntArray = RTRIM(LTRIM(SUBSTRING(#IntArray,CHARINDEX(#Delimiter,#IntArray,0)+1,LEN(#IntArray))))
IF LEN(#Item) > 0
INSERT INTO #retIntTable SELECT #Item
END
IF LEN(#IntArray) > 0
INSERT INTO #retIntTable SELECT #IntArray
END
ELSE
BEGIN
IF LEN(#IntArray) > 0
INSERT INTO #retIntTable SELECT #IntArray
END
RETURN
END;
You define your input variable as varchar(8000) and your #Item variable is varchar(8). Are your items typically 8 characters each? Is the string you send in w/ over 1000 items more than 8000 characters? Try changing your input to varchar(max) instead.
Are all of your comma seperated values 8 chars long? If so, then the input parameter will only be able to hold 888 (8000 / 9(including the comma) of them..
It's because your input parameter is limited to 8000 characters.
You might try calling the function using substring... Maybe:
WHERE
[myField] IN(Select ID from [dbo].[ufnConvertArrayToIntTable](substring(#inputarray, 1, 4000))
OR
[myField] IN(Select ID from [dbo].[ufnConvertArrayToIntTable](substring(#inputarray, 4001, 8000))
...