How to remove non alphanumeric characters in SQL without creating a function? - sql

I'm trying to remove non alphanumeric characters in multiple columns in a table, and have no permission to create functions nor temporary functions. I'm wonder whether anyone here have any experiences removing non alphanumeric characters without creating any functions at all? Thanks. I'm using MS SQL Server Management Studio v17.9.1

If you have to use a single SELECT query like #Forty3 mentioned then the multiple REPLACEs like #Gordon-Linoff said is probably best (but definitely not ideal).
If you can update the data or use T-SQL, then you could do something like this from https://searchsqlserver.techtarget.com/tip/Replacing-non-alphanumeric-characters-in-strings-using-T-SQL:
while ##rowcount > 0
update user_list_original
set fname = replace(fname, substring(fname, patindex('%[^a-zA-Z ]%', fname), 1), '')
where patindex('%[^a-zA-Z ]%', fname) <> 0

Here is a starting point - you will need to adjust it to accommodate all of the columns which require cleansing:
;WITH allcharcte ( id, textcol1, textcol2, textcol1where, textcol2where )
AS (SELECT id,
CAST(textcol1 AS NVARCHAR(255)),
CAST(textcol2 AS NVARCHAR(255)),
-- Start the process of looking for non-alphanumeric chars in each
-- of the text columns. The returned value from PATINDEX is the position
-- of the non-alphanumeric char and is stored in the *where columns
-- of the CTE.
PATINDEX(N'%[^0-9A-Z]%', textcol1),
PATINDEX(N'%[^0-9A-Z]%', textcol2)
FROM #temp
UNION ALL
-- This is the recursive part. It works through the rows which have been
-- returned thus far processing them for use in the next iteration
SELECT prev.id,
-- If the *where column relevant for each of the columns is NOT NULL
-- and NOT ZERO, then use the STUFF command to replace the char
-- at that location with an empty string
CASE ISNULL(prev.textcol1where, 0)
WHEN 0 THEN CAST(prev.textcol1 AS NVARCHAR(255))
ELSE CAST(STUFF(prev.textcol1, prev.textcol1where, 1, N'') AS NVARCHAR(255))
END,
CASE ISNULL(prev.textcol2where, 0)
WHEN 0 THEN CAST(prev.textcol2 AS NVARCHAR(255))
ELSE CAST(STUFF(prev.textcol2, prev.textcol2where, 1, N'') AS NVARCHAR(255))
END,
-- We now check for the existence of the next non-alphanumeric
-- character AFTER we replace the most recent finding
ISNULL(PATINDEX(N'%[^0-9A-Z]%', STUFF(prev.textcol1, prev.textcol1where, 1, N'')), 0),
ISNULL(PATINDEX(N'%[^0-9A-Z]%', STUFF(prev.textcol2, prev.textcol2where, 1, N'')), 0)
FROM allcharcte prev
WHERE ISNULL(prev.textcol1where, 0) > 0
OR ISNULL(prev.textcol2where, 0) > 0)
SELECT *
FROM allcharcte
WHERE textcol1where = 0
AND textcol2where = 0
Essentially, it is a recursive CTE which will repeatedly replace any non-alphanumeric character (found via the PATINDEX(N'%[^0-9A-Z]%', <column>)) with an empty string (via the STUFF(<column>, <where>, N'')). By replicating the blocks, you should be able to adapt it to any number of columns.
EDIT: If you anticipate having more than 100 instances of non-alphanumeric characters to strip out of any one column, you will need to adjust the MAXRECURSION property ahead of the call.

Related

How to ignore specific string value when using pattern and patindex function in SQL Server Query?

I have this query here.
WITH Cte_Reverse
AS (
SELECT CASE PATINDEX('%[^0-9.- ]%', REVERSE(EmailName))
WHEN 0
THEN REVERSE(EmailName)
ELSE left(REVERSE(EmailName), PATINDEX('%[^0-9.- ]%', REVERSE(EmailName)) - 1)
END AS Platform_Campaign_ID,
EmailName
FROM [Arrakis].[xtemp].[Stage_SendJobs_Marketing]
)
SELECT REVERSE(Platform_Campaign_ID) AS Platform_Campaign_ID, EmailName
FROM Cte_Reverse
WHERE REVERSE(Platform_Campaign_ID) <> '2020'
AND REVERSE(Platform_Campaign_ID) <> ''
AND LEN(REVERSE(Platform_Campaign_ID)) = 4;
It is working for the most part, below is a screenshot of the result set.
The query I posted above extracts the 4 numbers to the right out of the initial value that is set for the column I am extracting out of. But I am unable to figure out how I can also have the query ignore cases when the right most value is -v2, -v1, etc. essentially anything with -v and whatever number version it is.
If you want four digits, then one method is:
select substring(emailname, patindex('%[0-9][0-9][0-9][0-9]%', emailname), 4)

Update or replace a special character text in my database field

I am facing to an issue that I cannot update or replace some characters in my database.
Here is how this text look like in my column when I retrieve it:
As you can see, there is an unknown characters between 'master' and 'degree' which I cannot even paste it here.
I also tried to update and replace it with below code (I cannot paste that two vertical lines here since they are not supported in any browser and I am not sure what they are, Please see the picture above to see what is in my SQL statement).
begin transaction
update gm_desc set projdesc=replace(projdesc,'%â s%','') where projdesc like '%âs%' and proposalno = '15-149-01'
You can see the real SQL Statement here:
I tried to update, or replace it but I cannot do it. The update statement successfully works but I still see that weird special charters. I would be appreciate to help me.
Here's a scalar-valued function which removes all non-alphanumeric characters (preserves spaces) from a string.
Hopefully it helps!
dbfiddle
create function dbo.get_alphanumeric_str
(
#string varchar(max)
)
returns varchar(max)
as
begin
declare #ret varchar(max);
with nums as (
select 1 as n
union all select n+1 from nums
where n < 256
)
select #ret = replace(stuff(
(
select '' + substring(#string, nums.n, 1)
from nums
where patindex('%[^0-9A-Za-z ]%', substring(#string, nums.n,1)) = 0
for xml path('')
), 1, 0, ''
), ' ', ' ')
option (MAXRECURSION 256)
return #ret;
end
Usage
select dbo.get_alphanumeric_str('Helloᶄ âWorld 1234⅊⅐')
Returns: Hello World 1234
How it works
The nums CTE is just to get a list of numbers (you can set the maximum of 256 to a higher value if your strings are longer; n.b. option (MAXRECURSION n) is for this CTE but has to be placed at the query)
The stuff essentially iterates through the string, using the list of numbers above and extracts a substring of length 1; each of these chars are checked if they match the [^0-9A-Za-z ] regex group (0-9 all digits, A-Za-z all letters both lower and upper case, and a single space character)
If they match, patindex() should return 0; i.e. index zero.
Use replace(string, ' ', ' ') for the space character as the xml path returns a special encoding, see this question.
Use a binary collation for accented characters; see this answer

Simple Explanation for PATINDEX

I have have been reading up on PATINDEX attempting to understand what and why. I understand the when using the wildcards it will return an INT as to where that character(s) appears/starts. So:
SELECT PATINDEX('%b%', '123b') -- returns 4
However I am looking to see if someone can explain the reason as to why you would use this in a simple(ish) way. I have read some other forums but it just is not sinking in to be honest.
Are you asking for realistic use-cases? I can think of two, real-life use-cases that I've had at work where PATINDEX() was my best option.
I had to import a text-file and parse it for INSERT INTO later on. But these files sometimes had numbers in this format: 00000-59. If you try CAST('00000-59' AS INT) you'll get an error. So I needed code that would parse 00000-59 to -59 but also 00000159 to 159 etc. The - could be anywhere, or it could simply not be there at all. This is what I did:
DECLARE #my_var VARCHAR(255) = '00000-59', #my_int INT
SET #my_var = STUFF(#my_var, 1, PATINDEX('%[^0]%', #my_var)-1, '')
SET #my_int = CAST(#my_var AS INT)
[^0] in this case means "any character that isn't a 0". So PATINDEX() tells me when the 0's end, regardless of whether that's because of a - or a number.
The second use-case I've had was checking whether an IBAN number was correct. In order to do that, any letters in the IBAN need to be changed to a corresponding number (A=10, B=11, etc...). I did something like this (incomplete but you get the idea):
SET #i = PATINDEX('%[^0-9]%', #IBAN)
WHILE #i <> 0 BEGIN
SET #num = UNICODE(SUBSTRING(#IBAN, #i, 1))-55
SET #IBAN = STUFF(#IBAN, #i, 1, CAST(#num AS VARCHAR(2))
SET #i = PATINDEX('%[^0-9]%', #IBAN)
END
So again, I'm not concerned with finding exactly the letter A or B etc. I'm just finding anything that isn't a number and converting it.
PATINDEX is roughly equivalent to CHARINDEX except that it returns the position of a pattern instead of single character. Examples:
Check if a string contains at least one digit:
SELECT PATINDEX('%[0-9]%', 'Hello') -- 0
SELECT PATINDEX('%[0-9]%', 'H3110') -- 2
Extract numeric portion from a string:
SELECT SUBSTRING('12345', PATINDEX('%[0-9]%', '12345'), 100) -- 12345
SELECT SUBSTRING('x2345', PATINDEX('%[0-9]%', 'x2345'), 100) -- 2345
SELECT SUBSTRING('xx345', PATINDEX('%[0-9]%', 'xx345'), 100) -- 345
Quoted from PATINDEX (Transact-SQL)
The following example uses % and _ wildcards to find the position at
which the pattern 'en', followed by any one character and 'ure' starts
in the specified string (index starts at 1):
SELECT PATINDEX('%en_ure%', 'please ensure the door is locked');
Here is the result set.
8
You'd use the PATINDEX function when you want to know at which character position a pattern begins in an expression of a valid text or character data type.

Selecting Strings With Alphabetized Characters - In SQL Server 2008 R2

This is a recreational pursuit, and is not homework. If you value academic challenges, please read on.
A radio quiz show had a segment requesting listeners to call in with words that have their characters in alphabetical order, e.g. "aim", "abbot", "celt", "deft", etc. I got these few examples by a quick Notepad++ (NPP) inspection of a Scrabble dictionary word list.
I'm looking for an elegant way in T-SQL to determine if a word qulifies for the list, i.e. all its letters are in alpha order, case insensitive.
It seemed to me that there should be some kind of T-SQL algorithm possible that will do a SELECT on a table of English words and return the complete list of all words in the Srcabble dictionary that meets the spec. I've spent considerable time looking at regex strings, but haven't hit on anything that comes even remotely close. I've thought about the obvious looping scenario, but abandoned it for now as "inelegant". I'm looking for your ideas that will obtain the qualifying word list,
preferably using
- a REGEX expression
- a tally-table-based approach
- a scalar UDF that returns 1 if the input word meets the requirement, else 0.
- Other, only limited by your creativity.
But preferably NOT using
- a looping structure
- a recursive solution
- a CLR solution
Assumptions/observations:
1. A "word" is defined here as two or more characters. My dictionary shows 55 2-character words, of which only 28 qualify.
2. No word will have more than two concecutive characters that are identical. (If you find one, please point it out.)
3. At 21 characters, "electroencephalograms" is the longest word in my Scrabble dictionary
(though why that word is in the Scrabble dictionary escapes me--the board is only a 15-by-15 grid.)
Consider 21 as the upper limit on word length.
4. All words LIKE 'Z%' can be dismissed because all you can create is {'Z','ZZ', ... , 'ZZZ...Z'}.
5. As the dictionary's words' initial character proceedes through the alphabet, fewer words will qualify.
6. As the word lengths get longer, fewer words will qualify.
7. I suspect that there will be less than 0.2% of my dictionary's 60,387 words that will qualify.
For example, I've tried NPP regex searches like "^a[a-z][b-z][b-z][c-z][c-z][d-z][d-z][e-z]" for 9-letter words starting with "a", but the character-by-character alphabetic enforcement is not handled properly. This search will return "abilities" which fails the test with the "i" that follows the "l".
There's several free Scrabble word lists available on the web, but Phil Factor gives a really interesting treatment of T-SQL/Scrabble considerations at https://www.simple-talk.com/sql/t-sql-programming/the-sql-of-scrabble-and-rapping/ which is where I got my word list.
Care to give it a shot?
Split the word into individual characters using a numbers table. Use the numbers as one set of indices. Use ROW_NUMBER to create another set. Compare the two sets of indices to see if they match for every character to see if they match. If they do, the letters in the word are in the alphabetical order.
DECLARE #Word varchar(100) = 'abbot';
WITH indexed AS (
SELECT
Index1 = n.Number,
Index2 = ROW_NUMBER() OVER (ORDER BY x.Letter, n.Number),
x.Letter
FROM
dbo.Numbers AS n
CROSS APPLY
(SELECT SUBSTRING(#Word, n.Number, 1)) AS x (Letter)
WHERE
n.Number BETWEEN 1 AND LEN(#Word)
)
SELECT
Conclusion = CASE COUNT(NULLIF(Index1, Index2))
WHEN 0 THEN 'Alphabetical'
ELSE 'Not alphabetical'
END
FROM
indexed
;
The NULLIF(Index, Index2) expression does the comparison: it returns a NULL if the the arguments are equal, otherwise it returns the value of Index1. If all indices match, all the results will be NULL and COUNT will return 0, which means the order of letters in the word was alphabetical.
I did something similar to Andriy. I created a numbers table with value 1-21. I use it to create one set of data with the individual letters order by the index and the a second set ordered alphabetically. Joined the sets to each other on the letter and numbers. I then count nulls. Anything over 0 means it is not in order.
DECLARE #word VARCHAR(21)
SET #word = 'abbot'
SELECT Count(1)
FROM (SELECT Substring(#word, number, 1) AS Letter,
Row_number() OVER ( ORDER BY number) AS letterNum
FROM numbers
WHERE number <= CONVERT(INT, Len(#word))) a
LEFT OUTER JOIN (SELECT Substring(#word, number, 1) AS letter,
Row_number() OVER ( ORDER BY Substring(#word, number, 1)) AS letterNum
FROM numbers
WHERE number <= CONVERT(INT, Len(#word))) b
ON a.letternum = b.letternum
AND a.letter = b.letter
WHERE b.letter IS NULL
Interesting idea...
Here's my take on it. This returns a list of words that are in order, but you could easily return 1 instead.
DECLARE #WORDS TABLE (VAL VARCHAR(MAX))
INSERT INTO #WORDS (VAL)
VALUES ('AIM'), ('ABBOT'), ('CELT'), ('DAVID')
;WITH CHARS
AS
(
SELECT VAL AS SOURCEWORD, UPPER(VAL) AS EVALWORD, ASCII(LEFT(UPPER(VAL),1)) AS ASCIICODE, RIGHT(VAL,LEN(UPPER(VAL))-1) AS REMAINS, 1 AS ROWID, 1 AS INORDER, LEN(VAL) AS WORDLENGTH
FROM #WORDS
UNION ALL
SELECT SOURCEWORD, REMAINS, ASCII(LEFT(REMAINS,1)), RIGHT(REMAINS,LEN(REMAINS)-1), ROWID+1, INORDER+CASE WHEN ASCII(LEFT(REMAINS,1)) >= ASCIICODE THEN 1 ELSE 0 END AS INORDER, WORDLENGTH
FROM CHARS
WHERE LEN(REMAINS)>=1
),
ONLYINORDER
AS
(
SELECT *
FROM CHARS
WHERE ROWID=WORDLENGTH AND INORDER=WORDLENGTH
)
SELECT SOURCEWORD
FROM ONLYINORDER
Here it is as a UDF:
CREATE FUNCTION dbo.AlphabetSoup (#Word VARCHAR(MAX))
RETURNS BIT
AS
BEGIN
SET #WORD = UPPER(#WORD)
DECLARE #RESULT INT
;WITH CHARS
AS
(
SELECT #WORD AS SOURCEWORD,
#WORD AS EVALWORD,
ASCII(LEFT(#WORD,1)) AS ASCIICODE,
RIGHT(#WORD,LEN(#WORD)-1) AS REMAINS,
1 AS ROWID,
1 AS INORDER,
LEN(#WORD) AS WORDLENGTH
UNION ALL
SELECT SOURCEWORD,
REMAINS,
ASCII(LEFT(REMAINS,1)),
RIGHT(REMAINS,LEN(REMAINS)-1),
ROWID+1,
INORDER+CASE WHEN ASCII(LEFT(REMAINS,1)) >= ASCIICODE THEN 1 ELSE 0 END AS INORDER,
WORDLENGTH
FROM CHARS
WHERE LEN(REMAINS)>=1
),
ONLYINORDER
AS
(
SELECT 1 AS RESULT
FROM CHARS
WHERE ROWID=WORDLENGTH AND INORDER=WORDLENGTH
UNION
SELECT 0
FROM CHARS
WHERE NOT (ROWID=WORDLENGTH AND INORDER=WORDLENGTH)
)
SELECT #RESULT = RESULT FROM ONLYINORDER
RETURN #RESULT
END

Remove leading zeros

Given data in a column which look like this:
00001 00
00026 00
I need to use SQL to remove anything after the space and all leading zeros from the values so that the final output will be:
1
26
How can I best do this?
Btw I'm using DB2
This was tested on DB2 for Linux/Unix/Windows and z/OS.
You can use the LOCATE() function in DB2 to find the character position of the first space in a string, and then send that to SUBSTR() as the end location (minus one) to get only the first number of the string. Casting to INT will get rid of the leading zeros, but if you need it in string form, you can CAST again to CHAR.
SELECT CAST(SUBSTR(col, 1, LOCATE(' ', col) - 1) AS INT)
FROM tab
In DB2 (Express-C 9.7.5) you can use the SQL standard TRIM() function:
db2 => CREATE TABLE tbl (vc VARCHAR(64))
DB20000I The SQL command completed successfully.
db2 => INSERT INTO tbl (vc) VALUES ('00001 00'), ('00026 00')
DB20000I The SQL command completed successfully.
db2 => SELECT TRIM(TRIM('0' FROM vc)) AS trimmed FROM tbl
TRIMMED
----------------------------------------------------------------
1
26
2 record(s) selected.
The inner TRIM() removes leading and trailing zero characters, while the outer trim removes spaces.
This worked for me on the AS400 DB2.
The "L" stands for Leading.
You can also use "T" for Trailing.
I am assuming the field type is currently VARCHAR, do you need to store things other than INTs?
If the field type was INT, they would be removed automatically.
Alternatively, to select the values:
SELECT (CAST(CAST Col1 AS int) AS varchar) AS Col1
I found this thread for some reason and find it odd that no one actually answered the question. It seems that the goal is to return a left adjusted field:
SELECT
TRIM(L '0' FROM SUBSTR(trim(col) || ' ',1,LOCATE(' ',trim(col) || ' ') - 1))
FROM tab
One option is implicit casting: SELECT SUBSTR(column, 1, 5) + 0 AS column_as_number ...
That assumes that the structure is nnnnn nn, ie exactly 5 characters, a space and two more characters.
Explicit casting, ie SUBSTR(column,1,5)::INT is also a possibility, but exact syntax depends on the RDBMS in question.
Use the following to achieve this when the space location is variable, or even when it's fixed and you want to make a more robust query (in case it moves later):
SELECT CAST(SUBSTR(LTRIM('00123 45'), 1, CASE WHEN LOCATE(' ', LTRIM('00123 45')) <= 1 THEN LEN('00123 45') ELSE LOCATE(' ', LTRIM('00123 45')) - 1 END) AS BIGINT)
If you know the column will always contain a blank space after the start:
SELECT CAST(LOCATE(LTRIM('00123 45'), 1, LOCATE(' ', LTRIM('00123 45')) - 1) AS BIGINT)
both of these result in:
123
so your query would
SELECT CAST(SUBSTR(LTRIM(myCol1), 1, CASE WHEN LOCATE(' ', LTRIM(myCol1)) <= 1 THEN LEN(myCol1) ELSE LOCATE(' ', LTRIM(myCol1)) - 1 END) AS BIGINT)
FROM myTable1
This removes any content after the first space character (ignoring leading spaces), and then converts the remainder to a 64bit integer which will then remove all leading zeroes.
If you want to keep all the numbers and just remove the leading zeroes and any spaces you can use:
SELECT CAST(REPLACE('00123 45', ' ', '') AS BIGINT)
While my answer might seem quite verbose compared to simply SELECT CAST(SUBSTR(myCol1, 1, 5) AS BIGINT) FROM myTable1 but it allows for the space character to not always be there, situations where the myCol1 value is not of the form nnnnn nn if the string is nn nn then the convert to int will fail.
Remember to be careful if you use the TRIM function to remove the leading zeroes, and actually in all situations you will need to test your code with data like 00120 00 and see if it returns 12 instead of the correct value of 120.