SQL Server String parsing for special characters - sql

I need a solution (t-sql function / procedure) to parse an SQL Server varchar(max) and eliminating all special characters and accents
The output of this string will be transformed to a CSV file using an AWK script that breaks on special characters like '&', '%', '\' and all accent characters that on convert turn into unknown characters (like ç in français) so that's why I need this parser.
Thank you

You can try this:
CREATE TABLE dbo.Bad_ASCII_Characters (ascii_char CHAR(1) NOT NULL)
DECLARE #i INT
SET #i = 1
WHILE #i <= 255
BEGIN
IF (#i <> 32) AND
(#i NOT BETWEEN 48 AND 57) AND
(#i NOT BETWEEN 65 AND 90) AND
(#i NOT BETWEEN 97 AND 122)
BEGIN
INSERT INTO dbo.Bad_ASCII_Characters (ascii_char) VALUES(CHAR(#i))
END
SET #i = #i + 1
END
DECLARE #row_count INT
SET #row_count = 1
WHILE (#row_count > 0)
BEGIN
UPDATE T
SET my_column = REPLACE(my_column, ascii_char, '')
FROM My_Table T
INNER JOIN dbo.Bad_ASCII_Characters BAC ON
T.my_column LIKE '%' + BAC.ascii_char + '%'
SET #row_count = ##ROWCOUNT
END
I haven't tested it, so you might need to tweak it a bit. You can either generate the table on the fly each time, or you can leave it out there and if your requirements change slightly (for example, you find some characters that it will parse correctly) then you can just change the data in the table.
The WHILE loop around the update is in case some columns contain multiple special characters. If your table is very large you might see some performance issues here.

If I got you right:
SELECT REPLACE('abc&de','&','_')

Related

Replace every alpha character with itself + wildcard in string SQL Server

My goal is to create a query that will search for results related to a specific keyword.
Say in a database we had the word cat.
Regardless of if the user types C a t, C.A.T. or Cat I want to find a result related to the search as long as the alpha numeric characters are in the correct sequence that is all that matters
Say in the database we have these 4 records
cat
c/a/t
c.a.t
c. at
If the user types in C#$*(&A T I'd like to get all 4 results.
What I have written so far in my query is a function that strips any non-alphanumeric characters from the input string.
What can I do to replace each alphanumeric character with itself and add a wildcard at the end?
For every alpha character my input would look similar to this
C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%
Actually, that search string will return only one record from this table: the row with 'c.a.t '.
This is because the expression C%[^a-zA-Z0-9]%A does not mean there can't be any alpha-numeric chars between C and A.
What it actually means is there should be at least one non alpha-numeric value between C and A.
Moreover, it will return incorrect values as well - a value like 'c u a s e t ' will be returned.
You need to change your where clause to something like this:
WHERE column LIKE '%C%A%T%'
AND column NOT LIKE '%C%[a-zA-Z0-9]%A%[a-zA-Z0-9]%T%'
This way, if you have cat in the correct order, the first row will resolve to true, and if there are no other alpha-numeric chars between c, a, and t the second row will resolve to true.
Here is a test script, where you can see for yourself what I mean:
DECLARE #T AS TABLE
(
a varchar(20)
)
INSERT INTO #T VALUES
('cat'),
('c/a/t'),
('c.a.t '),
('c. at'),
('c u a s e t ')
-- Incorrect where clause
SELECT *
FROM #T
WHERE a LIKE 'C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%'
-- correct where clause
SELECT *
FROM #T
WHERE a LIKE '%C%A%T%'
AND a NOT LIKE '%C%[a-zA-Z0-9]%A%[a-zA-Z0-9]%T%'
You can also see it in action in this link.
And since I had some spare time, here is a script to create both the like and the not like patterns from the input string:
DECLARE #INPUT varchar(100) = '#*# c %^&# a ^&*$&* t (*&(%!##$'
DECLARE #Index int = 1,
#CurrentChar char(1),
#Like varchar(100),
#NotLike varchar(100) = '%'
WHILE #Index < LEN(#Input)
BEGIN
SET #CurrentChar = SUBSTRING(#INPUT, #Index, 1)
IF PATINDEX('%[^a-zA-Z0-9]%', #CurrentChar) = 0
BEGIN
SET #NotLike = #NotLike + #CurrentChar + '%[a-zA-Z0-9]%'
END
SET #Index = #Index + 1
END
SELECT #NotLike = LEFT(#NotLike, LEN(#NotLike) - 12),
#Like = REPLACE(#NotLike, '%[a-zA-Z0-9]%', '%')
SELECT *
FROM #T
WHERE a LIKE #Like
AND a NOT LIKE #NotLike
You can recursively go through your (cleaned) search string and to each letter add the expression you would like. In my example #builtString should be what you would like to use further on, if I understood correctly.
declare #cleanSearch as nvarchar(10) = 'CAT'
declare #builtString as nvarchar(100) = ''
WHILE LEN(#cleanSearch) > 0 -- loop until you deplete the search string
BEGIN
SET #builtString = #builtString + substring(#cleanSearch,1,1) + '%[^a-zA-Z0-9]%' -- append the letter plus regular expression
SET #cleanSearch = right(#cleanSearch, len(#cleanSearch) - 1) -- remove first letter of the search string
END
SELECT #builtString --will look like C%[^a-zA-Z0-9]%A%[^a-zA-Z0-9]%T%[^a-zA-Z0-9]%
SELECT #cleanSearch --#cleanSearch is now empty

How to replace non-standard Unicode with Space or Tab using SQL

I have a SQL2008 R2 database with several fields having the data type [IMAGE] the values in the field are actually BLOBs representing varied formats of mostly text. The Binary Data is created by HP’s Service Manager where they are used internally to populate tables and arrays in the GUI. I am using BIRT (4.2) the Eclipse-based reporting tool, to harvest data and create reports.
While it is possible to convert the IMAGE to table arrays, performance issues, preclude that in many cases. I am trying to create a fully SQL based solution to translate and dissect the IMAGE to readable, usable text. The Binary characters I care about are mostly in the first 127 Unicode set, and all in the first 255 Unicode. There is a bunch of garbage outside of this range that is presumably used for formatting in the GUI.
I am looking for a SQL solution that replaces values outside of basic Unicode (127 or 255) with a space or tab. My attempts to use replace() failed as it only seems to recognize the basic Unicode characters. My best solution would replace blocks of garbage outside of a given Unicode range with a single tab (and be as simple as existing solutions below).
I have one solution the converts it to a string with some garbage left in it.
select
-- Raw is an image, limited options for cast, so cast it as varbinary
-- Default characters converted is 30 so set to (8000)
-- then cast varbinary to varchar (so a person can read it)
-- substring ignores the first 9 characters after casting
substring (cast (cast (Table.a as varbinary (8000))as varchar(8000)), 9, 7991)as 'SubstringCastCast'
from dbo.Table
I have a screen shot of the data preview, but insufficient reputation to post it, It does not transfer well via copy and paste.
I have another solution where I find and extract the one piece that I need (i.e. IM0012001234)
select
-- Extract the 12 digit ticket number
substring (CastCast,
-- Find start of Ticket number
charindex('IM',CastCast)
, 12) as 'ETicket'
--Create data set with string that contains ticket, so I can extract it above
from(
select
-- use cast to get a small data set with the ticket number in it
cast (cast (Table.a as varbinary (200))as varchar(200)) as 'CastCast'
from dbo.Table
)InnerQ
I have written a function that strips out anything other than A-Z a-z 0-9... maybe this can help (tweak to suit your needs, you can put in ELSE ' ' to put in a space where the characters are unrecognised):
CREATE FUNCTION [dbo].[StripPunctuation]
(
#String VARCHAR(255)
)
RETURNS VARCHAR(255) AS
/*
$ Description: Strips out all non alpha-numeric
$ characters from a string
$
*/
BEGIN
DECLARE #i INT
DECLARE #Char CHAR(1)
DECLARE #Wk VARCHAR(255)
-- Only copy 0-9, a-z, A-Z.
SET #Wk = ''
SET #i = 1
WHILE #i <= LEN(#String)
BEGIN
SET #Char = SUBSTRING(#String, #i, 1)
IF (ASCII(#Char) > 47) AND (ASCII(#Char) < 58)
SET #Wk = #Wk + #Char
IF (ASCII(#Char) > 64) AND (ASCII(#Char) < 91)
SET #Wk = #Wk + #Char
IF (ASCII(#Char) > 96) AND (ASCII(#Char) < 123)
SET #Wk = #Wk + #Char
SET #i = #i +1
END
RETURN #Wk
END

Extract a number from String in SQL

I have the following string:
"FLEETWOOD DESIGNS 535353110XXXXX" (The X's are actually numbers I just wanted to hide them here)
Does anyone know how can I search through Strings in SQL and extract numbers that are greater then lets say 10 characters long?
This a quite old post but might help anyone else. I was searching for an user defined function in SQL Server to extract only the numbers of a given string, and, surprisingly I could not find exactly what I was looking for.
Let me put here the code of a function to "Extract a number from string in SQL" (valid for SQL Server). This is taken from the fantastic blog of Pinal Dave, I've modified it just to return NULL is a NULL value is passed to the function.
CREATE FUNCTION [dbo].[ExtractInteger](#String VARCHAR(2000))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE #Count INT
DECLARE #IntNumbers VARCHAR(1000)
SET #Count = 0
SET #IntNumbers = ''
IF #String IS NULL
RETURN NULL;
WHILE #Count <= LEN(#String)
BEGIN
IF SUBSTRING(#String,#Count,1) >= '0' AND SUBSTRING(#String,#Count,1) <= '9'
BEGIN
SET #IntNumbers = #IntNumbers + SUBSTRING(#String,#Count,1)
END
SET #Count = #Count + 1
END
RETURN #IntNumbers
END
Tests
select '"' + dbo.ExtractInteger('1a2b3c4d5e6f7g8h9i') + '"'
GO
select '"' + dbo.ExtractInteger('abcdefghi') + '"'
GO
select '"' + dbo.ExtractInteger(NULL) + '"'
GO
select '"' + dbo.ExtractInteger('') + '"'
GO
Results
"123456789"
""
NULL
""
You don't mention the DB engine, so we don't know what features are available...
If regexpressions are available then pattern like \d{10,} would match numbers with 10 or more digit.
In mySQL REGEXP can only return true or false (0 or 1) so you'd have to use some ugly hack like
SELECT
LEAST(
INSTR(field,'0'),
INSTR(field,'1'),
INSTR(field,'2'),
INSTR(field,'3'),
INSTR(field,'4'),
INSTR(field,'5'),
INSTR(field,'6'),
INSTR(field,'7'),
INSTR(field,'8'),
INSTR(field,'9')
) AS startPos,
REVERSE(field) AS backward,
LEAST(
INSTR(backward,'0'),
INSTR(backward,'1'),
INSTR(backward,'2'),
INSTR(backward,'3'),
INSTR(backward,'4'),
INSTR(backward,'5'),
INSTR(backward,'6'),
INSTR(backward,'7'),
INSTR(backward,'8'),
INSTR(backward,'9')
) AS endPos,
SUBSTRING(field, startPos, endPos - startPos + 1)
FROM tab
WHERE(field REGEXP '[0-9]{10,}')
but this isn't perfect - it would extract false substring for string like "ABC 9 A 1234567891", not to mention that it is probably so slooooow that it is faster to go througt data by hand.
SUBSTRING('FLEETWOOD DESIGNS 535353110XXXXX', 18, 32)
You could also use LEN() to get the length of the string itself. If you know the serial number length, you can just subtract that from the end index to get your start index of the substring.
It could be done like this
Declare #X varchar(100)
Select #X= 'Here is where15234Numbers'
--
Select #X= SubString(#X,PATINDEX('%[0-9]%',#X),Len(#X))
Select #X= SubString(#X,0,PATINDEX('%[^0-9]%',#X))
--// show result
Select #X

Other approach for handling this TSQL text manipulation

I have this following data:
0297144600-4799 0297485500-5599
The 0297485500-5599 based on observation always on position 31 char from the left which this is an easy approach.
But I would like to do is to anticipate just in case the data is like this below which means the position is no longer valid:
0297144600-4799 0297485500-5599 0297485600-5699
As you can see, I guess the first approach will the split by 1 blank space (" ") but due to number of space is unknown (varies) how do I take this approach then? Is there any method to find the space in between and shrink into 1 blank space (" ").
BTW ... it needs to be done in TSQL (Ms SQL 2005) unfortunately cause it's for SSIS :(
I am open with your idea/suggestion.
Thanks
I have updated my answer a bit, now that I know the number pattern will not always match. This code assumes the sequences will begin and end with a number and be separated by any number of spaces.
DECLARE #input nvarchar -- max in parens
DECLARE #pattern nvarchar -- max in parens
DECLARE #answer nvarchar -- max in parens
DECLARE #pos int
SET #input = ' 0297144623423400-4799 5615618131201561561 0297485600-5699 '
-- Make sure our search string has whitespace at the end for our pattern to match
SET #input = #input + ' '
-- Find anything that starts and ends with a number
WHILE PATINDEX('%[0-9]%[0-9] %', #input) > 0
BEGIN
-- Trim off the leading whitespace
SET #input = LTRIM(#input)
-- Find the end of the sequence by finding a space
SET #pos = PATINDEX('% %', #input)
-- Get the result out now that we know where it is
SET #answer = SUBSTRING(#input, 0, #pos)
SELECT [Result] = #answer
-- Remove the result off the front of the string so we can continue parsing
SET #input = SUBSTRING(#input, LEN(#answer) + 1, 8096)
END
Assuming you're processing one line at a time, you can also try this:
DECLARE #InputString nvarchar(max)
SET #InputString = '0297144600-4799 0297485500-5599 0297485600-5699'
BEGIN
WHILE CHARINDEX(' ',#InputString) > 0 -- Checking for double spaces
SET #InputString =
REPLACE(#InputString,' ',' ') -- Replace 2 spaces with 1 space
END
PRINT #InputString
(taken directly from SQLUSA, fnRemoveMultipleSpaces1)

How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table?

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?
I have sometimes been using this "cast" statement to find "strange" chars
select
*
from
<Table>
where
<Field> != cast(<Field> as varchar(1000))
First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.
-- Start with tab, line feed, carriage return
declare #str varchar(1024)
set #str = '|' + char(9) + '|' + char(10) + '|' + char(13)
-- Add all normal ASCII characters (32 -> 127)
declare #i int
set #i = 32
while #i <= 127
begin
-- Uses | to escape, could be any character
set #str = #str + '|' + char(#i)
set #i = #i + 1
end
The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.
select *
from yourtable
where yourfield like '%[^' + #str + ']%' escape '|'
The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.
Hope this is useful, and thanks to JohnFX's comment on the other answer.
Here ya go:
SELECT *
FROM Objects
WHERE
ObjectKey LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?#\[\^_`{|}~\]\\]%' ESCAPE '\'
It's probably not the best solution, but maybe a query like:
SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'
Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).
Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(#NChar) < 256 and ASCII(#NChar) = UNICODE(#NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:
;With cteNumbers as
(
Select ROW_NUMBER() Over(Order By c1.object_id) as N
From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))
Where UNICODE(Substring(TXT, n.N, 1)) > 255
OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))
This should also be very fast.
I started with #CC1960's solution but found an interesting use case that caused it to fail. It seems that SQL Server will equate certain Unicode characters to their non-Unicode approximations. For example, SQL Server considers the Unicode character "fullwidth comma" (http://www.fileformat.info/info/unicode/char/ff0c/index.htm) the same as a standard ASCII comma when compared in a WHERE clause.
To get around this, have SQL Server compare the strings as binary. But remember, nvarchar and varchar binaries don't match up (16-bit vs 8-bit), so you need to convert your varchar back up to nvarchar again before doing the binary comparison:
select *
from my_table
where CONVERT(binary(5000),my_table.my_column) != CONVERT(binary(5000),CONVERT(nvarchar(1000),CONVERT(varchar(1000),my_table.my_column)))
If you are looking for a specific unicode character, you might use something like below.
select Fieldname from
(
select Fieldname,
REPLACE(Fieldname COLLATE Latin1_General_BIN,
NCHAR(65533) COLLATE Latin1_General_BIN,
'CustomText123') replacedcol
from table
) results where results.replacedcol like '%CustomText123%'
My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.
Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.
You also might need to play around with the numeric range, since unicode characters can go beyond 255.
CREATE TABLE dbo.Numbers
(
number INT NOT NULL,
CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE #i INT
SET #i = 0
WHILE #i < 1000
BEGIN
INSERT INTO dbo.Numbers (number) VALUES (#i)
SET #i = #i + 1
END
GO
SELECT *,
T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
dbo.Numbers N
INNER JOIN dbo.My_Table T ON
T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
N.number BETWEEN 127 AND 255
ORDER BY
T.id, N.number
GO
-- This is a very, very inefficient way of doing it but should be OK for
-- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply
-- looks for characters with bit 7 set.
SELECT *
FROM yourTable as t
WHERE EXISTS ( SELECT *
FROM msdb..Nums as NaturalNumbers
WHERE NaturalNumbers.n < LEN(t.string_column)
AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)