Perform string comaparison ignoring the diacritics - sql

I'm trying search in Arabic text in SQL Server and need to ignore the Arabic diacritics.
So I'm using Arabic_100_CI_AI collation. but it's not work.
For example for the below query I must get 1, but it has no result!
select 1
where (N'مُحَمَّد' Collate Arabic_100_CI_AI) = (N'محمّد' Collate Arabic_100_CI_AI)
What is the problem and how can I perform diacritics insensitive comparison in Arabic text?

It seems AI flag is NOT working for Arabic. You can build your own Unicode Normalization function.
ALTER FUNCTION [dbo].[NormalizeUnicode]
(
-- Add the parameters for the function here
#unicodeWord nvarchar(max)
)
RETURNS nvarchar(max)
AS
BEGIN
-- Declare the return variable here
DECLARE #Result nvarchar(max)
-- Add the T-SQL statements to compute the return value here
declare #l int;
declare #i int;
SET #l = len(#unicodeWord + '-') - 1
SET #i = 1;
SET #Result = '';
WHILE (#i <= #l)
BEGIN
DECLARE #c nvarchar(1);
SET #c = SUBSTRING(#unicodeWord, #i, 1);
-- 0x064B to 0x65F, 0x0670 are Combining Characters
-- You may need to perform tests for this character range
IF NOT (unicode(#c) BETWEEN 0x064B AND 0x065F or unicode(#c) = 0x0670)
SET #Result = #Result + #c;
SET #i = #i + 1;
END
-- Return the result of the function
RETURN #Result
END
Following test should work correctly,
select 1
where dbo.NormalizeUnicode(N'بِسمِ اللہِ الرَّحمٰنِ الرَّحیم') = dbo.NormalizeUnicode(N'بسم اللہ الرحمن الرحیم');
Notes:
You may experience slow performance with this solution
The character range I've used in the function is NOT thoroughly tested.
For a complete reference on Arabic Unicode Character Set, see this document http://www.unicode.org/charts/PDF/U0600.pdf

Your use of collation is correct but if you carefully see the two Arabic words in your query (highlighted bold) they are completely different even though their meaning same and hence you are not getting the result (since comparison is failing)
N'مُحَمَّد' and N'محمّد'
I am pretty sure, if you try to find out their unicode value using unicode() function; their result will be different.
If you try the below query, it will succeed
select 1
where N'مُحَمَّد' Collate Arabic_100_CI_AI like '%%'
See this post for a better explanation
Treating certain Arabic characters as identical

Related

data length issue from SQL Server to Oracle with non english characher

we have 2 applications. One application uses SQL Server as the backend and the other application uses Oracle.
In the first application the user can enter some information and the 2nd application gets the data from SQL Server and insert it into oracle.
The problem is that the user can enter in any language following table shows sample data
Table in SQL Server
For instance user has entered Chinese characters in address field and length is 10,
Oracle Table
Address is not inserted here because length of address exceeds to 12, in oracle special character considering as 3 length.
I want to substring character (with non english and with english). How can I achieve that? I have written function which written number of special character.
how to get only 5 charachters from #nstring
Try to define the column in Oracle as VARCHAR2(10 CHAR). That changes the length semantics from bytes to characters. So the column will be able to accept 10 characters not just 10 bytes, which might be to short if there are special characters in the string.
declare #nstring NVARCHAR(MAX)=N'理,551'
declare #lenSQL int = len(#nstring)
declare #oracleLen int = #lenSQL +(2 * [dbo].[CountNonEnglishfromString](#nstring))
declare #OracleMaxLen int = 5; -- change as per len required
declare #newString nvarchar(max);
if(#OracleMaxLen < #oracleLen)
begin
declare #olen int =0
declare #count int =1;
WHILE ( #count <= #lenSQL)
BEGIN
declare #ch nvarchar(1) =(SELECT SUBSTRING(#nstring,#count,#count) AS ExtractString);
declare #isSpecialChar int = [dbo].[CountNonEnglishfromString](#ch)
if(#isSpecialChar = 1)
begin
set #olen = #olen+3;
end
else
begin
set #olen = #olen+1;
end
if(#OracleMaxLen < #olen)
begin
break
end
set #newString =CONCAT(#newString , #ch)
set #count = #count +1
End
End
else
begin
set #newString = #nstring
end
select isnull(#newString,'') as 'new string';

How to check Palindrome in SQL Server

To check palindrome I am using REVERSE function of SQL Server.
I wanted to check how reverse function works with this sample code:
declare #string nvarchar
set #string = 'szaaa'
SELECT REVERSE(#string)
But the output was 's' in case of 'aaazs' which I expected. How should I capture the reverse? Is there any better way to find palindrome?
In SQL Server, always use lengths with the character types:
declare #string nvarchar(255);
set #string = 'szaaa';
SELECT REVERSE(#string);
The default length varies by context. In this case, the default length is "1", so the string variable only holds one character.
To check palindrome, You can use CASE Statement
DECLARE #string NVARCHAR(255);
SET #string = 'szaaa';
SELECT CASE WHEN #string=REVERSE(#string)THEN 'Is palindrome'
ELSE 'Is not palindrome'
END

Creating multiple UDFs in one batch - SQL Server

I'm asking this question for SQL Server 2008 R2
I'd like to know if there is a way to create multiple functions in a single batch statement.
I've made the following code as an example; suppose I want to take a character string and rearrange its letters in alphabetical order. So, 'Hello' would become 'eHllo'
CREATE FUNCTION char_split (#string varchar(max))
RETURNS #characters TABLE
(
chars varchar(2)
)
AS
BEGIN
DECLARE #length int,
#K int
SET #length = len(#string)
SET #K = 1
WHILE #K < #length+1
BEGIN
INSERT INTO #characters
SELECT SUBSTRING(#string,#K,1)
SET #K = #K+1
END
RETURN
END
CREATE FUNCTION rearrange (#string varchar(max))
RETURNS varchar(max)
AS
BEGIN
DECLARE #SplitData TABLE (
chars varchar(2)
)
INSERT INTO #SplitData SELECT * FROM char_split(#string)
DECLARE #Output varchar(max)
SELECT #Output = coalesce(#Output,' ') + cast(chars as varchar(10))
from #SplitData
order by chars asc
RETURN #Output
END
declare #string varchar(max)
set #string = 'Hello'
select dbo.rearrange(#string)
When I try running this code, I get this error:
'CREATE FUNCTION' must be the first statement in a query batch.
I tried enclosing each function in a BEGIN END block, but no luck. Any advice?
Just use a GO statement between the definition of the UDFs
Not doable. SImple like that.
YOu can make it is one statement using a GO between them.
But as the GO is a batch delimiter.... this means you send multiple batches, which is explicitly NOT Wanted in your question.
So, no - it is not possible to do that in one batch as the error clearly indicates.

4000 character limit in LIKE statement

I have been getting an error in a previously working stored procedure called by an SSRS report and I have traced it down to a LIKE statement in a scalar function that is called by the stored procedure, in combination with a 7000+ NVARCHAR(MAX) string. It is something similar to:
Msg 8152, Level 16, State 10, Line 14
String or binary data would be truncated.
I can reproduce it with the following code:
DECLARE #name1 NVARCHAR(MAX) = ''
DECLARE #name2 NVARCHAR(MAX) = ''
DECLARE #count INT = 4001
WHILE #count > 0
BEGIN
SET #name1 = #name1 + 'a'
SET #name2 = #name2 + 'a'
SET #count = #count - 1
END
SELECT LEN(#name1)
IF #name1 LIKE #name2
PRINT 'OK'
What's the deal? Is there anyway around this limitation, or is it there for good reason? Thanks.
You can also reproduce it without the terrible loop:
DECLARE #name1 NVARCHAR(MAX), #name2 NVARCHAR(MAX);
SET #name1 = REPLICATE(CONVERT(NVARCHAR(MAX), N'a'), 4000);
SET #name2 = #name1;
IF #name1 LIKE #name2
PRINT 'OK';
SELECT #name1 += N'a', #name2 += N'a';
IF #name1 LIKE #name2
PRINT 'OK';
Result:
OK
Msg 8152, Level 16, State 10, Line 30
String or binary data would be truncated.
In any case, the reason is clearly stated in the documentation for LIKE (emphasis mine):
match_expression [ NOT ] LIKE pattern [ ESCAPE escape_character ]
...
pattern
Is the specific string of characters to search for in match_expression, and can include the following valid wildcard characters. pattern can be a maximum of 8,000 bytes.
And 8,000 bytes is used up by 4,000 Unicode characters.
I would suggest that comparing the first 4,000 characters is probably sufficient:
WHERE column LIKE LEFT(#param, 4000) + '%';
I can't envision any scenario where you want to compare the whole thing; how many strings contain the same first 4000 characters but then character 4001 is different? If that really is a requirement, I guess you could go to the great lengths identified in the Connect item David pointed out.
A simpler (though probably much more computationally expensive) workaround might be:
IF CONVERT(VARBINARY(MAX), #name1) = CONVERT(VARBINARY(MAX), #name2)
PRINT 'OK';
I suggest that it would be far better to fix the design and stop identifying rows by comparing large strings. Is there really no other way to identify the row you're after? This is like finding your car in the parking lot by testing the DNA of all the Dunkin Donuts cups in all the cup holders, rather than just checking the license plate.
I have the same problem right now, and I do believe my situation -where you want to compare two strings with more than 4000 characters- is a possible situation :-).
In my situation, I'm collecting a lot of data from different tables in a NVARCHAR(MAX) field in a specific table, to be able to search on that data using FullText. Keeping that table in sync, is done using the MERGE statement, converting everything to NVARCHAR(MAX).
So my MERGE statement would look like this:
MERGE MyFullTextTable AS target
USING (
SELECT --Various stuff from various tables, casting it as NVARCHAR(MAX)
...
) AS source (IndexColumn, FullTextColumn)
ON (target.IndexColumn = source.IndexColumn)
WHEN MATCHED AND source.FullTextColumn NOT LIKE target.FullTextColumn THEN
UPDATE SET FullTextColumn = source.FullTextColumn
WHEN NOT MATCHED THEN
INSERT (IndexColumn, FullTextColumn)
VALUES (source.IndexColumn, source.FullTextColumn)
OUTPUT -- Some stuff
This would produce errors because of the LIKE comparison when the FullText-data is bigger than 4000 characters.
So I created a function that does the comparison. Allthough it's not bullet proof, it works for me. You could also split data in blocks of 4000 characters, and compare each block, but for me (for now) comparing the first 4000 characters in combination with the length, is enough ...
So the Merge-statement would look like:
MERGE MyFullTextTable AS target
USING (
SELECT --Various stuff from various tables, casting it as NVARCHAR(MAX)
...
) AS source (IndexColumn, FullTextColumn)
ON (target.IndexColumn = source.IndexColumn)
WHEN MATCHED AND udfCompareTwoTexts(source.FullTextColumn, target.FullTextColumn) = 1 THEN
UPDATE SET FullTextColumn = source.FullTextColumn
WHEN NOT MATCHED THEN
INSERT (IndexColumn, FullTextColumn)
VALUES (source.IndexColumn, source.FullTextColumn)
OUTPUT -- Some stuff
And the function looks like:
ALTER FUNCTION udfCompareTwoTexts
(
#Value1 AS NVARCHAR(MAX),
#Value2 AS NVARCHAR(MAX)
)
RETURNS BIT
AS
BEGIN
DECLARE #ReturnValue AS BIT = 0
IF LEN(#Value1) > 4000 OR LEN(#Value2) > 4000
BEGIN
IF LEN(#Value1) = LEN(#Value2) AND LEFT(#Value1, 4000) LIKE LEFT(#Value2, 4000)
SET #ReturnValue = 1
ELSE
SET #ReturnValue = 0
END
ELSE
BEGIN
IF #Value1 LIKE #Value2
SET #ReturnValue = 1
ELSE
SET #ReturnValue = 0
END
RETURN #ReturnValue;
END
GO

SQL Server - Filter field contents to numbers only

How can I copy the value of a field, but only its numbers?
I am creating a computed column for fulltext search, and I want to copy the values from my Phone Number fields (which are varchar) into it, but not with their formatting - numbers only. What is the command that would do this in my computed column formula?
Thank you!
You are going to have to write a user defined function to do this. There are several ways to do this, here is one that I found with some quick Googling.
CREATE FUNCTION dbo.RemoveChars(#Input varchar(1000))
RETURNS VARCHAR(1000)
BEGIN
DECLARE #pos INT
SET #Pos = PATINDEX('%[^0-9]%',#Input)
WHILE #Pos > 0
BEGIN
SET #Input = STUFF(#Input,#pos,1,'')
SET #Pos = PATINDEX('%[^0-9]%',#Input)
END
RETURN #Input
END
Warning: I wouldn't put this in a WHERE condition on a large table, or in a SELECT that returns millions of rows, but it will work.
Ultimately you are probably better stripping the non-numeric characters out in the UI of your app than in DB code.
Assuming there's only a couple of non-number characters, a nested replace functions do the trick:
select replace(replace(replace(col1,'-',''),'(',''),')','')
from YourTable
You can check if you caught all characters like:
select col1
from YourTable
where col1 not like '%[-()0-9]%'
(This example is checking for -, (), and numbers.)
I'd create a user-defined function that you could use in your select and where criteria, maybe something like this:
DECLARE #position int, #result varchar(50)
SET #position = 1
SET #result = ''
WHILE #position <= DATALENGTH(#input)
BEGIN
IF ASCII(SUBSTRING(#input, #position, 1)) BETWEEN 48 AND 57
BEGIN
SET #result = #result + SUBSTRING(#input, #position, 1)
END
SET #position = #position + 1
END
RETURN #result
Best of luck!
I realize this is a somewhat older question but there is no need to resort to looping for this. And these days we should try to avoid scalar functions when possible as they are not good for performance. We can leverage an inline table valued function in conjunction with the light support of regular expressions that we have in sql server. This article from Jeff Moden explains this in more detail from the perspective of why IsNumeric does not really work. http://www.sqlservercentral.com/articles/ISNUMERIC()/71512/
The gist of it is this nifty function he put together.
CREATE FUNCTION dbo.IsAllDigits
/********************************************************************
Purpose:
This function will return a 1 if the string parameter contains only
numeric digits and will return a 0 in all other cases. Use it in
a FROM clause along with CROSS APPLY when used against a table.
--Jeff Moden
********************************************************************/
--===== Declare the I/O parameters
(#MyString VARCHAR(8000))
RETURNS TABLE AS
RETURN (
SELECT CASE
WHEN #MyString NOT LIKE '%[^0-9]%'
THEN 1
ELSE 0
END AS IsAllDigits
)