Is it possible to Compare two columns in Microsoft SQL server so that the comparison skips punctuation marks and other character like %, ' etc? - sql

I have two columns having data like below.
Column1
AMC Standard, School
Column2
AMC Standard School.
In need to compare these two columns such that comparison is made for the words only and not for any additional, meaning from the above example Column1 and ColumnC are match but due to the Comma ",' and the period sign "." the simple comparison of Column1 and Column2 suggests it as a mismatch.

you can replace the non comparable characters to empty string (in your case , and .)and then compare them. Something like this.
SELECT 1 WHERE REPLACE('AMC Standard, School',',','') = REPLACE('AMC Standard School.','.','')
Based on jarlh comments, You should (if possible) update the columns and remove the punctuation marks if they are not using in any comparison and display.

One option is to use SQL Servers SoundEx() and Difference() functions (https://msdn.microsoft.com/en-us/library/ms187384.aspx and https://msdn.microsoft.com/en-us/library/ms188753.aspx respectively)
DECLARE #val1 varchar(50) = 'AMC Standard, School'
, #val2 varchar(50) = 'AMC Standard School.'
;
SELECT #val1
, #val2
, SoundEx(#val1)
, SoundEx(#val2)
, Difference(SoundEx(#val1), SoundEx(#val2))
;
The return value of Difference() is between 0 and 4, with a higher number signifying a closer match.
IMPORTANT NOTE: This type of comparison is not as exacting as a method that cleans up your data beforehand as in those scenarios you can use an exact (a=a) comparison, whereas this method looks for similar values.

Try like this
DECLARE #column1 VARCHAR(100)='AMC Standard, School (Near to ABC Building)'
DECLARE #column2 VARCHAR(100)='AMC Standard, School (Opposite KFC)'
SELECT 'MATCHED' AS COLUMN_COMPARE
WHERE replace(replace(replace(#column1, ',', ''), '.', ''), substring(#column1, CHARINDEX('(', #column1), CHARINDEX(')', #column1) - 1), '') = replace(replace(replace(#column2, ',', ''), '.', ''), substring(#column2, CHARINDEX('(', #column2), CHARINDEX(')', #column2) - 1), '')

Related

How many ways can you generate an error converting varchar to numeric that won't be caught by ISNUMERIC()?

I am in the process of loading a bunch of tables into SQL Server and converting them from varchar to specific data types (int, date, etc.). One frustration is how many different ways there are to break the conversion from string to numeric (int, decimal, etc) and that there is not an easy diagnostic tool to find the offending rows (besides ISNUMERIC() which doesn't work all the time).
Here is my list of ways to break the conversion that won't get caught by ISNUMERIC().
The string contains scientific notation (ie 3.55E-10)
The string contains a blank ('')
The string contains a non-alphanumeric symbol ('$', '-', ',')
Here's what I'm currently using to compensate:
SELECT
CASE
WHEN [MyColumn] IN ('','-') THEN NULL -- deals with blanks
WHEN [MyColumn] LIKE '%E%' THEN CONVERT(DECIMAL(20, 4), CONVERT(FLOAT(53), [MyColumn])) -- deals with scientific notation
ELSE CAST(REPLACE(REPLACE([MyColumn] , '$', ''), '-', '') AS DECIMAL(20, 4))
END [MyColumn] -- deals with special characters
FROM
MyTable
Does anyone else have others? Or good ways to diagnose?
Don't use ISNUMERIC(). If you are on 2012+ then you could use TRY_CAST or TRY_CONVERT.
If you are on older versions, you could use some syntax like this:
SELECT *
FROM #TableA
WHERE ColA NOT LIKE '%[^0-9]%'
You can try to use LIKE '%[0-9]%' instead of ISNUMERIC()
SELECT col, CASE WHEN col NOT LIKE '%[^0-9]%' and col<>''
THEN 1
ELSE 0
END
FROM T
You can use NOT LIKE to exclude anything that isn't a digit... and REPLACE for commas and periods. Naturally, you can add other nested REPLACE functions for values you want to accept.
declare #var varchar(64) = '55,5646'
SELECT
CASE
WHEN replace(replace(#var,'.',''),',','') NOT LIKE '%[^0-9]%'
THEN 1
ELSE 0
END
This allows you to accept decimals for your decimal / numeric / float conversions.

Does the SQL CASE statement treat variables differently from columns?

I have the following code in a stored procedure and am trying to conditionally format a calculated number based on its length (if the number is less than 4 digits, pad with leading zeros). However, my case statement is not working. The "formattedNumber2" result is the one I'm looking for.
I'm assuming the case statement treats the variable strangely, but I also don't know of a way around this.
DECLARE #Number int = 5
SELECT
CASE
WHEN (LEN(CONVERT(VARCHAR, #Number)) > 4)
THEN #Number
ELSE RIGHT('0000' + CAST(#Number AS VARCHAR(4)), 4)
END AS formattedNumber,
LEN(CONVERT(VARCHAR, #Number)) AS numberLength,
RIGHT('0000' + CAST(#Number AS VARCHAR(4)), 4) AS formattedNumber2
I get the following results when I run the query:
formattedNumber numberLength formattedNumber2
-------------------------------------------------
5 1 0005
SQL DEMO
The problem is you are using different data type on your case , integer and string. So the CASE stay with the first type he find and convert the rest.
CASE WHEN (LEN(convert(VARCHAR, #Number)) > 4) THEN convert(VARCHAR, #Number)
This can be done a lot easier with format() since version 2012.
format(n,
'0000')
And that would also handle negative values, which your current approach apparently doesn't.
Prior 2012 it can be handled with basically replicate() and + (string concatenation).
isnull(replicate('-',
-sign(n)), '')
+
isnull(replicate('0',
4
-
len(cast(abs(n) AS varchar(10)))
),
'')
+
cast(abs(n) AS varchar(10))
(It targets integer values, choose a larger length for the varchar casts for bigint.)
db<>fiddle

Get index of two consecutive upper case characters

I am trying to separate a city/state/zip field into the city, state, and zip. Normally I would do this with charindex of ',' to get the city and state, and isnumeric and right() for the zip.
This will work fine for the zip, but most of the rows in the data I am working with now have no commas City ST Zip. Is there a way to identify the index of two upper case characters?
If not, does anybody have a better idea than just a case statement checking for each state individually?
EDIT: I found the PATINDEX/COLLATE option to work fairly intermittently. See my answer below.
PATINDEX should work for you:
PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as)
So your full extract would be something like:
WITH CTE AS
( SELECT i = PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as) + 1,
A
FROM (VALUES
('City ST Zip'),
('Another City ST Zip'),
('City, with comma ST Zip')
) t (A)
)
SELECT City = LEFT(A, i - 2),
State = SUBSTRING(A, i, 2),
Zip = SUBSTRING(A, i + 3, LEN(A))
FROM CTE;
Example on SQL Fiddle
The reason why PATINDEX appears to work intermittently is that you cannot use a character range (i.e. A-Z) to accomplish a case-sensitive search, even if using a case-sensitive collation. The issue is that character ranges work like sorting, and case-sensitive sorting groups the upper-case letters with their lower-case equivalents, just like it would be ordered in a dictionary. Range sorting is really: a,A,b,B,c,C,d,D,etc. Or, depending on the collation, it might be: A,a,B,b,C,c,D,d,etc (there are 31 Collations that sort upper-case first). When doing this in a case-sensitive collation, that merely groups all A entries together, separate from the a entries, whereas in a case-insensitive sort they would be intermixed.
But if you specify each of the letters individually (hence not using a range), then it will work as expected:
PATINDEX(N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%',
[CityStZip] COLLATE Latin1_General_100_CS_AS)
The reason that PATINDEX and LIKE (both of which allow for a single character class of [A-Z]) work this way is that the [start-end] syntax is not a Regular Expression. Many people claim that PATINDEX and LIKE support "limited" RegEx due to supporting this syntax, but that is not true. It is merely a very similar (and a confusingly similar) syntax to RegEx where [A-Z] would normally not include any lower-case matches.
Of course, if you are guaranteed to only be searching on the US-English letters of A-Z, then a binary collation (i.e. one ending in _BIN2; don't use ones ending in _BIN as they have been deprecated since SQL Server 2005 was introduced, I believe) should work.
PATINDEX(N'%[A-Z][A-Z]%', [CityStZip] COLLATE Latin1_General_100_BIN2)
For more details about case-sensitive matching, especially in regards to including Unicode / NVARCHAR data, please see my related answer on DBA.StackExchange:
How to find values with multiple consecutive upper case characters
If you have zip code and state at the end of the string, then this might work:
select right(address, 5) as zip,
left(right(address, 8), 2) as state,
left(address, len(address) - 9) as city
You can start by removing the commas and double spaces from the address.
If you have a table of states(which you should) with a column of the abbreviations you can do things like this:
SELECT a.* FROM Addresses a
INNER JOIN States s ON
a.CityStateZip Like '% ' + s.UpperCaseAbbreviation + ' %' --space on either side of abbreviation
You can make it work for both commas and spaces:
SELECT a.* FROM Addresses a
INNER JOIN States s ON
Replace(a.CityStateZip, ',' , ' ') Like '% ' + s.UpperCaseAbbreviation + ' %'
I found the PATINDEX/COLLATE option to work fairly intermittently. Here is what I ended up doing:
--get rid of the sparsely used commas
--get rid of the duplicate spaces
update MyTable set
CityStZip=
replace(
replace(
replace(CityStZip,' ',' '),
' ',' '),
',','')
select
--check if state and zip are there and then grab the city
case when isNumeric(right(CityStZip,1))=1
then left(CityStZip,len(CityStZip)-charindex(' ',reverse(CityStZip),
charindex(' ',reverse(CityStZip))+1)+1)
--no zip. check for state
when left(right(CityStZip,3),1) = ' '
then left(CityStZip,len(CityStZip)-charIndex(' ',reverse(CityStZip)))
else CityStZip
end as City,
--check if zip is there and then grab the city
case when isNumeric(right(CityStZip,1))=1
then substring(CityStZip,
len(CityStZip)-charindex(' ',reverse(CityStZip),
charindex(' ',reverse(CityStZip))+1)+2,
2)
--no zip. check if 3rd to last char is a space and grab the last two chars
when left(right(CityStZip,3),1) = ' '
then right(CityStZip,2)
end as [State],
--grab everything after the last space if the last character is numeric
case when isNumeric(right(CityStZip,1))=1
then substring(CityStZip,
len(CityStZip)-charindex(' ',reverse(CityStZip))+1,
charindex(' ',reverse(CityStZip)))
end as Zip
from MyTable

Extract float from String/Text SQL Server

I have a Data field that is supposed to have floating values(prices), however, the DB designers have messed up and now I have to perform aggregate functions on that field. Whereas 80% of the time data is in correct format,eg. '80.50', sometime it is saved as '$80.50' or '$80.50 per sqm'.
The data field is nvarchar. What I need to do is extract the floating point number from the nvarchar. I came accross this: Article on SQL Authority
This, however, solves half my problem, or compound it, some might say. That function just returns the numbers in a string. That is '$80.50 per m2'will return 80502. Obviously that wont work. I tried to change the Regex from =>
PATINDEX('%[^0-9]%', #strAlphaNumeric) to=>
PATINDEX('%[^0-9].[^0-9]%', #strAlphaNumeric)
doesnt work. Any help would be appreciated.
This will do want you need, tested on (http://sqlfiddle.com/#!6/6ef8e/53)
DECLARE #data varchar(max) = '$70.23 per m2'
Select LEFT(SubString(#data, PatIndex('%[0-9.-]%', #data),
len(#data) - PatIndex('%[0-9.-]%', #data) +1
),
PatIndex('%[^0-9.-]%', SubString(#data, PatIndex('%[0-9.-]%', #data),
len(#data) - PatIndex('%[0-9.-]%', #data) +1))
)
But as jpw already mentioned a regular expression over a CLR would be better
This should work too, but it assumes that the float numbers are followed by a white space in case there's text after.
// sample data
DECLARE #tab TABLE (strAlphaNumeric NVARCHAR(30))
INSERT #tab VALUES ('80.50'),('$80.50'),('$80.50 per sqm')
// actual query
SELECT
strAlphaNumeric AS Original,
CAST (
SUBSTRING(stralphanumeric, PATINDEX('%[0-9]%', strAlphaNumeric),
CASE WHEN PATINDEX('%[ ]%', strAlphaNumeric) = 0
THEN LEN(stralphanumeric)
ELSE
PATINDEX('%[ ]%', strAlphaNumeric) - PATINDEX('%[0-9]%', strAlphaNumeric)
END
)
AS FLOAT) AS CastToFloat
FROM #tab
From the sample data above it generates:
Original CastToFloat
------------------------------ ----------------------
80.50 80,5
$80.50 80,5
$80.50 per sqm 80,5
Sample SQL Fiddle.
If you want something more robust you might want to consider writing an CLR-function to do regex parsing instead like described in this MSDN article: Regular Expressions Make Pattern Matching And Data Extraction Easier
Inspired on #deterministicFail, I thought a way to extract only the numeric part (although it's not 100% yet):
DECLARE #NUMBERS TABLE (
Val VARCHAR(20)
)
INSERT INTO #NUMBERS VALUES
('$70.23 per m2'),
('$81.23'),
('181.93 per m2'),
('1211.21'),
(' There are 4 tokens'),
(' No numbers '),
(''),
(' ')
select
CASE
WHEN ISNUMERIC(RTRIM(LEFT(RIGHT(RTRIM(LTRIM(n.Val)), 1+LEN(RTRIM(LTRIM(n.Val)))-PatIndex('%[0-9.-]%', RTRIM(LTRIM(n.Val)))), LEN(RIGHT(RTRIM(LTRIM(n.Val)), 1+LEN(RTRIM(LTRIM(n.Val)))-PatIndex('%[0-9.-]%', RTRIM(LTRIM(n.Val)))))- PATINDEX('%[^0-9.-]%',RIGHT(RTRIM(LTRIM(n.Val)), 1+LEN(RTRIM(LTRIM(n.Val)))-PatIndex('%[0-9.-]%', RTRIM(LTRIM(n.Val))))))))=1 THEN
RTRIM(LEFT(RIGHT(RTRIM(LTRIM(n.Val)), 1+LEN(RTRIM(LTRIM(n.Val)))-PatIndex('%[0-9.-]%', RTRIM(LTRIM(n.Val)))), LEN(RIGHT(RTRIM(LTRIM(n.Val)), 1+LEN(RTRIM(LTRIM(n.Val)))-PatIndex('%[0-9.-]%', RTRIM(LTRIM(n.Val)))))- PATINDEX('%[^0-9.-]%',RIGHT(RTRIM(LTRIM(n.Val)), 1+LEN(RTRIM(LTRIM(n.Val)))-PatIndex('%[0-9.-]%', RTRIM(LTRIM(n.Val)))))))
ELSE '0.0'
END
FROM #NUMBERS n

Selecting financial values from db stored as text

I have some financial values stored as text in a mysql db. the significance of financial is that negative numbers are stored enclosed in paranthesis. is there a way to automatically get the numeric value associated with that text. (like '5' shoudl be retuned as 5 and '(5)' should be returned as -5)
You probably know that enclosing negative values in parentheses is a presentational issue and should not even be in the database to begin with. There are numeric data types for financial values that perfectly cover the negative range and are easy to select/manipulate/aggregate.
Now you are stuck with something horrible along the lines of this:
SELECT
CASE WHEN LEFT(val, 1) = '('
THEN -1 * CAST( REPLACE((val, '(', ''), ')', '') AS DECIMAL(10,4))
ELSE CAST( val AS DECIMAL(10,4) )
END AS num_val
FROM
val_table