Problem : How to find out from the given number from which numbers this number consists?
"Sunday = 1", "Monday = 2", "Tuesday = 4", Wednesday = 8", "Thursday =
16", "Friday = 32", "Saturday = 64"
For example : Given the number 109 this would signify Sunday, Tuesday, Wednesday, Friday, Saturday
You can do something like this.
CREATE FUNCTION dbo.Int2BinaryToWeekDay (#i INT) RETURNS NVARCHAR(2048) AS BEGIN
RETURN
CASE WHEN CONVERT(VARCHAR(16), #i & 64 ) > 0 THEN 'Saturday,' ELSE '' END +
CASE WHEN CONVERT(VARCHAR(16), #i & 32 ) > 0 THEN 'Friday,' ELSE '' END +
CASE WHEN CONVERT(VARCHAR(16), #i & 16 ) > 0 THEN 'Thurday,' ELSE '' END +
CASE WHEN CONVERT(VARCHAR(16), #i & 8 ) > 0 THEN 'Wed,' ELSE '' END +
CASE WHEN CONVERT(VARCHAR(16), #i & 4 ) > 0 THEN 'Tuesday,' ELSE '' END +
CASE WHEN CONVERT(VARCHAR(16), #i & 2 ) > 0 THEN 'Monday,' ELSE '' END +
CASE WHEN CONVERT(VARCHAR(16), #i & 1 ) > 0 THEN 'Sunday,' ELSE '' END
END;
GO
Now do following thing.
SELECT dbo.Int2BinaryToWeekDay(109)
Looks like a binary design. You need to use bitwise & operator to get the desired output.
Decimal = Binary
109 = 1101101
001 = 0000001
------&------
0000001 = 1
109 = 1101101
002 = 0000010
------&------
0000000 = 0
109 = 1101101
004 = 0000100
------&------
0000100 = 4
SQL Server has bitwise operators in built. you can utilize bitwise & for this like:
DECLARE #InputNum INT = 109
SELECT ISNULL(STUFF(CASE WHEN #InputNum & 1 > 0 THEN ', SUN' ELSE '' END +
CASE WHEN #InputNum & 2 > 0 THEN ', MON' ELSE '' END +
CASE WHEN #InputNum & 4 > 0 THEN ', TUE' ELSE '' END +
CASE WHEN #InputNum & 8 > 0 THEN ', WED' ELSE '' END +
CASE WHEN #InputNum & 16 > 0 THEN ', THU' ELSE '' END +
CASE WHEN #InputNum & 32 > 0 THEN ', FRI' ELSE '' END +
CASE WHEN #InputNum & 64 > 0 THEN ', SAT' ELSE '' END,1,2,''),'')
check the MS documentation for more detailed explaination of bitwise operators.
So, I'm looking at implementing Fuzzy logic matching in my company and having trouble getting good results. For starters, I'm trying to match up Company names with those on a list supplied by other companies.
My first attempt was to use soundex, but it looks like soundex only compares the first few sounds in the company name, so longer company names were too easily confused for one another.
I'm now working on my second attempt using the levenstein distance comparison. It looks promising, especially if I remove the punctuation first. However, I'm still having trouble finding duplicates without too many false positives.
One of the issues I have is companies such as widgetsco vs widgets inc. So, if I compare the substring of the length of the shorter name, I also pickup things like BBC University and CBC University campus. I suspect that a score using a combination of distance and longest common substring may be the solution.
Has anyone managed to build an algorithm that does such a matching with limited false positives?
We have had good results on name and address matching using a Metaphone function created by Lawrence Philips. It works in a similar way to Soundex, but creates a sound/consonant pattern for the whole value. You may find this useful in conjunction with some other techniques, especially if you can strip some of the fluff like 'co.' and 'inc.' as mentioned in other comments:
create function [dbo].[Metaphone](#str as nvarchar(70), #KeepNumeric as bit = 0)
returns nvarchar(25)
/*
Metaphone Algorithm
Created by Lawrence Philips.
Metaphone presented in article in "Computer Language" December 1990 issue.
*********** BEGIN METAPHONE RULES ***********
Lawrence Philips' RULES follow:
The 16 consonant sounds:
|--- ZERO represents "th"
|
B X S K J T F H L M N P R 0 W Y
Drop vowels
Exceptions:
Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-" ----> drop first letter
Beginning of word: "wh-" ----> change to "w"
Beginning of word: "x" ----> change to "s"
Beginning of word: vowel or "H" + vowel ----> Keep it
Transformations:
B ----> B unless at the end of word after "m", as in "dumb", "McComb"
C ----> X (sh) if "-cia-" or "-ch-"
S if "-ci-", "-ce-", or "-cy-"
SILENT if "-sci-", "-sce-", or "-scy-"
K otherwise
K "-sch-"
D ----> J if in "-dge-", "-dgy-", or "-dgi-"
T otherwise
F ----> F
G ----> SILENT if "-gh-" and not at end or before a vowel
"-gn" or "-gned"
"-dge-" etc., as in above rule
J if "gi", "ge", "gy" if not double "gg"
K otherwise
H ----> SILENT if after vowel and no vowel follows
or "-ch-", "-sh-", "-ph-", "-th-", "-gh-"
H otherwise
J ----> J
K ----> SILENT if after "c"
K otherwise
L ----> L
M ----> M
N ----> N
P ----> F if before "h"
P otherwise
Q ----> K
R ----> R
S ----> X (sh) if "sh" or "-sio-" or "-sia-"
S otherwise
T ----> X (sh) if "-tia-" or "-tio-"
0 (th) if "th"
SILENT if "-tch-"
T otherwise
V ----> F
W ----> SILENT if not followed by a vowel
W if followed by a vowel
X ----> KS
Y ----> SILENT if not followed by a vowel
Y if followed by a vowel
Z ----> S
*/
as
begin
declare #Result varchar(25)
,#str3 char(3)
,#str2 char(2)
,#str1 char(1)
,#strp char(1)
,#strLen tinyint
,#cnt tinyint
set #strLen = len(#str)
set #cnt = 0
set #Result = ''
-- Preserve first 5 numeric values when required
if #KeepNumeric = 1
begin
set #Result = case when isnumeric(substring(#str,1,1)) = 1
then case when isnumeric(substring(#str,2,1)) = 1
then case when isnumeric(substring(#str,3,1)) = 1
then case when isnumeric(substring(#str,4,1)) = 1
then case when isnumeric(substring(#str,5,1)) = 1
then left(#str,5)
else left(#str,4)
end
else left(#str,3)
end
else left(#str,2)
end
else left(#str,1)
end
else ''
end
set #str = right(#str,len(#str)-len(#Result))
end
--Process beginning exceptions
set #str2 = left(#str,2)
if #str2 = 'wh'
begin
set #str = 'w' + right(#str , #strLen - 2)
set #strLen = #strLen - 1
end
else
if #str2 in('ae', 'gn', 'kn', 'pn', 'wr')
begin
set #str = right(#str , #strLen - 1)
set #strLen = #strLen - 1
end
set #str1 = left(#str,1)
if #str1 = 'x'
set #str = 's' + right(#str , #strLen - 1)
else
if #str1 in ('a','e','i','o','u')
begin
set #str = right(#str, #strLen - 1)
set #strLen = #strLen - 1
set #Result = #Result + #str1
end
while #cnt <= #strLen
begin
set #cnt = #cnt + 1
set #str1 = substring(#str,#cnt,1)
set #strp = case when #cnt <> 0
then substring(#str,(#cnt-1),1)
else ' '
end
-- Check if the current character is the same as the previous character.
-- If we are keeping numbers, only compare non-numeric characters.
if case when #KeepNumeric = 1 and #strp = #str1 and isnumeric(#str1) = 0 then 1
when #KeepNumeric = 0 and #strp = #str1 then 1
else 0
end = 1
continue -- Skip this loop
set #str2 = substring(#str,#cnt,2)
set #Result = case when #KeepNumeric = 1 and isnumeric(#str1) = 1
then #Result + #str1
when #str1 in('f','j','l','m','n','r')
then #Result + #str1
when #str1 = 'q'
then #Result + 'k'
when #str1 = 'v'
then #Result + 'f'
when #str1 = 'x'
then #Result + 'ks'
when #str1 = 'z'
then #Result + 's'
when #str1 = 'b'
then case when #cnt = #strLen
then case when substring(#str,(#cnt - 1),1) <> 'm'
then #Result + 'b'
else #Result
end
else #Result + 'b'
end
when #str1 = 'c'
then case when #str2 = 'ch' or substring(#str,#cnt,3) = 'cia'
then #Result + 'x'
else case when #str2 in('ci','ce','cy') and #strp <> 's'
then #Result + 's'
else #Result + 'k'
end
end
when #str1 = 'd'
then case when substring(#str,#cnt,3) in ('dge','dgy','dgi')
then #Result + 'j'
else #Result + 't'
end
when #str1 = 'g'
then case when substring(#str,(#cnt - 1),3) not in ('dge','dgy','dgi','dha','dhe','dhi','dho','dhu')
then case when #str2 in('gi', 'ge','gy')
then #Result + 'j'
else case when #str2 <> 'gn' or (#str2 <> 'gh' and #cnt+1 <> #strLen)
then #Result + 'k'
else #Result
end
end
else #Result
end
when #str1 = 'h'
then case when #strp not in ('a','e','i','o','u') and #str2 not in ('ha','he','hi','ho','hu')
then case when #strp not in ('c','s','p','t','g')
then #Result + 'h'
else #Result
end
else #Result
end
when #str1 = 'k'
then case when #strp <> 'c'
then #Result + 'k'
else #Result
end
when #str1 = 'p'
then case when #str2 = 'ph'
then #Result + 'f'
else #Result + 'p'
end
when #str1 = 's'
then case when substring(#str,#cnt,3) in ('sia','sio') or #str2 = 'sh'
then #Result + 'x'
else #Result + 's'
end
when #str1 = 't'
then case when substring(#str,#cnt,3) in ('tia','tio')
then #Result + 'x'
else case when #str2 = 'th'
then #Result + '0'
else case when substring(#str,#cnt,3) <> 'tch'
then #Result + 't'
else #Result
end
end
end
when #str1 = 'w'
then case when #str2 not in('wa','we','wi','wo','wu')
then #Result + 'w'
else #Result
end
when #str1 = 'y'
then case when #str2 not in('ya','ye','yi','yo','yu')
then #Result + 'y'
else #Result
end
else #Result
end
end
return #Result
end
You want to use something like Levenshtein Distance or another string comparison algorithm. You may want to take a look at this project on Codeplex.
http://fuzzystring.codeplex.com/
Are you using Access? If so, consider the '*' character, without the quotes. If you're using SQL Server, use the '%' character. However, this really isn't fuzzy logic, it's really the Like operator. If you really need fuzzy logic, export your data-set to Excel and load the AddIn from the URL below.
https://www.microsoft.com/en-us/download/details.aspx?id=15011
Read the instructions very carefully. It definitely works, and it works great, but you need to follow the instructions, and it's not completely intuitive. The first time I tried it, I didn't follow the instructions, and I wasted a lot of time trying to get it to work. Eventually I figured it out, and it worked great!!
I found success implementing a function I found here on Stack Overflow that would find the percentage of strings that match. You can then adjust tolerance till you get an appropriate amount of matches/mismatches. The function implementation will be listed below, but the gist is including something like this in your query.
DECLARE #tolerance DEC(18, 2) = 50;
WHERE dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name) > #tolerance
Credit for the following percent matching function goes to Dragos Durlut, Dec 15 '11.
The credit for the LEVENSHTEIN function was included in the code by Dragos Durlut.
T-SQL Get percentage of character match of 2 strings
CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
#string1 NVARCHAR(100)
,#string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN
DECLARE #levenShteinNumber INT
DECLARE #string1Length INT = LEN(#string1)
, #string2Length INT = LEN(#string2)
DECLARE #maxLengthNumber INT = CASE WHEN #string1Length > #string2Length THEN #string1Length ELSE #string2Length END
SELECT #levenShteinNumber = [dbo].[LEVENSHTEIN] ( #string1 ,#string2)
DECLARE #percentageOfBadCharacters INT = #levenShteinNumber * 100 / #maxLengthNumber
DECLARE #percentageOfGoodCharacters INT = 100 - #percentageOfBadCharacters
-- Return the result of the function
RETURN #percentageOfGoodCharacters
END
-- =============================================
-- Create date: 2011.12.14
-- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html
-- =============================================
CREATE FUNCTION [dbo].[LEVENSHTEIN](#left VARCHAR(100),
#right VARCHAR(100))
returns INT
AS
BEGIN
DECLARE #difference INT,
#lenRight INT,
#lenLeft INT,
#leftIndex INT,
#rightIndex INT,
#left_char CHAR(1),
#right_char CHAR(1),
#compareLength INT
SET #lenLeft = LEN(#left)
SET #lenRight = LEN(#right)
SET #difference = 0
IF #lenLeft = 0
BEGIN
SET #difference = #lenRight
GOTO done
END
IF #lenRight = 0
BEGIN
SET #difference = #lenLeft
GOTO done
END
GOTO comparison
COMPARISON:
IF ( #lenLeft >= #lenRight )
SET #compareLength = #lenLeft
ELSE
SET #compareLength = #lenRight
SET #rightIndex = 1
SET #leftIndex = 1
WHILE #leftIndex <= #compareLength
BEGIN
SET #left_char = substring(#left, #leftIndex, 1)
SET #right_char = substring(#right, #rightIndex, 1)
IF #left_char <> #right_char
BEGIN -- Would an insertion make them re-align?
IF( #left_char = substring(#right, #rightIndex + 1, 1) )
SET #rightIndex = #rightIndex + 1
-- Would an deletion make them re-align?
ELSE IF( substring(#left, #leftIndex + 1, 1) = #right_char )
SET #leftIndex = #leftIndex + 1
SET #difference = #difference + 1
END
SET #leftIndex = #leftIndex + 1
SET #rightIndex = #rightIndex + 1
END
GOTO done
DONE:
RETURN #difference
END
Note: If you need to compare two or more fields (which I don't think you do) you can add another call to the function in the WHERE clause with a minimum tolerance. I also found success averaging the percentMatching and comparing it against a tolerance.
DECLARE #tolerance DEC(18, 2) = 25;
--could have multiple different tolerances for each field (weighting some fields as more important to be matching)
DECLARE #avg_tolerance DEC(18, 2) = 50;
WHERE AND dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name) > #tolerance
AND dbo.GetPercentageOfTwoStringMatching(first_table.address, second_table.address) > #tolerance
AND (dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name)
+ dbo.GetPercentageOfTwoStringMatching(first_table.address, second_table.address)
) / 2 > #avg_tolerance
The benefit of this solution is the tolerance variables can be specific per field (weighting the importance of certain fields matching) and the average can insure general matching across all fields.
Firstly, I suggest, you make sure that you can't match on any other attribute and company names are all you have(because fuzzy matching is bound to give you some false positives). If you want to go ahead with fuzzy matching you could use the following steps:
Remove all stop words from the text. For example : Co, Inc etc.
If your database is very large, make use of an indexing method such as blocking or sorted neighbourhood indexing.
Finally compute the fuzzy score using the Levenshtein distance. You could use the token_set_ratio or partial_ratio functions in Fuzzywuzzy.
Also, I found the following video which aims to solve the same problem: https://www.youtube.com/watch?v=NRAqIjXaZvw
The Nanonets blog also contains several resources on the subject that could potentially be helpful.
I have written the following query:
substring(SELECT DB_NAME()), 1, 1)
I wish to convert the char which this query returns in to a binary string like "11001101".
What is the correct way to do it?
Thanks
You could use ASCII() to convert the character to a decimal integer and then use the script given on this answer to convert that to a "binary" string
You will possibly end up with something like this:
DECLARE #i INT = ASCII(SUBSTRING((DB_NAME()),1,1))
SELECT
CASE WHEN CONVERT(VARCHAR(8), #i & 128 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 64 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 32 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 16 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 8 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 4 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 2 ) > 0 THEN '1' ELSE '0' END +
CASE WHEN CONVERT(VARCHAR(8), #i & 1 ) > 0 THEN '1' ELSE '0' END
I need to find how many true bit exists in my binary value.
example:
input: 0001101 output:3
input: 1111001 output:5
While both answers work, both have issues. A loop is not optimal and destructs the value. Both solutions can not be used in a select statement.
Possible better solution is by masking together as follows
select #counter = 0
+ case when #BinaryVariable2 & 1 = 1 then 1 else 0 end
+ case when #BinaryVariable2 & 2 = 2 then 1 else 0 end
+ case when #BinaryVariable2 & 4 = 4 then 1 else 0 end
+ case when #BinaryVariable2 & 8 = 8 then 1 else 0 end
+ case when #BinaryVariable2 & 16 = 16 then 1 else 0 end
+ case when #BinaryVariable2 & 32 = 32 then 1 else 0 end
+ case when #BinaryVariable2 & 64 = 64 then 1 else 0 end
+ case when #BinaryVariable2 & 128 = 128 then 1 else 0 end
+ case when #BinaryVariable2 & 256 = 256 then 1 else 0 end
+ case when #BinaryVariable2 & 512 = 512 then 1 else 0 end
This can be used in a select and update statement. It is also an order of magnitude faster. (on my server about 50 times)
To help you might want to use the following generator code
declare #x int = 1, #c int = 0
print ' #counter = 0 ' /*CHANGE field/parameter name */
while #c < 10 /* change to how many bits you want to see */
begin
print ' + case when #BinaryVariable2 & ' + cast(#x as varchar) + ' = ' + cast(#x as varchar) + ' then 1 else 0 end ' /* CHANGE the variable/field name */
select #x *=2, #c +=1
end
Also as further note: if you use a bigint or go beyond 32 bits it is necessary to cast like follows
print ' + case when #Missing & cast(' + cast(#x as varchar) + ' as bigint) = ' + cast(#x as varchar) + ' then 1 else 0 end '
Enjoy
DECLARE #BinaryVariable2 VARBINARY(10);
SET #BinaryVariable2 = 60; -- binary value is 111100
DECLARE #counter int = 0
WHILE #BinaryVariable2 > 0
SELECT #counter +=#BinaryVariable2 % 2, #BinaryVariable2 /= 2
SELECT #counter
Result:
4
I've left various debug selects in.
begin
declare #bin as varbinary(20);
declare #bitsSet as int;
set #bitsSet = 0;
set #bin = convert(varbinary(20), 876876876876);
declare #i as int;
set #i = 0
select LEN(#bin), 'Len';
while #i < LEN(#bin)
begin
declare #bit as varbinary(1);
set #bit = SUBSTRING(#bin, #i, 1);
select #bit, 'Bit';
declare #power as int
set #power = 0;
while #power < 8
begin
declare #powerOf2 as int;
set #powerOf2 = POWER(2, #power);
if #powerOf2 <> 0
set #bitsSet = #bitsSet + (#bit & #powerOf2) / #powerOf2; -- edited to add the divisor
select #power, #powerOf2;
set #power = #power + 1;
end;
select #bitsSet;
set #i = #i + 1;
end;
select #bitsSet, 'End'
end;
Cheers -
You can handle an arbitrary length binary value by using a recursive CTE to split the data into a table of 1-byte values and counting all of the bits that are true in each byte of that table...
DECLARE #data Varbinary(MAX) = Convert(Varbinary(MAX), N'We can count bits of very large varbinary values without a loop or number table if you like...');
WITH each ( byte, pos ) AS (
SELECT Substring(#data, Len(#data), 1), Len(#data)-1 WHERE Len(#data) > 0
UNION ALL
SELECT Substring(#data, pos, 1), pos-1 FROM each WHERE pos > 0
)
SELECT Count(*) AS [True Bits]
FROM each
CROSS JOIN (VALUES (1),(2),(4),(8), (16),(32),(64),(128)) [bit](flag)
WHERE each.byte & [bit].flag = [bit].flag
OPTION (MAXRECURSION 0);
From SQL Server 2022 you can just use SELECT BIT_COUNT(input)
expression_value can be
Any integer or binary expression that isn't a large object (LOB).
For integer expressions the result can depend on the datatype. e.g. -1 as smallint has a binary representation of 1111111111111111 (two's complement) and will have more bits set for int datatype.
I need to write this query in SQL Server:
IF isFloat(#value) = 1
BEGIN
PRINT 'this is float number'
END
ELSE
BEGIN
PRINT 'this is integer number'
END
Please help me out with this, thanks.
declare #value float = 1
IF FLOOR(#value) <> CEILING(#value)
BEGIN
PRINT 'this is float number'
END
ELSE
BEGIN
PRINT 'this is integer number'
END
Martin, under certain circumstances your solution gives an incorrect result if you encounter a value of 1234.0, for example. Your code determines that 1234.0 is an integer, which is incorrect.
This is a more accurate snippet:
if cast(cast(123456.0 as integer) as varchar(255)) <> cast(123456.0 as varchar(255))
begin
print 'non integer'
end
else
begin
print 'integer'
end
Regards,
Nico
DECLARE #value FLOAT = 1.50
IF CONVERT(int, #value) - #value <> 0
BEGIN
PRINT 'this is float number'
END
ELSE
BEGIN
PRINT 'this is integer number'
END
See whether the below code will help. In the below values only 9,
2147483647, 1234567 are eligible as Integer. We can create this as
function and can use this.
CREATE TABLE MY_TABLE(MY_FIELD VARCHAR(50))
INSERT INTO MY_TABLE
VALUES('9.123'),('1234567'),('9'),('2147483647'),('2147483647.01'),('2147483648'), ('2147483648ABCD'),('214,7483,648')
SELECT *
FROM MY_TABLE
WHERE CHARINDEX('.',MY_FIELD) = 0 AND CHARINDEX(',',MY_FIELD) = 0
AND ISNUMERIC(MY_FIELD) = 1 AND CONVERT(FLOAT,MY_FIELD) / 2147483647 <= 1
DROP TABLE MY_TABLE
OR
DECLARE #num VARCHAR(100)
SET #num = '2147483648AS'
IF ISNUMERIC(#num) = 1 AND #num NOT LIKE '%.%' AND #num NOT LIKE '%,%'
BEGIN
IF CONVERT(FLOAT,#num) / 2147483647 <= 1
PRINT 'INTEGER'
ELSE
PRINT 'NOT-INTEGER'
END
ELSE
PRINT 'NOT-INTEGER'