Not quite understood why I got #res = 1 when print return 0.57? I need to return the numeric results in my UDF function.
DECLARE #text1 VARCHAR(255) = 'some text'
DECLARE #text2 VARCHAR(255) = 'same another text'
DECLARE #res AS NUMERIC
DECLARE #i INT = 0
DECLARE #exist_counter INT = 0
WHILE #i < (LEN(#text1) - 2)
BEGIN
SET #i = #i + 1
IF CHARINDEX(SUBSTRING(#text1, #i, 3), #text2) > 0
BEGIN
SET #exist_counter = #exist_counter + 1
--print #exist_counter
--print SUBSTRING(#text1,#i,3)
END
END
PRINT #i
PRINT #exist_counter
PRINT cast(#exist_counter AS NUMERIC) / cast(nullif(#i, 0) AS NUMERIC)
SET #res = cast(#exist_counter AS NUMERIC) / cast(nullif(#i, 0) AS NUMERIC)
PRINT #res
From this post
Numeric data types that have fixed precision and scale
So you should change scale like.
declare #res as numeric(18, 2)
Why?
As #HABO's comment, "The default precision is 18." and "The default scale is 0" where scale is "The number of decimal digits that are stored to the right of the decimal point."
I need to create a consecutive sequence of varchar(5) (always 5 chars only) code starting from PREVIOUS code.
For example
'00000', '00001', '00002'...'00009', '0000A', '0000B'...'0000Z', '00010','00011'...'ZZZZZ'.
So if I have #PREVIOUS_CODE = '00000', #NEXT_CODE will be '00001'.
If I have #PREVIOUS_CODE = '00009', #NEXT_CODE will be '0000A'
If I have #PREVIOUS_CODE = '0000Z', #NEXT_CODE will be '00010'
So I need something like that
USE [DATABASE]
GO
CREATE PROCEDURE [dbo].[spGetNextCode]
#PREVIOUS_CODE VARCHAR(5)
AS
DECLARE #NEXT_CODE VARCHAR(5)
DO STUFF
...
SELECT #NEXT_CODE AS NEXT_CODE
GO
Any Help?
Just keep an integer counter in the same table and convert it. I'm using the following SQL Server function in one of my applications:
CREATE FUNCTION [dbo].[GetAlphanumericCode]
(
#number BIGINT,
#leadingzeroes INT = 0
)
RETURNS varchar(255)
AS
BEGIN
DECLARE #charPool varchar(36)
SET #charPool = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'
DECLARE #result varchar(255)
IF #number < 0
RETURN ''
IF #number = 0
SET #result = '0'
ELSE BEGIN
SET #result = ''
WHILE (#number > 0)
BEGIN
SET #result = substring(#charPool, #number % 36 + 1, 1) + #result
SET #number = #number / 36
END
END
IF #leadingzeroes > 0 AND len(#result) < #leadingzeroes
SET #result = right(replicate('0', #leadingzeroes) + #result, #leadingzeroes)
RETURN #result
END
It should be a trivial task to rewrite it as a stored procedure
So, I'm looking at implementing Fuzzy logic matching in my company and having trouble getting good results. For starters, I'm trying to match up Company names with those on a list supplied by other companies.
My first attempt was to use soundex, but it looks like soundex only compares the first few sounds in the company name, so longer company names were too easily confused for one another.
I'm now working on my second attempt using the levenstein distance comparison. It looks promising, especially if I remove the punctuation first. However, I'm still having trouble finding duplicates without too many false positives.
One of the issues I have is companies such as widgetsco vs widgets inc. So, if I compare the substring of the length of the shorter name, I also pickup things like BBC University and CBC University campus. I suspect that a score using a combination of distance and longest common substring may be the solution.
Has anyone managed to build an algorithm that does such a matching with limited false positives?
We have had good results on name and address matching using a Metaphone function created by Lawrence Philips. It works in a similar way to Soundex, but creates a sound/consonant pattern for the whole value. You may find this useful in conjunction with some other techniques, especially if you can strip some of the fluff like 'co.' and 'inc.' as mentioned in other comments:
create function [dbo].[Metaphone](#str as nvarchar(70), #KeepNumeric as bit = 0)
returns nvarchar(25)
/*
Metaphone Algorithm
Created by Lawrence Philips.
Metaphone presented in article in "Computer Language" December 1990 issue.
*********** BEGIN METAPHONE RULES ***********
Lawrence Philips' RULES follow:
The 16 consonant sounds:
|--- ZERO represents "th"
|
B X S K J T F H L M N P R 0 W Y
Drop vowels
Exceptions:
Beginning of word: "ae-", "gn", "kn-", "pn-", "wr-" ----> drop first letter
Beginning of word: "wh-" ----> change to "w"
Beginning of word: "x" ----> change to "s"
Beginning of word: vowel or "H" + vowel ----> Keep it
Transformations:
B ----> B unless at the end of word after "m", as in "dumb", "McComb"
C ----> X (sh) if "-cia-" or "-ch-"
S if "-ci-", "-ce-", or "-cy-"
SILENT if "-sci-", "-sce-", or "-scy-"
K otherwise
K "-sch-"
D ----> J if in "-dge-", "-dgy-", or "-dgi-"
T otherwise
F ----> F
G ----> SILENT if "-gh-" and not at end or before a vowel
"-gn" or "-gned"
"-dge-" etc., as in above rule
J if "gi", "ge", "gy" if not double "gg"
K otherwise
H ----> SILENT if after vowel and no vowel follows
or "-ch-", "-sh-", "-ph-", "-th-", "-gh-"
H otherwise
J ----> J
K ----> SILENT if after "c"
K otherwise
L ----> L
M ----> M
N ----> N
P ----> F if before "h"
P otherwise
Q ----> K
R ----> R
S ----> X (sh) if "sh" or "-sio-" or "-sia-"
S otherwise
T ----> X (sh) if "-tia-" or "-tio-"
0 (th) if "th"
SILENT if "-tch-"
T otherwise
V ----> F
W ----> SILENT if not followed by a vowel
W if followed by a vowel
X ----> KS
Y ----> SILENT if not followed by a vowel
Y if followed by a vowel
Z ----> S
*/
as
begin
declare #Result varchar(25)
,#str3 char(3)
,#str2 char(2)
,#str1 char(1)
,#strp char(1)
,#strLen tinyint
,#cnt tinyint
set #strLen = len(#str)
set #cnt = 0
set #Result = ''
-- Preserve first 5 numeric values when required
if #KeepNumeric = 1
begin
set #Result = case when isnumeric(substring(#str,1,1)) = 1
then case when isnumeric(substring(#str,2,1)) = 1
then case when isnumeric(substring(#str,3,1)) = 1
then case when isnumeric(substring(#str,4,1)) = 1
then case when isnumeric(substring(#str,5,1)) = 1
then left(#str,5)
else left(#str,4)
end
else left(#str,3)
end
else left(#str,2)
end
else left(#str,1)
end
else ''
end
set #str = right(#str,len(#str)-len(#Result))
end
--Process beginning exceptions
set #str2 = left(#str,2)
if #str2 = 'wh'
begin
set #str = 'w' + right(#str , #strLen - 2)
set #strLen = #strLen - 1
end
else
if #str2 in('ae', 'gn', 'kn', 'pn', 'wr')
begin
set #str = right(#str , #strLen - 1)
set #strLen = #strLen - 1
end
set #str1 = left(#str,1)
if #str1 = 'x'
set #str = 's' + right(#str , #strLen - 1)
else
if #str1 in ('a','e','i','o','u')
begin
set #str = right(#str, #strLen - 1)
set #strLen = #strLen - 1
set #Result = #Result + #str1
end
while #cnt <= #strLen
begin
set #cnt = #cnt + 1
set #str1 = substring(#str,#cnt,1)
set #strp = case when #cnt <> 0
then substring(#str,(#cnt-1),1)
else ' '
end
-- Check if the current character is the same as the previous character.
-- If we are keeping numbers, only compare non-numeric characters.
if case when #KeepNumeric = 1 and #strp = #str1 and isnumeric(#str1) = 0 then 1
when #KeepNumeric = 0 and #strp = #str1 then 1
else 0
end = 1
continue -- Skip this loop
set #str2 = substring(#str,#cnt,2)
set #Result = case when #KeepNumeric = 1 and isnumeric(#str1) = 1
then #Result + #str1
when #str1 in('f','j','l','m','n','r')
then #Result + #str1
when #str1 = 'q'
then #Result + 'k'
when #str1 = 'v'
then #Result + 'f'
when #str1 = 'x'
then #Result + 'ks'
when #str1 = 'z'
then #Result + 's'
when #str1 = 'b'
then case when #cnt = #strLen
then case when substring(#str,(#cnt - 1),1) <> 'm'
then #Result + 'b'
else #Result
end
else #Result + 'b'
end
when #str1 = 'c'
then case when #str2 = 'ch' or substring(#str,#cnt,3) = 'cia'
then #Result + 'x'
else case when #str2 in('ci','ce','cy') and #strp <> 's'
then #Result + 's'
else #Result + 'k'
end
end
when #str1 = 'd'
then case when substring(#str,#cnt,3) in ('dge','dgy','dgi')
then #Result + 'j'
else #Result + 't'
end
when #str1 = 'g'
then case when substring(#str,(#cnt - 1),3) not in ('dge','dgy','dgi','dha','dhe','dhi','dho','dhu')
then case when #str2 in('gi', 'ge','gy')
then #Result + 'j'
else case when #str2 <> 'gn' or (#str2 <> 'gh' and #cnt+1 <> #strLen)
then #Result + 'k'
else #Result
end
end
else #Result
end
when #str1 = 'h'
then case when #strp not in ('a','e','i','o','u') and #str2 not in ('ha','he','hi','ho','hu')
then case when #strp not in ('c','s','p','t','g')
then #Result + 'h'
else #Result
end
else #Result
end
when #str1 = 'k'
then case when #strp <> 'c'
then #Result + 'k'
else #Result
end
when #str1 = 'p'
then case when #str2 = 'ph'
then #Result + 'f'
else #Result + 'p'
end
when #str1 = 's'
then case when substring(#str,#cnt,3) in ('sia','sio') or #str2 = 'sh'
then #Result + 'x'
else #Result + 's'
end
when #str1 = 't'
then case when substring(#str,#cnt,3) in ('tia','tio')
then #Result + 'x'
else case when #str2 = 'th'
then #Result + '0'
else case when substring(#str,#cnt,3) <> 'tch'
then #Result + 't'
else #Result
end
end
end
when #str1 = 'w'
then case when #str2 not in('wa','we','wi','wo','wu')
then #Result + 'w'
else #Result
end
when #str1 = 'y'
then case when #str2 not in('ya','ye','yi','yo','yu')
then #Result + 'y'
else #Result
end
else #Result
end
end
return #Result
end
You want to use something like Levenshtein Distance or another string comparison algorithm. You may want to take a look at this project on Codeplex.
http://fuzzystring.codeplex.com/
Are you using Access? If so, consider the '*' character, without the quotes. If you're using SQL Server, use the '%' character. However, this really isn't fuzzy logic, it's really the Like operator. If you really need fuzzy logic, export your data-set to Excel and load the AddIn from the URL below.
https://www.microsoft.com/en-us/download/details.aspx?id=15011
Read the instructions very carefully. It definitely works, and it works great, but you need to follow the instructions, and it's not completely intuitive. The first time I tried it, I didn't follow the instructions, and I wasted a lot of time trying to get it to work. Eventually I figured it out, and it worked great!!
I found success implementing a function I found here on Stack Overflow that would find the percentage of strings that match. You can then adjust tolerance till you get an appropriate amount of matches/mismatches. The function implementation will be listed below, but the gist is including something like this in your query.
DECLARE #tolerance DEC(18, 2) = 50;
WHERE dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name) > #tolerance
Credit for the following percent matching function goes to Dragos Durlut, Dec 15 '11.
The credit for the LEVENSHTEIN function was included in the code by Dragos Durlut.
T-SQL Get percentage of character match of 2 strings
CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
#string1 NVARCHAR(100)
,#string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN
DECLARE #levenShteinNumber INT
DECLARE #string1Length INT = LEN(#string1)
, #string2Length INT = LEN(#string2)
DECLARE #maxLengthNumber INT = CASE WHEN #string1Length > #string2Length THEN #string1Length ELSE #string2Length END
SELECT #levenShteinNumber = [dbo].[LEVENSHTEIN] ( #string1 ,#string2)
DECLARE #percentageOfBadCharacters INT = #levenShteinNumber * 100 / #maxLengthNumber
DECLARE #percentageOfGoodCharacters INT = 100 - #percentageOfBadCharacters
-- Return the result of the function
RETURN #percentageOfGoodCharacters
END
-- =============================================
-- Create date: 2011.12.14
-- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html
-- =============================================
CREATE FUNCTION [dbo].[LEVENSHTEIN](#left VARCHAR(100),
#right VARCHAR(100))
returns INT
AS
BEGIN
DECLARE #difference INT,
#lenRight INT,
#lenLeft INT,
#leftIndex INT,
#rightIndex INT,
#left_char CHAR(1),
#right_char CHAR(1),
#compareLength INT
SET #lenLeft = LEN(#left)
SET #lenRight = LEN(#right)
SET #difference = 0
IF #lenLeft = 0
BEGIN
SET #difference = #lenRight
GOTO done
END
IF #lenRight = 0
BEGIN
SET #difference = #lenLeft
GOTO done
END
GOTO comparison
COMPARISON:
IF ( #lenLeft >= #lenRight )
SET #compareLength = #lenLeft
ELSE
SET #compareLength = #lenRight
SET #rightIndex = 1
SET #leftIndex = 1
WHILE #leftIndex <= #compareLength
BEGIN
SET #left_char = substring(#left, #leftIndex, 1)
SET #right_char = substring(#right, #rightIndex, 1)
IF #left_char <> #right_char
BEGIN -- Would an insertion make them re-align?
IF( #left_char = substring(#right, #rightIndex + 1, 1) )
SET #rightIndex = #rightIndex + 1
-- Would an deletion make them re-align?
ELSE IF( substring(#left, #leftIndex + 1, 1) = #right_char )
SET #leftIndex = #leftIndex + 1
SET #difference = #difference + 1
END
SET #leftIndex = #leftIndex + 1
SET #rightIndex = #rightIndex + 1
END
GOTO done
DONE:
RETURN #difference
END
Note: If you need to compare two or more fields (which I don't think you do) you can add another call to the function in the WHERE clause with a minimum tolerance. I also found success averaging the percentMatching and comparing it against a tolerance.
DECLARE #tolerance DEC(18, 2) = 25;
--could have multiple different tolerances for each field (weighting some fields as more important to be matching)
DECLARE #avg_tolerance DEC(18, 2) = 50;
WHERE AND dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name) > #tolerance
AND dbo.GetPercentageOfTwoStringMatching(first_table.address, second_table.address) > #tolerance
AND (dbo.GetPercentageOfTwoStringMatching(first_table.name, second_table.name)
+ dbo.GetPercentageOfTwoStringMatching(first_table.address, second_table.address)
) / 2 > #avg_tolerance
The benefit of this solution is the tolerance variables can be specific per field (weighting the importance of certain fields matching) and the average can insure general matching across all fields.
Firstly, I suggest, you make sure that you can't match on any other attribute and company names are all you have(because fuzzy matching is bound to give you some false positives). If you want to go ahead with fuzzy matching you could use the following steps:
Remove all stop words from the text. For example : Co, Inc etc.
If your database is very large, make use of an indexing method such as blocking or sorted neighbourhood indexing.
Finally compute the fuzzy score using the Levenshtein distance. You could use the token_set_ratio or partial_ratio functions in Fuzzywuzzy.
Also, I found the following video which aims to solve the same problem: https://www.youtube.com/watch?v=NRAqIjXaZvw
The Nanonets blog also contains several resources on the subject that could potentially be helpful.
I want to find a credit card numeric value in a sql string.
for example;
DECLARE #value1 NVARCHAR(MAX) = 'The payment is the place 1234567812345678'
DECLARE #value2 NVARCHAR(MAX) = 'The payment is the place 123456aa7812345678'
DECLARE #value3 NVARCHAR(MAX) = 'The payment1234567812345678is the place'
The result should be :
#value1Result 1234567812345678
#value2Result NULL
#value3Result 1234567812345678
16 digits must be together without space.
How to do this in a sql script or a function?
edit :
if I want to find these 2 credit card value.
#value4 = 'card 1 is : 4034349183539301 and the other one is 3456123485697865'
how should I implement the scripts?
You can use PathIndex as
PATINDEX('%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', yourStr)
if the result is 0 then it doesnt containg 16 digits other was it contains.
It can be used withing a Where statement or Select statement based on your needs
You can write as:
SELECT case when Len(LEFT(subsrt, PATINDEX('%[^0-9]%', subsrt + 't') - 1)) = 16
then LEFT(subsrt, PATINDEX('%[^0-9]%', subsrt + 't') - 1)
else ''
end
FROM (
SELECT subsrt = SUBSTRING(string, pos, LEN(string))
FROM (
SELECT string, pos = PATINDEX('%[0-9]%', string)
FROM table1
) d
) t
Demo
DECLARE #value1 NVARCHAR(MAX) = 'card 1 is : 4034349183539301 and the other one is 3456123485697865'
DECLARE #Lenght INT
,#Count INT
,#Candidate CHAR
,#cNum INT
,#result VARCHAR(16)
SELECT #Count = 1
SELECT #cNum = 0
SELECT #result = ''
SELECT #Lenght = LEN(#value1)
WHILE #Count <= #Lenght
BEGIN
SELECT #Candidate = SUBSTRING(#value1, #Count, 1)
IF #Candidate != ' '
AND ISNUMERIC(#Candidate) = 1
BEGIN
SET #cNum = #cNum + 1
SET #result = #result + #Candidate
END
ELSE
BEGIN
SET #cNum = 1
SET #result = ''
END
IF #cNum > 16
BEGIN
SELECT #result 'Credit Number'
END
SET #Count = #Count + 1
END
There you go kind sir.
DECLARE
#value3 NVARCHAR(MAX) = 'The payment1234567812345678is the place',
#MaxCount int,
#Count int,
#Numbers NVARCHAR(100)
SELECT #Count = 1
SELECT #Numbers = ''
SELECT #MaxCount = LEN(#value3)
WHILE #Count <= #MaxCount
BEGIN
IF (UNICODE(SUBSTRING(#value3,#Count,1)) >= 48 AND UNICODE(SUBSTRING(#value3,#Count,1)) <=57)
SELECT #Numbers = #Numbers + SUBSTRING(#value3,#Count,1)
SELECT #Count = #Count + 1
END
PRINT #Numbers
You can make this as a function if you are planning to use it a lot.
I need to cast an INT value as HEX (not to hex).
For example, given the value 1234, I want to transform it to x'1234'.
My first inclination was to use the hex function, but that does not produce the desired results:
hex(1234) = x'04D2'
I need a function or algorithm such that
my_function(1234) = x'1234'
EDIT: Thanks to Lennart's answer I learned that it would be the equivalent of HEXTORAW or
VARCHAR_BIT_FORMAT which exist on DB2 for LUW, but not for z/OS
Not sure I understand your question, is this in the ballpark?
with t (s) as (values ('1234'),(x'F0F0F0F0F0F0F0F0F0F0F0F0F0'))
select s
, case when translate(s, '', '0123456789') = ''
then hextoraw(s)
else s
end
from t;
S 2
------------- --------------------------
1234 x'1234'
ððððððððððððð ððððððððððððð
It is possible to do this transformation using the EBCDIC_CHR function. This solution assumes your system character encoding is EBCDIC.
See this thread for discussion, and UDF: http://www.idug.org/p/fo/st/thread=43924
This user defined function is from that thread. It will receive a varchar input and for each pair of values convert them to raw using EBCDIC_CHR, concatenating it all together.
CREATE FUNCTION UDFUNC.HEX2RAW(INSTR VARCHAR(1024))
RETURNS VARCHAR(2048)
DETERMINISTIC
NO EXTERNAL ACTION
CONTAINS SQL
BEGIN
DECLARE invalid_hexval CONDITION FOR SQLSTATE '22007';
DECLARE VALIDSTR VARCHAR(1024) default '';
DECLARE OUTSTR VARCHAR(2048) DEFAULT '';
DECLARE HEXCHR CHAR(16) DEFAULT '0123456789ABCDEF';
DECLARE MODVAL INT DEFAULT 0;
DECLARE LENSTR INT;
DECLARE I INT DEFAULT 0;
DECLARE J INT DEFAULT 0;
IF INSTR IS NULL THEN
RETURN NULL;
END IF;
set VALIDSTR = TRANSLATE(INSTR,'',HEXCHR);
IF (VALIDSTR <> '') THEN
SIGNAL invalid_hexval SET MESSAGE_TEXT = 'Not Hex: [' || CAST(INSTR AS VARCHAR(59))||']';
END IF;
SET MODVAL = MOD(LENGTH(INSTR),2);
IF MODVAL <> 0 THEN
SET INSTR = CONCAT('0',INSTR);
END IF;
SET LENSTR = LENGTH(INSTR);
WHILE I < LENSTR DO
SET J = 16 * (POSSTR(HEXCHR,SUBSTR(INSTR,I+1,1))-1);
SET I = I + 1;
SET J = J + (POSSTR(HEXCHR,SUBSTR(INSTR,I+1,1))-1);
SET I = I + 1;
SET OUTSTR = CONCAT(OUTSTR,EBCDIC_CHR(J));
END WHILE;
RETURN OUTSTR;
END#
Function usage:
SELECT UDFUNC.HEX2RAW('1234')
, HEX(UDFUNC.HEX2RAW('1234'))
, UDFUNC.HEX2RAW('1234567')
, HEX(UDFUNC.HEX2RAW('1234567'))
FROM SYSIBM.SYSDUMMY1;
results in:
1 2 3 4
-- ---- ---- --------
1234 áÅ 01234567