Optimizing Levenshtein distance algorithm

Optimizing Levenshtein distance algorithm - optimization

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:
ALTER function dbo.Levenshtein
(
#Source nvarchar(200),
#Target nvarchar(200)
)
RETURNS int
AS
BEGIN
DECLARE #Source_len int, #Target_len int, #i int, #j int, #Source_char nchar, #Dist int, #Dist_temp int, #Distv0 varbinary(8000), #Distv1 varbinary(8000)
SELECT #Source_len = LEN(#Source), #Target_len = LEN(#Target), #Distv1 = 0x0000, #j = 1, #i = 1, #Dist = 0
WHILE #j <= #Target_len
BEGIN
SELECT #Distv1 = #Distv1 + CAST(#j AS binary(2)), #j = #j + 1
END
WHILE #i <= #Source_len
BEGIN
SELECT #Source_char = SUBSTRING(#Source, #i, 1), #Dist = #i, #Distv0 = CAST(#i AS binary(2)), #j = 1
WHILE #j <= #Target_len
BEGIN
SET #Dist = #Dist + 1
SET #Dist_temp = CAST(SUBSTRING(#Distv1, #j+#j-1, 2) AS int) +
CASE WHEN #Source_char = SUBSTRING(#Target, #j, 1) THEN 0 ELSE 1 END
IF #Dist > #Dist_temp
BEGIN
SET #Dist = #Dist_temp
END
SET #Dist_temp = CAST(SUBSTRING(#Distv1, #j+#j+1, 2) AS int)+1
IF #Dist > #Dist_temp SET #Dist = #Dist_temp
BEGIN
SELECT #Distv0 = #Distv0 + CAST(#Dist AS binary(2)), #j = #j + 1
END
END
SELECT #Distv1 = #Distv0, #i = #i + 1
END
RETURN #Dist
END
Where should I go from here?

The way I've done this in the past is to store the "database" (actually a dictionary of words for a spelling correcter) as a trie.
Then I used a branch-and-bound routine to look up nearest matching entries. For small distances, the time it takes is exponential in the distance. For large distances, it is linear in the size of the dictionary, just as you are seeing now.
Branch-and-bound is basically a depth-first tree walk of the trie, but with an error budget. At each node, you keep track of the current levenshtein distance, and if it exceeds the budget, you prune that branch of the tree.
First you do the walk with a budget of zero. That will only find exact matches. If you don't find a match, then you walk it with a budget of one. That will find matches at a distance of 1. If you don't find any, then you do it with a budget of 2, and so on. This sounds inefficient, but since each walk takes so much more time than the previous one, the time is dominated by the last walk that you make.
Added: outline of code (pardon my C):
// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
tnodeTag* p[128];
} tnode;
tnode* top; // the top of the trie
void walk(tnode* p, char* s, int budget){
int i;
if (*s == 0){
if (p == NULL){
// print the current trie path
}
}
else if (budget >= 0){
// try deleting this letter
walk(p, s+1, budget-1);
// try swapping two adjacent letters
if (s[1]){
swap(s[0], s[1]);
walk(p, s, budget-1);
swap(s[0], s[1]);
}
if (p){
for (i = 0; i < 128; i++){
// try exact match
if (i == *s) walk(p->p[i], s+1, budget);
// try replacing this character
if (i != *s) walk(p->p[i], s+1, budget-1);
// try inserting this letter
walk(p->p[i], s, budget-1);
}
}
}
}
Basically, you simulate deleting a letter by skipping it and searching at the same node. You simulate inserting a letter by descending the trie without advancing s. You simulate replacing a letter by acting as if the letter matched, even though it doesn't. When you get the hang of it, you can add other possible mismatches, like replacing 0 with O and 1 with L or I - dumb stuff like that.
You probably want to add a character array argument to represent the current word you are finding in the trie.

Related

Replace the multiple values between 2 characters in azure sql

In Azure SQL, I'm attempting to delete any text that is present between the < and > characters to my column in my table
Sample text:
The best part is that. < br >Note:< br >< u> reading
:< /u> < span style="font-family: calibri,sans-serif; font-size: 11pt;"> moral stories from an early age
< b>not only helps your child.< /b>< br>< u>in
learning important: < /u>< /span>< span style="font-family: calibri;
">life lessons but it also helps, in language development.< /span>< ./span>
Output:
The best part is that. reading: moral stories from an early age not only helps your child in learning important: life lessons but it also helps in language development.
I tried below query its working only for small comments text:
SELECT [Comments],REPLACE([Comments], SUBSTRING([Comments], CHARINDEX('<', [Comments]), CHARINDEX('>', [Comments]) - CHARINDEX('<', [Comments]) + 1),'') AS result
FROM table

I have taken input table named check_1 and sample data is inserted into that table.
This query removes only the first occurring pattern.
SELECT [Comments],REPLACE([Comments], SUBSTRING([Comments], CHARINDEX('<', [Comments]), CHARINDEX('>', [Comments]) - CHARINDEX('<', [Comments]) + 1),'') AS result
FROM check_1
In order to remove all string patterns beginning with '<' and ending with '>' in the text, a user defined function with a while loop is created.
CREATE FUNCTION [dbo].[udf_removetags] (#input_text VARCHAR(MAX)) RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE #pos_1 INT
DECLARE #pos_n INT
DECLARE #Length INT
SET #pos_1 = CHARINDEX('<',#input_text)
SET #pos_n = CHARINDEX('>',#input_text,CHARINDEX('<',#input_text))
SET #Length = (#pos_n - #pos_1) + 1
WHILE #pos_1 > 0 AND #pos_n > 0 AND #Length > 0
BEGIN
SET #input_text = replace(#input_text,substring(#input_text,#pos_1,#Length),'')
SET #pos_1 = CHARINDEX('<',#input_text)
SET #pos_n = CHARINDEX('>',#input_text,CHARINDEX('<',#input_text))
SET #Length = (#pos_n - #pos_1) + 1
END
RETURN #input_text
END
select [dbo].[udf_removetags](comments) as result from check_1
Output String:
The best part is that. Note: reading : moral stories from an early age not only helps your child.in learning important: life lessons but it also helps, in language development.
You can also use Stuff [Refer Microsoft document on STUFF] in place of replace+substring function.
Replace this SET #input_text = replace(#input_text,substring(#input_text,#pos_1,#Length),'')
line with the line
SET #input_text = STUFF(#input_text,#pos_1,#Length,'')
in the user defined function.
Result will be same.

According to https://learn.microsoft.com/../azure/../regexp_replace Azure supports REGEXP_REPLACE.
This means it should be possible to replace all '<...>' by '' via
select regexp_replace(comments, '<[^>]*>', '') from mytable;

SQL count number of time series events, with some some start or stop entries missing

I have some start/stop events and I need to count the number of total events but sometimes a start or stop is missing, for example:
Time Event
10:50 START
10:52 STOP
10:59 START
11:01 STOP
11:45 STOP
Count(Event) Where Event='START'
Would return 2, I also need to count the missing START value, so the result should be 3. Any ideas on how this could be done? Thanks!

Two constraints must be met to enable event counting.
Two START-STOP periods cannot overlap.
Two consecutive and chronologically ordered START and STOP event cannot be possibly originated from two different events, namely START+(missing TOP) and (missing START)+STOP.
It the conditions are met, a simple state machine can be implemented to detect the "missing" events. Such a row-by-row logic could (almost always) be implemented using the cursor syntax.
N.B. To exemplify the generality of the cursor method you can also see other answers A (update columns), B (a tedious algo) I made. The code structures are highly similar.
Test Dataset
use [testdb];
if OBJECT_ID('testdb..test') is not null
drop table testdb..test;
create table test (
[time] varchar(50),
[event] varchar(50),
);
insert into test ([time], [event])
values ('10:50', 'START'),('10:52', 'STOP'),('10:59', 'START'),
('11:01', 'STOP'),('11:45', 'STOP'),('11:50', 'STOP'),('11:55', 'START');
select * from test;
Code
/* cursor variables */
-- storage for each row
declare #time varchar(50),
#event varchar(50),
#state int = 0, -- state variable
#count int = 0; -- event count
-- open a cursor ordered by [time]
declare cur CURSOR local
for select [time], [event]
from test
order by [time]
open cur;
/* main loop */
while 1=1 BEGIN
/* fetch next row and check termination condition */
fetch next from cur
into #time, #event;
-- termination condition
if ##FETCH_STATUS <> 0 begin
-- check unfinished START before exit
if #state = 1
set #count += 1;
-- exit loop
break;
end
/* program body */
-- case 1. state = 0 (clear state)
if #state = 0 begin
-- 1-1. normal case -> go to state 1
if #event = 'START'
set #state = 1;
-- 1-2. a STOP without START -> keep state 0 and count++
else if #event = 'STOP'
set #count += 1;
-- guard
else
print '[Error] Bad event name: ' + #event
end
-- case 2. start = 1 (start is found)
else if #state = 1 begin
-- 2-1. normal case -> go to state 0 and count++
if #event = 'STOP' begin
set #count += 1;
set #state = 0;
end
-- 2-2. a START without STOP -> keep state 1 and count++
else if #event = 'START'
set #count += 1;
-- guard
else
print '[Error] Bad event name: ' + #event
end
END
-- cleanup
close cur;
deallocate cur;
Result
print #count; -- correct answer: 5
Tested on SQL Server 2017 (linux docker image, latest version).

Well, you could count each start and then each "stop" where the preceding event is not a start:
select count(*)
from (select t.*,
lag(event) over (order by time) as prev_event
from t
) t
where event = 'start' or
(prev_event = 'stop' and event = 'stop');

SQL Where with Binary(n) column

I have a stored procedure:
ALTER PROCEDURE [dbo].[spUpdateOrInsertNotification]
#ContentJsonHash BINARY(32)
AS
DECLARE #NotificationId INT;
SET #NotificationId = (SELECT #NotificationId
FROM dbo.tblNotifications n
WHERE n.ContentJsonHash = #ContentJsonHash);
IF #NotificationId IS NOT NULL
BEGIN
-- Increment Count
END
ELSE
BEGIN
-- Insert new row.
END
It's supposed to check if the Hash already exists and if it does, increment the count for the row, otherwise insert the row. However, it never finds the Hash and the corresponding NotificationId. NotificationId is always null.
If I run it twice, passing it the same data (a C# array byte[32]). It never finds the same NotificationId and I end up with duplicate entries being put in.
e.g.
NotificationId | ContentJsonHash
9 0xB966C33517993003D789EDF78DA20C4C491617F8F42F76F48E572ACF8EDFAC2A
10 0xB966C33517993003D789EDF78DA20C4C491617F8F42F76F48E572ACF8EDFAC2A
Can I not do comparisons on Binary(n) fields like this WHERE n.ContentJsonHash = #ContentJsonhash ?
The C# code:
using (var conn = new SqlConnection(Sql.ConnectionString))
{
await conn.OpenAsync();
using (var cmd = new SqlCommand(Sql.SqlUpdateOrInsertNotification, conn))
{
cmd.CommandType = CommandType.StoredProcedure;
cmd.Parameters.AddWithValue("#Source", notificationMessage.Source);
cmd.Parameters.AddWithValue("#Sender", notificationMessage.Sender);
cmd.Parameters.AddWithValue("#NotificationType", notificationMessage.NotificationType);
cmd.Parameters.AddWithValue("#ReceivedTimestamp", notificationMessage.Timestamp);
cmd.Parameters.AddWithValue("#ContentJSon", notificationMessage.NotificationContent);
cmd.Parameters.AddWithValue("#ContentJsonHash", notificationMessage.ContentHashBytes);
await cmd.ExecuteNonQueryAsync();
}
}
I've also tried calling the stored procedure from SQL like this:
exec dbo.spUpdateOrInsertNotification 'foo', 'bar', 0,
'2017-12-05 15:23:41.207', '{}',
0xB966C33517993003D789EDF78DA20C4C491617F8F42F76F48E572ACF8EDFAC2A
Calling this twice returns 2 rows :(
I can do this, which works, hard coding the binary field I want to check
select *
from dbo.tblNotifications
where ContentJsonhash = 0xB966C33517993003D789EDF78DA20C4C491617F8F42F76F48E572ACF8EDFAC2A

Binary comparisons can be tricky. If you are using a true binary column, I believe length also comes into play. So even if those bytes are the same, and the lengths differ, the comparison would be false. An easy way is to convert these to strings:
alter procedure [dbo].[spUpdateOrInsertNotification]
#ContentJsonHash BINARY(32)
AS
DECLARE #NotificationId INT;
SET #NotificationId = (SELECT NotificationId
FROM dbo.tblNotifications n
WHERE convert(varchar(32), n.ContentJsonHash, 2) = convert(varchar(32), #ContentJsonHash, 2));
IF #NotificationId IS NOT NULL
BEGIN
-- Increment Count
END
ELSE
BEGIN
-- Insert new row.
END

I had an # where I shouldn't have had an ampersand.
SET #NotificationId = (SELECT #NotificationId
FROM dbo.tblNotifications n
WHERE convert(varchar(32), n.ContentJsonHash, 2) = convert(varchar(32), #ContentJsonHash, 2));
Should be
SET #NotificationId = (SELECT NotificationId
FROM dbo.tblNotifications n
WHERE convert(varchar(32), n.ContentJsonHash, 2) = convert(varchar(32), #ContentJsonHash, 2));
I feel so stupid for not noticing this sooner :(

SQL server update fields from a concatenated string in a field

I'm fairly new to SQL code which writes to the database. I've been trying to work out this piece of code on my own, but I'm not having a lot of luck. Especially since I really don't know how to test it without actually writing to the DB
I have a database with 5 UDF fields. 'UDF1-UDF5'. The operators at my facility are supposed to scan a bar code into a specific bar code field which splits up the bar code into the five fields (they are all char(30) fields ). Unfortunately what is happening is that they are scanning directly into the UDF1 field, so the entire barcode string is all in one field. (I don't have control over this software) I am trying to write a script which will parse the DB, split these fields into separate variables and update the DB. I could use a little assistance because I think I need Dynamic SQL to do this and I don't know much about it. Here is a little more info about the system.
The barcode field looks like this:
%2S12345%1%1%0%10%
where the '%' characters begin and end the bar code and concatenate the characters. the first character of the first UDF field '2' is a check digit, and always the same.
The first field is always either 5 or 6 characters (excluding the check digit), the rest are either 1 or 2 digits. I also need code that won't break if the bar code only has the first three fields. Not a lot of consistency here. some of the bar codes are truncated.
questions,
As far as I know, the only way to break apart concatenated text is substring() which is position based, so I would need an additional 5 variables to get the length of each field and a way to query that information. Is there an easier way?
At some point I have to conditionally set the variables and I can't seem to get set commands to work. I understand why something like this doesn't work, but I don't know any other way of doing it.
.
DECLARE #BASEID CHAR(30), #LOTID CHAR(30), #SPLITID CHAR(30), #SUBID CHAR(30), #SEQUENCENO CHAR(30), #BASELEN INT
SET #BASELEN =
CASE WHEN(
SELECT ISNUMERIC(SUBSTRING(R.UDF1,3,1))
FROM VISION17SLITTER.DBO.ROLLINFO R
WHERE R.UDF1 LIKE '[%]%'
) = 1
THEN 5
ELSE 6
END
3. once I could get the variable set I assume that a simple conditional update statement would work, but if there is anything else I should know before trying this I would appreciate the advice.
Thanks again,
Dan

Consider the following:
Declare #YourTable table (ID int,BarCode varchar(100))
Insert Into #YourTable values
(1,'%2S12345%1%1%0%10%'),
(2,'%ABC1234%2%3%4%50%')
Select A.ID
,A.BarCode
,B.*
From #YourTable A
Cross Apply (
Select Pos1 = xDim.value('/x[1]','varchar(max)')
,Pos2 = xDim.value('/x[2]','varchar(max)')
,Pos3 = xDim.value('/x[3]','varchar(max)')
,Pos4 = xDim.value('/x[4]','varchar(max)')
,Pos5 = xDim.value('/x[5]','varchar(max)')
,Pos6 = xDim.value('/x[6]','varchar(max)')
,Pos7 = xDim.value('/x[7]','varchar(max)')
,Pos8 = xDim.value('/x[8]','varchar(max)')
,Pos9 = xDim.value('/x[9]','varchar(max)')
From (Select Cast('<x>' + Replace(A.BarCode,'%','</x><x>')+'</x>' as XML) as xDim) A
) B
Returns
Now, you may notice Pos1 and Pos7 are blank. This is due to the fact that your string begins and ends with the delimiter. If you want to tailor the CROSS APPLY as such:
Select Pos1 = xDim.value('/x[2]','varchar(max)')
,Pos2 = xDim.value('/x[3]','varchar(max)')
,Pos3 = xDim.value('/x[4]','varchar(max)')
,Pos4 = xDim.value('/x[5]','varchar(max)')
,Pos5 = xDim.value('/x[6]','varchar(max)')
From (Select Cast('<x>' + Replace(A.BarCode,'%','</x><x>')+'</x>' as XML) as xDim) A
Which Returns

Create a function to split your string
CREATE FUNCTION [dbo].[fnSplitString]
(
#string NVARCHAR(MAX),
#delimiter CHAR(1)
)
RETURNS #output TABLE(splitdata NVARCHAR(MAX)
)
BEGIN
DECLARE #start INT, #end INT
SELECT #start = 1, #end = CHARINDEX(#delimiter, #string)
WHILE #start < LEN(#string) + 1 BEGIN
IF #end = 0
SET #end = LEN(#string) + 1
INSERT INTO #output (splitdata)
VALUES(SUBSTRING(#string, #start, #end - #start))
SET #start = #end + 1
SET #end = CHARINDEX(#delimiter, #string, #start)
END
RETURN
END
Then remove the first two control characters, and the last control character and invoke the function like this
select * from dbo.fnSplitString('S12345%1%1%0%10','%')
The function will then return a table with the following values:
splitdata
=========
S1234
1
1
0
10

SQL Server 2008 math fail

After hunting around on various forums for almost an hour, I've come to the conclusion that SQL server is slightly stupid about simple arithmetic.
I am attempting to utilize a function which, until recently seemed to work just fine. Upon changing out some of the values for a different set of information on the form in use, I get the odd behavior ahead.
The problem is that it is giving me the incorrect result as based on an excel spreadsheet formula.
The formula looks like this:
=IF(D8=0,0,(((D8*C12-C16)*(100-C13)/100+C16)/D8)+(C18*D8))
My SQL looks like this:
(((#DaysBilled * #ContractRate - #ActualPlanDed) * (100 - #InsCover) / 100 + #ActualPlanDed) / #DaysBilled) + (#CoPay * #DaysBilled)
Filling the variables with the given data looks like this:
(((11 * 433 - 15) * (100 - 344) / 100 + 15) / 11) + (15 * 11)
Even stranger, if I use the numbers above (adding .00 to the end of each value) manually in the server environment, it gives me -11405.1200000000
With the values I am giving, it should come out 166.36. Unfortunately, I am getting -886.83
Here is the entire function and how it is called:
ALTER FUNCTION Liability
(
#ClientGUID CHAR(32),
#RecordGUID CHAR(32),
#Type CHAR(3)
)
RETURNS DECIMAL(18,2) AS
BEGIN
DECLARE #ReturnValue decimal(18,2);
DECLARE #DaysBilled int;
DECLARE #ContractRate decimal(18,2);
DECLARE #ActualPlanDed decimal(18,2);
DECLARE #InsCover decimal(18,2);
DECLARE #CoPay decimal(18,2);
IF (#Type = 'RTC')
BEGIN
SELECT #DaysBilled = RTCDaysBilled,
#ContractRate = CAST(REPLACE(REPLACE(ContractRateRTC, ' ',''),'$', '') AS DECIMAL(6,2)),
#ActualPlanDed = RTCActualPlanDed,
#InsCover = InsRTCCover,
#CoPay = RTCCoPay
FROM AccountReconciliation1
WHERE #ClientGUID = tr_42b478f615484162b2391ef0b2c35ddc
AND #RecordGUID = tr_abb4effa0d9c4fe98c78cb4d2e21ba5d
END
IF (#Type = 'PHP')
BEGIN
SELECT #DaysBilled = PHPDaysBilled,
#ContractRate = CAST(REPLACE(REPLACE(ContractRatePHP, ' ',''),'$', '') AS DECIMAL(6,2)),
#ActualPlanDed = PHPActualPlanDed,
#InsCover = InsPHPCover,
#CoPay = PHPCoPay
FROM AccountReconciliation1
WHERE #ClientGUID = tr_42b478f615484162b2391ef0b2c35ddc
AND #RecordGUID = tr_abb4effa0d9c4fe98c78cb4d2e21ba5d
END
IF (#Type = 'IOP')
BEGIN
SELECT #DaysBilled = IOPDaysBilled,
#ContractRate = CAST(REPLACE(REPLACE(ContractRateIOP, ' ',''),'$', '') AS DECIMAL(6,2)),
#ActualPlanDed = IOPActualPlanDed,
#InsCover = InsIOPCover,
#CoPay = IOPCoPay
FROM AccountReconciliation1
WHERE #ClientGUID = tr_42b478f615484162b2391ef0b2c35ddc
AND #RecordGUID = tr_abb4effa0d9c4fe98c78cb4d2e21ba5d
END
IF (#DaysBilled <> 0)
BEGIN
SET #ReturnValue = (((#DaysBilled * #ContractRate - #ActualPlanDed)
*
(100 - #InsCover) / 100 + #ActualPlanDed)
/
#DaysBilled
)
+
(#CoPay * #DaysBilled)
END
ELSE
BEGIN
SET #ReturnValue = 0;
END
RETURN #ReturnValue;
END
It is called by running a select statement from our front-end, but the result is the same as calling the function from within management studio:
SELECT dbo.Liability('ClientID','RecordID','PHP') AS Liability
I have been reading about how a unary minus tends to break SQL's math handling, but I'm not entirely sure how to counteract it.
One last stupid trick with this function: It must remain a function. I cannot convert it into a stored procedure because it must be used with our front-end, which cannot utilize stored procedures.
Does SQL server even care about the parentheses? Or is it just ignoring them?

The calculation is correct, it differes of course if you are using float values
instead of integers.
For (((11 * 433 - 15) * (100 - 344) / 100 + 15) / 11) + (15 * 11)
a value around -886.xx depending in which places integers/floats are used is correct,
What makes you believe it should be 166.36?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas