Finding the smallest DFA that separates two words without using brute force search? - finite-automata

Given two strings a and b, I want to build a minimum size DFA that accepts a and rejects b. One way to do this is brute force search.
Is there an alternative? If so, what is it?

Related

Is it possible to have more than one minimal DFA 's for a regular language?

If we create two DFA's for a language L say DFA A and DFA B. Then, after minimising the DFA's we get their corresponding equivalent minimal DFA's . Is It always the case that both minimal DFA's have same number of states?
I designed 2 DFAs for a language containing strings with 1 as their second last symbol. (The alphabet is {0,1}
I made 2 DFA's one has 3 states one Has four. I am unable to minimise any of the two.
The minimum deterministic finite automata is unique up to isomorphism.
Isomorphism effectively means "equal shape". With other words, there is only one minimum DFA and you can name the states as you want, this renaming technically creates a new automata, but all of this different possible renaming of the states are isomorphic to each other - the shape is the same, just the representation is different.
Ignoring the isomorphism, the minimum DFA is unique.

When clustering with OpenRefine, is there a way to "exclude" a string in a cluster ? right now it feels like either it clusterize everything or not

When using the clustering function in OpenRefine, you can select the "Merge?" option to clusterize the strings that were put together with the method of your choose, but what if the method clusterizes correctly most of them except for one string that I manually identify doesnt belongs in th ecluster, is there a way to exclude that specific string from the rest of the cluster ?
Unfortunately there is not currently a way of excluding or selecting a subset of terms from a cluster. The only two options I can think of are:
a) modify the clustering algorithm you are using to try to get better
clustering which doesn't include the incorrect terms
b) Go to 'browse
cluster' and mark the rows with the terms you don't want to have in
the cluster (e.g. by Flagging the rows), exclude the flagged rows in
a facet and re-cluster - this will then not include any of the terms
you didn't want

Simulate random vector conditionally on a subdomain

Suppose i have a bivariate random vector which i can simulate from, taking values in a given domain, and for the sake of simplicity let's suppose that it takes values in whole $R^2$.
Suppose now that the density of my random vector in a given subdomain (e.g [0,1]^2) is very small.
To simulate the random values conditionally on being in this given subdomain, the easy technique "simulate unconditionally and discard if it's not in the subdomain" wont be very efficient.
Is there a generic way to simulate conditionally on being in a sub-domain that would be more efficient that this easy trick ?
I have access to a random number generator from my bivariate law, but i don't have access to the law itself (no expression of density, cdf or whatever).
Maybe it's not the right place to post this ?

Karp-Rabin algorithm

The below image is from : 6.006-Introduction to algorithms,
While doing the course 6.006-Introduction to algorithms, provided by MIT OCW, I came across the Rabin-Karp algorithm.
Can anyone help me understand as to why the first rs()==rt() required? If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead? Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
In the image, rs() is for hash value, and rs.skip[arg] is to remove the first character of that string assuming it is ‘arg’
Can anyone help me understand as to why the first rs()==rt() required?
I assume you mean the one right before the range-loop. If the strings have the same length, then the range-loop will not run (empty range). The check is necessary to cover that case.
If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead?
Not sure what you mean here. The posted code leaves blank (with ...) after matching hashes are found. Let's not forget that at that point, we must compare strings to confirm we really found a match. And, it's up to the (not shown) implementation to continue searching until the end or not.
Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
I really don't get this part. Note the first two loops are to populate the rolling hashes for the input strings. Then comes a check if we have a match at this point, and then the loop updating the rolling hashes pair wise, and then comparing them. The entire t is checked, from start to end.

VB.NET Comparing files with Levenshtein algorithm

I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.