Performance characteristics of a non-minimal DFA - finite-automata

Much has been written about the performance of algorithms for minimizing DFAs. That's frustrating my Google-fu because that's not what I'm looking for.
Can we say anything generally about the performance characteristics of a non-minimal DFA? My intuition is that the run time of a non-minimal DFA will still be O(n) with respect to the length of the input. It seems that minimization would only affect the number of states and hence the storage requirements. Is this correct?
Can we refine the generalizations if we know something about the construction of the NFA from which the DFA was derived? For example, say the NFA was constructed entirely by applying concatenation, union, and Kleene star operations to primitive automatons that match either one input symbol or epsilon. With no way to remove a transition or create an arbitrary transition, I don't think it's possible to have any dead states. What generalizations can we make about the DFAs constructed from these NFAs? I'm interested in both theoretical and empirical answers.

Regarding your first question, on the runtime of the non-optimal DFA. Purely theoretically your intuition that it should still run in O(n) is correct. However, imagine (as an example) the following pseudo-code for the Kleene-Star operator:
// given that the kleene-star operator starts at i=something
while string[i] == 'r':
accepting = true;
i++;
while string[i] == 'r':
accepting = true;
i++;
// here the testing of the input string can continue for i+1
As you can see, the first two while-loops are identical, and could be understood as a redundant state. However, "splitting" while loops will decrease (among other things) your branch-prediction accuracy and therefore the overall runtime (see Mysticial's brilliant explanation of branch prediction for more details here.
Many other, similar "practical" arguments can be made on why a non-optimal DFA will be slower; among them, as you mentioned, a higher memory usage (and in many cases, more memory means slower, for memory is - by comparison - a slower part of the computer); more "ifs", for each additional state requires input checking for its successors; possibly more loops (as in the example), which would make the algorithm slower not only on the basis of branch prediction, but simply because some programming languages are just very slow on loops.
Regarding your second question - here I am not sure on what you mean. After all, if you do the conversion properly you should derive a pretty optimal DFA in the first place.
EDIT:
In the discussion the idea came up that there can be several non-minimal DFAs constructed from one NFA that would have different efficiencies (in whatever measure chosen), not in the implementation, but in the structure of the DFA.
This is not possible, for there is only one optimal DFA. This is the outline of a proof for this:
Assuming that our procedure for creating and minimizing a DFA is optimal.
when applying the procedure, we will start by constructing a DFA first. In this step, we can create indefinitely many equivalent states. These states are all connected to the graph of the NFA in some way.
In the next step we eliminate all non-reachable states. This is indifferent to perfomance, for an unreachable state would correspond to "dead code" - never to be executed.
In the fourth step, we minimize the DFA by grouping equivalent states. This is where it becomes interesting - for the idea is that we can do this in different ways, resulting in different DFAs with different performance. However, the only "choice" we have is assigning a state to a different group.
So, for arguments sake, we assume we could do that.
But, by the idea behind the minimization algorithm, we can only group equivalent states. So if we have different choices of grouping a particular state, by transitivity of equivalence, not only would the state be equivalent to both groups, but the groups would be equivalent, too. So if we could group differently, the algorithm would not be optimal, for it would have grouped all states in the groups into one group in the first place.
Therefore, the assumption that there can be different minimizations has to be wrong.

The reasoning that the "runtime" for input acceptance will be the same, as usually one character of the input is consumed; I never heared the notion "runtime" (in the sense of asymptotic runtime complexity) in the context of DFAs. The minimization aims at minimizing the number of states (i.e. to optimize the "implementation size") of the DFA.

Related

How does a nested if statement affect actual runtime before f(n) simplifies to O(g(n))?

I am aware that constant coefficients and constants are simply ignored when calculating runtime complexity of an algorithm. However, I would still like to know whether an if statement nested in a while or for loop adds to the total actual runtime of an algorithm, f(n).
This picture is from an intro to theoretical computer science lecture I am currently studying, and the algorithm in question counts the number of 'a's for any input string. The lecturer counts the nested if statement as one of the timesteps that affect total runtime, but I am unsure whether this is correct. I am aware that the entire algorithm simplifies to O(g(n)) where g(n) = n, but I would like to know definitively whether f(n) itself equals to 2n + a or n + a. Understanding this is important to me, since I believe first knowing exactly the actual runtime, f(n), before simplifying it to O(g(n)) reduces mistakes when calculating runtime for more complicated algorithms. I would appreciate your insight.
Youtube clip: https://www.youtube.com/watch?v=5Bbxqv73EbU&list=PLAwxTw4SYaPl4bx7Pck4JWjy1WVbrDx0U&index=35
Knowing the actual runtime, as you say, before calculating the time complexity in big-O is not important. In fact, as you continue studying, you will find that in many cases it is ambiguous, annoying or very, very difficult to find an exact number of steps that an algorithm will execute. It often comes down to definition, and depending on how you see things, you can come up with different answers.
Time complexity, on the other hand, is a useful and often easier expression to find. I believe this is the very point this video is trying to make. But to answer your question: Yes, in this case, the if statement is definitely a step that the algorithm has to make. It only compares one character, so it is clearly a constant-time operation. The author considers this comparison to take 1 step. And since it will execute n times, the total number of steps that this line of "code" will be executed is n. So yes you can see the whole algorithm as taking 2n + a steps.
However, what if we are working on a computer where we can't just compare a character in a single step, but we need to copy the character variable to a special register first, and then do the comparison. Perhaps on this computer we need to see that line as taking 2 steps, so 2n in total. Then the overall number of steps will be 3n + a, yet the time complexity is still O(n). When we study complexity theory, we don't want to go down on that level of counting, because just different ways of counting will give you different results.
You will soon learn to automatically filter out the constants and terms and identify the variables that contribute to the time complexity. When you study different algorithms, you find that as the input grows, those differences become negligible.

Compound "OR" evaluation in DB2

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?
In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.
You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.
Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...
Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.
Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

Optimising table assignment to guests for an event based on a criteria

66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..

multicollinearity for one-hot encoding

Do we always need to remove a column for one-hot encoding to prevent multicollinearity?
In the solution here (https://www.kaggle.com/omarelgabry/titanic/a-journey-through-titanic/comments#138896) it mentions
#Kevin Chang You need to delete one column of the dummy variables to
avoid the state of Multicollinearity. It's a state of very high
correlations among the columns(independent variables); meaning that
one can be predicted from the others. It is therefore, a type of
disturbance in the data, and if present in the data the statistical
conclusions made about the data may not be reliable.
In the solutions here, there is not catering for multicollinearity
https://www.kaggle.com/sharmasanthosh/allstate-claims-severity/exploratory-study-on-ml-algorithms
May I know is it a must, or in what situation we ned to cater that?
If I have to answer your question "Do we always need to remove a column for one-hot encoding to prevent multicollinearity?", the answer is yes.
The common way to prevent multicollinearity is to remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't reduce the R-squared.
Or you could use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number of predictors to a smaller set of uncorrelated components.

NFA with half the strings in {0,1}^n

If there is an NFA M whose language L(M) is a subset of {0,1}* then how do you prove that determining if L(M) has fewer than half the strings in {0,1}^n for n>=0 is NP-hard.
First, you have to decide whether the problem you are proposing is actually solvable.
Assuming that it is indeed solvable by an NFA, then it sure is solvable by a corresponding Turing Machine (TM).
Let L(TM) = L(M)
Then, there exists a deterministic Turing Machine that can verify the solutions for the given set of problems. Hence, the problem is NP.
As per your question, in order to determine whether the L(M) has fewer than half the strings in {0,1}^n for n>=0, the problem is decidable and can be reduced to P type.
Therefore, we can prove it to be NP-Hard by taking an algorithm that can change it to another problem that is already proved NP-Hard in polynomial time.
Required data missing to formulate the algorithm.