How does a nested if statement affect actual runtime before f(n) simplifies to O(g(n))? - time-complexity

I am aware that constant coefficients and constants are simply ignored when calculating runtime complexity of an algorithm. However, I would still like to know whether an if statement nested in a while or for loop adds to the total actual runtime of an algorithm, f(n).
This picture is from an intro to theoretical computer science lecture I am currently studying, and the algorithm in question counts the number of 'a's for any input string. The lecturer counts the nested if statement as one of the timesteps that affect total runtime, but I am unsure whether this is correct. I am aware that the entire algorithm simplifies to O(g(n)) where g(n) = n, but I would like to know definitively whether f(n) itself equals to 2n + a or n + a. Understanding this is important to me, since I believe first knowing exactly the actual runtime, f(n), before simplifying it to O(g(n)) reduces mistakes when calculating runtime for more complicated algorithms. I would appreciate your insight.
Youtube clip: https://www.youtube.com/watch?v=5Bbxqv73EbU&list=PLAwxTw4SYaPl4bx7Pck4JWjy1WVbrDx0U&index=35

Knowing the actual runtime, as you say, before calculating the time complexity in big-O is not important. In fact, as you continue studying, you will find that in many cases it is ambiguous, annoying or very, very difficult to find an exact number of steps that an algorithm will execute. It often comes down to definition, and depending on how you see things, you can come up with different answers.
Time complexity, on the other hand, is a useful and often easier expression to find. I believe this is the very point this video is trying to make. But to answer your question: Yes, in this case, the if statement is definitely a step that the algorithm has to make. It only compares one character, so it is clearly a constant-time operation. The author considers this comparison to take 1 step. And since it will execute n times, the total number of steps that this line of "code" will be executed is n. So yes you can see the whole algorithm as taking 2n + a steps.
However, what if we are working on a computer where we can't just compare a character in a single step, but we need to copy the character variable to a special register first, and then do the comparison. Perhaps on this computer we need to see that line as taking 2 steps, so 2n in total. Then the overall number of steps will be 3n + a, yet the time complexity is still O(n). When we study complexity theory, we don't want to go down on that level of counting, because just different ways of counting will give you different results.
You will soon learn to automatically filter out the constants and terms and identify the variables that contribute to the time complexity. When you study different algorithms, you find that as the input grows, those differences become negligible.

Related

Optimising table assignment to guests for an event based on a criteria

66 guests at an event, 8 tables. Each table has a "theme". We want to optimize various criteria: e.g., even number of men/women at the table, people get to discuss the topic they selected, etc.
I formulated this as a gradient-free optimisation problem: I wrote a function that calculates the goodness of the arrangement (i.e., cost of difference of men women, cost of non-preferred theme, etc.) and I am basically randomly perturbing the arrangement by swapping tables and keeping the "best so far" arrangement. This seems to work, but cannot guarantee optimality.
I am wondering if there is a more principled way to go about this. There (intuitively) seems to be no useful gradient in the operation of "swapping" people between tables, so random search is the best I came up with. However, brute-forcing by evaluating all possibilities seems to be difficult; if there are 66 people, there are factorial(66) possible orders, which is a ridiculously large number (10^92 according to Python). Since swapping two people at the same table is the same, there are actually fewer, which I think can be calculated by dividing out the repeats, e.g. fact(66)/(fact(number of people at table 1) * fact(number of people at table 2) * ...), which in my problem still comes out to about 10^53 possible arrangements, way too many to consider.
But is there something better that I can do than random search? I thought about evolutionary search but I don't know if it would provide any advantages.
Currently I am swapping a random number of people on each evaluation and keeping it only if it gives a better value. The random number of people is selected from an exponential distribution to make it more probable to swap 1 person than 6, for example, to make small steps on average but to keep the possibility of "jumping" a bit further in the search.
I don't know how to prove it but I have a feeling this is an NP-hard problem; if that's the case, how could it be reformulated for a standard solver?
Update: I have been comparing random search with a random "greedy search" and a "simulated annealing"-inspired approach where I have a probability of keeping swaps based on the measured improvement factor, that anneals over time. So far the greedy search surprisingly strongly outperforms the probabilistic approach. Adding the annealing schedule seems to help.
What I am confused by is exactly how to think about the "space" of the domain. I realize that it is a discrete space, and that distance are best described in terms of Levenshtein edit distance, but I can't think how I could "map" it to some gradient-friendly continuous space. Possibly if I remove the exact number of people-per-table and make this continuous, but strongly penalize it to incline towards the number that I want at each table -- this would make the association matrix more "flexible" and possibly map better to a gradient space? Not sure. Seating assignment could be a probability spread over more than one table..

Relationship between code optimization and data compression

I searched on the Internet about this question and Found that some researchers used data compression algorithms for compiler optimization like Huffman coding.
My question is more general :
Can we consider code optimization as lossy type of compression?
At a concrete level, it's apples and oranges. But at an abstract level it's an interesting question.
Data compression deals with redundancy, which is the difference between data and information.
It seeks to reduce needless redundancy by revising the coding of the information.
Often this coding works by taking a common substring and making a code that refers to it, rather that repeating the substring.
Compiler optimization (for speed) seeks to reduce needless cycles.
One way is if the result of some computation is needed twice or more,
make sure it is saved in some address or register (memoizing) so it can be re-used with fewer cycles.
Another form of encoding numbers is so-called "unary notation" where there is only one digit, and numbers are represented by repeating it. For example, the numbers "three" and "four" are "111" and "1111", which takes N digits.
This code is optimized by switching to binary, as in "011" and "100", which takes log(N) digits (base 2, of course).
A programming analogy to this is the difference between linear and binary search.
Linear search takes O(N) comparisons.
Each comparison can yield either a lot of information or very little - on average much less than a bit.
Binary search takes O(log(N)) comparisons, with each comparison yielding one bit.
With some thinking, it should be possible to find other parallels.

NFA with half the strings in {0,1}^n

If there is an NFA M whose language L(M) is a subset of {0,1}* then how do you prove that determining if L(M) has fewer than half the strings in {0,1}^n for n>=0 is NP-hard.
First, you have to decide whether the problem you are proposing is actually solvable.
Assuming that it is indeed solvable by an NFA, then it sure is solvable by a corresponding Turing Machine (TM).
Let L(TM) = L(M)
Then, there exists a deterministic Turing Machine that can verify the solutions for the given set of problems. Hence, the problem is NP.
As per your question, in order to determine whether the L(M) has fewer than half the strings in {0,1}^n for n>=0, the problem is decidable and can be reduced to P type.
Therefore, we can prove it to be NP-Hard by taking an algorithm that can change it to another problem that is already proved NP-Hard in polynomial time.
Required data missing to formulate the algorithm.

Performance characteristics of a non-minimal DFA

Much has been written about the performance of algorithms for minimizing DFAs. That's frustrating my Google-fu because that's not what I'm looking for.
Can we say anything generally about the performance characteristics of a non-minimal DFA? My intuition is that the run time of a non-minimal DFA will still be O(n) with respect to the length of the input. It seems that minimization would only affect the number of states and hence the storage requirements. Is this correct?
Can we refine the generalizations if we know something about the construction of the NFA from which the DFA was derived? For example, say the NFA was constructed entirely by applying concatenation, union, and Kleene star operations to primitive automatons that match either one input symbol or epsilon. With no way to remove a transition or create an arbitrary transition, I don't think it's possible to have any dead states. What generalizations can we make about the DFAs constructed from these NFAs? I'm interested in both theoretical and empirical answers.
Regarding your first question, on the runtime of the non-optimal DFA. Purely theoretically your intuition that it should still run in O(n) is correct. However, imagine (as an example) the following pseudo-code for the Kleene-Star operator:
// given that the kleene-star operator starts at i=something
while string[i] == 'r':
accepting = true;
i++;
while string[i] == 'r':
accepting = true;
i++;
// here the testing of the input string can continue for i+1
As you can see, the first two while-loops are identical, and could be understood as a redundant state. However, "splitting" while loops will decrease (among other things) your branch-prediction accuracy and therefore the overall runtime (see Mysticial's brilliant explanation of branch prediction for more details here.
Many other, similar "practical" arguments can be made on why a non-optimal DFA will be slower; among them, as you mentioned, a higher memory usage (and in many cases, more memory means slower, for memory is - by comparison - a slower part of the computer); more "ifs", for each additional state requires input checking for its successors; possibly more loops (as in the example), which would make the algorithm slower not only on the basis of branch prediction, but simply because some programming languages are just very slow on loops.
Regarding your second question - here I am not sure on what you mean. After all, if you do the conversion properly you should derive a pretty optimal DFA in the first place.
EDIT:
In the discussion the idea came up that there can be several non-minimal DFAs constructed from one NFA that would have different efficiencies (in whatever measure chosen), not in the implementation, but in the structure of the DFA.
This is not possible, for there is only one optimal DFA. This is the outline of a proof for this:
Assuming that our procedure for creating and minimizing a DFA is optimal.
when applying the procedure, we will start by constructing a DFA first. In this step, we can create indefinitely many equivalent states. These states are all connected to the graph of the NFA in some way.
In the next step we eliminate all non-reachable states. This is indifferent to perfomance, for an unreachable state would correspond to "dead code" - never to be executed.
In the fourth step, we minimize the DFA by grouping equivalent states. This is where it becomes interesting - for the idea is that we can do this in different ways, resulting in different DFAs with different performance. However, the only "choice" we have is assigning a state to a different group.
So, for arguments sake, we assume we could do that.
But, by the idea behind the minimization algorithm, we can only group equivalent states. So if we have different choices of grouping a particular state, by transitivity of equivalence, not only would the state be equivalent to both groups, but the groups would be equivalent, too. So if we could group differently, the algorithm would not be optimal, for it would have grouped all states in the groups into one group in the first place.
Therefore, the assumption that there can be different minimizations has to be wrong.
The reasoning that the "runtime" for input acceptance will be the same, as usually one character of the input is consumed; I never heared the notion "runtime" (in the sense of asymptotic runtime complexity) in the context of DFAs. The minimization aims at minimizing the number of states (i.e. to optimize the "implementation size") of the DFA.

Measuring the complexity of SQL statements

The complexity of methods in most programming languages can be measured in cyclomatic complexity with static source code analyzers. Is there a similar metric for measuring the complexity of a SQL query?
It is simple enough to measure the time it takes a query to return, but what if I just want to be able to quantify how complicated a query is?
[Edit/Note]
While getting the execution plan is useful, that is not necessarily what I am trying to identify in this case. I am not looking for how difficult it is for the server to execute the query, I am looking for a metric that identifies how difficult it was for the developer to write the query, and how likely it is to contain a defect.
[Edit/Note 2]
Admittedly, there are times when measuring complexity is not useful, but there are also times when it is. For a further discussion on that topic, see this question.
Common measures of software complexity include Cyclomatic Complexity (a measure of how complicated the control flow is) and Halstead complexity (a measure of complex the arithmetic is).
The "control flow" in a SQL query is best related to "and" and "or" operators in query.
The "computational complexity" is best related to operators such as SUM or implicit JOINS.
Once you've decided how to categorize each unit of syntax of a SQL query as to whether it is "control flow" or "computation", you can straightforwardly compute Cyclomatic or Halstead measures.
What the SQL optimizer does to queries I think is absolutely irrelevant. The purpose of complexity measures is to characterize how hard is to for a person to understand the query, not how how efficiently it can be evaluated.
Similarly, what the DDL says or whether views are involved or not shouldn't be included in such complexity measures. The assumption behind these metrics is that the complexity of machinery inside a used-abstraction isn't interesting when you simply invoke it, because presumably that abstraction does something well understood by the coder. This is why Halstead and Cyclomatic measures don't include called subroutines in their counting, and I think you can make a good case that views and DDL information are those "invoked" abstractractions.
Finally, how perfectly right or how perfectly wrong these complexity numbers are doesn't matter much, as long they reflect some truth about complexity and you can compare them relative to one another. That way you can choose which SQL fragments are the most complex, thus sort them all, and focus your testing attention on the most complicated ones.
I'm not sure the retrieval of the query plans will answer the question: the query plans hide a part of the complexity about the computation performed on the data before it is returned (or used in a filter); the query plans require a significative database to be relevant. In fact, complexity, and length of execution are somewhat oppposite; something like "Good, Fast, Cheap - Pick any two".
Ultimately it's about the chances of making a mistake, or not understanding the code I've written?
Something like:
number of tables times (1
+1 per join expression (+1 per outer join?)
+1 per predicate after WHERE or HAVING
+1 per GROUP BY expression
+1 per UNION or INTERSECT
+1 per function call
+1 per CASE expression
)
Please feel free to try my script that gives an overview of the stored procedure size, the number of object dependencies and the number of parameters -
Calculate TSQL Stored Procedure Complexity
SQL queries are declarative rather than procedural: they don't specify how to accomplish their goal. The SQL engine will create a procedural plan of attack, and that might be a good place to look for complexity. Try examining the output of the EXPLAIN (or EXPLAIN PLAN) statement, it will be a crude description of the steps the engine will use to execute your query.
Well I don't know of any tool that did such a thing, but it seems to me that what would make a query more complicated would be measured by:
the number of joins
the number of where conditions
the number of functions
the number of subqueries
the number of casts to differnt datatypes
the number of case statements
the number of loops or cursors
the number of steps in a transaction
However, while it is true that the more comlex queries might appear to be the ones with the most possible defects, I find that the simple ones are very likely to contain defects as they are more likely to be written by someone who doesn't understand the data model and thus they may appear to work correctly, but in fact return the wrong data. So I'm not sure such a metric wouild tell you much.
In the absence of any tools that will do this, a pragmatic approach would be to ensure that the queries being analysed are consistently formatted and to then count the lines of code.
Alternatively use the size of the queries in bytes when saved to file (being careful that all queries are saved using the same character encoding).
Not brilliant but a reasonable proxy for complexity in the absence of anything else I think.
In programming languages we have several methods to compute the time complexity or space complexity.
Similarly we could compare with sql as well like in a procedure the no of lines you have with loops similar to a programming language but unlike only input usually in programming language in sql it would along with input will totally depend on the data in the table/view etc to operate plus the overhead complexity of the query itself.
Like a simple row by row query
Select * from table ;
// This will totally depend on no of
records say n hence O(n)
Select max(input) from table;
// here max would be an extra
overhead added to each
Therefore t*O(n) where t is max
Evaluation time
Here is an idea for a simple algorithm to compute a complexity score related to readability of the query:
Apply a simple lexer on the query (like ones used for syntax coloring in text editors or here on SO) to split the query in tokens and give each token a class:
SQL keywords
SQL function names
string literals with character escapes
string literals without character escape
string literals which are dates or date+time
numeric literals
comma
parenthesis
SQL comments (--, /* ... */)
quoted user words
non quoted user words: everything else
Give a score to each token, using different weights for each class (and differents weights for SQL keywords).
Add the scores of each token.
Done.
This should work quite well as for example counting sub queries is like counting the number of SELECT and FROM keywords.
By using this algorithm with different weight tables you can even measure the complexity in different dimensions. For example to have nuanced comparison between queries. Or to score higher the queries which use keywords or functions specific to an SQL engine (ex: GROUP_CONCAT on MySQL).
The algorithm can also be tweaked to take in account the case of SQL keywords: increase complexity if they are not consistently upper case. Or to account for indent (carriage return, position of keywords on a line)
Note: I have been inspired by #redcalx answer that suggested applying a standard formatter and counting lines of code. My solution is simpler however as it doesn't to build a full AST (abstract syntax tree).
Toad has a built-in feature for measuring McCabe cyclomatic complexity on SQL:
https://blog.toadworld.com/what-is-mccabe-cyclomatic-complexity
Well if you're are using SQL Server I would say that you should look at the cost of the query in the execution plan (specifically the subtree cost).
Here is a link that goes over some of the things you should look at in the execution plan.
Depending on your RDBMS, there might be query plan tools that can help you analyze the steps the RDBMS will take in fetching your query.
SQL Server Management Studio Express has a built-in query execution plan. Pervasive PSQL has its Query Plan Finder. DB2 has similar tools (forgot what they're called).
A good question. The problem is that for a SQL query like:
SELECT * FROM foo;
the complexity may depend on what "foo" is and on the database implementation. For a function like:
int f( int n ) {
if ( n == 42 ) {
return 0;
}
else {
return n;
}
}
there is no such dependency.
However, I think it should be possible to come up with some useful metrics for a SELECT, even if they are not very exact, and I'll be interested to see what answers this gets.
It's reasonably enough to consider complexity as what it would be if you coded the query yourself.
If the table has N rows then,
A simple SELECT would be O(N)
A ORDER BY is O(NlogN)
A JOIN is O(N*M)
A DROP TABLE is O(1)
A SELECT DISTINCT is O(N^2)
A Query1 NOT IN/IN Query2 would be O( O1(N) * O2(N) )