problem size = 1 million
algorithm running time = N^2
operation per second = 10^9
The table in my algorithms book says it takes "hours" to complete, however I thought based off the information that it would take "minutes". My thought process was...
( 1 million )^2 / ( 10^9 ) = 1000 seconds which is less than an hour. Where did I go wrong? Thank you.
The table that you mentioned is most likely just giving a rough estimate, in the granularity of seconds/hours/days/years. The purpose of such a table might just be to convey a feeling about what O(N^2) actually means: Sorting a telephone book with 10000000 entries with an O(N^2) algorithm? Not a good idea.
This is affirmed by the fact that the asymptotic running time, when it is given in O-notation, omits any constant factor. So an algorithm in O(N^2) might actually perform, for example, 7.2 * N^2 operations to complete its task. And there you have 7200 seconds - that is, 2 hours.
Related
This question already has answers here:
Which algorithm is faster O(N) or O(2N)?
(6 answers)
Closed 1 year ago.
In Big-O Notation, O(N) and O(2N) describe the same complexity. That is to say, the growth rate of the time or space complexity for an algorithm at O(2N) is essentially equal to O(N). This can be seen especially when compared to an algorithm with a complexity like O(N^2) given an extremely large value for N. O(N) increases linearly while O(N^2) increases quadratically.
So I understand why O(N) and O(2N) are considered to be equal, but I'm still uncertain about treating these two as completely equal. In a program where your number of inputs N is 1 million or more, it seems to me like halving the time complexity would actually save quite a lot time because the program would have potentially millions less actions to execute.
I'm thinking of a program that contains two for-loops. Each for-loop iterates over the entire length of a very large array of N elements. This program would have a complexity of O(2N). O(2N) reduces to O(N), but I feel like an implementation that only requires one for-loop instead of two would make it a faster program (even if a single for-loop implementation sacrificed some functionality for the sake of speed, for example).
My question:
If you had an algorithm with time complexity O(2N), would optimizing it to have O(N) time complexity make it twice as fast?
To put it another way, is it ever significantly beneficial to optimize an O(2N) algorithm down to O(N)? I imagine there would be some increase in the speed of the program, or would the increase be so insignificant that it isn't worth the effort since O(2N) == O(N)?
Time complexity is not the same as speed. For a given size of data, a program with O(N) might be slower, faster or the same speed as O(2N). Also, for a given size of data O(N) might be slower, faster or the same speed as O(N^2).
So if Big-O doesn't mean anything, why are we talking about it anyway?
Big-O notation describes the behaviour a program as the size of data increases. This behaviour is always relative. In other words, Big-O tells you the shape of asymptotic curve, but not its scale or dimension.
Let's say you have a program A that is O(N). This means that processing time will be linearly proportional to data size (ignoring real-world complications like cache sizes that might make the run-time more like piecewise-linear):
for 1000 rows it will take 3 seconds
for 2000 rows it will take 6 seconds
for 3000 rows it will take 9 seconds
And for another program B which is also O(N):
for 1000 rows it will take 1 second
for 2000 rows it will take 2 seconds
for 3000 rows it will take 3 seconds
Obviously, the second program is 3 times faster per row, even though they both have O(N). Intuitively, this tells you that both programs go through every row and spend some fixed time on processing it. The difference in time from 2000 to 1000 is the same as difference from 3000 to 2000 - this means that the grows linearly, in other words time needed for one record does not depend on number of all records. This is equivalent to program doing some kind of a for-loop, as for example when calculating a sum of numbers.
And, since the programs are different and do different things, it doesn't make any sense to compare 1 second of program A's time to 1 second of program B's time anyway. You would be comparing apples and oranges. That's why we don't care about the constant factor and we say that O(3n) is equivalent to O(n).
Now imagine a third program C, which is O(N^2).
for 1000 rows it will take 1 second
for 2000 rows it will take 4 seconds
for 3000 rows it will take 9 seconds
The difference in time here between 3000 and 2000 is bigger than difference between 2000 and 1000. The more the data, the bigger the increase. This is equivalent to a program doing a for loop inside a for loop - as, for example when searching for pairs in data.
When your data is small, you might not care about 1-2 seconds difference. If you compare programs A and C just from above timings and without understanding the underlying behaviour, you might be tempted to say that A is faster. But look what happens with more records:
for 10000 rows program A will take 30 seconds
for 10000 rows program C will take 1000 seconds
for 20000 rows program A will take 60 seconds
for 20000 rows program C will take 4000 seconds
Initially the same performance for the same data quickly becomes painfully obvious - by a factor of almost 100x. There is not a way in this worlds how running C on a faster CPU could ever keep up with A, and the bigger the data, the more this is true. The thing that makes all the difference is scalability. This means answering questions like how big of a machine are we going to need in 1 years' time when the database will grow to twice its size. With O(N), you are generally OK - you can buy more servers, more memory, use replication etc. With O(N^2) you are generally OK up to a certain size, at which point buying any number of new machines will not be enough to solve your problems any more and you will need to find a different approach in software, or run it on massively parallel hardware such as GPU clusters. With O(2^N) you are pretty much fucked unless you can somehow limit the maximum size of the data to something which is still useable.
Note that the above examples are theoretical and intentionally simplified; as #PeterCordes pointed out, the times on a real CPU might be different because of caching, branch misprediction, data alignment issues, vector operations and million other implementation-specific details. Please see his links in comments below.
Suppose I have query ( it has joins on multiple tables ) and assuming it is tuned, and optimized. This query runs on the target database/tables with N number of records and query results R number of records and takes time T. Now gradually the load increases and say the target records become N2, and result it give is R2 and time it takes as T2. Assuming that I have allocated enough memory to the Oracle, L2/L1 will be close to T2/T1. Means the proportional increase in the load will result proportional increase in execution time. For this question lets say L2 = 5L1, means load has increased to 5times. Then time take to complete by this query would also be 5times or little more, right? So, to reduce the proportional growth in time, do we have options in Oracle, like parallel hint etc? In Java we split the job in multiple threads and 2times the load with 2times the worker thread we get almost same time to complete. So with increasing load we increase the worker thread and achieve the scaling issue reasonably well. Is such thing possible in Oracle or does Oracle take care of such thing in the back end and will scale, by splitting the load internally into parallel processing? Here, I have multi core processors. I Will experiment it, but if expert opinion is available it will help.
No. Query algorithms do not necessarily grow linearly.
You should probably learn something about algorithms and complexity. But many algorithms used in a data are super-linear. For instance, ordering a set of rows has a complexity of O(n log n), meaning that if you double the data size, the time taken for sorting more than doubles.
This is also true of index lookups and various join algorithms.
On the other hand, if your query is looking up a few rows using a b-tree index, then the complex is O(log n) -- this is sublinear. So index lookups grow more slowly than the size of the data.
So, in general you cannot assume that increasing the size of data by a factor of n has a linear effect on the time.
I have seen this question, but really, it's only about MySQL. Is there any sql database out there, that does not create an index for a unique constraint?
In one sense, no one can give you a definitive answer. As we speak, someone could be creating that very thing. But it's a fair bet that any DBMS you've heard of or are likely to hear of will use indexes to enforce uniqueness, because that's what the science dictates.
DBMSs use indexes for this because searching them is quick. The index uses some kind of structure that supports a binary search, providing O(log N) time complexity.
Consider what the system would have to do without such a structure.
for each row to be inserted
scan all rows in table
error if found
In the best case -- when there's no error -- each inserted row would cause a scan of the entire table. That's O(nm) complexity, a/k/a exponential time.
Suppose for example you're inserting 10,000 rows into a 10,000-row table. You're looking at 100,000,000 = 10,000 * 10,000 comparisons! A binary search, by contrast, requires ~13 comparisons for 10,000 rows, and ~17 for 20,000. Because we're inserting into the same table we're comparing to, the number of comparisons on average will be 15, so the total number of comparisons is 150,000 = 15 * 10,000, or 0.15% of the work.
Databases are all about scale, and exponential time even at modest scale is infeasible.
On an ordinary machine I have handy, a simple program to compare two unsorted arrays of 10,000 integers takes 0.1 seconds. As we might expect, 100,000 integers takes 10 seconds, 100 times longer. At 1,000,000 integers, we could expect 1000 seconds, or about 15 minutes. A cool billion would take a million times longer, until sometime in the year 2042.
Rob Pike likes to say, Fancy algorithms are slow when n is small, and n is usually small. It's true. But rule #5 is just as important: Data dominates.
I have a simple LP with linear constraints. There are many decision variables, roughly 24 million. I have been using lpSolve in R to play with small samples, but this solver isn't scaling well. Are there ways to get an approximate solution to the LP?
Edit:
The problem is a scheduling problem. There are 1 million people who need to be scheduled into one of 24 hours, hence 24 million decision variables. There is a reward $R_{ij}$ for scheduling person $i$ into hour $j$. The constraint is that each person needs to be scheduled into some hour, but each hour only has a finite amount of appointment slots $c$
One good way to approach LPs/IPs with a massive number of variables and constraints is to look for ways to group the decision variables in some logical way. Since you have only given a sketch of your problem, here's a solution idea.
Approach 1 : Group people into smaller batches
Instead of 1M people, think of them as 100 units of 10K people each. So now you only have 2400 (24 x 100) variables. This will get you part of the way there, and note that this won't be the optimal solution, but a good approximation. You can of course make 1000 batches of 1000 people and get a more fine-grained solution. You get the idea.
Approach 2: Grouping into cohorts, based on the Costs
Take a look at your R_ij's. Presumably you don't have a million different costs. There will typically be only a few unique cost values. The idea is to group many people with the same cost structure into one 'cohort'. Now you solve a much smaller problem - which cohorts go into which hour.
Again, once you get the idea you can make it very tractable.
Update Based on OP's comment:
By its very nature, making these groups is an approximation technique. There is no guarantee that the optimal solution will be obtained. However, the whole idea of careful grouping (by looking at cohorts with identical or very similar cost structures) is to get solutions as close to the optimal as possible, with far less computational effort.
I should have also added that when scaling (grouping is just one way to scale-down the problem size), the other constants should also be scaled. That is, c_j should also be in the same units (10K).
If persons A,B,C cannot be fit into time slot j, then the model will squeeze in as many of those as possible in the lowest cost time slot, and move the others to other slots where the cost is slightly higher, but they can be accommodated.
Hope that helps you going in the right direction.
Assuming you have a lot of duplicate people, you are now using way too many variables.
Suppose you only have 1000 different kinds of people and that some of these occcur 2000 times whilst others occur 500 times.
Then you just have to optimize the fraction of people that you allocate to each hour. (Note that you do have to adjust the objective functions and constraints a bit by adding 2000 or 500 as a constant)
The good news is that this should give you the optimal solution with just a 'few' variables, but depending on your problem you will probably need to round the results to get whole people as an outcome.
I saw something from an "execution plan" article:
10 rows fetched in 0.0003s (0.7344s)
(the link: http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/ )
How come there are 2 durations shown? What if I don't have large data set yet. For example, if I have only 20, 50, or even just 100 records, I can't really measure how faster 2 different SQL statements compare in term of speed in real life situation? In other words, there needs to be at least hundreds of thousands of records, or even a million records to accurately compares the performance of those 2 different SQL statements?
For your first question:
X row(s) fetched in Y s (Z s)
X = number of rows (of course);
Y = time it took the MySQL server to execute the query (parse, retrieve, send);
Z = time the resultset spent in transit from the server to the client;
(Source: http://forums.mysql.com/read.php?108,51989,210628#msg-210628)
For the second question, you will never ever know how the query performs unless you test with a realistic number of records. Here is a good example of how to benchmark correctly: http://www.mysqlperformanceblog.com/2010/04/21/mysql-5-5-4-in-tpcc-like-workload/
That blog in general as well as the book "High Performance MySQL" is a goldmine.
The best way to test and compare performance of operations is often (if not always !) to work with a realistic set of data.
If you plan on having millions of rows when your application is in production, then, you should test with millions of rows right now, and not only a dozen !
A couple of tips :
While benchmarking, use select SQL_NO_CACHE ..., instead of select ...
This will prevent MySQL from using its query cache (which would make the first query take a normal amount of time, and re-executing it several times a lot faster)
Learn how to use EXPLAIN, and understand its output
Read the Chapter 7. Optimization section of the manual ;-)
Generally when there are 2 times shown, one is CPU time and one is wall-clock time. I cannot recall which is which, but it appears that the first is the CPU time and the second is elapsed time.