time complexity of HashSet - time-complexity

I was discussing with a friend about the Hashset design using mod function as the Hashing function.
The time complexity of such implementation appears to be O(N/K) , where N is total items stored in the set and k is total # of buckets. This time complexity assumes that that all items are distributed among all buckets and bucket's average size is N/K.
I confused myself because i believe the time complexity should be O(N). Since time complexity is the worst case performance. Here the worst case could be that all N items go to same bucket and value we are looking for could be at the end of the bucket. Please help me here.

You're right that the worst case is all items going into one bucket. The items being evenly distributed is the best case. That said, O(N/k) is the same as O(N) if k is held constant, since constants can be neglected. I would not expect k to be part of the input to a lookup anyway. If k can vary, then it is different, but the worst case is still O(N).

Related

Does optimizing an algorithm from O(2N) down to O(N) make it twice as fast? [duplicate]

This question already has answers here:
Which algorithm is faster O(N) or O(2N)?
(6 answers)
Closed 1 year ago.
In Big-O Notation, O(N) and O(2N) describe the same complexity. That is to say, the growth rate of the time or space complexity for an algorithm at O(2N) is essentially equal to O(N). This can be seen especially when compared to an algorithm with a complexity like O(N^2) given an extremely large value for N. O(N) increases linearly while O(N^2) increases quadratically.
So I understand why O(N) and O(2N) are considered to be equal, but I'm still uncertain about treating these two as completely equal. In a program where your number of inputs N is 1 million or more, it seems to me like halving the time complexity would actually save quite a lot time because the program would have potentially millions less actions to execute.
I'm thinking of a program that contains two for-loops. Each for-loop iterates over the entire length of a very large array of N elements. This program would have a complexity of O(2N). O(2N) reduces to O(N), but I feel like an implementation that only requires one for-loop instead of two would make it a faster program (even if a single for-loop implementation sacrificed some functionality for the sake of speed, for example).
My question:
If you had an algorithm with time complexity O(2N), would optimizing it to have O(N) time complexity make it twice as fast?
To put it another way, is it ever significantly beneficial to optimize an O(2N) algorithm down to O(N)? I imagine there would be some increase in the speed of the program, or would the increase be so insignificant that it isn't worth the effort since O(2N) == O(N)?
Time complexity is not the same as speed. For a given size of data, a program with O(N) might be slower, faster or the same speed as O(2N). Also, for a given size of data O(N) might be slower, faster or the same speed as O(N^2).
So if Big-O doesn't mean anything, why are we talking about it anyway?
Big-O notation describes the behaviour a program as the size of data increases. This behaviour is always relative. In other words, Big-O tells you the shape of asymptotic curve, but not its scale or dimension.
Let's say you have a program A that is O(N). This means that processing time will be linearly proportional to data size (ignoring real-world complications like cache sizes that might make the run-time more like piecewise-linear):
for 1000 rows it will take 3 seconds
for 2000 rows it will take 6 seconds
for 3000 rows it will take 9 seconds
And for another program B which is also O(N):
for 1000 rows it will take 1 second
for 2000 rows it will take 2 seconds
for 3000 rows it will take 3 seconds
Obviously, the second program is 3 times faster per row, even though they both have O(N). Intuitively, this tells you that both programs go through every row and spend some fixed time on processing it. The difference in time from 2000 to 1000 is the same as difference from 3000 to 2000 - this means that the grows linearly, in other words time needed for one record does not depend on number of all records. This is equivalent to program doing some kind of a for-loop, as for example when calculating a sum of numbers.
And, since the programs are different and do different things, it doesn't make any sense to compare 1 second of program A's time to 1 second of program B's time anyway. You would be comparing apples and oranges. That's why we don't care about the constant factor and we say that O(3n) is equivalent to O(n).
Now imagine a third program C, which is O(N^2).
for 1000 rows it will take 1 second
for 2000 rows it will take 4 seconds
for 3000 rows it will take 9 seconds
The difference in time here between 3000 and 2000 is bigger than difference between 2000 and 1000. The more the data, the bigger the increase. This is equivalent to a program doing a for loop inside a for loop - as, for example when searching for pairs in data.
When your data is small, you might not care about 1-2 seconds difference. If you compare programs A and C just from above timings and without understanding the underlying behaviour, you might be tempted to say that A is faster. But look what happens with more records:
for 10000 rows program A will take 30 seconds
for 10000 rows program C will take 1000 seconds
for 20000 rows program A will take 60 seconds
for 20000 rows program C will take 4000 seconds
Initially the same performance for the same data quickly becomes painfully obvious - by a factor of almost 100x. There is not a way in this worlds how running C on a faster CPU could ever keep up with A, and the bigger the data, the more this is true. The thing that makes all the difference is scalability. This means answering questions like how big of a machine are we going to need in 1 years' time when the database will grow to twice its size. With O(N), you are generally OK - you can buy more servers, more memory, use replication etc. With O(N^2) you are generally OK up to a certain size, at which point buying any number of new machines will not be enough to solve your problems any more and you will need to find a different approach in software, or run it on massively parallel hardware such as GPU clusters. With O(2^N) you are pretty much fucked unless you can somehow limit the maximum size of the data to something which is still useable.
Note that the above examples are theoretical and intentionally simplified; as #PeterCordes pointed out, the times on a real CPU might be different because of caching, branch misprediction, data alignment issues, vector operations and million other implementation-specific details. Please see his links in comments below.

Oracle SQL or PLSQL scale with load

Suppose I have query ( it has joins on multiple tables ) and assuming it is tuned, and optimized. This query runs on the target database/tables with N number of records and query results R number of records and takes time T. Now gradually the load increases and say the target records become N2, and result it give is R2 and time it takes as T2. Assuming that I have allocated enough memory to the Oracle, L2/L1 will be close to T2/T1. Means the proportional increase in the load will result proportional increase in execution time. For this question lets say L2 = 5L1, means load has increased to 5times. Then time take to complete by this query would also be 5times or little more, right? So, to reduce the proportional growth in time, do we have options in Oracle, like parallel hint etc? In Java we split the job in multiple threads and 2times the load with 2times the worker thread we get almost same time to complete. So with increasing load we increase the worker thread and achieve the scaling issue reasonably well. Is such thing possible in Oracle or does Oracle take care of such thing in the back end and will scale, by splitting the load internally into parallel processing? Here, I have multi core processors. I Will experiment it, but if expert opinion is available it will help.
No. Query algorithms do not necessarily grow linearly.
You should probably learn something about algorithms and complexity. But many algorithms used in a data are super-linear. For instance, ordering a set of rows has a complexity of O(n log n), meaning that if you double the data size, the time taken for sorting more than doubles.
This is also true of index lookups and various join algorithms.
On the other hand, if your query is looking up a few rows using a b-tree index, then the complex is O(log n) -- this is sublinear. So index lookups grow more slowly than the size of the data.
So, in general you cannot assume that increasing the size of data by a factor of n has a linear effect on the time.

Why is Hash Table insertion time complexity worst case is not N log N

Looking at the fundamental structure of hash table. We know that it resizes WRT load factor or some other deterministic parameter. I get that if the resizing limit is reached within an insertion we need to create a bigger hash table and insert everything there. Here is the thing which I don't get.
Let's consider a hash table where each bucket contains an AVL - balanced BST. If my hash function returns the same index for every key then I would store everything in the same AVL tree. I know that this hash function would be a really bad function and would not be used but I'm doing a worst case scenario here. So after some time let's say that resizing factor has been reached. So in order to resize I created a new hash table and tried to insert every old elements in my previous table. Since the hash function mapped everything back into one AVL tree, I would need to insert all the N elements into the same AVL. N insertion on an AVL tree is N logN. So why is the worst case of insertion for hash tables considered O(N)?
Here is the proof of adding N elements into Avl three is N logN:
Running time of adding N elements into an empty AVL tree
In short: it depends on how the bucket is implemented. With a linked list, it can be done in O(n) under certain conditions. For an implementation with AVL trees as buckets, this can indeed, wost case, result in O(n log n). In order to calculate the time complexity, the implementation of the buckets should be known.
Frequently a bucket is not implemented with an AVL tree, or a tree in general, but with a linked list. If there is a reference to the last entry of the list, appending can be done in O(1). Otherwise we can still reach O(1) by prepending the linked list (in that case the buckets store data in reversed insertion order).
The idea of using a linked list, is that a dictionary that uses a reasonable hashing function should result in few collisions. Frequently a bucket has zero, or one elements, and sometimes two or three, but not much more. In that case, a simple datastructure can be faster, since a simpler data structure usually requires less cycles per iteration.
Some hash tables use open addressing where buckets are not separated data structures, but in case the bucket is already taken, the next free bucket is used. In that case, a search will thus iterate over the used buckets until it has found a matching entry, or it has reached an empty bucket.
The Wikipedia article on Hash tables discusses how the buckets can be implemented.

Given an array of N integers how to find the largest element which appears an even number of times in the array with minimum time complexity

You are given an array of N integers. You are asked to find the largest element which appears an even number of times in the array. What is the time complexity of your algorithm? Can you do this without sorting the entire array?
You could do it in O(n log n) with a table lookup method. For each element in the list, look it up in the table. If it is missing, insert a key-value pair with the key being the element and the value as the number of appearances (starting at one); if it is present, increment the appearances. At the end just loop through the table in O(n) and look for the largest key with an even value.
In theory for an ideal hash-table, a lookup operation is O(1). So you can find and/or insert all n elements in O(n) time, making the total complexity O(n). However, in practice you will have trouble with space allocation (need much more space than data set size) and collisions (why you need it). This makes the O(1) lookup very difficult to achieve; in the worst case scenario it can be as much as O(n) (though also unlikely) - making the total complexity O(n^2).
Instead you can be more secure with a tree-based table - that is, the keys are stored in a binary tree. Lookup and insertion operations are all O(log n) in this case, provided that the tree is balanced; there are a wide range of tree structures to help ensure this e.g. Red-Black trees, AVL, splay, B-trees etc (Google is your friend). This will make the total complexity a guaranteed O(n log n).

Sorting array where many of the keys are identical - time complexity analysis

The question goes like this:
Given an array of n elements where elements are same. Worst case time complexity of sorting the array (with RAM model assumptions) will be:
So, I thought to use selection algorithm in order to find the element whose size is the , call it P. This should take O(n). Next, I take any element which doesn't equal this element and put it in another array. In total I will have k=n-n^(2001/2002) elements. Sorting this array will cost O(klog(k)) which equals O(nlogn). Finally, I will find the max element which is smaller than P and the min element which is bigger than P and I can sort the array.
All of it takes O(nlogn).
Note: if , then we can reduce the time to O(n).
I have two question: is my analysis correct? Is there any way to reduce time complexity? Also, what is the RAM model assumptions?
Thanks!
Your analysis is wrong - there is no guarantee that the n^(2001/2002)th-smallest element is actually one of the duplicates.
n^(2001/2002) duplicates simply don't constitute enough of the input to make things easier, at least in theory. Sorting the input is still at least as hard as sorting the n - n^(2001/2002) = O(n) other elements, and under standard comparison sort assumptions in the RAM model, that takes at least O(n*log(n)) worst-case time.
(For practical input sizes, n^(2001/2002) duplicates would be at least 98% of the input, so isolating the duplicates and sorting the rest would be both easy and highly efficient. This is one of those cases where the asymptotic analysis doesn't capture the behavior we care about in practice.)