How is a hash map stored? - optimization

I have an upcoming interview and was looking through some technical interview questions and I came across this one. It is asking for the time complexity for the insertion and deletion functions of a hash map. The consensus seems to be that the time complexity is O(1) if the has map is distributed evenly but O(n) if they are all in the same pool.
I guess my question is how exactly are hash maps stored in memory? How would these 2 cases happen?

One answer on your linked page is:
insertion always would be O(1) if even not properly distributed (if we
make linked list on collision) but Deletion would be O(n) in worst
case.
This is not a good answer. A generalized answer to time complexity for a hashmap would come to a similar statement as the Wikipedia article on hash tables:
Time complexity
in big O notation
Average Worst case
Space O(n) O(n)
Search O(1) O(n)
Insert O(1) O(n)
Delete O(1) O(n)
To adress your question how hash maps are stored in memory: There are a number of "buckets" that store values in the average case, but must be expanded to some kind of list when a hash collision occurs. Good explanations of hash tables are the Wikipedia article, this SO question and this C++ example.
The time complexity table above is like this because in the average case, a hash map just looks up and stores single values, but collisions make everything O(n) in worst case, where all your elements share a bucket and the behaviour is similar to the list implementation you chose for that case.
Note that there are specialized implementations that adress the worst cases here, also described in the Wikipedia article, but each of them has other disadvantages, so you'll have to choose the best for your use case.

Related

What is the relationship between time complexity and the number of steps in an algorithm?

For large values of n, an algorithm that takes 20000n^2 steps has better time complexity (takes less time) than one that takes 0.001n^5 steps
I believe this statement is true. But, why?
If there are more steps wouldn't that take more time?
Computational complexity is considered in the asymptotic sense because the important question is usually of scaling. Even with your clear case, the ^5 algorithm begins to take longer around 275 items - which isn't very many. See this figure from wolfram alpha:
Quoting from the wikipedia article linked above:
Usually asymptotic estimates are used because different implementations of the same algorithm may differ in efficiency. However the efficiencies of any two "reasonable" implementations of a given algorithm are related by a constant multiplicative factor called a hidden constant.
All that said, if you have two comparable algorithms and the one with less complexity has a significant constant coefficient and you're only going to process 10 items, then it very well may be a good idea to choose the less efficient one. Some common libraries even switch algorithms depending upon the size of the data being processed; this is called a hybrid algorithm and Python's sorted implementation, Timsort uses it to switch between insertion sort and merge sort.

To what extent shall we optimize time complexity?

Theory vs practice here.
Regarding time complexity, and I have a conceptual question that we didn't get to go deeper into in class.
Here it is:
There's a barbaric BROOT force algorithm, O(n^3)... and we got it down o O(n) and it was considered good enough. If we dive in deeper, it is actually O(n)+O(n), two separate iterations of the input. I came up with another way which was actually O(n/2). But those two algorithms are considered to be the same since both are O(n) and as n reaches infinity, it makes no difference, so not necessary at all once we reach O(n).
My question is:
In reality, in practice, we always have a finite number of inputs (admittedly occasionally in the trillions). So following the time complexity logic, O(n/2) is four times as fast as O(2n). So if we can make it faster, why not?
Time complexity is not everything. As you already noticed, the Big-Oh can hide a lot and also assumes that all operations cost the same.
In Practice you should always try to find a fast/the fastest solution for your problem. Sometimes this means that you use a algorithm with a bad complexity but good constants if you know that your problem is always small. Depending on your use case, you also want to implement optimizations that utilize hardware properties like cache optimizations.

Implement an iterator on a binary heap

I am looking for a way to implement an iterator on binary heaps (maximum or minimum).
That is, by using it’s nextNode() function for the i-th time, can get the i-th (greater or smaller) element in the heap.
Note that this operation happens without actually extracting the heap’s root!
My initial thoughts were:
Actually extract i elements, push them into a stack, and then insert them back into the heap after getting the i-th value. This takes O(i*log(n)) for each function call.
Keep an auxiliary sorted data structure, which can allow to lookup the next value in O(1), however updates would take O(n).
I understand these approaches eliminate the benefits of using heaps, so I’m looking for a better approach.
It's not clear what the use-case for this is, so it's hard to say what would make a solution viable, or better than any other solution.
That said, I suggest a small alteration to the general "extract and sort" ideas already thrown around: If we're fine making changes to the data structure, we can do our sorting in place.
The basic implementation suggested on Wikipedia is a partially sorted list under-the-hood. We can pay a (hopefully) one-time O(n log(n)) cost to sort our heap when the first time next() is called, after which next is O(1). Critically, a fully-sorted list is still a valid heap.
Furthermore, if you consider the heapsort algorithm, you can start at stage two, because you're starting with a valid heap.

Optimizing a genetic algorithm?

I've been playing with parallel processing of genetic algorithms to improve performance but I was wondering what some other commonly used techniques are to optimize a genetic algorithm?
Since fitness values are frequently recalculated (the diversity of the population decreases as the algorithm runs), a good strategy to improve the performance of a GA is to reduce the time needed to calculate the fitness.
Details depend on implementation, but previously calculated fitness values can often be
efficiently saved with a hash table. This kind of optimization can drop computation time significantly (e.g. "IMPROVING GENETIC ALGORITHMS PERFORMANCE BY HASHING FITNESS VALUES" - RICHARD J. POVINELLI, XIN FENG reports that the application of hashing to a GA can improve performance by over 50% for complex real-world problems).
A key point is collision management: you can simply overwrite the existing element of the hash table or adopt some sort of scheme (e.g. a linear probe).
In the latter case, as collisions mount, the efficiency of the hash table degrades to that of a linear search. When the cumulative number of collisions exceeds the size of the hash table, a rehash should be performed: you have to create a larger hash table and copy the elements from the smaller hash table to the larger one.
The copy step could be omitted: the diversity decreases as the GA runs, so many of the eliminated elements will not be used and the most frequently used chromosome values will be quickly recalculated (the hash table will fill up again with the most used key element values).
One thing I have done is to limit the number of fitness calculations. For example, where the landscape is not noisy i.e. where a recalculation of fitness would result in the same answer every time, don't recalculate simply cache the answer.
Another approach is to use a memory operator. The operator maintains a 'memory' of solutions and ensures that the best solution in that memory is included in the GA population if it is better than the best in the population. The memory is kept up to date with good solutions during the GA run. This approach can reduce the number of fitness calculations required and increase the performance.
I have examples of some of this stuff here:
http://johnnewcombe.net/blog/gaf-part-8/
http://johnnewcombe.net/blog/gaf-part-3/
This is a very broad question; I suggest to use the R galgo package for this purpose.

Need Help Studying Running Times

At the moment, I'm studying for a final exam for a Computer Science course. One of the questions that will be asked is most likely a question on how to combine running times, so I'll give an example.
I was wondering, if I created a program that preprocessed inputs using Insertion Sort, and then searched for a value "X" using Binary Search, how would I combine the running times to find the best, worst, and average case time complexities of the over-all program?
For example...
Insertion Sort
Worst Case O(n^2)
Best Case O(n)
Average Case O(n^2)
Binary Search
Worst Case O(logn)
Best Case O(1)
Average Case O(logn)
Would the Worst case be O(n^2 + logn), or would it be O(n^2), or neither?
Would the Best Case be O(n)?
Would the Average Case be O(nlogn), O(n+logn), O(logn), O(n^2+logn), or none of these?
I tend to over-think solutions, so if I can get any guidance on combining running times, it would be much appreciated.
Thank you very much.
You usually don't "combine" (as in add) the running times to determine the overall efficiency class rather, you take the one that takes the longest for each worst, average, and best case.
So if you're going to perform insertion sort and then do a binary search after to find an element X in an array, the worst case is O(n^2) and the best case is O(n) -- all from insertion sort since it takes the longest.
Based on my limited study, (we haven't reached Amortization so this might be where Jim has the rest correct), but basically you just go based on whoever is slowest of the overall algorithm.
This seems to be a good book on the subject of Algorithms (I haven't got much to compare to):
http://www.amazon.com/Introduction-Algorithms-Third-Thomas-Cormen/dp/0262033844/ref=sr_1_1?ie=UTF8&qid=1303528736&sr=8-1
Also MIT has a full course on the Algorithms on their site here is the link for that too!
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/
I've actually found it helpful, it might not answer specifically your question, but I think it will help get you more confident seeing some of the topics explained a few times.