What benefit does a balanced search tree provide over a sorted key-value pair array? - binary-search-tree

public class Entry{
int key;
String value;
}
If you have an array of Entry.
Entry[]
You can do a binary search on this array to find, Insert or remove an Entry all in O(Log(n)). I can also do a range search in O(log(n)).
And this is very simple.
What does a comparatively complicated data structure like a red-black balanced search tree, give me over a simple sorted key value array?

If data is immutable, the tree has no benefit.
The only benefit of the array is locality of reference, e.g. data is close together and CPU may cache it.
Because the array is sorted, search is O(log n)
If you add / remove items things changed.
For small number of elements, the array is better (faster) this is because of the locality of reference.
For larger number of items Red Black Tree (or another self balanced tree) will perform better, because the array will need to shift the elements.
e.g. insert and delete will take O(log n) + huge n/2 for the shift.

Related

Data structure with quick min, delete, insert, search for big compute job

I'm looking for a data structure that will let me perform the operations I need efficiently. I expect to traverse a loop between 1011 and 1013 times so Ω(n) operations are right out. (I'll try to trim n down so it can fit in cache but it won't be small.) Each time through the loop I will call
Min exactly once
Delete exactly once (delete the minimum, if that helps)
Insert 0 to 2 times, with an average of somewhat more than 1
Search once for each insert
I only care about average or amortized performance, not worst-case. (The calculation will take ages, it's no concern if bits of the calculation stall from time to time.) The data will not be adversarial.
What kinds of structures should I consider? Maybe there's some kind of heap modified to have quick search?
A balanced tree is a quite good data structure for such a usage. All the specified operations are computed in O(log n). I think you can write an optimized tree implementation so that the minimum can be retrieved in O(1) (by keeping an iterator to the min and possibly the value for faster fetches). The resulting time of the algorithm will be O(m log n) where m is the number of iteration and n the number of items in the data structure.
This is the optimal algorithmic complexity. Indeed, assuming each iteration can be done in (amortized) O(1), each of the four operations must have such a complexity too. Let's assume a data structure S can be built with such a properties. One can write the following algorithm (written in Python):
def superSort(input):
s = S()
inputSize = len(input)
for i in range(inputSize):
s.insert(item[i])
output = list()
for i in range(inputSize):
output.append(s.getMin())
s.deleteMin()
return output
superSort has an (amortized) complexity of O(n). However, the theoretical optimal exact algorithmic complexity for a comparison-based sort has been proven to be O(n log (n)). Thus, S cannot exist and at least one of the 4 operations need to be done in at-least O(log n) time.
Note that naive binary tree implementations are often pretty inefficient. There is a lot of optimization you can perform to make them much faster. For example, you can pack the nodes (see B-trees), put the nodes in an array (assuming the number of item is bounded), use a relaxed balancing possibly based on random properties (see Treaps), use small references (eg. 16-bit indices or 32-bit indices rather than 64-bit pointers), etc. You can start with a naive AVL or a splay-tree.
My suggested data structure requires more work to be implemented, but it does achieve the desired results;
A data structure with {insert, delete, findMin, search} operations can be implemented using an AVL tree which ensures that each operation is done in O(logn) and findMin is done in O(1).
I'm going to dive in a bit into the implementation:
The tree would contain a pointer to the minimum node which is updated on each insertion and deletion, thus findMin requires O(1).
insert is implemented as it is in every AVL tree which takes O(logn) (using the balance factor and rotations/swaps to balance the tree). After you insert an element, you would need to update the minimum node pointer by going all the way to the left from the root of the tree, which requires O(logn) as well since the tree height is O(logn).
Likewise, after using delete you would need to update the minimum pointer in same fashion, thus it requires O(logn).
Finally, search also requires O(logn).
If more assumptions were given, e.g. the inserted elements are within a certain range of the minimum, then you could also give each node in the tree successor and predecessor pointers, which can also be updated in O(logn) during insertions and deletions, and thus can be accessed in O(1) without the need to traverse over the entire tree. And searching for the inserted elements can be done faster.
The successor of an inserted node can be updated by going to the right child and then all the way to the left. But if a right child does not exist then you would need to climb up the parents as long as the current node is not the left child of its parent.
The predecessor is updated in the exact reverse way.
In c++ a node would look something like this
template <class Key,class Value>
class AvlNode{
private:
Key key;
Value value;
int Height;
int BF; //balance factor
AvlNode* Left;
AvlNode* Right;
AvlNode* Parent;
AvlNode* Succ;
AvlNode* Pred;
public:
...
}
While the tree would look something like this:
template <class Key,class Value>
class AVL {
private:
int NumOfKeys;
int Height;
AvlNode<Key, Value> *Minimum;
AvlNode<Key, Value> *Root;
static void swapLL(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
static void swapLR(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
static void swapRL(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
static void swapRR(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
public:
...
}
From what you told us, I think I would use an open-addressed hash table for search and a heap to keep track of the minimum.
In the heap, instead of storing values, you would store indexes/pointers to the items in the hash table. That way when you delete min from the heap, you can follow the pointer to find the item you need to delete from the hash table.
The total memory overhead will be 3 or 4 words per item -- about the same as a balanced tree, but the implementation is simpler and faster.

Why is Hash Table insertion time complexity worst case is not N log N

Looking at the fundamental structure of hash table. We know that it resizes WRT load factor or some other deterministic parameter. I get that if the resizing limit is reached within an insertion we need to create a bigger hash table and insert everything there. Here is the thing which I don't get.
Let's consider a hash table where each bucket contains an AVL - balanced BST. If my hash function returns the same index for every key then I would store everything in the same AVL tree. I know that this hash function would be a really bad function and would not be used but I'm doing a worst case scenario here. So after some time let's say that resizing factor has been reached. So in order to resize I created a new hash table and tried to insert every old elements in my previous table. Since the hash function mapped everything back into one AVL tree, I would need to insert all the N elements into the same AVL. N insertion on an AVL tree is N logN. So why is the worst case of insertion for hash tables considered O(N)?
Here is the proof of adding N elements into Avl three is N logN:
Running time of adding N elements into an empty AVL tree
In short: it depends on how the bucket is implemented. With a linked list, it can be done in O(n) under certain conditions. For an implementation with AVL trees as buckets, this can indeed, wost case, result in O(n log n). In order to calculate the time complexity, the implementation of the buckets should be known.
Frequently a bucket is not implemented with an AVL tree, or a tree in general, but with a linked list. If there is a reference to the last entry of the list, appending can be done in O(1). Otherwise we can still reach O(1) by prepending the linked list (in that case the buckets store data in reversed insertion order).
The idea of using a linked list, is that a dictionary that uses a reasonable hashing function should result in few collisions. Frequently a bucket has zero, or one elements, and sometimes two or three, but not much more. In that case, a simple datastructure can be faster, since a simpler data structure usually requires less cycles per iteration.
Some hash tables use open addressing where buckets are not separated data structures, but in case the bucket is already taken, the next free bucket is used. In that case, a search will thus iterate over the used buckets until it has found a matching entry, or it has reached an empty bucket.
The Wikipedia article on Hash tables discusses how the buckets can be implemented.

Given an array of N integers how to find the largest element which appears an even number of times in the array with minimum time complexity

You are given an array of N integers. You are asked to find the largest element which appears an even number of times in the array. What is the time complexity of your algorithm? Can you do this without sorting the entire array?
You could do it in O(n log n) with a table lookup method. For each element in the list, look it up in the table. If it is missing, insert a key-value pair with the key being the element and the value as the number of appearances (starting at one); if it is present, increment the appearances. At the end just loop through the table in O(n) and look for the largest key with an even value.
In theory for an ideal hash-table, a lookup operation is O(1). So you can find and/or insert all n elements in O(n) time, making the total complexity O(n). However, in practice you will have trouble with space allocation (need much more space than data set size) and collisions (why you need it). This makes the O(1) lookup very difficult to achieve; in the worst case scenario it can be as much as O(n) (though also unlikely) - making the total complexity O(n^2).
Instead you can be more secure with a tree-based table - that is, the keys are stored in a binary tree. Lookup and insertion operations are all O(log n) in this case, provided that the tree is balanced; there are a wide range of tree structures to help ensure this e.g. Red-Black trees, AVL, splay, B-trees etc (Google is your friend). This will make the total complexity a guaranteed O(n log n).

Associative arrays in oracle

Associative Arrays as i understood it stores key value pairs and it is of variable length. Like we can add any number of key value pairs to Associative array.
Also i read to use
while loop to traverse sparese Associative array and
For loop to traverse Dense Associative array.
How can an associative array be sparse it is dynamic and we are adding values to it
Associative arrays are sparse because they are stored in the order of the hash of their key and not in the order they were inserted. An array is dense because elements are always appended to the end as they are added. When you preform operations like insert on an array you are actually creating a new array and appending values. This makes inserts "expensive" in that they require more CPU time to find the insertion point and more memory to store the intermediate copies while insertion is taking place. With an assocative array insertion (as long as it doesnt expand the size of the associative array beyond the hash key size) is fast in that it takes a predictably small amount of CPU and memory. The other thing that is expensive with arrays is looking up a specific value by its key. With associative arrays you can quickly lookup any element (or know immediately that there is no element with that key) while with an array you have to test every index to know where or if an element exists. On small sets this might not seem like a big deal but these problems only get worse the larger your sets become. Don't think associative arrays are the best and only way though. They get their speed by using more memory. Also iterating over all keys in an associative array (depending on the data type implementation) can be slower than iterative through a dense array. As is always the best advice try to choose the best tool for the job.
Associative array are dense and sparse depending how you index it.
If you index it with a primary key or pls_integer or something which can pack data densely then the assosiative array becomes dense. And it will be fast to fetch data.
Where as if you index by some varchar2 column or others which wont be easy to fetch then that specific assosiative array is sparse.

Design a highly optimized datastructure to perform three operations insert, delete and getRandom

I just had a software interview. One of the questions was to design any datastructure with three methods insert, delete and getRandom in a highly optimized way. The interviewer asked me to think of a combination of datastructures to design a new one. Insert can be designed anyway but for random and delete i need to get the position of specific element. He gave me a hint to think about the datastructure which takes minimum time for sorting.
Any answer or discussion is welcomed....
Let t be the type of the elements you want to store in the datastructure.
Have an extensible array elements containing all the elements in no particular order. Have a hashtable indices that maps elements of type t to their position in elements.
Inserting e means
add e at the end of elements (i.e. push_back), get its position i
insert the mapping (e,i) into `indices
deleting e means
find the position i of e in elements thanks to indices
overwrite e with the last element f of elements
update indices: remove the mapping (f,indices.size()) and insert (f,i)
drawing one element at random (leaving it in the datastructure, i.e. it's peek, not pop) is simply drawing an integer i in [0,elements.size()[ and returning elements[i].
Assuming the hashtable is well suited for your elements of type t, all three operations are O(1).
Be careful about the cases where there are 0 or 1 element in the datastructure.
A tree might work well here. Order log(n) insert and delete, and choose random could also be log(n): start at the root node and at each junction choose a child at random (weighted by the total number of leaf nodes per child) until you reach a leaf.
The data structure which takes the least time for sorting is sorted array.
get_random() is binary search, so O(log n).
insert() and delete() involve adding/removing the element in question and then resorting, which is O(n log n), e.g. horrendous.
I think his hint was poor. You may have been in a bad interview.
What I feel is that you can use some balaced version of tree like Red-Black trees. This will give O(log n) insertion and deletion time.
For getting random element, may be you can have a additional hash table to keep track of elements which are in the tree structure.
It might be Heap (data structure)