Why is Hash Table insertion time complexity worst case is not N log N - time-complexity

Looking at the fundamental structure of hash table. We know that it resizes WRT load factor or some other deterministic parameter. I get that if the resizing limit is reached within an insertion we need to create a bigger hash table and insert everything there. Here is the thing which I don't get.
Let's consider a hash table where each bucket contains an AVL - balanced BST. If my hash function returns the same index for every key then I would store everything in the same AVL tree. I know that this hash function would be a really bad function and would not be used but I'm doing a worst case scenario here. So after some time let's say that resizing factor has been reached. So in order to resize I created a new hash table and tried to insert every old elements in my previous table. Since the hash function mapped everything back into one AVL tree, I would need to insert all the N elements into the same AVL. N insertion on an AVL tree is N logN. So why is the worst case of insertion for hash tables considered O(N)?
Here is the proof of adding N elements into Avl three is N logN:
Running time of adding N elements into an empty AVL tree

In short: it depends on how the bucket is implemented. With a linked list, it can be done in O(n) under certain conditions. For an implementation with AVL trees as buckets, this can indeed, wost case, result in O(n log n). In order to calculate the time complexity, the implementation of the buckets should be known.
Frequently a bucket is not implemented with an AVL tree, or a tree in general, but with a linked list. If there is a reference to the last entry of the list, appending can be done in O(1). Otherwise we can still reach O(1) by prepending the linked list (in that case the buckets store data in reversed insertion order).
The idea of using a linked list, is that a dictionary that uses a reasonable hashing function should result in few collisions. Frequently a bucket has zero, or one elements, and sometimes two or three, but not much more. In that case, a simple datastructure can be faster, since a simpler data structure usually requires less cycles per iteration.
Some hash tables use open addressing where buckets are not separated data structures, but in case the bucket is already taken, the next free bucket is used. In that case, a search will thus iterate over the used buckets until it has found a matching entry, or it has reached an empty bucket.
The Wikipedia article on Hash tables discusses how the buckets can be implemented.

Related

What is indexing? Why don't we use hashing for everything?

Going over some interview info about data structures etc.
So, as I understand, arrays are O(1) for indexing, which I believe means finding the specific element contained at space x in the array. Just want to confirm this as I am second guessing myself.
Also, hash maps are O(1) for indexing, searching, insertion and deletion. Does that not kind of make any data structure question pointless, since a hash map will always be the best solution?
Thanks
Well indexing is not only about arrays,
according to this - indexing is creating tables (indexes) that point to the location of folders, files and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.
For your second question hash maps are not absolute or best data structures for various reasons, mainly:
Collisions
Hash function calculation time
Extra memory used
Also there's lots of Data Structure questions where hashmaps are not superior:
Data structure for finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
Data structure for finding if word is in dictionary (Sure hashmap works but Trie is so much faster & less memory)
Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
...

Neo4j scalability and indexing

An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.

Given an array of N integers how to find the largest element which appears an even number of times in the array with minimum time complexity

You are given an array of N integers. You are asked to find the largest element which appears an even number of times in the array. What is the time complexity of your algorithm? Can you do this without sorting the entire array?
You could do it in O(n log n) with a table lookup method. For each element in the list, look it up in the table. If it is missing, insert a key-value pair with the key being the element and the value as the number of appearances (starting at one); if it is present, increment the appearances. At the end just loop through the table in O(n) and look for the largest key with an even value.
In theory for an ideal hash-table, a lookup operation is O(1). So you can find and/or insert all n elements in O(n) time, making the total complexity O(n). However, in practice you will have trouble with space allocation (need much more space than data set size) and collisions (why you need it). This makes the O(1) lookup very difficult to achieve; in the worst case scenario it can be as much as O(n) (though also unlikely) - making the total complexity O(n^2).
Instead you can be more secure with a tree-based table - that is, the keys are stored in a binary tree. Lookup and insertion operations are all O(log n) in this case, provided that the tree is balanced; there are a wide range of tree structures to help ensure this e.g. Red-Black trees, AVL, splay, B-trees etc (Google is your friend). This will make the total complexity a guaranteed O(n log n).

Design a highly optimized datastructure to perform three operations insert, delete and getRandom

I just had a software interview. One of the questions was to design any datastructure with three methods insert, delete and getRandom in a highly optimized way. The interviewer asked me to think of a combination of datastructures to design a new one. Insert can be designed anyway but for random and delete i need to get the position of specific element. He gave me a hint to think about the datastructure which takes minimum time for sorting.
Any answer or discussion is welcomed....
Let t be the type of the elements you want to store in the datastructure.
Have an extensible array elements containing all the elements in no particular order. Have a hashtable indices that maps elements of type t to their position in elements.
Inserting e means
add e at the end of elements (i.e. push_back), get its position i
insert the mapping (e,i) into `indices
deleting e means
find the position i of e in elements thanks to indices
overwrite e with the last element f of elements
update indices: remove the mapping (f,indices.size()) and insert (f,i)
drawing one element at random (leaving it in the datastructure, i.e. it's peek, not pop) is simply drawing an integer i in [0,elements.size()[ and returning elements[i].
Assuming the hashtable is well suited for your elements of type t, all three operations are O(1).
Be careful about the cases where there are 0 or 1 element in the datastructure.
A tree might work well here. Order log(n) insert and delete, and choose random could also be log(n): start at the root node and at each junction choose a child at random (weighted by the total number of leaf nodes per child) until you reach a leaf.
The data structure which takes the least time for sorting is sorted array.
get_random() is binary search, so O(log n).
insert() and delete() involve adding/removing the element in question and then resorting, which is O(n log n), e.g. horrendous.
I think his hint was poor. You may have been in a bad interview.
What I feel is that you can use some balaced version of tree like Red-Black trees. This will give O(log n) insertion and deletion time.
For getting random element, may be you can have a additional hash table to keep track of elements which are in the tree structure.
It might be Heap (data structure)

LIST alternative in redis

Redis.io
The main features of Redis Lists from the point of view of time
complexity is the support for constant time insertion and deletion of
elements near the head and tail, even with many millions of inserted
items. Accessing elements is very fast near the extremes of the list
but is slow if you try accessing the middle of a very big list, as it
is an O(N) operation.
what is the LIST alternative when the data is too high and writes are lesser than Reads
This is something I'd definitely benchmark before doing, but if you're really hitting a performance issue accessing items in the middle of the list, there are a couple of alternatives that really depend on your use case.
Don't make a list so big, age out/trim pieces that don't matter any more.
Memoize hot sections of the list. If a particular paginated range is being requested much more often than others, make that it's own list. Check if it exists already, and if it doesn't create a subset of your list in the paginated range.
Bucket your list from the beginning into "manageable sizes" (for whatever your definition of managable is). If a list is purely additive (no removal from the list), you could use the modulus index of an item as part of the key so that your list is stored in smaller buckets. Ex: key = "your_key_name_" + index % 100000