I've been using sinter which does intersection of unordered integer sets. Is there any faster way of doing intersection given that I wouldn't mind sorting beforehand (or performing any other preprocessing)?
EDIT:
Found some info [here][1]
EDIT2:
Bounty for specific answer: is zinterstore faster than sinter? Benchmarking would be cool too.
Fast answer
In theory the intersection of lists has complexity O(N) where N is the cardinality of the smallest set.
Use SET (SINTER/SINTERSTORE) if have sparsed data / should keep RAM low (O(N*M)) and use BITSET(SETBIT/BITOP) in all other cases (O(N). Like in your edit one info.
BITSET
The Redis BIT key operations have complexity O(N), there N is the cardinality of the smallest key. And the bitops has very best execution speed based on CPU cache (look at bitops.c sources). So this can be the absolute winner in you have not sparsed data or if memory not important for you (here is more about strings in Redis).
ZSET vs SET (zinterstore vs sinter)
Do not use ZSET (zinterstore) is you have a plain list of integers and want to intersect them. Sorted set in Redis is a complex structure their keys stored in ziplist or skiplist internal encodings. The last one used to store sorted scores but keys stored in other structures. In case of ZSET intersection always much complicated with comparison SET:
ZSET intersection: O(N * K) + O(M * log(M)) worst case with N being the smallest input sorted set, K being the number of input sorted sets and M being the number of elements in the resulting sorted set.
SET intersection: O(N * M) worst case where N is the cardinality of the smallest set and M is the number of sets. Actually math-based minimum in theory.
SET uses dict / intset data structures to store data and in your case (unordered integer sets) intset would be used. Intset the most memory save the structure in Redis. And have the best read speed with comparison to ziplist (doubly linked list, more about this data structure internals here).
Related
An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).
From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.
You are given an array of N integers. You are asked to find the largest element which appears an even number of times in the array. What is the time complexity of your algorithm? Can you do this without sorting the entire array?
You could do it in O(n log n) with a table lookup method. For each element in the list, look it up in the table. If it is missing, insert a key-value pair with the key being the element and the value as the number of appearances (starting at one); if it is present, increment the appearances. At the end just loop through the table in O(n) and look for the largest key with an even value.
In theory for an ideal hash-table, a lookup operation is O(1). So you can find and/or insert all n elements in O(n) time, making the total complexity O(n). However, in practice you will have trouble with space allocation (need much more space than data set size) and collisions (why you need it). This makes the O(1) lookup very difficult to achieve; in the worst case scenario it can be as much as O(n) (though also unlikely) - making the total complexity O(n^2).
Instead you can be more secure with a tree-based table - that is, the keys are stored in a binary tree. Lookup and insertion operations are all O(log n) in this case, provided that the tree is balanced; there are a wide range of tree structures to help ensure this e.g. Red-Black trees, AVL, splay, B-trees etc (Google is your friend). This will make the total complexity a guaranteed O(n log n).
The question goes like this:
Given an array of n elements where elements are same. Worst case time complexity of sorting the array (with RAM model assumptions) will be:
So, I thought to use selection algorithm in order to find the element whose size is the , call it P. This should take O(n). Next, I take any element which doesn't equal this element and put it in another array. In total I will have k=n-n^(2001/2002) elements. Sorting this array will cost O(klog(k)) which equals O(nlogn). Finally, I will find the max element which is smaller than P and the min element which is bigger than P and I can sort the array.
All of it takes O(nlogn).
Note: if , then we can reduce the time to O(n).
I have two question: is my analysis correct? Is there any way to reduce time complexity? Also, what is the RAM model assumptions?
Thanks!
Your analysis is wrong - there is no guarantee that the n^(2001/2002)th-smallest element is actually one of the duplicates.
n^(2001/2002) duplicates simply don't constitute enough of the input to make things easier, at least in theory. Sorting the input is still at least as hard as sorting the n - n^(2001/2002) = O(n) other elements, and under standard comparison sort assumptions in the RAM model, that takes at least O(n*log(n)) worst-case time.
(For practical input sizes, n^(2001/2002) duplicates would be at least 98% of the input, so isolating the duplicates and sorting the rest would be both easy and highly efficient. This is one of those cases where the asymptotic analysis doesn't capture the behavior we care about in practice.)
I have read online that Redis can say if an element is member of set or not in O(1) time. I want to know how Redis does this. What algorithm does Redis use to achieve this.
A Redis Set is implemented internally in one of two ways: an intset or a hashtable. The intset is a special optimization for integer-only sets and uses the intsetSearch function to search the set. This function, however, uses a binary search so that's actually technically O(logN). However, since the cardinallity of intsets is capped at a constant (the set-max-intset-entries configuration directive), we can assume O(1) accurately reflects the complexity here.
hashtable is used for a lot of things in Redis, including the implementation of Sets. It uses a hash function on the key to map it into a table (array) of entries - checking whether the hashed key value is in the array is straightforwardly done in O(1) in dictFind. The elements under each hashed key are stored as a linked list, so again you're basically talking O(N) to traverse it, but given the hash function extremely low probability for collisions (hmm, need some sort of citation here?) these lists are extremely short so we can safely assume it is effectively O(1).
Because of the above, SISMEMBER's claim of being O(1) in terms of computational complexity is valid.
I have a use case where I know for a fact that some sets I have materialized in my redis store are disjoint. Some of my sets are quite large, as a result, their sunion or sunionstore takes quite a large amount of time. Does redis provide any functionality for handling such unions?
Alternatively, if there is a way to add elements to a set in Redis without checking for uniqueness before each insert, it could solve my issue.
Actually, there is no need for such feature, because of the relative cost of operations.
When you build Redis objects (such as sets or lists), the cost is not dominated by the data structure management (hash table or linked lists), because the amortized complexity of individual insertion operations is O(1). The cost is dominated by the allocation and initialization of all the items (i.e. the set objects or the list objects). When you retrieve those objects, the cost is dominated by the allocation and formatting of the output buffer, not by the access paths in the data structure.
So bypassing the uniqueness property of the sets does not bring a significant optimization.
To optimize a SUNION command if the sets are disjoint, the best is to replace it by a pipeline of several SMEMBERS commands to retrieve the individual sets (and build the union on client side).
Optimizing a SUNIONSTORE is not really possible since disjoint sets is a worst case for the performance. The performance is dominated by the number of resulting items, so the less items in common, the more response time.