I plan to make a class that represents a strict partially ordered set, and I assume the most natural way to model its interface is as a binary relation. This gives functions like:
bool test(elementA, elementB); //return true if elementA < elementB
void set(elementA, elementB); //declare that elementA < elementB
void clear(elementA, elementB); //forget that elementA < elementB
and possibly functions like:
void applyTransitivity(); //if test(a,b) and test(b, c), then set(a, c)
bool checkIrreflexivity(); //return true if for no a, a < a
bool checkAsymmetry(); //return true if for no a and b, a < b and b < a
The naive implementation would be to have a list of pairs such that (a, b) indicates a < b. However, it's probably not optimal. For example, test would be linear time. Perhaps it could be better done as a hash map of lists.
Ideally, though, the in memory representation would by its nature enforce applyTransitivity to always be "in effect" and not permit the creation of edges that cause reflexivity or symmetry. In other words, the degrees of freedom of the data structure represent the degrees of freedom of a strict poset. Is there a known way to do this? Or, more realistically, is there a means of checking for being cyclical, and maintaining transitivity that is amortized and iterative with each call to set and clear, so that the cost of enforcing the correctness is low. Is there a working implementation?
Okay, let's talk about achieving bare metal-scraping micro-efficiency, and you can choose how deep down that abyss you want to go. At this architectural level, there are no data structures like hash maps and lists, there aren't even data types, just bits and bytes in memory.
As an aside, you'll also find a lot of info on representations here by looking into common representations of DAGs. However, most of the common reps are designed more for convenience than efficiency.
Here, we want the data for a to be fused with that adjacency data into a single memory block. So you want to store the 'list', so to speak, of items that have a relation to a in a's own memory block so that we can potentially access a and all the elements related to a within a single cache line (bonus points if those related elements might also fit in the same cache line, but that's an NP-hard problem).
You can do that by storing, say, 32-bit indices in a. We can model such objects like so if we go a little higher level and use C for exemplary purposes:
struct Node
{
// node data
...
int links[]; // variable-length struct
};
This makes the Node a variable-length structure whose size and potentially even address changes, so we need an extra level of indirection to get stability and avoid invalidation, like an index to an index (if you control the memory allocator/array and it's purely contiguous), or an index to a pointer (or reference in some languages).
That makes your test function still involve a linear time search, but linear with respect to the number of elements related to a, not the number of elements total. Because we used a variable-length structure, a and its neighbor indices will potentially fit in a single cache line, and it's likely that a will already be in the cache just to make the query.
It's similar to the basic idea you had of the hash map storing lists, but without the explosion of lists overhead and without the hash lookup (which may be constant time but not nearly as fast as just accessing the connections to a from the same memory block). Most importantly, it's far more cache-friendly, and that's often going to make the difference between a few cycles and hundreds.
Now this means that you still have to roll up your sleeves and check for things like cycles yourself. If you want a data structure that more directly and conveniently models the problem, you'll find a nicer fit with graph data structures revolving around a formalization of a directed edge. However, those are much more convenient than they are efficient.
If you need the container to be generic and a can be any given type, T, then you can always wrap it (using C++ now):
template <class T>
struct Node
{
T node_data;
int links[1]; // VLS, not necessarily actually storing 1 element
};
And still fuse this all into one memory block this way. We need placement new here to preserve those C++ object semantics and possibly keep an eye on alignment here.
Transitivity checks always involves a search of some sort (breadth first or depth first). I don't think there's any rep that avoids this unless you want to memoize/cache a potentially massive explosion of transitive data.
At this point you should have something pretty fast if you want to go this deep down the abyss and have a solution that's really hard to maintain and understand. I've unfortunately found that this doesn't impress the ladies very much as with the case of having a car that goes really fast, but it can make your software go really, really fast.
Related
I'm looking for a data structure that will let me perform the operations I need efficiently. I expect to traverse a loop between 1011 and 1013 times so Ω(n) operations are right out. (I'll try to trim n down so it can fit in cache but it won't be small.) Each time through the loop I will call
Min exactly once
Delete exactly once (delete the minimum, if that helps)
Insert 0 to 2 times, with an average of somewhat more than 1
Search once for each insert
I only care about average or amortized performance, not worst-case. (The calculation will take ages, it's no concern if bits of the calculation stall from time to time.) The data will not be adversarial.
What kinds of structures should I consider? Maybe there's some kind of heap modified to have quick search?
A balanced tree is a quite good data structure for such a usage. All the specified operations are computed in O(log n). I think you can write an optimized tree implementation so that the minimum can be retrieved in O(1) (by keeping an iterator to the min and possibly the value for faster fetches). The resulting time of the algorithm will be O(m log n) where m is the number of iteration and n the number of items in the data structure.
This is the optimal algorithmic complexity. Indeed, assuming each iteration can be done in (amortized) O(1), each of the four operations must have such a complexity too. Let's assume a data structure S can be built with such a properties. One can write the following algorithm (written in Python):
def superSort(input):
s = S()
inputSize = len(input)
for i in range(inputSize):
s.insert(item[i])
output = list()
for i in range(inputSize):
output.append(s.getMin())
s.deleteMin()
return output
superSort has an (amortized) complexity of O(n). However, the theoretical optimal exact algorithmic complexity for a comparison-based sort has been proven to be O(n log (n)). Thus, S cannot exist and at least one of the 4 operations need to be done in at-least O(log n) time.
Note that naive binary tree implementations are often pretty inefficient. There is a lot of optimization you can perform to make them much faster. For example, you can pack the nodes (see B-trees), put the nodes in an array (assuming the number of item is bounded), use a relaxed balancing possibly based on random properties (see Treaps), use small references (eg. 16-bit indices or 32-bit indices rather than 64-bit pointers), etc. You can start with a naive AVL or a splay-tree.
My suggested data structure requires more work to be implemented, but it does achieve the desired results;
A data structure with {insert, delete, findMin, search} operations can be implemented using an AVL tree which ensures that each operation is done in O(logn) and findMin is done in O(1).
I'm going to dive in a bit into the implementation:
The tree would contain a pointer to the minimum node which is updated on each insertion and deletion, thus findMin requires O(1).
insert is implemented as it is in every AVL tree which takes O(logn) (using the balance factor and rotations/swaps to balance the tree). After you insert an element, you would need to update the minimum node pointer by going all the way to the left from the root of the tree, which requires O(logn) as well since the tree height is O(logn).
Likewise, after using delete you would need to update the minimum pointer in same fashion, thus it requires O(logn).
Finally, search also requires O(logn).
If more assumptions were given, e.g. the inserted elements are within a certain range of the minimum, then you could also give each node in the tree successor and predecessor pointers, which can also be updated in O(logn) during insertions and deletions, and thus can be accessed in O(1) without the need to traverse over the entire tree. And searching for the inserted elements can be done faster.
The successor of an inserted node can be updated by going to the right child and then all the way to the left. But if a right child does not exist then you would need to climb up the parents as long as the current node is not the left child of its parent.
The predecessor is updated in the exact reverse way.
In c++ a node would look something like this
template <class Key,class Value>
class AvlNode{
private:
Key key;
Value value;
int Height;
int BF; //balance factor
AvlNode* Left;
AvlNode* Right;
AvlNode* Parent;
AvlNode* Succ;
AvlNode* Pred;
public:
...
}
While the tree would look something like this:
template <class Key,class Value>
class AVL {
private:
int NumOfKeys;
int Height;
AvlNode<Key, Value> *Minimum;
AvlNode<Key, Value> *Root;
static void swapLL(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
static void swapLR(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
static void swapRL(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
static void swapRR(AVL<Key, Value> *avl, AvlNode<Key, Value> *root);
public:
...
}
From what you told us, I think I would use an open-addressed hash table for search and a heap to keep track of the minimum.
In the heap, instead of storing values, you would store indexes/pointers to the items in the hash table. That way when you delete min from the heap, you can follow the pointer to find the item you need to delete from the hash table.
The total memory overhead will be 3 or 4 words per item -- about the same as a balanced tree, but the implementation is simpler and faster.
Scenario
Let's say I am storing up to 5 byte arrays, each 50kB, per user.
Possible Implementations:
1) One byte array per record, indexed by secondary key.
Pros: Fast read/write.
Cons: High cardinality query (up to 5 results per query). Bad for horizontal scaling, if byte arrays are frequently accessed.
2) All byte arrays in single record in separate bins
Pros: Fast read
Neutral: Blocksize must be greater than 250kB
Cons: Slow write (one change means rewriting all byte arrays).
3) Store byte arrays in a LLIST LDT
Pros: Avoid the cons of solution (1) and (2)
Cons: LDTs are generally slow
4) Store each byte array in a separate record, keyed to a UUID. Store a UUID list in another record.
Pros: Writes to each byte array does not require rewriting all arrays. No low-cardinality concern of secondary indexes. Avoids use of LDT.
Cons: A client read is 2-stage: Get list of UUIDs from meta record, then multi-get for each UUID (very slow?)
5) Store each byte array as a separate record, using a pre-determined primary key scheme (e.g. userid_index, e.g. 123_0, 123_1, 123_2, 123_3, 123_4)
Pros: Avoid 2-stage read
Cons: Theoretical collision possibility with another user (e.g. user1_index1 and user2_index2 product same hash). I know this is (very, very) low-probability, but avoidance is still preferred (imagine one user being able to read the byte array of another user due to collision).
My Evaluation
For balanced read/write OR high read/low write situations, use #2 (One record, multiple bins). A rewrite is more costly, but avoids other cons (LDT penalty, 2-stage read).
For a high (re)write/low read situation, use #3 (LDT). This avoids having to rewrite all byte arrays when one of them is updated, due to the fact that records are copy-on-write.
Question
Which implementation is preferable, given the current data pattern (small quantity, large objects)? Do you agree with my evaluation (above)?
Here is some input. (I want to disclose that I do work at Aerospike).
Do avoid #3. Do not use LDT as the feature is definitely not as mature as the rest of the platform, especially when it comes to performance / reliability during cluster rebalance (migrations) situations when nodes leave/join a cluster.
I would try to stick as much as possible with basic Key/Value transactions. That should always be the fastest and most scalable. As you pointed out, option #1 would not scale. Secondary indices also do have an overhead in memory and currently do not allow for fast start (enterprise edition only anyways).
You are also correct on #2 for high write loads, especially if you are going to always update 1 bin...
So, this leaves options #4 and #5. For option #5, the collision will not happen in practice. You can go over the math, it will simply not happen. If it does, you will get famous and can publish a paper :) (there may even be a price for having found a collision). Also, note thatyou have the option to store the key along the record which will provide you with a 'key check' on writes which should be very cheap (since records are read anyway before being written). Option #4 would work as well, it will just do an extra read (which should be super fast).
It all depends on where you want the bit extra complexity. So you can do some simple benchmarking between the 2 options if you have that luxury before deciding.
I have read online that Redis can say if an element is member of set or not in O(1) time. I want to know how Redis does this. What algorithm does Redis use to achieve this.
A Redis Set is implemented internally in one of two ways: an intset or a hashtable. The intset is a special optimization for integer-only sets and uses the intsetSearch function to search the set. This function, however, uses a binary search so that's actually technically O(logN). However, since the cardinallity of intsets is capped at a constant (the set-max-intset-entries configuration directive), we can assume O(1) accurately reflects the complexity here.
hashtable is used for a lot of things in Redis, including the implementation of Sets. It uses a hash function on the key to map it into a table (array) of entries - checking whether the hashed key value is in the array is straightforwardly done in O(1) in dictFind. The elements under each hashed key are stored as a linked list, so again you're basically talking O(N) to traverse it, but given the hash function extremely low probability for collisions (hmm, need some sort of citation here?) these lists are extremely short so we can safely assume it is effectively O(1).
Because of the above, SISMEMBER's claim of being O(1) in terms of computational complexity is valid.
There is a lot available on the Net about consistent hashing, and implementations in several languages available. The Wikipedia entry for the topic references another algorithm with the same goals:
Rendezvous Hashing
This algorithm seems simpler, and doesn't need the addition of replicas/virtuals around the ring to deal with uneven loading issues. As the article mentions, it appears to run in O(n) which would be an issue for large n, but references a paper stating it can be structured to run in O(log n).
My question for people with experience in this area is, why would one choose consistent hashing over HRW, or the reverse? Are there use cases where one of these solutions is the better choice?
Many thanks.
Primarily I would say the advantage of consistent hashing is when it comes to hotspots. Depending on the implementation its possible to manually modify the token ranges to deal with them.
With HRW if somehow you end up with hotspots (ie caused by poor hashing algorithm choices) there isn't much you can do about it short of removing the hotspot and adding a new one which should balance the requests out.
Big advantage to HRW is when you add or remove nodes you maintain an even distribution across everything. With consistent hashes they resolve this by giving each node 200 or so virtual nodes, which also makes it difficult to manually manage ranges.
Speaking as someone who's just had to choose between the two approaches and who ultimately plumped for HRW hashing: My use case was a simple load balancing one with absolutely no reassignment requirement -- if a node died it's perfectly OK to just choose a new one and start again. No re balancing of existing data is required.
1) Consistent Hashing requires a persistent hashmap of the nodes and vnodes (or at least a sensible implementation does, you could build all the objects on every request.... but you really don't want to!). HWR does not (it's state-less). Nothing needs to be modified when machines join or leave the cluster - there is no concurrency to worry about (except that your clients have a good view of the state of the cluster which is the same in both cases)
2) HRW is easier to explain and understand (and the code is shorter). For example this is a complete HRW algorythm implemented in Riverbed Stingray TrafficScript. (Note there are better hash algorithms to choose than MD5 - it's overkill for this job)
$nodes = pool.listActiveNodes("stingray_test");
# Get the key
$key = http.getFormParam("param");
$biggest_hash = "";
$node_selected = "";
foreach ($node in $nodes) {
$hash_comparator = string.hashMD5($node . '-' . $key);
# If the combined hash is the biggest we've seen, we have a candidate
if ( $hash_comparator > $biggest_hash ) {
$biggest_hash = $hash_comparator;
$node_selected = $node;
}
}
connection.setPersistenceNode( $node_selected );
3) HRW provides an even distribution when you lose or gain nodes (assuming you chose a sensible hash function). Consistent Hashing doesn't guarantee that but with enough vnodes it's probably not going to be an issue
4) Consistent Routing may be faster - in normal operation it should be an order Log(N) where N is the number of nodes * the replica factor for vnodes. However, if you don't have a lot of nodes (I didn't) then HRW is going to be probably fast enough for you.
4.1) As you mentioned wikipedia mentions that there is a way to do HWR in log(N) time. I don't know how to do that! I'm happy with my O(N) time on 5 nodes.....
In the end, the simplicity and the stateless nature of HRW made the choice for me....
Assume we build an object to represent some network (social, wireless, whatever). So we have some 'node' object to represent the KIND of network, different nodes might have different behaviors and so forth. The network has a MutableList of nodes.
But each node has neighbors, and these neighbors are also nodes. So somewhere, there has to be a list, per node, of all of the neighbors of that node--or such a list has to be generated on the fly whenever it is needed. If the list of neighbors is stored in the node objects, is it cheaper to store it (a) as a list of nodes, or (b) as list of numbers that can be used to reference nodes out of the network?
Some code for clarity:
//approach (a)
class network {
val nodes = new MutableList[Node]
// other stuff //
}
class Node {
val neighbors = new MutableList[Node]
// other stuff //
}
//approach (b)
class Network {
val nodes = new MutableList[Node]
val indexed_list = //(some function to get an indexed list off nodes)
//other stuff//
}
class Node {
val neighbors = MutableList[Int]
//other stuff//
}
Approach (a) seems like the easiest. My first question is whether this is costly in Scala 2.8, and the second is whether it breaks the principle of DRY?
Short answer: premature optimization is the root of etc. Use the clean reference approach. When you have performance issues there's no substitute for profiling and benchmarking.
Long answer: Scala uses the exact same reference machinery as Java so this is really a JVM question more than a Scala question. Formally the JVM spec doesn't say one word about how references are implemented. In practice they tend to be word sized or smaller pointers that either point to an object or index into a table that points to the object (the later helps garbage collectors).
Either way, an array of refs is about the same size as a array of ints on a 32 bit vm or about double on a 64bit vm (unless compressed-oops are used). That doubling might be important to you or might not.
If you go with the ref based approach, each traversal from a node to a neighbor is a reference indirection. With the int based approach, each traversal from a node to a neighbor is a lookup into a table and then a reference indirection. So the int approach is more expensive computationally. And that's assuming you put the ints into a collection that doesn't box the ints. If you do box the ints then it's just pure craziness because now you've got just as many references as the original AND you've got a table lookup.
Anyway, if you go with the reference based approach then the extra references can make a bit of extra work for a garbage collector. If the only references to nodes lie in one array then the gc will scan that pretty damn fast. If they're scattered all over in a graph then the gc will have to work harder to track them all down. That may or may not affect your needs.
From a cleanliness standpoint the ref based approach is much nicer. So go with it and then profile to see where you're spending your time. That or benchmark both approaches.
The question is - what kind of a cost? Memory-wise, the b) approach would probably end up consuming more memory, since you have both mutable lists, and boxed integers in that list, and another global structure holding all the indices. Also, it would probably be slower because you would need several levels of indirection to reach the neighbour node.
One important note - as soon as you start storing integers into mutable lists, they will undergo boxing. So, you will have a list of heap objects in both cases. To avoid this, and furthermore to conserve memory, in the b) approach you would have to keep a dynamically grown array of integers that are the indices of the neighbours.
Now, even if you modify the approach b) as suggested above, and make sure the indexed list in the Network class is really an efficient structure (a direct lookup table or a hash table), you would still pay an indirection cost to find your Node. And memory consumption would still be higher. The only benefit I see is in keeping some sort of a table of weak references if you're concerned you might run out of memory, and recreate the Node object when you need it and you cannot find it in your indexed_list which keeps a set of weak references.
This is, of course, just a hypothesis, you would have to profile/benchmark your code to see the difference.
My suggestion would be to use something like an ArrayBuffer in Node and use it store direct references to nodes.
If memory concerns are an issue, and you want to do the b) approach together with weak references, then I would further suggest rolling in your own dynamically grown integer-array for neighbours, to avoid boxing with ArrayBuffer[Int].