What is the cost of object reference in Scala? - oop

Assume we build an object to represent some network (social, wireless, whatever). So we have some 'node' object to represent the KIND of network, different nodes might have different behaviors and so forth. The network has a MutableList of nodes.
But each node has neighbors, and these neighbors are also nodes. So somewhere, there has to be a list, per node, of all of the neighbors of that node--or such a list has to be generated on the fly whenever it is needed. If the list of neighbors is stored in the node objects, is it cheaper to store it (a) as a list of nodes, or (b) as list of numbers that can be used to reference nodes out of the network?
Some code for clarity:
//approach (a)
class network {
val nodes = new MutableList[Node]
// other stuff //
}
class Node {
val neighbors = new MutableList[Node]
// other stuff //
}
//approach (b)
class Network {
val nodes = new MutableList[Node]
val indexed_list = //(some function to get an indexed list off nodes)
//other stuff//
}
class Node {
val neighbors = MutableList[Int]
//other stuff//
}
Approach (a) seems like the easiest. My first question is whether this is costly in Scala 2.8, and the second is whether it breaks the principle of DRY?

Short answer: premature optimization is the root of etc. Use the clean reference approach. When you have performance issues there's no substitute for profiling and benchmarking.
Long answer: Scala uses the exact same reference machinery as Java so this is really a JVM question more than a Scala question. Formally the JVM spec doesn't say one word about how references are implemented. In practice they tend to be word sized or smaller pointers that either point to an object or index into a table that points to the object (the later helps garbage collectors).
Either way, an array of refs is about the same size as a array of ints on a 32 bit vm or about double on a 64bit vm (unless compressed-oops are used). That doubling might be important to you or might not.
If you go with the ref based approach, each traversal from a node to a neighbor is a reference indirection. With the int based approach, each traversal from a node to a neighbor is a lookup into a table and then a reference indirection. So the int approach is more expensive computationally. And that's assuming you put the ints into a collection that doesn't box the ints. If you do box the ints then it's just pure craziness because now you've got just as many references as the original AND you've got a table lookup.
Anyway, if you go with the reference based approach then the extra references can make a bit of extra work for a garbage collector. If the only references to nodes lie in one array then the gc will scan that pretty damn fast. If they're scattered all over in a graph then the gc will have to work harder to track them all down. That may or may not affect your needs.
From a cleanliness standpoint the ref based approach is much nicer. So go with it and then profile to see where you're spending your time. That or benchmark both approaches.

The question is - what kind of a cost? Memory-wise, the b) approach would probably end up consuming more memory, since you have both mutable lists, and boxed integers in that list, and another global structure holding all the indices. Also, it would probably be slower because you would need several levels of indirection to reach the neighbour node.
One important note - as soon as you start storing integers into mutable lists, they will undergo boxing. So, you will have a list of heap objects in both cases. To avoid this, and furthermore to conserve memory, in the b) approach you would have to keep a dynamically grown array of integers that are the indices of the neighbours.
Now, even if you modify the approach b) as suggested above, and make sure the indexed list in the Network class is really an efficient structure (a direct lookup table or a hash table), you would still pay an indirection cost to find your Node. And memory consumption would still be higher. The only benefit I see is in keeping some sort of a table of weak references if you're concerned you might run out of memory, and recreate the Node object when you need it and you cannot find it in your indexed_list which keeps a set of weak references.
This is, of course, just a hypothesis, you would have to profile/benchmark your code to see the difference.
My suggestion would be to use something like an ArrayBuffer in Node and use it store direct references to nodes.
If memory concerns are an issue, and you want to do the b) approach together with weak references, then I would further suggest rolling in your own dynamically grown integer-array for neighbours, to avoid boxing with ArrayBuffer[Int].

Related

OptaPlanner Constraint Streams: Count Distinct Values in Planning Entity Set

I'm looking for some help with OptaPlanner's constraint streams. The problem is a variant on job-shop scheduling, and my planning entities (CandidateAssignment) are wrapping around two decision variables: choice of robot and assigned time grain. Each CandidateAssignment also has a field (a Set) denoting which physical containers in a warehouse will be filled by assigning that task.
The constraint I'm trying to enforce is to minimize the total number of containers used by all CandidateAssignments in a solution (the goal being to guide OptaPlanner towards grouping tasks by container... there are domain-specific benefits to this in the warehouse). If each CandidateAssignment could only service a single container, this would be easy:
protected Constraint maximizeContainerCompleteness(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(CandidateAssignment.class)
.filter(CandidateAssignment::isAssigned)
.groupBy(CandidateAssignment::getContainerId, countDistinct())
.penalizeConfigurable("Group by container");
}
Moving from a single ID to a collection seems less straightforward to me (i.e., if CandidateAssignment:getContainerIds returns a set of integers). Any help would be much appreciated.
EDIT: Thanks Christopher and Lukáš for the responses. Christopher's constraint matches my use case (minimize the number of containers serviced by a solution). However, this ends up being a pretty poor way to guide OptaPlanner towards (more) optimal solutions since it's operating via iterated local search. Given a candidate solution, the majority of neighbors in that solution's neighborhood will have equal value for that constraint (# unique containers used), so it doesn't have much power of discernment.
The approach I've tested with reasonable results is as follows:
protected Constraint maximizeContainerCompleteness(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(CandidateAssignment.class)
.filter(CandidateAssignment::isAssigned)
.join(Container.class, Joiners.filtering(
(candidate, container) -> candidate.getContainerIds().contains(container.getContainerId())))
.rewardConfigurable("Group by container", (candidate, container) -> container.getPercentFilledSquared());
}
This is a modified version of Lukáš' answer. It works by prioritizing containers which are "mostly full." In the real-world use case (which I think I explained pretty poorly above), we'd like to minimize the number of containers used in a solution because it allows the warehouse to replace those containers with new ones which are "easier" to fulfill (the search space is less constrained). We're planning in a receding time horizon, and having many partially filled bins means that each planning horizon becomes increasingly more difficult to schedule. "Closing" containers by fulfilling all associated tasks means we can replace that container with a new one and start fresh.
Anyways, just a bit of context. This is a very particular use case, but if anyone else reads this and wants to know how to work with this type of constraint, hopefully that helps.
Interpreting your constraint as "Penalize by 1 for each container used", this should work:
Constraint maximizeContainerCompleteness(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(CandidateAssignment.class)
.filter(CandidateAssignment::isAssigned)
.flattenLast(CandidateAssignment::getContainerIds)
.distinct()
.penalizeConfigurable("Group by container");
}
What it does: for each assigned candidate assignment, flatten its set of container ids (resulting in a stream of non-distinct used container ids), take the distinct elements of that stream (resulting in a stream of distinct used container ids), and trigger a penalize call for each one.
Not to take away from Christopher's correct answer, but there are various ways how you could do that. For example, consider conditional propagation (ifExists()):
return constraintFactory.forEach(Container.class)
.ifExists(CandidateAssignment.class,
Joiners.filtering((container, candidate) -> candidate.isAssigned()
&& candidate.getContainerIds().contains(container.getId()))
.penalizeConfigurable("Penalize assigned containers",
container -> 1);
I have a hunch that this approach will be faster, but YMMV. I recommend you benchmark the two approaches and pick the one that performs better.
This approach also has the extra benefit of Container instance showing up in the constraint matches, and not some anonymous Integer.

How does Redis checks if an element is a member of set or not?

I have read online that Redis can say if an element is member of set or not in O(1) time. I want to know how Redis does this. What algorithm does Redis use to achieve this.
A Redis Set is implemented internally in one of two ways: an intset or a hashtable. The intset is a special optimization for integer-only sets and uses the intsetSearch function to search the set. This function, however, uses a binary search so that's actually technically O(logN). However, since the cardinallity of intsets is capped at a constant (the set-max-intset-entries configuration directive), we can assume O(1) accurately reflects the complexity here.
hashtable is used for a lot of things in Redis, including the implementation of Sets. It uses a hash function on the key to map it into a table (array) of entries - checking whether the hashed key value is in the array is straightforwardly done in O(1) in dictFind. The elements under each hashed key are stored as a linked list, so again you're basically talking O(N) to traverse it, but given the hash function extremely low probability for collisions (hmm, need some sort of citation here?) these lists are extremely short so we can safely assume it is effectively O(1).
Because of the above, SISMEMBER's claim of being O(1) in terms of computational complexity is valid.

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

How to represent a binary relation

I plan to make a class that represents a strict partially ordered set, and I assume the most natural way to model its interface is as a binary relation. This gives functions like:
bool test(elementA, elementB); //return true if elementA < elementB
void set(elementA, elementB); //declare that elementA < elementB
void clear(elementA, elementB); //forget that elementA < elementB
and possibly functions like:
void applyTransitivity(); //if test(a,b) and test(b, c), then set(a, c)
bool checkIrreflexivity(); //return true if for no a, a < a
bool checkAsymmetry(); //return true if for no a and b, a < b and b < a
The naive implementation would be to have a list of pairs such that (a, b) indicates a < b. However, it's probably not optimal. For example, test would be linear time. Perhaps it could be better done as a hash map of lists.
Ideally, though, the in memory representation would by its nature enforce applyTransitivity to always be "in effect" and not permit the creation of edges that cause reflexivity or symmetry. In other words, the degrees of freedom of the data structure represent the degrees of freedom of a strict poset. Is there a known way to do this? Or, more realistically, is there a means of checking for being cyclical, and maintaining transitivity that is amortized and iterative with each call to set and clear, so that the cost of enforcing the correctness is low. Is there a working implementation?
Okay, let's talk about achieving bare metal-scraping micro-efficiency, and you can choose how deep down that abyss you want to go. At this architectural level, there are no data structures like hash maps and lists, there aren't even data types, just bits and bytes in memory.
As an aside, you'll also find a lot of info on representations here by looking into common representations of DAGs. However, most of the common reps are designed more for convenience than efficiency.
Here, we want the data for a to be fused with that adjacency data into a single memory block. So you want to store the 'list', so to speak, of items that have a relation to a in a's own memory block so that we can potentially access a and all the elements related to a within a single cache line (bonus points if those related elements might also fit in the same cache line, but that's an NP-hard problem).
You can do that by storing, say, 32-bit indices in a. We can model such objects like so if we go a little higher level and use C for exemplary purposes:
struct Node
{
// node data
...
int links[]; // variable-length struct
};
This makes the Node a variable-length structure whose size and potentially even address changes, so we need an extra level of indirection to get stability and avoid invalidation, like an index to an index (if you control the memory allocator/array and it's purely contiguous), or an index to a pointer (or reference in some languages).
That makes your test function still involve a linear time search, but linear with respect to the number of elements related to a, not the number of elements total. Because we used a variable-length structure, a and its neighbor indices will potentially fit in a single cache line, and it's likely that a will already be in the cache just to make the query.
It's similar to the basic idea you had of the hash map storing lists, but without the explosion of lists overhead and without the hash lookup (which may be constant time but not nearly as fast as just accessing the connections to a from the same memory block). Most importantly, it's far more cache-friendly, and that's often going to make the difference between a few cycles and hundreds.
Now this means that you still have to roll up your sleeves and check for things like cycles yourself. If you want a data structure that more directly and conveniently models the problem, you'll find a nicer fit with graph data structures revolving around a formalization of a directed edge. However, those are much more convenient than they are efficient.
If you need the container to be generic and a can be any given type, T, then you can always wrap it (using C++ now):
template <class T>
struct Node
{
T node_data;
int links[1]; // VLS, not necessarily actually storing 1 element
};
And still fuse this all into one memory block this way. We need placement new here to preserve those C++ object semantics and possibly keep an eye on alignment here.
Transitivity checks always involves a search of some sort (breadth first or depth first). I don't think there's any rep that avoids this unless you want to memoize/cache a potentially massive explosion of transitive data.
At this point you should have something pretty fast if you want to go this deep down the abyss and have a solution that's really hard to maintain and understand. I've unfortunately found that this doesn't impress the ladies very much as with the case of having a car that goes really fast, but it can make your software go really, really fast.

consistent hashing vs. rendezvous (HRW) hashing - what are the tradeoffs?

There is a lot available on the Net about consistent hashing, and implementations in several languages available. The Wikipedia entry for the topic references another algorithm with the same goals:
Rendezvous Hashing
This algorithm seems simpler, and doesn't need the addition of replicas/virtuals around the ring to deal with uneven loading issues. As the article mentions, it appears to run in O(n) which would be an issue for large n, but references a paper stating it can be structured to run in O(log n).
My question for people with experience in this area is, why would one choose consistent hashing over HRW, or the reverse? Are there use cases where one of these solutions is the better choice?
Many thanks.
Primarily I would say the advantage of consistent hashing is when it comes to hotspots. Depending on the implementation its possible to manually modify the token ranges to deal with them.
With HRW if somehow you end up with hotspots (ie caused by poor hashing algorithm choices) there isn't much you can do about it short of removing the hotspot and adding a new one which should balance the requests out.
Big advantage to HRW is when you add or remove nodes you maintain an even distribution across everything. With consistent hashes they resolve this by giving each node 200 or so virtual nodes, which also makes it difficult to manually manage ranges.
Speaking as someone who's just had to choose between the two approaches and who ultimately plumped for HRW hashing: My use case was a simple load balancing one with absolutely no reassignment requirement -- if a node died it's perfectly OK to just choose a new one and start again. No re balancing of existing data is required.
1) Consistent Hashing requires a persistent hashmap of the nodes and vnodes (or at least a sensible implementation does, you could build all the objects on every request.... but you really don't want to!). HWR does not (it's state-less). Nothing needs to be modified when machines join or leave the cluster - there is no concurrency to worry about (except that your clients have a good view of the state of the cluster which is the same in both cases)
2) HRW is easier to explain and understand (and the code is shorter). For example this is a complete HRW algorythm implemented in Riverbed Stingray TrafficScript. (Note there are better hash algorithms to choose than MD5 - it's overkill for this job)
$nodes = pool.listActiveNodes("stingray_test");
# Get the key
$key = http.getFormParam("param");
$biggest_hash = "";
$node_selected = "";
foreach ($node in $nodes) {
$hash_comparator = string.hashMD5($node . '-' . $key);
# If the combined hash is the biggest we've seen, we have a candidate
if ( $hash_comparator > $biggest_hash ) {
$biggest_hash = $hash_comparator;
$node_selected = $node;
}
}
connection.setPersistenceNode( $node_selected );
​
3) HRW provides an even distribution when you lose or gain nodes (assuming you chose a sensible hash function). Consistent Hashing doesn't guarantee that but with enough vnodes it's probably not going to be an issue
4) Consistent Routing may be faster - in normal operation it should be an order Log(N) where N is the number of nodes * the replica factor for vnodes. However, if you don't have a lot of nodes (I didn't) then HRW is going to be probably fast enough for you.
4.1) As you mentioned wikipedia mentions that there is a way to do HWR in log(N) time. I don't know how to do that! I'm happy with my O(N) time on 5 nodes.....
In the end, the simplicity and the stateless nature of HRW made the choice for me....