OptaPlanner Constraint Streams: Count Distinct Values in Planning Entity Set - optaplanner

I'm looking for some help with OptaPlanner's constraint streams. The problem is a variant on job-shop scheduling, and my planning entities (CandidateAssignment) are wrapping around two decision variables: choice of robot and assigned time grain. Each CandidateAssignment also has a field (a Set) denoting which physical containers in a warehouse will be filled by assigning that task.
The constraint I'm trying to enforce is to minimize the total number of containers used by all CandidateAssignments in a solution (the goal being to guide OptaPlanner towards grouping tasks by container... there are domain-specific benefits to this in the warehouse). If each CandidateAssignment could only service a single container, this would be easy:
protected Constraint maximizeContainerCompleteness(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(CandidateAssignment.class)
.filter(CandidateAssignment::isAssigned)
.groupBy(CandidateAssignment::getContainerId, countDistinct())
.penalizeConfigurable("Group by container");
}
Moving from a single ID to a collection seems less straightforward to me (i.e., if CandidateAssignment:getContainerIds returns a set of integers). Any help would be much appreciated.
EDIT: Thanks Christopher and Lukáš for the responses. Christopher's constraint matches my use case (minimize the number of containers serviced by a solution). However, this ends up being a pretty poor way to guide OptaPlanner towards (more) optimal solutions since it's operating via iterated local search. Given a candidate solution, the majority of neighbors in that solution's neighborhood will have equal value for that constraint (# unique containers used), so it doesn't have much power of discernment.
The approach I've tested with reasonable results is as follows:
protected Constraint maximizeContainerCompleteness(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(CandidateAssignment.class)
.filter(CandidateAssignment::isAssigned)
.join(Container.class, Joiners.filtering(
(candidate, container) -> candidate.getContainerIds().contains(container.getContainerId())))
.rewardConfigurable("Group by container", (candidate, container) -> container.getPercentFilledSquared());
}
This is a modified version of Lukáš' answer. It works by prioritizing containers which are "mostly full." In the real-world use case (which I think I explained pretty poorly above), we'd like to minimize the number of containers used in a solution because it allows the warehouse to replace those containers with new ones which are "easier" to fulfill (the search space is less constrained). We're planning in a receding time horizon, and having many partially filled bins means that each planning horizon becomes increasingly more difficult to schedule. "Closing" containers by fulfilling all associated tasks means we can replace that container with a new one and start fresh.
Anyways, just a bit of context. This is a very particular use case, but if anyone else reads this and wants to know how to work with this type of constraint, hopefully that helps.

Interpreting your constraint as "Penalize by 1 for each container used", this should work:
Constraint maximizeContainerCompleteness(ConstraintFactory constraintFactory) {
return constraintFactory.forEach(CandidateAssignment.class)
.filter(CandidateAssignment::isAssigned)
.flattenLast(CandidateAssignment::getContainerIds)
.distinct()
.penalizeConfigurable("Group by container");
}
What it does: for each assigned candidate assignment, flatten its set of container ids (resulting in a stream of non-distinct used container ids), take the distinct elements of that stream (resulting in a stream of distinct used container ids), and trigger a penalize call for each one.

Not to take away from Christopher's correct answer, but there are various ways how you could do that. For example, consider conditional propagation (ifExists()):
return constraintFactory.forEach(Container.class)
.ifExists(CandidateAssignment.class,
Joiners.filtering((container, candidate) -> candidate.isAssigned()
&& candidate.getContainerIds().contains(container.getId()))
.penalizeConfigurable("Penalize assigned containers",
container -> 1);
I have a hunch that this approach will be faster, but YMMV. I recommend you benchmark the two approaches and pick the one that performs better.
This approach also has the extra benefit of Container instance showing up in the constraint matches, and not some anonymous Integer.

Related

Is it possible to use dynamic weighting (#ConstraintConfiguration) with an EasyScoreCalculator

I've been reading the documentation and it provides some examples for drools and constraint streams, but it doesn't explicitly say whether you can or cannot use Constraint Configuration with an EasyScoreCalculator.
As the ConstrationConfiguration is a field in the PlanningSolution class, it's available in the EasyScoreCalculator's calculateScore(Solution_ solution) method, which computes the score of the entire solution for every move.
Let me just note that the EasyScoreCalculator does not scale for bigger data sets - exactly because it computes the score of the entire solution for every move.

Neo4j - Find node by ID - How to get the ID for querying?

I want to be able to to find a specific node by it's ID for performance reasons (IDs are more efficient than indexes)
In order to execute the following example:
MATCH (s)
WHERE ID(s) = 65110
RETURN s
I will need the ID of the node (65110 in this case)
But how to I get it? Since the ID is auto-generated, It's impossible to find the ID without querying the graph, which kind of defeats the purpose since I will already have the node.
Am I missing something?
TL;DR: use an indexed property for lookups unless you absolutely need to optimise and can measure the difference.
Typically you use an index lookup as an entry point to the graph, that is, to obtain the node that provides the start of an edge traversal. While the pointer-like nature of Neo4j node IDs means they are theoretically faster, index lookups are also very efficient so you should not discount them on performance grounds unless you are sure it will make a measurable difference.
You should also consider that Neo4j node IDs are not stable. If you delete a node it is possible for the same ID to be re-used in future. For this reason they should really be considered an internal implementation detail and not one that should be relied on as part of your application's external interface.
That said, I have an application that stores Neo4j IDs in a Solr index for looking up nodes in bulk, but this index is considered volatile and the nodes also contain an indexed, application-generated UUID property (with a unique constraint) that serves as their main "primary key".
Further reading and discussion: https://github.com/neo4j/neo4j/issues/258

How to implement a scalable, unordered collection in DynamoDB?

I am looking into implementing a scalable unordered collection of objects on top of Amazon DynamoDB. So far the following options have been considered:
Use DynamoDB document data types (map, list) and use document path to access stand-alone items. This has one obvious drawback for collection being limited to 400KB of data, meaning perhaps 1..10K objects depending on their size. Less obvious drawback is that cost of insertion of a new object into such collection is going to be huge: Amazon specifies that the write capacity will be deducted based on the total item size, not just newly added object -- therefore ~400 capacity units for inserting 1KB object when approaching the size limit. So considering this ruled out?
Using composite primary hash + range key, where primary hash remains the same for all objects in the collection, and range key is just something random or an atomic counter. Obvious drawback is that having identical hash key results in bad key distribution -- cardinality is low when there are collections with large number of objects. This means bad partitioning, and having a scale issue with all reads/writes on the same collection being stuck to one shard, becoming subject to 3000 reads / 1000 writes per second limitation of DynamoDB partition.
Using global secondary index with secondary hash + range key, where hash key remains the same for all objects belonging to the same collection, and range key is just something random or an atomic counter. Similar to above, partitioning becomes poor for the GSI, and it will become a bottleneck with too many identical hashes draining all the provisioned capacity to the index rapidly. I didn't find how the GSI is implemented exactly, thus not sure how badly it suffers from low cardinality.
Question is, whether I could live with (2) or (3) and suffer from non-ideal key distribution, or is there another way of implementing collection that was overlooked, or perhaps I should at all consider looking into another nosql database engine.
This is a "shooting from the hip" answer, what you end up doing may depend on how much and what type of reading and writing you do.
Two things the dynamo docs encourage you to avoid are hot keys and, in general, scans. You noted that in cases (2) and (3), you end up with a hot key. If you expect this to scale (large collections), the hot key will probably hurt more and more, especially if this is a write-intensive application.
The docs on Query and Scan operations (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html) say that, for a query, "you must specify the hash key attribute name and value as an equality condition." So if you want to avoid scans, this might still force your hand and put you back into that hot key situation.
Maybe one route would be to embrace doing a scan operation, but just have one table devoted to your collection. Then you could just have a fully random (well distributed) hash key and do a scan every time. This assumes you always want everything from the collection (you didn't say). This will still hurt if you scale up to a large collection, but if you always want the full set back, you'll have to deal with that pain regardless. If you just want a subset, you can add a limit parameter. This would help performance, but you will always get back the same subset (or you can use the last evaluated key and keep going). The docs also mention parallel scans.
If you are using AWS, elasticache/redis might be another route to try? The first pass might code up a lot faster/cleaner than situation (1) that you mentioned.

consistent hashing vs. rendezvous (HRW) hashing - what are the tradeoffs?

There is a lot available on the Net about consistent hashing, and implementations in several languages available. The Wikipedia entry for the topic references another algorithm with the same goals:
Rendezvous Hashing
This algorithm seems simpler, and doesn't need the addition of replicas/virtuals around the ring to deal with uneven loading issues. As the article mentions, it appears to run in O(n) which would be an issue for large n, but references a paper stating it can be structured to run in O(log n).
My question for people with experience in this area is, why would one choose consistent hashing over HRW, or the reverse? Are there use cases where one of these solutions is the better choice?
Many thanks.
Primarily I would say the advantage of consistent hashing is when it comes to hotspots. Depending on the implementation its possible to manually modify the token ranges to deal with them.
With HRW if somehow you end up with hotspots (ie caused by poor hashing algorithm choices) there isn't much you can do about it short of removing the hotspot and adding a new one which should balance the requests out.
Big advantage to HRW is when you add or remove nodes you maintain an even distribution across everything. With consistent hashes they resolve this by giving each node 200 or so virtual nodes, which also makes it difficult to manually manage ranges.
Speaking as someone who's just had to choose between the two approaches and who ultimately plumped for HRW hashing: My use case was a simple load balancing one with absolutely no reassignment requirement -- if a node died it's perfectly OK to just choose a new one and start again. No re balancing of existing data is required.
1) Consistent Hashing requires a persistent hashmap of the nodes and vnodes (or at least a sensible implementation does, you could build all the objects on every request.... but you really don't want to!). HWR does not (it's state-less). Nothing needs to be modified when machines join or leave the cluster - there is no concurrency to worry about (except that your clients have a good view of the state of the cluster which is the same in both cases)
2) HRW is easier to explain and understand (and the code is shorter). For example this is a complete HRW algorythm implemented in Riverbed Stingray TrafficScript. (Note there are better hash algorithms to choose than MD5 - it's overkill for this job)
$nodes = pool.listActiveNodes("stingray_test");
# Get the key
$key = http.getFormParam("param");
$biggest_hash = "";
$node_selected = "";
foreach ($node in $nodes) {
$hash_comparator = string.hashMD5($node . '-' . $key);
# If the combined hash is the biggest we've seen, we have a candidate
if ( $hash_comparator > $biggest_hash ) {
$biggest_hash = $hash_comparator;
$node_selected = $node;
}
}
connection.setPersistenceNode( $node_selected );
​
3) HRW provides an even distribution when you lose or gain nodes (assuming you chose a sensible hash function). Consistent Hashing doesn't guarantee that but with enough vnodes it's probably not going to be an issue
4) Consistent Routing may be faster - in normal operation it should be an order Log(N) where N is the number of nodes * the replica factor for vnodes. However, if you don't have a lot of nodes (I didn't) then HRW is going to be probably fast enough for you.
4.1) As you mentioned wikipedia mentions that there is a way to do HWR in log(N) time. I don't know how to do that! I'm happy with my O(N) time on 5 nodes.....
In the end, the simplicity and the stateless nature of HRW made the choice for me....

What is the cost of object reference in Scala?

Assume we build an object to represent some network (social, wireless, whatever). So we have some 'node' object to represent the KIND of network, different nodes might have different behaviors and so forth. The network has a MutableList of nodes.
But each node has neighbors, and these neighbors are also nodes. So somewhere, there has to be a list, per node, of all of the neighbors of that node--or such a list has to be generated on the fly whenever it is needed. If the list of neighbors is stored in the node objects, is it cheaper to store it (a) as a list of nodes, or (b) as list of numbers that can be used to reference nodes out of the network?
Some code for clarity:
//approach (a)
class network {
val nodes = new MutableList[Node]
// other stuff //
}
class Node {
val neighbors = new MutableList[Node]
// other stuff //
}
//approach (b)
class Network {
val nodes = new MutableList[Node]
val indexed_list = //(some function to get an indexed list off nodes)
//other stuff//
}
class Node {
val neighbors = MutableList[Int]
//other stuff//
}
Approach (a) seems like the easiest. My first question is whether this is costly in Scala 2.8, and the second is whether it breaks the principle of DRY?
Short answer: premature optimization is the root of etc. Use the clean reference approach. When you have performance issues there's no substitute for profiling and benchmarking.
Long answer: Scala uses the exact same reference machinery as Java so this is really a JVM question more than a Scala question. Formally the JVM spec doesn't say one word about how references are implemented. In practice they tend to be word sized or smaller pointers that either point to an object or index into a table that points to the object (the later helps garbage collectors).
Either way, an array of refs is about the same size as a array of ints on a 32 bit vm or about double on a 64bit vm (unless compressed-oops are used). That doubling might be important to you or might not.
If you go with the ref based approach, each traversal from a node to a neighbor is a reference indirection. With the int based approach, each traversal from a node to a neighbor is a lookup into a table and then a reference indirection. So the int approach is more expensive computationally. And that's assuming you put the ints into a collection that doesn't box the ints. If you do box the ints then it's just pure craziness because now you've got just as many references as the original AND you've got a table lookup.
Anyway, if you go with the reference based approach then the extra references can make a bit of extra work for a garbage collector. If the only references to nodes lie in one array then the gc will scan that pretty damn fast. If they're scattered all over in a graph then the gc will have to work harder to track them all down. That may or may not affect your needs.
From a cleanliness standpoint the ref based approach is much nicer. So go with it and then profile to see where you're spending your time. That or benchmark both approaches.
The question is - what kind of a cost? Memory-wise, the b) approach would probably end up consuming more memory, since you have both mutable lists, and boxed integers in that list, and another global structure holding all the indices. Also, it would probably be slower because you would need several levels of indirection to reach the neighbour node.
One important note - as soon as you start storing integers into mutable lists, they will undergo boxing. So, you will have a list of heap objects in both cases. To avoid this, and furthermore to conserve memory, in the b) approach you would have to keep a dynamically grown array of integers that are the indices of the neighbours.
Now, even if you modify the approach b) as suggested above, and make sure the indexed list in the Network class is really an efficient structure (a direct lookup table or a hash table), you would still pay an indirection cost to find your Node. And memory consumption would still be higher. The only benefit I see is in keeping some sort of a table of weak references if you're concerned you might run out of memory, and recreate the Node object when you need it and you cannot find it in your indexed_list which keeps a set of weak references.
This is, of course, just a hypothesis, you would have to profile/benchmark your code to see the difference.
My suggestion would be to use something like an ArrayBuffer in Node and use it store direct references to nodes.
If memory concerns are an issue, and you want to do the b) approach together with weak references, then I would further suggest rolling in your own dynamically grown integer-array for neighbours, to avoid boxing with ArrayBuffer[Int].