OptaPlanner partitioned search strategy - optaplanner

I'm considering OptaPlanner's partitioned search feature because of a big scale VRPTW problem I need to deal with.
As far as I know a custom implementation of the SolutionPartitioner have to be implemented according to OptaPlanner's documentation. The example of a partitioner for a cloud balancing problem is straightforward, but I wonder how to partition planning etities in a VRPTW class problem.
Should I use a kind of a clustering algorithm in order to make a cluster based partitioning, or should I just divide input data like in the cloud balance example? Sometimes a lot of customers are placed on a relatively small area, but more workers are scheduled to service them. On the other hand there can be a service area where two clearly disjoint sub-areas are visible.
Thank you in advance!

Related

Maintain data in Google Bigtable for longer periods

We have use-cases where we would like to store a large volume of data in Google Bigtable for long periods:
during product development
for performance tuning
for demos
We need to store the data but we don't really need it to be "online" all the time. The current cost bottleneck seems to be the cost of nodes which in these cases are idle for long periods.
How is Google Bigtable being used during product development? I am aware of the development mode (and the emulator) and they are fine for some use-cases but we still need the production environment for other use-cases.
Really, what would be ideal is the ability to switch "off" Bigtable (while still paying for data stored but not for nodes) and bring up the nodes when needed. I don't believe this feature exists. In its absence are there other possible workarounds/alternatives?
It's an interesting question. I've done it with smaller projects using Datastore with much smaller sizes (~2Gb) that have hung around for years after disabling billing. Given how much it costs to do backup/restores on those projects, I could imagine this would be cost prohibitive solution in the BigTable world. It is disappointing that Google hasn't provided a better solution for this. They do talk about different storage classes so I'd imagine disabling a project would move its assets to coldline - but that is just rank speculation on my part.

Data model design guide lines with GEODE

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.
As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.
If my queries are mostly read intensive which requires joins across
tables (2-5 tables), what's the best way to deal with the same with in
memory grid?
Case 1:
Separate regions for all tables in your database and then do a similar join using OQL as you do in database?
Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.
Modeling 1-to-many and many-many relationship using object graph?
Case 2:
If you know how your join queries look like, create a view model per join query having equi join characteristics.
Confusion:
(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]
(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement
So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.
Changes in database can still be managed by event listeners, but what's the recommendations for that?
Thanks,
Dharam
I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).
Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.
In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.
For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.
Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.
Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.
The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.
Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).
You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.
There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.
It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.
No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).
But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.
Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

Clustering: Cluster validation

I want to use some clustering method for large social network dataset. The problem is how to evaluate the clustering method. yes, I can use some external ,internal and relative cluster validation methods. I used Normalized mutual information(NMI) as external validation method for cluster validation based on synthetic data. I produced some synthetic dataset by producing 5 clusters with equal number of nodes and some strongly connected links inside each cluster and weak links between clusters to check the clustering method, Then I analysed the spectral clustering and modularity based community detection methods on this synthetic datasets. I use the clustering with the best NMI for my real world dataset and check the error(cost function) of my algorithm and the result was good. Is my testing method for my cost function is good? or I should also validate clusters of my real word clusters again?
Thanks.
Try more than one measure.
There are a dozen cluster validation measures, and it's hard to predict which one is most appropriate for a problem. The differences between them are not really understood yet, so it's best if you consult more than one.
Also note that if you don't use a normalized measure, the baseline may be really high. So the measures are mostly useful to say "result A is more similar to result B than result C", but should not be taken as an absolute measure of quality. They are a relative measure of similarity.

Ideas for a distributed processing project?

I am looking for a project idea in distributed processing on Unix based systems. I wish to use only the C programming language. I have to finish the project in 4 months and it's a part of my course work. Can someone help me with an idea?
Cryptography problems
Distributed Ray Tracer
Chess AI (really, AI for any game)
Large Prime Number Search
Web crawler or other search mechanism
Generic Problem Solver (push out problem definition on the fly, followed by problem data).
Note on the last one:
An example would be if you have a gaming website with lots of board games that you were coming out with all the time. You don't want to have to install new clients on all your servers every time you write a new AI for a board game, so you have a program which you can send new AIs to and then after that you can just send the game data and the pushed AI will be used to solve the problem. This is best used for problems which can be broken into smaller chunks.
It is hard to answer without knowing anything about performance, the scale of the project, what you are trying to accomplish, etc. For example, is it one task or multiple tasks? Is the project just totally open?
4 months is pretty short, but maybe some kind of physics problem or math problem. Sorting or some kind of database work might be dull but beneficial.
Check out mapreduce for ideas! I was really motivated by this work, personally.
We used distributed processing here at work, but it's such a broad field..
Yeah.
Why not write a distributed compiler. You may then present an interface for people to compile things on the fly, and it will be passed to your distribute compilenet. Java is probably well-suited, and you'll get to do fun things, like be very mindful of security and so on.
The BOINC project is always looking for help and is very interesting:
http://boinc.berkeley.edu/
If you want to leave your mark and change the way we search the web,
look into B-Trees.
B-Trees and offspring/variants are the working horse of the internet.
Google uses them extensively to index the web.
Database indexes/indices are B-Tree offspring/variants.
Every LAMP system uses a database and indexes/indices.
Also, they are used extensively in distributed VLDB (Very Large DataBases)
Perhaps you can improve existing distributed databases (Cassandra and HBase)
These are lofty goals, but for me, this would leave a lasting mark
in the way Web data is processed, indexed and stored.
Write a distributed, fault tolerant, redundant network B+Tree or B*Tree.
Read Drozdek's book Data Structures and Algorithms in C++.
It's a good survey of B-Trees.
Read about skip trees
http://www.cs.huji.ac.il/~ittaia/papers/AAY-OPODIS05.pdf
Read about Efficient B-tree Based Indexing for Cloud Data Processing
http://www.comp.nus.edu.sg/~ooibc/vldb10-cgindex.pdf
Google search "Network B+Tree"
https://www.google.com/search?rlz=1C1CHKZ_enUS431US431&sourceid=chrome&ie=UTF-8&q=Network+B%2BTree

Commercial uses for grid computing?

I keep hearing from associates about grid computing which, from what I can gather, is highly distributed stuff along the lines of SETI#Home.
Is anyone working on these sort of systems for business use? My interest is in figuring out if there's a commercial reason for starting software development in this field.
Rendering Farms such as Pixar
Model Evaluation e.g. weather, financials, military
Architectural Engineering e.g. earthquakes.
To list a few.
Grid computing is really only needed if you have a lot of WORK that needs to be done, like folding proteins, otherwise a simple server farm will likely be plenty.
Obviously Google are major users of Grid Computing; all their search service relies on it, and many others.
Engines such as BigTable are based on using lots of nodes for storage and computation. These are commercially very useful because they're a good alternative to a small number of big servers, providing better redundancy and cost effective scaling.
The downside is that the software is fiendishly difficult to write, but Google seem to manage that one ok :)
So anything which requires big storage and/or lots of computation.
I used to work for these guys. Grid computing is used all over. Anyone who makes computer chips uses them to test designs before getting physical silicon cut. Financial websites use grids to calculate if you qualify for that loan. These days they are starting to replace big iron in a lot of places, as they tend to be cheaper to maintain over the long term.