Ideal number of cores for multithreaded solving - optaplanner

I'm looking for a way to speed up the solver by adding more cores.
What is the ideal number of cores I can donate when configuring multithreaded solving?
According to this thread: OptaPlanner, Score calculation speed will be too low
the current best practice is to not go above 4 move threads, as the solver likely won't scale past that anyway
Are we really limited to 4 cores? If the ideal/optimal number of cores is 4/~4, is there other option for me to explore around in order to scale the solver, besides using partitioned search.

Multi-threaded incremental solving does not scale linearly to the number of CPU cores; from a certain number (8 or 16) the performance is below what using 4 CPU cores can do.
Watch the score calculation speed and pick the number of CPU cores that gives you the best results.
There are also other multi-threading options for OptaPlanner.
Last but not least, benchmarking and tweaking the configuration can bring better results than using more cores (of course, these two are not mutually exclusive).

Related

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

Optaplanner - large datasets with millions of rows

There are a couple of threads discussing the scalability of Optaplanner, and I am wondering what's the recommended approach to deal with very large datasets when it comes to millions of rows?
As this blog discussed I am already using heuristic (Simulated Annealing + Tabu Search). The search space of cloud balancing problem is c^p, but the feasible space is unknown/NP-complete.
http://www.optaplanner.org/blog/2014/03/27/IsTheSearchSpaceOfAnOptimizationProblemReallyThatBig.html
The problem I am trying to solve is similar to cloud balancing. But the main difference is in the input data, besides a list of computers and a list of processes, there is also a big two dimensional 'score list/table' which has the scores for each possible combinations that needs to be loaded into memory.
In other words, except for the constraints between computers and processes that the planning needs to satisfy, different valid combinations yield various scores and the higher the score the better.
It's a simple problem but when it comes to hundreds of computers, 100k+ processes and the score table has a million+ combinations, it needs a lot of memory. Even though I could allocate more memory to increase the heap size, the planning could become very slow and struggling, as the steps are sorted with custom planning variable/entity comparator classes.
A straight-forward solution is to divide the dataset into smaller subsets, run each of them individually and then combine the results, so that I could have multiple machines to run at the same time and each machine runs on multi-threads. The biggest drawback of this approach is the result produced is far away from optimal.
I am wondering is there any other better solutions?
The MachineReassignment example also has a big "score combination" matrix. OptaPlanner doesn't care about that - those are just problem facts and the DRL quickly matches the combination(s) that is picked for an assignment. The Solver.solve() causes no big memory consumption or performance impact.
However, loading the problem in your code (before calling Solver.solve()) does cause a huge memory consumption: Understand that if n = 20k, then n² = 400m and an int takes of up 4 bytes, so for 20 000 elements that matrix is 1.6 GB in its most efficient uncompressed form int[][] (both in Java and C++!). So for 20k reserve 2GB RAM, for 40k reserve 8GB RAM for 80k reserve 32 GB RAM. That scales badly.
As for dealing with these big problems, I use combinations of techniques such as Nearby Selection (see my blog article on that), Partitioned Search (what you described, it will be supported out of the box in 7, but I 've implemented it for customers in a CustomPhase), Limited Selection Construction Heuristics (need to research that further, the plumbing is there, usually overkill), ... Partitioned Search does indeed exclude optimal solutions, but above 10k planning entities the trade-off quality vs time taking clearly favors Partitioned Search given a reasonable solving time (minutes/hours/days instead of millenia). The trick is to keep the size of each partition big enough, above 1k entities (hence the use NearbySelection). Score calculation speed also matters a lot, of course.

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

MATLAB parallel computing setup

I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help
Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.

FLOPS assigned to sqrt in GPU to measure performance and global efficiency

In a GPU implementation we need to estimate its performance in terms of GLOPS. The code is very basic, but my problem is how many FLOPS should I give to the operations "sqrt" or "mad", whether 1 or more.
Besides, I obtain 50 GFLOPS for my code if 1 say 1 FLOP for these operations, while the theoretical maximum for this GPU is 500GFLOPS. If I express it in precentages I get 10 %. In terms of speedup I get 100 times. So I think it is great, but 10% seems to be a bit low yield, what do you think?
Thanks
The right answer is probably "it depends".
For pure comparative performance between code run on different platforms, I usually count transcendentals, sqrt, mads, as one operation. In that sort of situation, the key performance metric is how long the code takes to run. It is almost impossible to do the comparison any other way - how would you go about comparing the "FLOP" count of a hardware instruction for a transcendental which takes 25 cycles to retire, versus a math library generated stanza of fmad instructions which also takes 25 cycles to complete? Counting instructions or FLOPs becomes meaningless in such a case, both performed the desired operation in the same amount of clock cycles, despite a different apparent FLOP count.
On the other hand, for profiling and performance tuning of a piece of code on given hardware, the FLOP count might be a useful metric to have. In GPUs, it is normal to look at FLOP or IOP count and memory bandwidth utilization to determine where the performance bottleneck of a given code lies. Having those numbers might point you in the direction of useful optimizations.