I would like to run ALPSO and NSGA2 from OpenMDAO using the PyOptSparse driver in parallel. The catch is that I don't want to run the model itself in parallel (which I have done frequently in OpenMDAO), I just want to run the optimization computations in parallel (e.g. distribute the calculations for swarm members of ALPSO). I have been looking through the documentation and source for all of the above mentioned codes, but I have not found a way to do this. Could someone point me in the right direction?
Note: I am currently using OpenMDAO 1.7.3, but I am open to answers involving later versions
I don't believe that those optimizers support parallel execution. It would most likely require modifications to the code in ALPSO/NSGA2, pyoptsparse, and the pyoptsparse driver to support this.
In OpenMDAO 2.2 (the latest version), we do have a simple GA driver that can run the evaluation of points in the population in parallel, so maybe that is an option. (it is pretty simple though, and only supports single objective.)
Related
I am studying about tensorflow-federated API to make federated learning with real multiple machines.
But I found the answer on this site that not support to make real multiple federated learning using multiple learning.
Are there no way to make federated learning with real multiple machines?
Even I make a network structure for federated learning with 2 clients PC and 1 server PC, Is it impossible to consist of that system using tensorflow federated API?
Or even if I apply the code, can't I make the system I want?
If you can modify the code to configure it, can you give me a tip?If not, when will there be an example to configure on a real computer?
In case you are still looking for something: If you're not bound to TensorFlow, you could have a look at PySyft, which is using PyTorch. Here is a practical example of a FL system built with one server and two Raspberry Pis as clients.
TFF is really about expressing the federated computations you wish to execute. In terms of physical deployments, TFF includes two distinct runtimes: one "reference executor" which simply interprets the syntactic artifact that TFF generates, serially, all in Python and without any fancy constructs or optimizations; another still under development, but demonstrated in the tutorials, which uses asyncio and hierarchies of executors to allow for flexible executor architectures. Both of these are really about simulation and FL research, and not about deploying to devices.
In principle, this may address your question (in particular, see tff.framework.RemoteExecutor). But I assume that you are asking more about deployment to "real" FL systems, e.g. data coming from sources that you don't control. This is really out of scope for TFF. From the FAQ:
Although we designed TFF with deployment to real devices in mind, at this stage we do not currently provide any tools for this purpose. The current release is intended for experimentation uses, such as expressing novel federated algorithms, or trying out federated learning with your own datasets, using the included simulation runtime.
We anticipate that over time the open source ecosystem around TFF will evolve to include runtimes targeting physical deployment platforms.
I am running some code on an embedded system with an extremely limited memory, and even more limited processing power.
I am using TensorFlow for this implementation.
I have never had to work in this kind of environment before.
What are some steps I can take to ensure I am being efficient as possible in my implementations/optimization?
Some ideas -
- Pruning code -
https://jacobgil.github.io/deeplearning/pruning-deep-learning
- Ensure loops are as minimal as possible (in the big O sense)
- ...
Thanks a lot.
I suggest using TensorFlow Lite.
It will enable you to compress and quantize your model to make it smaller and faster to run.
It also supports leveraging GPU and/or hardware accelerator if any of this is available to you.
https://www.tensorflow.org/lite
If you are working with TensorFlow 1.13 (the latest stable version before the 2.0 prototype), there is a pruning function from tf.contrib submodule. It contains a sparcity parameter that you can tune to determine the size of the network.
I suggest you to take a look at all the tf.contrib.model_pruning submodule here. It's plenty of functions you might need for your specific task.
After watching this question I decided to give writing a new op for TensorFlow a try.
Since the requirements of C++, Python and likely a *nix system are not my primary tools, I would like to avoid being at a point where I have to back out and make a system/tool changes just because I did not ask.
Is there a standard or preferred system and or tools used by those working or TensorFlow?
I know that recommendation questions are not allowed here; I am not asking for a personal recommendation, I am asking for the standard used by or what the TensorFlow group finds that works.
Really, anything where you can get Bazel and the required libraries up and running. But since you're starting from scratch: Ubuntu's a very safe bet and (I haven't measured this, but this is a solid estimate) probably gets the most testing and development by the tf team. But there are many options that all work -- you can develop inside a virtualenv on many environments. Things like GPU support get a little more platform-specific, and that's where Ubuntu starts to become the easiest choice if you don't have any other constraints.
The key requirements are outlined in installing Tensorflow from sources.
I want to use cachegrind to do some performance profiling on the OpenJDK JVM. (BTW, if this is Not A Good Idea, I would like to understand why.)
The problem is it keeps tripping up assertions in the JVM. So what can I do to get a run using cachegrind. Or else, please tell me why this wouldn't work. And if you can suggest an alternative to cachegrind. (Note that I have looked at and used perf. It's just that I was curious how different a tool like cachegrind/callgrind would be in terms of results.)
Valgrind and the tools based on it are typically not very good for performance profiling. Valgrind is a kind of a virtual machine that does not execute the original code directly but rather translates it. Programs often run much slower under Valgrind, and this can affect the application execution scenario.
Also Valgrind does not work well with the dynamic and self-modifying code. With respect to JVM's JIT compilation this can result in random assertion failures and JVM crashes that you apparently observe. Valgrind is known to be much more stable when Java is run with JIT disabled, i.e. -Xint. But in -Xint mode profiling does not make much sense since it does not reflect the real application performance.
perf is preferable way to do this kind of measurements. perf relies on hardware counters, it does not modify the executed code. For deeper performance analyis there is Intel VTune Amplifier. It is not free though, but a trial version is available.
I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.