Alternating Generations GA? To keep evaluators running - optimization

What I want?
I Want to be able to generate population for generation N+2, from ranked population of generation N; while the generation N+1 is still in evaluation. Once the generation N+1 finishes evaluation it will be used to generate generation N+3 (and at that point generation N+2 evaluation will be already running).
Why do I want that?
To better utilize computing resources. My evaluation threads can then run continuously with minimum synchronization and waiting. For a fast evaluation and small generations a single generation can be finished in a very quick time and evaluation thread would then need to wait for next generation; instead it can already work on something else.
Alternative thought
Alternative is to have 2 independent GAs run and generate generations while taking rounds in access to evaluation threads.
This is utilizing resources better, however it's obviously not allways what we want to do (digging two holes instead digging a single hole doesn't allways mean I can find oil faster)

Related

Recommended way of measuring execution time in Tensorflow Federated

I would like to know whether there is a recommended way of measuring execution time in Tensorflow Federated. To be more specific, if one would like to extract the execution time for each client in a certain round, e.g., for each client involved in a FedAvg round, saving the time stamp before the local training starts and the time stamp just before sending back the updates, what is the best (or just correct) strategy to do this? Furthermore, since the clients' code run in parallel, are such a time stamps untruthful (especially considering the hypothesis that different clients may be using differently sized models for local training)?
To be very practical, using tf.timestamp() at the beginning and at the end of #tf.function client_update(model, dataset, server_message, client_optimizer) -- this is probably a simplified signature -- and then subtracting such time stamps is appropriate?
I have the feeling that this is not the right way to do this given that clients run in parallel on the same machine.
Thanks to anyone can help me on that.
There are multiple potential places to measure execution time, first might be defining very specifically what is the intended measurement.
Measuring the training time of each client as proposed is a great way to get a sense of the variability among clients. This could help identify whether rounds frequently have stragglers. Using tf.timestamp() at the beginning and end of the client_update function seems reasonable. The question correctly notes that this happens in parallel, summing all of these times would be akin to CPU time.
Measuring the time it takes to complete all client training in a round would generally be the maximum of the values above. This might not be true when simulating FL in TFF, as TFF maybe decided to run some number of clients sequentially due to system resources constraints. In practice all of these clients would run in parallel.
Measuring the time it takes to complete a full round (the maximum time it takes to run a client, plus the time it takes for the server to update) could be done by moving the tf.timestamp calls to the outer training loop. This would be wrapping the call to trainer.next() in the snippet on https://www.tensorflow.org/federated. This would be most similar to elapsed real time (wall clock time).

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

Is faster code also more power efficient?

Assume I have a CPU running at a constant rate, pulling an equal amount of energy per instruction. I also have two functionally identical programs, which result in the same output, except one has been optimized to execute only 100 instructions, while the other program executes 200 instructions. Is the 100 instruction program necessarily faster than the 200 instruction program? Does a program with fewer instructions draw less power than a program with more instructions?
Things are much more complex than this.
For example execution speed is in many cases dominated by memory. As a practical example some code could process the pixels of an image first in rows and then in columns... a different code instead could be more complex but processing rows and columns at the same time.
The second version could execute more instructions because of more complex housekeeping of the data but I wouldn't be surprised if it was faster because of how memory is organized: reading an image one column at a time is going to "trash the cache" and it's very possible that despite being simple the code working that way could be a LOT slower than the more complex one doing the processing in a memory-friendly way. The simpler code may end up being "stalled" a lot waiting for the cache lines to be filled or flushed to the external memory.
This is just an example, but in reality what happens inside a CPU when code is executed is for many powerful processors today a very very complex process: instructions are exploded in micro-instructions, registers are renamed, there is speculative execution of parts of code depending on what branch predictors guess even before the program counter really reaches a certain instruction and so on. Today the only way to know for sure if something is faster or slower is in many cases just trying with real data and measure.
Is the 100 instruction program necessarily faster than the 200 instruction program?
No. Firstly, on some architectures (such as x86) different instructions can take a different number of cycles. Secondly, there are effects — such cache misses, page faults and branch mispreditictions — that complicate the picture further.
From this it follows that the answer to your headline question is "not necessarily".
Further reading.
I found a paper from 2017 comparing the energy usage, speed, and memory consumption of various programming languages. There is an obvious positive correlation between faster languages also using less energy.

dynamic optimization of running programs

I was told that running programs generate probability data used to optimize repeated instructions.
For example, if an "if-then-else" control structure has been evaluated TRUE 8/10 times, then the next time the "if-then-else" statement is being evaluated, there is an 80% chance the condition will be TRUE. This statistic is used to prompt hardware to load the appropriate data into the registers assuming the outcome will be TRUE. The intent is to speed up the process. If the statement does evaluate to TRUE, data is already loaded to the appropriate registers. If the statement evaluates to FALSE, then the other data is loaded in and simply written over what was decided "more likely".
I have a hard time understanding how the probability calculations don't out-weigh the performance cost of decisions it's trying to improve. Is this something that really happens? Does it happen at a hardware level? Is there a name for this?
I can seem to find any information about the topic.
This is done. It's called branch prediction. The cost is non-trivial, but it's handled by dedicated hardware, so the cost is almost entirely in terms of extra circuitry -- it doesn't affect the time taken to execute the code.
That means the real cost would be one of lost opportunity -- i.e., if there was some other way of designing a CPU that used that amount of circuitry for some other purpose and gained more from doing so. My immediate guess is that the answer is usually no -- branch prediction is generally quite effective in terms of return on investment.

basic operations cpu time cost

I was wondering, how to optimize loops for systems with very limited resources. Let's say, if we have a basic for loop, like ( written in javascript):
for(var i = someArr.length - 1; i > -1; i--)
{
someArr[i]
}
I honestly don't know, isn't != cheaper than > ?
I would be grateful for any resources covering computing cost in context of basic operators, like the aforementioned, >>, ~, !, and so on.
Performance on a modern CPU is far from trivial. Here are a couple of things that complicate it:
Computers are fast. Your CPU can execute upwards of 6 billion instructions per second. So even the slowest instruction can be executed millions of times per second, meaning that it only really matters if you use it very often
Modern CPU's have hundreds of instructions in flight simultaneously. They are pipelined, meaning that while one instruction is being read, another is reading from registers, a third one is executing, and a fourth one is writing back to a register. Modern CPU's have 15-20 of such stages. On top of this, they can execute 3-4 instructions at the same time on each of these stages. And they can reorder these instructions. If the multiplication unit is being used by another instruction, perhaps we can find an addition instruction to execute instead, for example. So even if you have some slow instructions mixed in, their cost can be hidden very well most of the time, by executing other instructions while waiting for the slow one to finish.
Memory is hundreds of times slower than the CPU. The instructions being executed don't really matter if their cost is dwarfed by retrieval of data from memory. And even this isn't reliable, because the CPU has its own onboard caches to attempt to hide this cost.
So the short answer is "don't try to outsmart the compiler". If you are able to choose between two equivalent expressions, the compiler is probably able to do the same, and will pick the most efficient one. The cost of an instruction varies, depending on all the above factors. Which other instructions are executing, what data is in the CPU's cache, which precise CPU model is the code running on, and so on. Code that is super efficient in one case may be very inefficient in other cases. The compiler will try to pick the most generally efficient instructions, and schedule them as well as possible. Unless you know more than the compiler about this, you're unlikely to be able to do a better job of it.
Don't try such microoptimizations unless you really know what you're doing. As the above shows, low-level performance is a ridiculously complex subject, and it's very easy to write "optimizations" that result in far slower code. Or which just sacrifice readability on something that makes no difference at all.
Further, most of your code simply doesn't have a measurable impact on performance.
People generally love quoting (or misquoting) Knuth on this subject:
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil
People often interpret this as "don't bother trying to optimize your code". If you actually read the full quote, some much more interesting consequences should become clear:
Most of the time, we should forget about microoptimizations. Most code is executed so rarely that optimizations won't matter. Keeping in mind the number of instructions a CPU can execute per second, it is obvious that a block of code has to be executed very often for optimizations in it to have any effect. So about 97% of the time, your optimizations will be a waste of time. But he also says that sometimes (3% of the time), your optimizations will matter. And obviously, looking for those 3% is a bit like looking for a needle in a haystack. If you just decide to "optimize your code" in general, you're going to waste your time on the first 97%. Instead, you need to first locate the 3% that actually need optimizing. In other words, run your code through a profiler, and let it tell you which code takes up the most CPU time. Then you know where to optimize. And then your optimizations are no longer premature.
It is extraordinarily unlikely that such micro-optimizations will make a noticeable difference to your code in any but the most extreme (real time embedded systems?) circumstances. Your time would probably be better served worrying about making your code readable and maintainable.
When in doubt, always begin by asking Donald Knuth:
http://shreevatsa.wordpress.com/2008/05/16/premature-optimization-is-the-root-of-all-evil/
Or, for a slightly less high-brow take on micro-optimization:
http://www.codinghorror.com/blog/archives/000185.html
Most comparations have same coast, because the processor simply compares it in all aspects, then after that it takes a decision based on flags generated by this previous comparation so the comparation signal doesn't matter at all. But some architectures try to accelerate this process based on the value you are comparing with, like comparations against 0.
As far as I know, bitwise operations are the cheapest operations, slightly faster than addition and subtraction. Multiplication and division operations are a little more expensive, and comparation is the highest coast operation.
That's like asking for a fish, when I would rather teach you to fish.
There are simple ways to see for yourself how long things take. My favorite is to just copy the code 10 times, and then wrap it in a loop of 10^8 times. If I run it and look at my watch, the number of seconds it takes translates to nanoseconds.
Saying don't do premature optimization is a "don't be". If you want a "do be" you could try a proactive performance tuning technique like this.
BTW my favorite way of coding your loop is:
for (i = N; --i >= 0;){...}
Premature Optimization can be dangerous the best approach would be to write your application without worrying about that and then find the slow points and optimize those. If you are really worried about this use a lower level language. An interpreted language like javascript will cost you some processing power when compared to a lower level language like C.
In this particular case, > vs = is probably not a perfomance issue. HOWEVER > is generally safer choice because prevents cases where you modified the code from running off into the weeds and stuck in a infinite loop.