I would like to know whether there is a recommended way of measuring execution time in Tensorflow Federated. To be more specific, if one would like to extract the execution time for each client in a certain round, e.g., for each client involved in a FedAvg round, saving the time stamp before the local training starts and the time stamp just before sending back the updates, what is the best (or just correct) strategy to do this? Furthermore, since the clients' code run in parallel, are such a time stamps untruthful (especially considering the hypothesis that different clients may be using differently sized models for local training)?
To be very practical, using tf.timestamp() at the beginning and at the end of #tf.function client_update(model, dataset, server_message, client_optimizer) -- this is probably a simplified signature -- and then subtracting such time stamps is appropriate?
I have the feeling that this is not the right way to do this given that clients run in parallel on the same machine.
Thanks to anyone can help me on that.
There are multiple potential places to measure execution time, first might be defining very specifically what is the intended measurement.
Measuring the training time of each client as proposed is a great way to get a sense of the variability among clients. This could help identify whether rounds frequently have stragglers. Using tf.timestamp() at the beginning and end of the client_update function seems reasonable. The question correctly notes that this happens in parallel, summing all of these times would be akin to CPU time.
Measuring the time it takes to complete all client training in a round would generally be the maximum of the values above. This might not be true when simulating FL in TFF, as TFF maybe decided to run some number of clients sequentially due to system resources constraints. In practice all of these clients would run in parallel.
Measuring the time it takes to complete a full round (the maximum time it takes to run a client, plus the time it takes for the server to update) could be done by moving the tf.timestamp calls to the outer training loop. This would be wrapping the call to trainer.next() in the snippet on https://www.tensorflow.org/federated. This would be most similar to elapsed real time (wall clock time).
Related
I want to have a function like if the calculation time get too long, we abort routing calculation and submit the best solution at the point of time. Is there such a function in optaplanner ?
For example in a GUI application, you would start solving on a background (worker) thread. In this scenario you can stop solver asynchronously by calling solver.terminateEarly() from another thread, typically the UI thread when you click a stop button.
If this is not what you're looking for, read on.
Provided that by calculation you actually mean the time spent solving, you have several options how to stop solver. Besides asynchronous termination described in the first paragraph, you can use synchronous termination:
Use time spent termination if you know how much time you want dedicate to solving beforehand.
Use unimproved time spent termination if you want to stop solving if the solution doesn't improve for a specified amount of time.
Use best score termination if you want to stop solving after a certain score has been reached.
Synchronous termination is defined before starting the solver and it's done either by XML solver configuration or using the SolverConfig API. See OptaPlanner documentation for other termination conditions.
Lastly, in case you're talking about score calculation and it takes too long to calculate score for a single move (solution change) then you're most certainly doing something wrong. For OptaPlanner to be able to search the solution space effectively, the score calculation must be fast (at least 1000 calculations per second).
For example in vehicle routing problem, driving time or road distances must be known at the time when you start solving. You shouldn't slow down score calculation with a heavy computation that can be done beforehand.
This is an example (pseudo code) of how you could simulate and render a video game.
//simulate 20ms into the future
const long delta = 20;
long simulationTime = 0;
while(true)
{
while(simulationTime < GetMilliSeconds()) //GetMilliSeconds = Wall Clock Time
{
//the frame we simulated is still in the past
input = GetUserlnput();
UpdateSimulation(delta, input);
//we are trying to catch up and eventually pass the wall clock time
simulationTime += delta;
}
//since my current simulation is in the future and
//my last simulation is in the past
//the current looking of the world has got to be somewhere inbetween
RenderGraphics(InterpolateWorldState(GetMilliSeconds() - simulationTime));
}
That's my question:
I have 40ms to go through the outer 'while true' loop (means 25FPS).
The RenderGraphics method takes 10ms. So that means I have 30ms for the inner loop. The UpdateSimulation method takes 5ms. Everything else can be ignored since it's a value under 0.1ms.
What is the maximum I can set the variable 'delta' to in order to stay in my time schedule of 40ms (outer loop)?
And why?
This largely depends on how often you want and need to update your simulation status and user input, given the constraints mentioned below. For example, if your game contains internal state based on physical behavior, you would need a smaller delta to ensure that movements and collisions, if any, are properly evaluated and reflected within the game state. Also, if your user input requires fine-grained evaluation and state update, you would also need smaller delta values. For example, a shooting game with analogue user input (e.g. mouse, joystick), would benefit from update frequencies larger than 30Hz. If your game does not need such high-frequency evaluation of input and game state, then you could get away with larger delta values, or even by simply updating your game state once any input by the player was being detected.
In your specific pseudo-code, your simulation would update according to a fixed time-slice of length delta, which requires your simulation update to be processed in less wallclock time than the wallclock time to be simulated. Otherwise, wallclock time would proceed faster, than your simulation time can be updated. This ultimately limits your delta depending on how quick any simulation update of delta simulation time can actually be computed. This relationship also depends on your use case and may not be linear or constant. For example, physics engines often would divide your delta time given internally to what update rate they can reasonably process, as longer delta times may cause numerical instabilities and harder to solve linear systems raising processing effort non-linearly. In other use cases, simulation updates may take a linear or even constant time. Even so, many (possibly external) events could cause your simulation update to be processed too slowly, if it is inherently demanding. For example, loading resources during simulation updates, your operating system deciding to lay your execution thread aside, another process run by the user, anti-virus software kicking in, low memory pressure, a slow CPU and so on. Until now, I saw mostly two strategies to evade this problem or remedy its effects. First, simply ignoring it could work if the simulation update effort is low and it is assumed that the cause of the slowdown is temporary only. This would result in more or less noticeable "slow motion" behavior of your simulation, which could - in worst case - lead to simulation time lag piling up forever. The second strategy I often saw was to simply cap the measured frame time to be simulated to some artificial value, say 1000ms. This leads to smooth behavior as soon as the cause of slow down disappears, but has the drawback that the 'capped' simulation time is 'lost', which may lead to animation hiccups if not handled or accounted for. To choose a strategy, analyzing your use case could consist of measuring the wallclock time it takes to process simulation updates of delta and x * delta time and how changing the delta time and simulation load to process actually reflects in wallclock time needed to compute it, which will hint you to what the maximum value of delta is for your specific hardware and software environment.
If I have a large number of functions in my application, Do they effect the execution speed of the application?
For example: I have 10000 functions in my application but each time that I run my application only 1 or 2 functions will work. It is not known beforehand which function(s) will be called, it depends on user given input.
Does it changes the execution speed it I have many number of functions?
The speed shouldn't be significantly affected in your case. The number of procedures defined is much less important than the computational complexity of each procedure called.
Think about it. A 2.5GHz processor can theoretically perform more than 10 billion floating point operations per second (FLOPS). The time required to load a fixed number of procedures into memory, even a million lines of code, will remain constant and fairly trivial, but if one of your procedures is complex enough, the number of operations can increase massively over a comparatively few iterations.
9,998 functions not used, but still in since they are referenced, does not affect performance unless you need to parse all code at each run.
I'm thinking the case analysis size might affect the performance. If you have 10,000 fucntions and only use about 2 each time, then you'll have about 5,000 outcomes and that means a lot of tests if it's a linear analysis or about 13 if it's binary.
I'd start with profiling the code to find the bottlenecks.
I was told that running programs generate probability data used to optimize repeated instructions.
For example, if an "if-then-else" control structure has been evaluated TRUE 8/10 times, then the next time the "if-then-else" statement is being evaluated, there is an 80% chance the condition will be TRUE. This statistic is used to prompt hardware to load the appropriate data into the registers assuming the outcome will be TRUE. The intent is to speed up the process. If the statement does evaluate to TRUE, data is already loaded to the appropriate registers. If the statement evaluates to FALSE, then the other data is loaded in and simply written over what was decided "more likely".
I have a hard time understanding how the probability calculations don't out-weigh the performance cost of decisions it's trying to improve. Is this something that really happens? Does it happen at a hardware level? Is there a name for this?
I can seem to find any information about the topic.
This is done. It's called branch prediction. The cost is non-trivial, but it's handled by dedicated hardware, so the cost is almost entirely in terms of extra circuitry -- it doesn't affect the time taken to execute the code.
That means the real cost would be one of lost opportunity -- i.e., if there was some other way of designing a CPU that used that amount of circuitry for some other purpose and gained more from doing so. My immediate guess is that the answer is usually no -- branch prediction is generally quite effective in terms of return on investment.
At work we perform demanding numerical computations.
We have a network of several Linux boxes with different processing capabilities. At any given time, there can be anywhere from zero to dozens of people connected to a given box.
I created a script to measure the MFLOPS (Million of Floating Point Operations per Second) using the Linpack Benchmark; it also provides number of cores and memory.
I would like to use this information together with the load average (obtained using the uptime command) to suggest the best computer for performing a demanding computation. In other words, its 3:00pm; I have a meeting in two hours; I need to run a demanding process: what node will get me the answer fastest?
I envision a script which will output a suggestion along the lines of:
SUGGESTED HOSTS (IN ORDER OF PREFERENCE)
HOST1.MYNETWORK
HOST2.MYNETWORK
HOST3.MYNETWORK
Such suggestion should favor fast computers (high MFLOPS) if the load average is low and, as load average increases for a given node, it should favor available nodes instead (i.e., I'd rather run in a slower computer with no users than in an eight-core with forty dudes logged in).
How should I prioritize? What algorithm (rationale) would you use? Again, what I have is:
Load Average (1min, 5min, 15min)
MFLOPS measure
Number of users logged in
RAM (installed and available)
Number of cores (important to normalize the load average)
Any thoughts? Thanks!
You don't have enough data to make an well-informed decision. It sounds as though the scheduling is very volatile: "At any given time, there can be anywhere from zero to dozens of people connected to a given box." So the current load does not necessarily reflect the future load of the machines.
To properly asses what hosts someone should use to minimize computation time would require knowing when the current jobs will terminate. If a powerful machine is about to be done doing most of its jobs, it would be a good candidate even though it currently has a high load.
If you want to guess purely on the current situation, you can do a weighed calculation to find out which hosts have the most MFLOPS available.
MFLOPS available = host's MFLOPS + (number of logical processors - load average)
Sort the hosts by MFLOPS available and suggest them in a descending order.
This formula assumes that the MFLOPS of a host is linearly related to its load average. This might not be exactly true, but it's probably fairly close.
I would favor the most recent load average since it's closer to the current/future situation, whereas, jobs from 15 minutes ago might have completed by now.
Have you considered a distributed approach to computation? Not all computations can be broken up such that more than one cpu can work on them. But perhaps your problem space can benefit from some parallelization. Have a look at Hadoop.
You don't need to know FLOPS. beowulf modules paralell computing center has I go to has the script for sure
PDC operates leading-edge, high-performance computers on a national level. PDC offers easily accessible computational resources that primarily cater to the ...