I'm wondering what APIs are available to avoid the following problem.
Casting my mind back to Operating System lectures on my old CS course, the topic was multiprocess scheduling and concurrent I/O. Here's what the lecturer gave as an example of what would happen:
Two processes, X and Y have some work to do. There's one processor/bus/whatever and the scheduler distributes timeslices between X and Y, naively, as follows:
X gets timeslice 1
Y gets timeslice 2
X gets timeslice 3
...
This was described as being "fair", however it seems to me grossly unfair. Consider two cases under this scheme
If X and Y are both going to take 10 seconds each, now both will take 20 seconds.
If X requires 10 seconds and Y requires 100 seconds, then X will take 20 seconds and Y will take 110 seconds.
If the scheduler was simply "do all of X then all of Y" then in the first case X would take 10 seconds and Y would take 20 seconds; in the second case X would take 10 and y would take 110.
How a system which makes nobody better-off and somebody worse-off be a good idea? The only argument in the "fair" system's favour is that if we did all of Y before any of X then a small job X would be delayed by a large job Y and we need to keep both jobs "responsive".
For the second case, part of me sees the natural "best" way as being to say "X is 10 times smaller, therefore absent any explicit preference, it should get 10 times as many timeslices as Y". (It's a bit like giving pedestrians right of way before cars on the grounds that they put less strain on the roads, but I digress.) Under this scheme, X finishes in 11 seconds and Y finishes in 110 seconds. Real world consequence: my mp3 loads and plays without appreciable extra delay even though a massive file copy is happening in the background.
Obviously there is a whole universe of strategies available and I don't want to argue the suitability of any particular one, my point is this: all such strategies require knowledge of the size of the job.
So, are there OS APIs (Linux, or even Windows) which allow one to specify hints of the amount of work an operation will take?
(NB you could claim disk I/O incorporates this implicitly but while(not_done){read_chunk();} would render it meaningless -- the kind of API I'm thinking of would specify megabytes at file open time, clock cycles at thread creation time, or something along these lines.)
If all tasks represent work that will have no value until they are run to completion, then the best approach is to run all the jobs in some sequence so as to minimize the cost of other things' (or peoples') having to wait for them. In practice, many tasks represent a sequence of operations which may have some individual value, so if two tasks will take ten seconds each, having both tasks be half done at the ten-second mark may be better than having one task completed and one task not even started. This is especially true of tasks are producing data which will be needed by a downstream process which is performed by another machine, and the downstream process will be able to perform useful work any time it has received more data than it has processed. It is also somewhat true if part of the work entails showing a person that something useful is actually happening. A user who watches a progress bar count up over a period of 20 seconds is less likely to get unhappy than one whose progress bar doesn't even budge for ten seconds.
In common operating systems you typically don't care about the delay of the task but you try to maximize the throughput - in 110 seconds will both X and Y be done, period. Of course, some of the processes can be interactive and therefore the OS takes the extra overhead of context switches between processes to keep the illusion of computation in parallel.
As you said, any strategy that should minimalize task's completion time would require to know how long it will take. That's very often a problem to find if the task is more than just copy a file - that's why sometimes the progress bar in some application goes to 99% percent and stays there for a while doing just the few last things.
However, in real-time operating systems you often have to know task's worst case execution time or some deadline until the task must be finished - and then you are obligated to provide such "hint". The scheduler must then do a little bit smarter scheduling (moreover if there are some locks or dependencies included), on multiprocessors is the process sometimes NP-complete (then the scheduler uses some heuristics).
I suggest you read something about RTOSes, Earliest Deadline First scheduling and Rate Monotonic scheduling.
The only argument in the "fair" system's favour is that if we did all of Y before any of X then a small job X would be delayed by a large job Y and we need to keep both jobs "responsive".
That's exactly the rationale. Fair scheduling is fair in that it tends to distribute computing time, and therefore delays, equally among processes asking for it.
So, are there OS APIs (Linux, or even Windows) which allow one to specify hints of the amount of work an operation will take?
Batch systems do this, but, as you concluded yourself, this requires knowledge of the task at hand. Unix/Linux has the nice command which gives a process lower priority; it's a good idea to let any long running, CPU-bound process on a multitasking machine be "nice" so it doesn't hold up short and interactive tasks. ionice does the same for IO priority.
(Also, ever since the early 1970s, Unix schedulers have dynamically raised the priority of processes that do not "eat up" their slices, so interactive processes get high CPU priority and stay responsive without CPU-bound ones holding everything up. See Thompson and Ritchie's early papers on Unix.)
Related
I would like to know whether there is a recommended way of measuring execution time in Tensorflow Federated. To be more specific, if one would like to extract the execution time for each client in a certain round, e.g., for each client involved in a FedAvg round, saving the time stamp before the local training starts and the time stamp just before sending back the updates, what is the best (or just correct) strategy to do this? Furthermore, since the clients' code run in parallel, are such a time stamps untruthful (especially considering the hypothesis that different clients may be using differently sized models for local training)?
To be very practical, using tf.timestamp() at the beginning and at the end of #tf.function client_update(model, dataset, server_message, client_optimizer) -- this is probably a simplified signature -- and then subtracting such time stamps is appropriate?
I have the feeling that this is not the right way to do this given that clients run in parallel on the same machine.
Thanks to anyone can help me on that.
There are multiple potential places to measure execution time, first might be defining very specifically what is the intended measurement.
Measuring the training time of each client as proposed is a great way to get a sense of the variability among clients. This could help identify whether rounds frequently have stragglers. Using tf.timestamp() at the beginning and end of the client_update function seems reasonable. The question correctly notes that this happens in parallel, summing all of these times would be akin to CPU time.
Measuring the time it takes to complete all client training in a round would generally be the maximum of the values above. This might not be true when simulating FL in TFF, as TFF maybe decided to run some number of clients sequentially due to system resources constraints. In practice all of these clients would run in parallel.
Measuring the time it takes to complete a full round (the maximum time it takes to run a client, plus the time it takes for the server to update) could be done by moving the tf.timestamp calls to the outer training loop. This would be wrapping the call to trainer.next() in the snippet on https://www.tensorflow.org/federated. This would be most similar to elapsed real time (wall clock time).
Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.
This is an example (pseudo code) of how you could simulate and render a video game.
//simulate 20ms into the future
const long delta = 20;
long simulationTime = 0;
while(true)
{
while(simulationTime < GetMilliSeconds()) //GetMilliSeconds = Wall Clock Time
{
//the frame we simulated is still in the past
input = GetUserlnput();
UpdateSimulation(delta, input);
//we are trying to catch up and eventually pass the wall clock time
simulationTime += delta;
}
//since my current simulation is in the future and
//my last simulation is in the past
//the current looking of the world has got to be somewhere inbetween
RenderGraphics(InterpolateWorldState(GetMilliSeconds() - simulationTime));
}
That's my question:
I have 40ms to go through the outer 'while true' loop (means 25FPS).
The RenderGraphics method takes 10ms. So that means I have 30ms for the inner loop. The UpdateSimulation method takes 5ms. Everything else can be ignored since it's a value under 0.1ms.
What is the maximum I can set the variable 'delta' to in order to stay in my time schedule of 40ms (outer loop)?
And why?
This largely depends on how often you want and need to update your simulation status and user input, given the constraints mentioned below. For example, if your game contains internal state based on physical behavior, you would need a smaller delta to ensure that movements and collisions, if any, are properly evaluated and reflected within the game state. Also, if your user input requires fine-grained evaluation and state update, you would also need smaller delta values. For example, a shooting game with analogue user input (e.g. mouse, joystick), would benefit from update frequencies larger than 30Hz. If your game does not need such high-frequency evaluation of input and game state, then you could get away with larger delta values, or even by simply updating your game state once any input by the player was being detected.
In your specific pseudo-code, your simulation would update according to a fixed time-slice of length delta, which requires your simulation update to be processed in less wallclock time than the wallclock time to be simulated. Otherwise, wallclock time would proceed faster, than your simulation time can be updated. This ultimately limits your delta depending on how quick any simulation update of delta simulation time can actually be computed. This relationship also depends on your use case and may not be linear or constant. For example, physics engines often would divide your delta time given internally to what update rate they can reasonably process, as longer delta times may cause numerical instabilities and harder to solve linear systems raising processing effort non-linearly. In other use cases, simulation updates may take a linear or even constant time. Even so, many (possibly external) events could cause your simulation update to be processed too slowly, if it is inherently demanding. For example, loading resources during simulation updates, your operating system deciding to lay your execution thread aside, another process run by the user, anti-virus software kicking in, low memory pressure, a slow CPU and so on. Until now, I saw mostly two strategies to evade this problem or remedy its effects. First, simply ignoring it could work if the simulation update effort is low and it is assumed that the cause of the slowdown is temporary only. This would result in more or less noticeable "slow motion" behavior of your simulation, which could - in worst case - lead to simulation time lag piling up forever. The second strategy I often saw was to simply cap the measured frame time to be simulated to some artificial value, say 1000ms. This leads to smooth behavior as soon as the cause of slow down disappears, but has the drawback that the 'capped' simulation time is 'lost', which may lead to animation hiccups if not handled or accounted for. To choose a strategy, analyzing your use case could consist of measuring the wallclock time it takes to process simulation updates of delta and x * delta time and how changing the delta time and simulation load to process actually reflects in wallclock time needed to compute it, which will hint you to what the maximum value of delta is for your specific hardware and software environment.
I am reading concurrency programming guide in ios dev site
when move to the section "Moving away from thread" ,Apple said
Although threads have been around for many years and continue to have
their uses, they do not solve the general problem of executing
multiple tasks in a scalable way. With threads, the burden of creating
a scalable solution rests squarely on the shoulders of you, the
developer. You have to decide how many threads to create and adjust
that number dynamically as system conditions change. Another problem
is that your application assumes most of the costs associated with
creating and maintaining any threads it uses.
follow my previous learning,the OS will take care about process-thread management , and programmer just only create and destroy threads in desire ,
is it wrong ?
No it is not wrong. What it is saying is when you are programming with threads, most of the time you dynamically create threads based on certain conditions that the programmer places in their code. For example, finding prime numbers can be split up with threads but the creating and destruction of threads is made by the programmer. You are completely correct, it is just saying what you are saying in a more descriptive and elaborate way.
Oh and for the thread management, sometimes if the developer sees that most of the time the user will need to create a large amount of threads, it is cheaper to spawn a pool of threads and use those.
Say you have 100 tasks to perform, all using independent--for the duration of the task--data. Every thread you start costs quite a bit of overhead. So if you have two cores, you only want to start two threads, because that's all that's going to run anyway. Then you have to feed tasks to each of those threads to keep them both running. If you have 100 cores, you'll launch 100 threads. It's worth the overhead to get the job done 50 times faster.
So in old-fashioned programming, you have to do two jobs. You have to find out how many cores you have, and you have to feed tasks to each of your threads so they keep running and don't waste cores. (This becomes only one job if you have >= 100 cores.)
I believe Apple is offering take over these two awkward jobs for you.
If your jobs share data, that changes things. With two threads running, one can block the other, and even on a 2-core machine it pays to have three or more threads running. You are apt to find letting 100 threads loose at once makes sense because it improves the chances that at least two of them are not blocked. It prevents one blocked task from holding up the rest of the tasks in its thread. You pay a price in thread overhead, but get it back in high CPU usage.
So this feature is sometimes very useful and sometimes not. It helps with parallel programming, but would hinder with non-parallel concurrency (multithreading).
I want to make a comparison between RR and MLFQ in terms of waiting time, response time, turnaround time in 3 cases:
a) More CPU-bounded jobs than I/O
jobs
b) More I/O-bounded jobs than
CPU bounded jobs
c) When only a few
jobs need to schedule.
Could you help me to clarify or give me some sources for reference. Thanks a lot
There's some maths for this called "queueing theory", which can give you some equations to use.
Another way is to develop a simulation (software model) of the queue, and measure things (e.g. the distribution of response times) as you change various parameters (e.g. utilization).
The important thing to decide is the distribution of inter-arrival times of the input events (jobs to be processed): if they arrive regularly then there may be typically no queueing delay at all (assuming the system utilization is less than 100%), but if they arrive randomly (e.g. with a Poisson distribution) then there's (on average) a non-zero queue.