About number of threads - objective-c

I am reading concurrency programming guide in ios dev site
when move to the section "Moving away from thread" ,Apple said
Although threads have been around for many years and continue to have
their uses, they do not solve the general problem of executing
multiple tasks in a scalable way. With threads, the burden of creating
a scalable solution rests squarely on the shoulders of you, the
developer. You have to decide how many threads to create and adjust
that number dynamically as system conditions change. Another problem
is that your application assumes most of the costs associated with
creating and maintaining any threads it uses.
follow my previous learning,the OS will take care about process-thread management , and programmer just only create and destroy threads in desire ,
is it wrong ?

No it is not wrong. What it is saying is when you are programming with threads, most of the time you dynamically create threads based on certain conditions that the programmer places in their code. For example, finding prime numbers can be split up with threads but the creating and destruction of threads is made by the programmer. You are completely correct, it is just saying what you are saying in a more descriptive and elaborate way.
Oh and for the thread management, sometimes if the developer sees that most of the time the user will need to create a large amount of threads, it is cheaper to spawn a pool of threads and use those.

Say you have 100 tasks to perform, all using independent--for the duration of the task--data. Every thread you start costs quite a bit of overhead. So if you have two cores, you only want to start two threads, because that's all that's going to run anyway. Then you have to feed tasks to each of those threads to keep them both running. If you have 100 cores, you'll launch 100 threads. It's worth the overhead to get the job done 50 times faster.
So in old-fashioned programming, you have to do two jobs. You have to find out how many cores you have, and you have to feed tasks to each of your threads so they keep running and don't waste cores. (This becomes only one job if you have >= 100 cores.)
I believe Apple is offering take over these two awkward jobs for you.
If your jobs share data, that changes things. With two threads running, one can block the other, and even on a 2-core machine it pays to have three or more threads running. You are apt to find letting 100 threads loose at once makes sense because it improves the chances that at least two of them are not blocked. It prevents one blocked task from holding up the rest of the tasks in its thread. You pay a price in thread overhead, but get it back in high CPU usage.
So this feature is sometimes very useful and sometimes not. It helps with parallel programming, but would hinder with non-parallel concurrency (multithreading).

Related

When is TensorFlow's ParameterServerStrategy preferable to its MultiWorkerMirroredStrategy?

When training a neural network across multiple servers and GPUs, I can't think of a scenario where the ParameterServerStrategy would be preferable to the MultiWorkerMirroredStrategy.
What are the ParameterServerStrategy's main use cases and why would it be better than using MultiWorkerMirroredStrategy?
MultiWorkerMirroredStrategy is intended for synchronous distributed training across multiple workers, each of which can have multiple GPUs
ParameterServerStrategy: Supports parameter servers. It can be used for multi-GPU synchronous local training or asynchronous multi-machine training.
One of the key differences is that ParameterServerStrategy can be used for asynchronous training, while MultiWorkerMirroredStrategy is intended for Synchronous distributed training. In MultiWorkerMirroredStrategy a copy of all variables in the model is kept on each device across all workers, and a communication method is needed to keep all variables in sync. In contrast, in ParameterServerStrategy each variable of the model is placed on one parameter server.
This matters because:
In synchronous training, all the workers are kept in sync in terms of training epochs and steps, other workers would need to wait for the failed or preempted worker to restart to continue. If the failed or preempted worker does not restart for some reason, your workers will keep waiting.
In contrast in ParameterServerStrategy, each worker is running the same code independently, but parameter servers are running a standard server. This means that while each worker will synchronously compute a single gradient update across all GPUs, updates between workers proceed asynchronously. Operations that occur only on the first replica (such as incrementing the global step), will occur on the first replica of every worker. Hence unlike MultiWorkerMirroredStrategy, different workers are not waiting on each other.
I guess the question is, do you expect workers to fail, and will the delay in restarting them slow down training when MultiWorkerMirroredStrategy ? If that is the case, maybe ParameterServerStrategy is better.
EDIT: Answers to questions in comments:
So is the only benefit of PSS the fact that it resists better to
failing workers than MWMS?
Not exactly - even if workers do not fail in MWMS, as workers still need to be in sync there could be network bottle necks.
If so, then I imagine it would only be useful when training on many
workers, say 20 or more, or else the probability that a worker will
fail during training is low (and it can be avoided by saving regular
snapshots).
Maybe not, it depends on the situation. Perhaps in your scenario the probability of failure is low. In someone else's scenario there may be a higher probability. For the same number of workers, the longer a job is, there is more likelihood of a failure occurring in the middle of a job. To illustrate further (with an over simplistic example), if I have the same number of nodes, but theyre simply slower, they could take much longer to do a job, and hence there is greater likelihood of any kind of interruption / failure occurring during the job.
(and it can be avoided by saving regular snapshots).
Not sure I understand what you mean - if a worker fails, and you've saved a snapshot, then you haven't lost data. But the worker still needs to restart. In the interim between failure and restarting other workers may be waiting.
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
I will first try to answer it from a conceptual point of view.
I would say try looking at it from a different angle - in a synchronous operation, you're waiting for something else to finish, and you may be idle till that something gives you what you need.
In constrast in an asynchronous operation, you do your own work and when you need more you ask for it.
There is no hard and fast rule about whether synchronous operations or asynchronous operations are better. It depends on the situation.
I will now try to answer it from an optimization point of view:
Isn't there a possible benefit with I/O saturation? If the updates are
asynchronous, I/O would be more spread out in time, right? But maybe
this benefit is cancelled by the fact that it uses more I/O? Could you
please detail this a bit?
In a distributed system it is possible that your bottleneck could be CPU / GPU, Disk or Network. Nowadays networks are really fast, and in some cases faster than disk. Depending on your workers configuration CPU / GPU could be the bottle neck. So it really depends on the configuration of your hardware and network.
Hence I would do some performance testing to determine where the bottlenecks in your system are, and optimize for your specific problem.
EDIT: Additional follow up questions:
One last thing: in your experience, in what use cases is PSS used? I
mean, both PSS and MWMS are obviously for use with large datasets (or
else a single machine would suffice), but what about the model? Would
PSS be better for larger models? And in your experience, is MWMS more
frequently used?
I think cost and the type of problem being worked on may influence the choice. For example, both AWS and GCP offer “spot instances” / “premptible instances” which are heavily discounted servers that can be taken away at any moment. In such a scenario, it may make sense to use PSS - even though machine failure is unlikely, a instance may simply be taken away without notice because it is a “spot instance”. If you use PSS, then the performance impact of servers disappearing may not be as large as when using MWMS.
If you’re using dedicated instances, the instances are dedicated to you, and will not be taken away - the only risk of interruption is machine failure. In such cases MWMS may be more attractive if you can take advantage of performance optimisations or plugin architecture.

safety of using cocoa's performSelectorOnMainThread thousands of times

In my app I have a worker thread which sits around doing a lot of processing. While it's processing, it sends updates to the main thread which uses the information to update GUI elements. This is done with performSelectorOnMainThread. For simplicity in the code, there are no restrictions on these updates and they get sent at a high rate (hundreds or thousands per second), and waitUntilDone is false. The methods called simply take the variable and copy it to a private member of the view controller. Some of them update the GUI directly (because I'm lazy!). Once every few seconds, the worker thread calls performSelectorOnMainThread with waitUntilDone set to true (this is related to saving the output of the current calculation batch).
My question: is this a safe use of performSelectorOnMainThread? I ask because I recently encountered a problem where my displayed values stopped updating, despite the background thread continuing to work without issues (and produce the correct output). Since they are fed values this way, I wondered if it might have hit a limit in the number of messages. I already checked the usual suspects (overflows, leaks, etc) and everything's clean. I haven't been able to reproduce the problem, however.
For simplicity in the code, there are no restrictions on these updates
and they get sent at a high rate (hundreds or thousands per second),
and waitUntilDone is false.
Yeah. Don't do that. Not even for the sake of laziness in an internal only application.
It can cause all kinds of potential problems beyond making the main run loop unresponsive.
Foremost, it will starve your worker thread for CPU cycles as your main thread is constantly spinning trying to update the UI as rapidly as the messages arrive. Given that drawing is oft done in a secondary thread, this will likely cause yet more thread contention, slowing things down even more.
Secondly, all those messages consume resources. Potentially lots of them and potentially ones that are relatively scarce, depending on implementation details.
While there shouldn't be a limit, there may likely be a practical limit that, when exceeded, things stop working. If this is the case, it would be a bug in the system, but one that is unlikely to be fixed beyond a console log that says "Too many messages, too fast, make fewer.".
It may also be a bug in your code, though. Transfer of state between threads is an area rife with pitfalls. Are you sure your cross-thread-communication code is bullet proof? (And, if it is bulletproof, it is quite likely a huge performance cost for your thousands/sec update notifications).
It isn't hard to throttle updates. While the commented suggestions are all reasonable, it can be done much more easily (NSNotificationQueue is fantastic, but likely overkill unless you are updating the main thread from many different places in your computation).
create an NSDate whenever you notify the main thread and store date in an ivar
next time you go to notify main thread, check if more than N seconds have passed
if they have, update your ivar
[bonus performance] if all that date comparison is too expensive, consider revisiting your algorithm to move the "update now" trigger to somewhere less frequent. Barring that, create an int ivar counter and only check the date every N iterations

updating 2 800 000 records with 4 threads

I have a VB.net application with an Access Database with one table that contains about 2,800,000 records, each raw is updated with new data daily. The machine has 64GB of ram and i7 3960x and its over clocked to 4.9GHz.
Note: data sources are local.
I wonder if I use ~10 threads will it finish updating the data to the rows faster.
If it is possiable what would be the mechanisim of deviding this big loop to multiple threads?
Update: Sometimes the loop has to repeat the calculation for some row depending on results also the loop have exacly 63 conditions and its 242 lines of code.
Microsoft Access is not particularly good at handling many concurrent updates, compared to other database platforms.
The more your tasks need to do calculations, the more you will typically benefit from concurrency / threading. If you spin up 10 threads that do little more than send update commands to Access, it is unlikely to be much faster than it is with just one thread.
If you have to do any significant calculations between reading and writing data, threads may show a performance improvement.
I would suggest trying the following and measuring the result:
One thread to read data from Access
One thread to perform whatever calculations are needed on the data you read
One thread to update Access
You can implement this using a Producer / Consumer pattern, which is pretty easy to do with a BlockingCollection.
The nice thing about the Producer / Consumer pattern is that you can add more producer and/or consumer threads with minimal code changes to find the sweet spot.
Supplemental Thought
IO is probably the bottleneck of your application. Consider placing the Access file on faster storage if you can (SSD, RAID, or even a RAM disk).
Well if you're updating 2,800,000 records with 2,800,000 queries, it will definitely be slow.
Generally, it's good to avoid opening multiple connections to update your data.
You might want to show us some code of how you're currently doing it, so we could tell you what to change.
So I don't think (with the information you gave) that going multi-thread for this would be faster. Now, if you're thinking about going multi-thread because the update freezes your GUI, now that's another story.
If the processing is slow, I personally don't think it's due to your servers specs. I'd guess it's more something about the logic you used to update the data.
Don't wonder, test. Write it so you could dispatch as much threads to make the work and test it with various numbers of threads. What does the loop you are talking about look like?
With questions like "if I add more threads, will it work faster"? it is always best to test, though there are rule of thumbs. If the DB is local, chances are that Oded is right.

Surely schedulers aren't this harmful? Don't we have better APIs?

I'm wondering what APIs are available to avoid the following problem.
Casting my mind back to Operating System lectures on my old CS course, the topic was multiprocess scheduling and concurrent I/O. Here's what the lecturer gave as an example of what would happen:
Two processes, X and Y have some work to do. There's one processor/bus/whatever and the scheduler distributes timeslices between X and Y, naively, as follows:
X gets timeslice 1
Y gets timeslice 2
X gets timeslice 3
...
This was described as being "fair", however it seems to me grossly unfair. Consider two cases under this scheme
If X and Y are both going to take 10 seconds each, now both will take 20 seconds.
If X requires 10 seconds and Y requires 100 seconds, then X will take 20 seconds and Y will take 110 seconds.
If the scheduler was simply "do all of X then all of Y" then in the first case X would take 10 seconds and Y would take 20 seconds; in the second case X would take 10 and y would take 110.
How a system which makes nobody better-off and somebody worse-off be a good idea? The only argument in the "fair" system's favour is that if we did all of Y before any of X then a small job X would be delayed by a large job Y and we need to keep both jobs "responsive".
For the second case, part of me sees the natural "best" way as being to say "X is 10 times smaller, therefore absent any explicit preference, it should get 10 times as many timeslices as Y". (It's a bit like giving pedestrians right of way before cars on the grounds that they put less strain on the roads, but I digress.) Under this scheme, X finishes in 11 seconds and Y finishes in 110 seconds. Real world consequence: my mp3 loads and plays without appreciable extra delay even though a massive file copy is happening in the background.
Obviously there is a whole universe of strategies available and I don't want to argue the suitability of any particular one, my point is this: all such strategies require knowledge of the size of the job.
So, are there OS APIs (Linux, or even Windows) which allow one to specify hints of the amount of work an operation will take?
(NB you could claim disk I/O incorporates this implicitly but while(not_done){read_chunk();} would render it meaningless -- the kind of API I'm thinking of would specify megabytes at file open time, clock cycles at thread creation time, or something along these lines.)
If all tasks represent work that will have no value until they are run to completion, then the best approach is to run all the jobs in some sequence so as to minimize the cost of other things' (or peoples') having to wait for them. In practice, many tasks represent a sequence of operations which may have some individual value, so if two tasks will take ten seconds each, having both tasks be half done at the ten-second mark may be better than having one task completed and one task not even started. This is especially true of tasks are producing data which will be needed by a downstream process which is performed by another machine, and the downstream process will be able to perform useful work any time it has received more data than it has processed. It is also somewhat true if part of the work entails showing a person that something useful is actually happening. A user who watches a progress bar count up over a period of 20 seconds is less likely to get unhappy than one whose progress bar doesn't even budge for ten seconds.
In common operating systems you typically don't care about the delay of the task but you try to maximize the throughput - in 110 seconds will both X and Y be done, period. Of course, some of the processes can be interactive and therefore the OS takes the extra overhead of context switches between processes to keep the illusion of computation in parallel.
As you said, any strategy that should minimalize task's completion time would require to know how long it will take. That's very often a problem to find if the task is more than just copy a file - that's why sometimes the progress bar in some application goes to 99% percent and stays there for a while doing just the few last things.
However, in real-time operating systems you often have to know task's worst case execution time or some deadline until the task must be finished - and then you are obligated to provide such "hint". The scheduler must then do a little bit smarter scheduling (moreover if there are some locks or dependencies included), on multiprocessors is the process sometimes NP-complete (then the scheduler uses some heuristics).
I suggest you read something about RTOSes, Earliest Deadline First scheduling and Rate Monotonic scheduling.
The only argument in the "fair" system's favour is that if we did all of Y before any of X then a small job X would be delayed by a large job Y and we need to keep both jobs "responsive".
That's exactly the rationale. Fair scheduling is fair in that it tends to distribute computing time, and therefore delays, equally among processes asking for it.
So, are there OS APIs (Linux, or even Windows) which allow one to specify hints of the amount of work an operation will take?
Batch systems do this, but, as you concluded yourself, this requires knowledge of the task at hand. Unix/Linux has the nice command which gives a process lower priority; it's a good idea to let any long running, CPU-bound process on a multitasking machine be "nice" so it doesn't hold up short and interactive tasks. ionice does the same for IO priority.
(Also, ever since the early 1970s, Unix schedulers have dynamically raised the priority of processes that do not "eat up" their slices, so interactive processes get high CPU priority and stay responsive without CPU-bound ones holding everything up. See Thompson and Ritchie's early papers on Unix.)

Improve MPI program

I wrote a MPI program that seems to run ok, but I wonder about performance. Master thread needs to do 10 or more times MPI_Send, and the worker receives data 10 or more times and sends it. I wonder if it gives a performance penalty and whether I could transfer everything in single structs or which other technique could I benefit from.
Other general question, once a mpi program works more or less, what are the best optimization techniques.
It's usually the case that sending 1 large message is faster than sending 10 small messages. The time cost of sending a message is well modelled by considering a latency (how long it would take to send an empty message, which is non-zero because of the overhead of function calls, network latency, etc) and a bandwidth (how much longer it takes to send an extra byte given that the network communications has already started). By bundling up messages into one message, you only incurr the latency cost once, and this is often a win (although it's always possible to come up with cases where it isn't). The best way to know for any particular code is simply to try. Note that MPI datatypes allow you very powerful ways to describe the layout of your data in memory so that you can take it almost directly from memory to the network without having to do an intermediate copy into some buffer (so-called "marshalling" of the data).
As to more general optimization questions about MPI -- without knowing more, all we can do is give you advice which is so general as to not be very useful. Minimize the amount of communications which need to be done; wherever possible, use built-in MPI tools (collectives, etc) rather than implementing your own.
One way to fully understand the performance of your MPI application is to run it within the SimGrid platform simulator. The tooling and models provided are sufficient to get realistic timing predictions of mid-range applications (like, a few dozen thousands lines of C or Fortran), and it can be associated to adapted visualization tools that can help you fully understand what is going on in your application, and the actual performance tradeoffs that you have to consider.
For a demo, please refer to this screencast: https://www.youtube.com/watch?v=NOxFOR_t3xI