Parallel reduce algorithm implementation - objective-c

I've been investigating implementations of reduce [inject, fold, whatever you want to call it] functions in Objective-C using blocks and was wondering if there were any techniques for parallelizing the computation where the function applied is associative (e.g. sum of a collection of integers)?
i.e. is it possible to parallelize or improve on something like this on NSArray:
- (id)reduceWithBlock:(id (^)(id memo, id obj))block andAccumulator:(id)accumulator
id acc = [[accumulator copy] autorelease];
for (id obj in self) {
acc = block(acc, obj);
return acc;
Using grand-central dispatch?
EDIT: I've made a second attempt, partitioning the array into smaller chunks and reducing them in separate dispatch queues but there's no discernable performance gain in my testing: (gist here)

You can use dispatch_apply with Dispatch Global Queue for parallelizing it, but your code seems that it is not so efficient with concurrent work. Because the accumulator object requires exclusive access, and it is tightly used by the block, thus it will cause giant lock for the accumulator object.
For example, this code is nearly non-concurrent work even though using dispatch_apply with Dispatch Global Queue.
dispatch_semaphore_t sema = dispatch_semaphore_create(1);
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply([array count], queue, ^(size_t index) {
dispatch_semaphore_wait(sema, DISPATCH_TIME_FOREVER);
acc = block(acc, [array objectAtIndex:index]);
You need split the block and the accumulator implementation for efficient parallelization.
(I haven't check the algorithm of your code.)
dispatch_queue_t result_queue = dispatch_queue_create(NULL, NULL);
You are using Serial Queue. Serial queue executes one block at a time. Thus, it might be
dispatch_queue_t result_queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_queue_t result_queue = dispatch_queue_create(NULL, DISPATCH_QUEUE_CONCURRENT);
/* DISPATCH_QUEUE_CONCURRENT is only available OS X 10.7/iOS 4.3 or later. */

I implemented a parallel divide & conquer algorithm which works with associative functions here. Unfortunately I couldn't get any discernable speedup from it so I'm sticking with a simple serial version for now. I believe my base case needs optimising- I read somewhere that the inequality n >= p^2 should hold, where n is the number of jobs and p the number of processors.
Obviously a lot of time is being lost on array-splitting and recursing, if anybody has suggestions they'd be much appreciated.


How can doing tasks in multiple threads be 100 times slower than doing sequentially on the main thread?

I have this other question of mine where I have asked about converting a code from sequential to parallel processing using Grand Central Dispatch.
I will copy the question text to makes things easy...
I have an array of NSNumbers that have to pass thru 20 tests. If one test fails than the array is invalid if all tests pass than the array is valid. I am trying to do it in a way that as soon as the first failure happens it stops doing the remaining tests. If a failure happens on the 3rd test then stop evaluating other tests.
Every individual test returns YES when it fails and NO when it is ok.
I am trying to convert the code I have that is serial processing, to parallel processing with grand central dispatch, but I cannot wrap my head around it.
This is what I have.
First the definition of the tests to be done. This array is used to run the tests.
#define TESTS #[ \
#"averageNotOK:", \
#"numbersOverRange:", \
#"numbersForbidden:", \
// ... etc etc
- (BOOL) numbersPassedAllTests:(NSArray *)numbers {
NSInteger count = [TESTS count];
for (int i=0; i<count; i++) {
NSString *aMethodName = TESTS[i];
SEL selector = NSSelectorFromString(aMethodName);
BOOL failed = NO;
NSMethodSignature *signature = [[self class] instanceMethodSignatureForSelector:selector];
NSInvocation *invocation = [NSInvocation invocationWithMethodSignature:signature];
[invocation setSelector:selector];
[invocation setTarget:self];
[invocation setArgument:&numbers atIndex:2];
[invocation invoke];
[invocation getReturnValue:&failed];
if (failed) {
return NO;
return YES;
This work perfectly but performs the tests sequentially.
After working on the code with the help of an user, I got this code using grand central dispatch:
- (BOOL) numbersPassedAllTests:(NSArray *)numbers {
volatile __block int32_t hasFailed = 0;
NSInteger count = [TESTS count];
__block NSArray *numb = [[NSArray alloc] initWithArray:numbers];
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0),
^(size_t index)
// do no computation if somebody else already failed
if(hasFailed) {
SEL selector = NSSelectorFromString(TESTS[index]);
BOOL failed = NO;
NSMethodSignature *signature = [[self class] instanceMethodSignatureForSelector:selector];
NSInvocation *invocation = [NSInvocation invocationWithMethodSignature:signature];
[invocation setSelector:selector];
[invocation setTarget:self];
[invocation setArgument:&numb atIndex:2];
[invocation invoke];
[invocation getReturnValue:&failed];
return !hasFailed;
Activity Monitor shows what appears to be the cores being used with more intensity but this code is at least 100 times slower than the older one working sequentially!
How can that be?
If your methods that you're calling are simple, the overhead of creating all of these threads could offset any advantage gained by concurrency. As the Performing Loop Iterations Concurrently section of the Concurrency Programming Guide says:
You should make sure that your task code does a reasonable amount of work through each iteration. As with any block or function you dispatch to a queue, there is overhead to scheduling that code for execution. If each iteration of your loop performs only a small amount of work, the overhead of scheduling the code may outweigh the performance benefits you might achieve from dispatching it to a queue. If you find this is true during your testing, you can use striding to increase the amount of work performed during each loop iteration. With striding, you group together multiple iterations of your original loop into a single block and reduce the iteration count proportionately. For example, if you perform 100 iterations initially but decide to use a stride of 4, you now perform 4 loop iterations from each block and your iteration count is 25. For an example of how to implement striding, see “Improving on Loop Code.”
That link to Improving on Loop Code walks through a sample implementation of striding, whereby you balance the number of threads with the amount of work done by each. It will take some experimentation to find the right balance with your methods, so play around with different striding values until you achieve the best performance.
In my experiments with a CPU-bound process, I found that I achieved a huge gain when doing two threads, but it diminished after that point. It may vary based upon what is in your methods that you're calling.
By the way, what are these methods that you're calling doing? If you're doing anything that requires the main thread (e.g. UI updates), that will also skew the results. For the sake of comparison, I'd suggest you take your serial example and dispatch that to a background queue (as a single task), and see what sort of performance you get that way. This way you can differentiate between main vs. background queue related issues, and the too-many-threads overhead issue I discuss above.
Parallel computing only makes sense if you have enough tasks for each node to do. Otherwise, the extra overhead of setting up/managing the parallel nodes takes up more time than the problem itself.
Example of bad parallelization:
void function(){
for(int i = 0; i < 1000000; ++i){
for(int j = 0; j < 1000000; ++j){
ParallelAction{ //Turns the following code into a thread to be done concurrently.
print(i + ", " + j)
Problem: every print() statement has to be turned into a thread, where a worker node has to initialize, acquire the thread, finish, and find a new thread.
Essentially, you've got 1 000 000 * 1 000 000 threads waiting for a node to work on them.
How to make the above better:
void function(){
for(int i = 0; i < 1000000; ++i){
ParallelAction{ //Turns the following code into a thread to be done concurrently.
for(int j = 0; j < 1000000; ++j){
print(i + ", " + j)
This way, every node can start up, do a sizeable amount of work (print 1 000 000 things), finish up, and find a new job.
The above link talks about granularity, the amount breaking up of a problem that you do.

How to implement a reentrant locking mechanism in objective-c through GCD?

I have an objective-c class with some methods, which use a GCD queue to ensure that concurrent accesses to a resource take place serially (standard way to do this).
Some of these methods need to call other methods of the same class. So the locking mechanism needs to be re-entrant. Is there a standard way to do this?
At first, I had each of these methods use
dispatch_sync(my_queue, ^{
// Critical section
to synchronize accesses. As you know, when one of these methods calls another such method, a deadlock happens because the dispatch_sync call stops the current executing until that other block is executed, which can't be executed also, because execution on the queue is stopped. To solve this, I then used e.g. this method:
- (void) executeOnQueueSync:(dispatch_queue_t)queue : (void (^)(void))theBlock {
if (dispatch_get_current_queue() == queue) {
} else {
dispatch_sync(queue, theBlock);
And in each of my methods, I use
[self executeOnQueueSync:my_queue : ^{
// Critical section
I do not like this solution, because for every block with a different return type, I need to write another method. Moreover, this problem looks very common to me and I think there should exist a nicer, standard solution for this.
First things first: dispatch_get_current_queue() is deprecated. The canonical approach would now be to use dispatch_queue_set_specific. One such example might look like:
typedef dispatch_queue_t dispatch_recursive_queue_t;
static const void * const RecursiveKey = (const void*)&RecursiveKey;
dispatch_recursive_queue_t dispatch_queue_create_recursive_serial(const char * name)
dispatch_queue_t queue = dispatch_queue_create(name, DISPATCH_QUEUE_SERIAL);
dispatch_queue_set_specific(queue, RecursiveKey, (__bridge void *)(queue), NULL);
return queue;
void dispatch_sync_recursive(dispatch_recursive_queue_t queue, dispatch_block_t block)
if (dispatch_get_specific(RecursiveKey) == (__bridge void *)(queue))
dispatch_sync(queue, block);
This pattern is quite usable, but it's arguably not bulletproof, because you could create nested recursive queues with dispatch_set_target_queue, and trying to enqueue work on the outer queue from inside the inner one would deadlock, even though you are already "inside the lock" (in derision quotes because it only looks like a lock, it's actually something different: a queue — hence the question, right?) for the outer one. (You could get around that by wrapping calls to dispatch_set_target_queue and maintaining your own out-of-band targeting graph, etc., but that's left as an exercise for the reader.)
You go on to say:
I do not like this solution, because for every block with a different
return types, I need to write another method.
The general idea of this "state-protecting serial queue" pattern is that you're protecting private state; why would you "bring your own queue" to this? If it's about multiple objects sharing the state protection, then give them an inherent way to find the queue (i.e., either push it in at init time, or put it somewhere that's mutually accessible to all interested parties). It's not clear how "bringing your own queue" would be useful here.

Concurrent drawRect:

I have a large array of objects (typically 500 - 2000) that render to the screen. Unfortunately, rendering is not exactly snappy at the moment.
Each object needs to perform some calculations which take up most of the time and finally draw itself to the screen, i.e. currently my drawRect: method looks essentially like this:
(I've left out trivial optimizations like checking bounding rects vs. dirtyRect for the sake of readability)
- (void)drawRect:(NSRect)dirtyRect
for (Thing *thing in [self getThings])
[thing prepareForDrawing];
[thing draw];
An obvious candidate for concurrent processing, right?
I couldn't come up with a good approach to decouple preparation from the actual drawing operations, i.e. perform the pre-processing in parallel and somehow queue the drawing commands until all processing is done, then render all in one go.
However, thinking of the goodness that is GCD I came up with the following scheme.
It kind of sounds OK to me but being new to GCD and before running into weird multi-threading issues four weeks after a public release or just using a bad GCD design pattern in general I thought I'd ask for feedback.
Can anybody see a problem with this approach - potential issues, or a better solution?
- (void)drawRect:(NSRect)dirtyRect
[[self getThings] enumerateObjectsWithOptions:NSEnumerationConcurrent
usingBlock:^(id obj, NSUInteger idx, BOOL *stop)
// prepare concurrently
Thing *thing = (Thing*)obj;
[thing prepareForDrawing];
// always draw in main thread
dispatch_async(dispatch_get_main_queue(), ^{
[thing draw];
That won't work because the invocations of [thing draw] will happen outside of -drawRect: after it has completed. The graphics context will no longer be valid for drawing into that view.
Why are the "things" not prepared in advance? -drawRect: is for drawing, not computation. Any necessary expensive computation should have been done in advance.

Is Objective-C's NSMutableArray thread-safe?

I've been trying to fix this crash for almost a week. The application crashes without any exception or stack-trace. The application does not crash in any way while running through instruments in zombie mode.
I have a method that gets called on a different thread.
The solution that fixed the crash was replacing
[self.mutableArray removeAllObjects];
dispatch_async(dispatch_get_main_queue(), ^{
[self.searchResult removeAllObjects];
I thought it might be a timing issue, so I tried to synchronize it, but it still crashed:
[self.searchResult removeAllObjects];
Here is the code
- (void)populateItems
// Cancel if already exists
[self.searchThread cancel];
self.searchThread = [[NSThread alloc] initWithTarget:self
[self.searchThread start];
- (void)populateItemsinBackground
if ([[NSThread currentThread] isCancelled])
[NSThread exit];
[self.mutableArray removeAllObjects];
// Populate data here into mutable array
for (loop here)
if ([[NSThread currentThread] isCancelled])
[NSThread exit];
// Add items to mutableArray
Is this problem with NSMutableArray not being thread-safe?
It is not thread safe and if you need to modify your mutable array from another thread you should use NSLock to ensure everything goes as planned:
NSLock *arrayLock = [[NSLock alloc] init];
[arrayLock lock]; // NSMutableArray isn't thread-safe
[myMutableArray addObject:#"something"];
[myMutableArray removeObjectAtIndex:5];
[arrayLock unlock];
As others already said, NSMutableArray is not thread safe. In case anyone want to achieve more than removeAllObject in a thread-safe environment, I will give another solution using GCD besides the one using lock. What you have to do is to synchronize the read/update(replace/remove) actions.
First get the global concurrent queue:
dispatch_queue_t concurrent_queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
For read:
- (id)objectAtIndex:(NSUInteger)index {
__block id obj;
dispatch_sync(self.concurrent_queue, ^{
obj = [self.searchResult objectAtIndex:index];
return obj;
For insert:
- (void)insertObject:(id)obj atIndex:(NSUInteger)index {
dispatch_barrier_async(self.concurrent_queue, ^{
[self.searchResult insertObject:obj atIndex:index];
From Apple Doc about dispatch_barrier_async:
When the barrier block reaches the front of a private concurrent queue, it is not executed immediately. Instead, the queue waits until its currently executing blocks finish executing. At that point, the barrier block executes by itself. Any blocks submitted after the barrier block are not executed until the barrier block completes.
Similar for remove:
- (void)removeObjectAtIndex:(NSUInteger)index {
dispatch_barrier_async(self.concurrent_queue, ^{
[self.searchResult removeObjectAtIndex:index];
EDIT: Actually I found another simpler way today to synchronize access to a resource by using a serial queue provided by GCD.
From Apple Doc Concurrency Programming Guide > Dispatch Queues:
Serial queues are useful when you want your tasks to execute in a specific order. A serial queue executes only one task at a time and always pulls tasks from the head of the queue. You might use a serial queue instead of a lock to protect a shared resource or mutable data structure. Unlike a lock, a serial queue ensures that tasks are executed in a predictable order. And as long as you submit your tasks to a serial queue asynchronously, the queue can never deadlock.
Create your serial queue:
dispatch_queue_t myQueue = dispatch_queue_create("com.example.MyQueue", NULL);
Dispatch tasks async to the serial queue:
dispatch_async(myQueue, ^{
obj = [self.searchResult objectAtIndex:index];
dispatch_async(myQueue, ^{
[self.searchResult removeObjectAtIndex:index];
Hope it helps!
As well as NSLock can also use #synchronized(condition-object) you just have to make sure every access of the array is wrapped in a #synchronized with the same object acting as the condition-object , if you only want to modify the contents of the same array instance then you can use the array itself as the condition-object, other wise you will have to use something else you know will not go away, the parent object, i.e self, is a good choice because it will always be the same one for the same array.
atomic in #property attributes will only make setting the array thread safe not modifying the contents, i.e. self.mutableArray = ... is thread safe but [self.mutableArray removeObject:] is not.
__weak typeof(self)weakSelf = self;
#synchronized (weakSelf.mutableArray) {
[weakSelf.mutableArray removeAllObjects];
Since serial queues were mentioned: With a mutable array, just asking "is it thread safe" isn't enough. For example, making sure that removeAllObjects doesn't crash is all good and fine, but if another thread tries to process the array at the same time, it will either process the array before or after all elements are removed, and you really have to think what the behaviour should be.
Creating one class + object that is responsible for this array, creating a serial queue for it, and doing all operations through the class on that serial queue is the easiest way to get things right without making your brain hurt through synchronisation problems.
All the NSMutablexxx classes are not thread-safe. Operations including get,insert,remove,add and replace should be used with NSLock.This is a list of thread-safe and thread-unsafe classes given by apple: Thread Safety Summary
Almost NSMutable classes object is not thread safe.

NSOperation & Singleton: Correct concurency design

I need an advice from you guys on the design of my app here, basically I would like to know if it will work as I expect ? As the multi-threading is quite tricky thing I would like to hear from you.
Basically my task is very simple -I've SomeBigSingletonClass - big singleton class, which has two methods someMethodOne and someMethodTwo
These methods should be invoked periodically (timer based) and in separate threads.
But there should be only one instance of each thread at the moment, e.g. there should be only one running someMethodOne at any time and the same for someMethodTwo.
What I've tried
GCD - Did implementation with GCD but it lacks very important feature, it does not provide means to check if there is any running task at the moment, i.e. I was not able to check if there is only one running instance of let say someMethodOne method.
NSThread - It does provide good functionality but I'm pretty sure that new high level technologies like NSOperation and GCD will make it more simple to maintain my code. So I decided to give-up with NSThread.
My Solution with NSOperation
How I plan to implement the two thread invokation
#implementation SomeBigSingletonClass
- (id)init
// queue is an iVar
queue = [[NSOperationQueue alloc] init];
// As I'll have maximum two running threads
[queue setMaxConcurrentOperationCount:2];
+ (SomeBigSingletonClass *)sharedInstance
static SomeBigSingletonClass *sharedInstance = nil;
static dispatch_once_t onceToken;
dispatch_once(&onceToken, ^{
sharedInstance = [[SomeBigSingletonClass alloc] init];
return sharedInstance;
- (void)someMethodOne
SomeMethodOneOperation *one = [[SomeMethodOneOperation alloc] init];
[queue addOperation:one];
- (void)someMethodTwo
SomeMethodTwoOperation *two = [[SomeMethodOneOperation alloc] init];
[queue addOperation:two];
And finally my NSOperation inherited class will look like this
#implementation SomeMethodOneOperation
- (id)init
if (![super init]) return nil;
return self;
- (void)main {
// Check if the operation is not running
if (![self isExecuting]) {
[[SomeBigSingletonClass sharedInstance] doMethodOneStuff];
And the same for SomeMethodTwoOperation operation class.
If you are using NSOperation, you can achieve what you want be creating your own NSOperationQueue and setting numberOfConcurrentOperations to 1.
You could have also maybe used an #synchronized scope with your class as your lock object.
EDIT: clarification---
What I am proposing:
Queue A (1 concurrent operation--used to perform SomeMethodOneOperation SomeMethodTwoOperation once at a time)
Queue B (n concurrent operations--used for general background operation performing)
EDIT 2: Updated code illustrating approach to run maximum operation one and operation two, with max one each of operation one and operation two executing at any given time.
static NSOperationQueue * methodOneQueue = nil ;
static dispatch_once_t onceToken ;
dispatch_once(&onceToken, ^{
queue = [ [ NSOperationQueue alloc ] init ] ;
queue = 1 ;
[ queue addOperation:[ NSBlockOperation blockOperationWithBlock:^{
... do method one ...
} ] ];
static NSOperationQueue * queue = nil ;
static dispatch_once_t onceToken ;
dispatch_once(&onceToken, ^{
queue = [ [ NSOperationQueue alloc ] init ] ;
queue = 1 ;
[ queue addOperation:[ NSBlockOperation blockOperationWithBlock:^{
... do method two ...
} ] ];
per our discussion:
I pointed out that isExecuting is a member variable and refers only to the state of the operation being queried, not if any instance of that class is executing
therefore Deimus' solution won't work to keep multiple instances of operation one running simultaneously for example
Sorry, I'm late to the party. If your methods are called back based on timers, and you want them to execute concurrently with respect to one another, but synchronous with respect to themselves, might I suggest using GCD timers.
Basically, you have two timers, one which executes methodOne, and the other executes methodTwo. Since you pass blocks to the GCD timers, you don't even have to use methods, especially if you want to make sure other code does not call those methods when they are not supposed to run.
If you schedule the timers onto a concurrent queue, then both timers could possibly be running at the same time on different threads. However, the timer itself will only run when it is scheduled. Here is an example I just hacked up... you can easily use it with a singleton...
First, a helper function to create a timer that takes a block which will be called when the timer fires. The block passes the object, so it can be referenced by the block without creating a retain cycle. If we use self as the parameter name, the code in the block can look just like other code...
static dispatch_source_t setupTimer(Foo *fooIn, NSTimeInterval timeout, void (^block)(Foo * self)) {
// Create a timer that uses the default concurrent queue.
// Thus, we can create multiple timers that can run concurrently.
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_source_t timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, queue);
uint64_t timeoutNanoSeconds = timeout * NSEC_PER_SEC;
dispatch_time(DISPATCH_TIME_NOW, timeoutNanoSeconds),
// Prevent reference cycle
__weak Foo *weakFoo = fooIn;
dispatch_source_set_event_handler(timer, ^{
// It is possible that the timer is running in another thread while Foo is being
// destroyed, so make sure it is still there.
Foo *strongFoo = weakFoo;
if (strongFoo) block(strongFoo);
return timer;
Now, the basic class implementation. If you don't want to expose methodOne and methodTwo, there is no reason to even create them, especially if they are simple, as you can just put that code directly in the block.
#implementation Foo {
dispatch_source_t timer1_;
dispatch_source_t timer2_;
- (void)methodOne {
- (void)methodTwo {
- (id)initWithTimeout1:(NSTimeInterval)timeout1 timeout2:(NSTimeInterval)timeout2 {
if (self = [super init]) {
timer1_ = setupTimer(self, timeout1, ^(Foo *self) {
// Do "methodOne" work in this block... or call it.
[self methodOne];
timer2_ = setupTimer(self, timeout2, ^(Foo *self) {
// Do "methodOne" work in this block... or call it.
[self methodTwo];
return self;
- (void)dealloc {
In response to the comments (with more detail to hopefully explain why the block will not be executed concurrently, and why missed timers are coalesced into one).
You do not need to check for it being run multiple times. Straight from the documentation...
Dispatch sources are not reentrant. Any events received while the
dispatch source is suspended or while the event handler block is
currently executing are coalesced and delivered after the dispatch
source is resumed or the event handler block has returned.
That means when a GCD dispatch_source timer block is dispatched, it will not be dispatched again until the one that is already running completes. You do nothing, and the library itself will make sure the block is not executed multiple times concurrently.
If that block takes longer than the timer interval, then the "next" timer call will wait until the one that is running completes. Also, all the events that would have been delivered are coalesced into one single event.
You can call
unsigned numEventsFired = dispatch_source_get_data(timer);
from within your handler to get the number of events that have fired since the last time the handler was executed (e.g., if your handler ran through 4 timer firings, this would be 4 - but you would still get all this firings in this one event -- you would not receive separate events for them).
For example, let's say your interval timer is 1 second, and your timer happens to take 5 seconds to run. That timer will not fire again until the current block is done. Furthermore, all those timers will be coalesced into one, so you will get one call into your block, not 5.
Now, having said all that, I should caution you about what I think may be a bug. Now, I rarely lay bugs at the feet of library code, but this one is repeatable, and seems to go against the documentation. So, if it's not a bug, it's an undocumented feature. However, it is easy to get around.
When using timers, I have noticed that coalesced timers will most certainly be coalesced. That means, if your timer handler is running, and 5 timers fired while it was running, the block will be called immediately, representing those missed 5 events. However, as soon as that one is done, the block will be executed again, just once, no matter how many timer events were missed before.
It's easy to identify these, though, because dispatch_source_get_data(timer) will return 0, which means that no timer events have fired since the last time the block was called.
Thus, I have grown accustomed to adding this code as the first line of my timer handlers...
if (dispatch_source_get_data(timer) == 0) return;