In order to promote good programming habits and increase the efficiency of my code (Read: "My brother and I are arguing over some code"), I propose this question to experienced programmers:
Which block of code is "better"?
For those who can't be bothered to read the code, is it worth putting a conditional within a for-loop to decrease the amount of redundant code than to put it outside and make 2 for-loops? Both pieces of code work, the question is efficiency vs. readability.
- (NSInteger)eliminateGroup {
NSMutableArray *blocksToKill = [[NSMutableArray arrayWithCapacity:rowCapacity*rowCapacity] retain];
NSInteger numOfBlocks = (NSInteger)[self countChargeOfGroup:blocksToKill];
Block *temp;
NSInteger chargeTotal = 0;
//Start paying attention here
if (numOfBlocks > 3)
for (NSUInteger i = 0; i < [blocksToKill count]; i++) {
temp = (Block *)[blocksToKill objectAtIndex:i];
chargeTotal += temp.charge;
[temp eliminate];
temp.beenCounted = NO;
}
}
else {
for (NSUInteger i = 0; i < [blocksToKill count]; i++) {
temp = (Block *)[blocksToKill objectAtIndex:i];
temp.beenCounted = NO;
}
}
[blocksToKill release];
return chargeTotal;
}
Or...
- (NSInteger)eliminateGroup {
NSMutableArray *blocksToKill = [[NSMutableArray arrayWithCapacity:rowCapacity*rowCapacity] retain];
NSInteger numOfBlocks = (NSInteger)[self countChargeOfGroup:blocksToKill];
Block *temp;
NSInteger chargeTotal = 0;
//Start paying attention here
for (NSUInteger i = 0; i < [blocksToKill count]; i++) {
temp = (Block *)[blocksToKill objectAtIndex:i];
if (numOfBlocks > 3) {
chargeTotal += temp.charge;
[temp eliminate];
}
temp.beenCounted = NO;
}
[blocksToKill release];
return chargeTotal;
}
Keep in mind that this is for a game. The method is called anytime the user double-taps the screen and the for loop normally runs anywhere between 1 and 15 iterations, 64 at maximum. I understand that it really doesn't matter that much, this is mainly for helping me understand exactly how costly conditional statements are. (Read: I just want to know if I'm right.)
The first code block is cleaner and more efficient because the check numOfBlocks > 3 is either true or false throughout the whole iteration.
The second code block avoids code duplication and might therefore pose lesser risk. However, it is conceptually more complicated.
The second block can be improved by adding
bool increaseChargeTotal = (numOfBlocks > 3)
before the loop and then using this boolean variable instead of the actual check inside the loop, emphasizing the fact that during the iteration it doesn't change.
Personally, in this case I would vote for the first option (duplicated loops) because the loop bodies are small and this shows clearly that the condition is external to the loop; also, it's more efficient and might fit the pattern "make the common case fast".
There is no way to answer this without defining your requirements for "better". Is it runtime efficiency? compiled size? code readability? code maintainability? code portability? code reuseability? algorithmic provability? developer efficiency? (Please leave comments on any popular measurements I've missed.)
Sometimes absolute runtime efficiency is all that matters, but not as often as people generally imagine, as you give a nod towards in your question—but this is at least easy to test! Often it's a mix of all these concerns, and you'll have to make a subjective judgement in the end.
Every answer here is applying a personal mix of these aspects, and people often get into vigorous Holy Wars because everyone's right—in the right circumstance. These approaches are ultimately wrong. The only correct approach is to define what matters to you, and then measure against it.
All other things being equal, having two separate loops will generally be faster, because you do the test once instead of every iteration of the loop. The branch inside the loop each iteration will often slow you down significantly due to pipeline stalls and branch mispredictions; however, since the branch always goes the same way, the CPU will almost certainly predict the branch correctly for every iteration except for the first few, assuming you're using a CPU with branch prediction (I'm not sure if the ARM chip used in the iPhone has a branch predictor unit).
However, another thing to consider is code size: the two loops approach generates a lot more code, especially if the rest of the body of the loop is large. Not only does this increase the size of your program's object code, but it also hurts your instruction cache performance -- you'll get a lot more cache misses.
All things considered, unless the code is a significant bottleneck in your application, I would go with the branch inside of the loop, as it leads to clearer code, and it doesn't violate the don't repeat yourself principle. If you make a change to once of the loops and forget to change the other loop in the two-loops version, you're in for a world of hurt.
I would go with the second option. If all of the logic in the loop was completely different, then it would make sense to make 2 for loops, but the case is that some of the logic is the same, and some is additional based upon the conditional. So the second option is cleaner.
The first option would be faster, but marginally so, and I would only use it if I found there to be a bottleneck there.
You would probably waste more time in the pointless and unnessesary [blocksToKill retain]/[blocksToKill release] at the start/end of the method than the time taken to execute a few dozens comparisons. There is no need to retain the array since you wont need it after you return and it will never be cleaned up before then.
IMHO, code duplication is a leading cause of bugs which should be avoided whenever possible.
Adding Jens recomendation to use fast enumeration and Antti's recomendation to use a clearly named boolean, you'd get something like:
- (NSInteger)eliminateGroup {
NSMutableArray *blocksToKill = [NSMutableArray arrayWithCapacity:rowCapacity*rowCapacity];
NSInteger numOfBlocks = (NSInteger)[self countChargeOfGroup:blocksToKill];
NSInteger chargeTotal = 0;
BOOL calculateAndEliminateBlocks = (numOfBlocks > 3);
for (Block* block in blocksToKill) {
if (calculateAndEliminateBlocks) {
chargeTotal += block.charge;
[block eliminate];
}
block.beenCounted = NO;
}
return chargeTotal;
}
If you finish your project and if your program is not running fast enough (two big ifs), then you can profile it and find the hotspots and then you can determine whether the few microseconds you spend contemplating that branch is worth thinking about — certainly it is not worth thinking about at all now, which means that the only consideration is which is more readable/maintainable.
My vote is strongly in favor of the second block.
The second block makes clear what the difference in logic is, and shares the same looping structure. It is both more readable and more maintainable.
The first block is an example of premature optimization.
As for using a bool to "save" all those LTE comparisons--in this case, I don't think it will help, the machine language will likely require exactly the same number and size of instructions.
The overhead of the "if" test is a handful of CPU instructions; way less than a microsecond. Unless you think the loop is going to run hundreds of thousands of times in response to user input, that's just lost in the noise. So I would go with the second solution because the code is both smaller and easier to understand.
In either case, though, I would change the loop to be
for (temp in blocksToKill) { ... }
This is both cleaner-reading and considerably faster than manually getting each element of the array.
Readability (and thus maintainability) can and should be sacrificed in the name of performance, but when, and only when, it's been determined that performance is an issue.
The second block is more readable, and unless/until speed is an issue, it is better (in my opinion). During testing for your app, if you find out this loop is responsible for unacceptable performance, then by all means, seek to make it faster, even if it becomes harder to maintain. But don't do it until you have to.
Related
When I pass a string the Apple-style way to a function and test it a billion times it takes ~ 42,001 seconds:
- (void)test:(NSString *)str {
NSString *test = str;
if (test) {
return;
}
}
NSString *value = #"Value 1";
NSLog(#"START");
for (int i = 0; i < 1e9; i++) {
[self test:value];
}
NSLog(#"END");
But then passing the pointer it's pointer as a value (assuming my test function will be read-only style) like so:
- (void)test:(NSString **)str {
NSString *test = *str;
if (test) {
return;
}
}
NSLog(#"START");
for (int i = 0; i < 1e9; i++) {
[self test:&value];
}
NSLog(#"END");
..only takes ~26,804 seconds.
Why does Apple promote the first example as normal practice, while the latter seems to perform so different?
I read about the Toll-Free Bridging that Foundation applies, but if the difference is relatively so big, what's the added value? A whole application that would run a factor of more than 100% faster by just upgrading some major function arguments like this, then isn't that a considerable flaw by Apple, in their way of instructing how to build apps in Objective-C?
You wouldn't use the NSString ** syntax, as that suggests that the method you're calling can change what value points to. You would never do that unless this is really what was taking place.
The simple NSString * example may be taking longer because in the absence of any optimization, the NSString * rendition is probably adding/removing of a strong references to value when the method is called and returns.
If you turn on optimization, the behavior changes. For example, when I used -Os "Fastest, Smallest" build setting, the NSString * rendition was actually faster than the NSString ** one. And even if the performance was worse, I wouldn't write the code that exposed me to all sorts problems down the line just because it was was 0.0000152 seconds faster per call. I'd find other ways to optimize the code.
To quote Donald Knuth:
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. [Emphasis added]
The goal is always to write code whose functional intent is clear, whose type handling is safest and then, where possible, use the compiler's own internal optimization capabilities to tackle the performance issues. Only sacrifice the code readability and ease of maintenance and debugging when it's absolutely essential.
I've been using c style enumeration in objective C when I need to know at which index the object is. However, apple recommends a different style.
So, they recommend that:
for(int i=0;i<array.count;i++)
NSLog(#"Object at index %i is: %#", i, array[i]);
be changed to
int index = 0;
for (id eachObject in array) {
NSLog(#"Object at index %i is: %#", index, eachObject);
index++;
}
Is there a good reason for this other than stylistic preference?
And a follow up: How would one enumerate the letters in a String using the second kind of enumeration?
A while back someone analyzed the different enumeration options for an NSArray. The results are very interesting. It's worth noting that the tests were done on iOS4 and OSX 10.6 though.
http://darkdust.net/writings/objective-c/nsarray-enumeration-performance
In general, he shows that when dealing with something other than very small arrays, fast enumeration is better performing than block enumeration, and both are better performing than basic enumeration (array[i]).
It would be great to see these tests on iOS7!
Fast enumeration, the second option, can be better optimized by the compiler. So besides resulting in cleaner looking code that doesn't have to track the index, you also end up with faster code. So any time the second option would work, it likely should be used.
Now if you wanted to iterate over the characters in an NSString, you couldn't by default do that using fast enumeration. The only way you could do it is if you first put the characters into an NSArray, most likely using a standard for loop to iterate over the characters and add them manually. But this would defeat the purpose of using fast iteration in the first place since you have to do a standard for loop first anyway. You would only do this if you wanted to do fast iteration over this same string of text many many times, and could add them to an array first just once.
The advantage is simply that it's faster to write and easier to read. It's especially useful when you don't need the index (i for example) other than for accessing the object.
I've been looking for a while for a similar question but without any success. I don't know how to optimize some code in cocoa to use all available cores of CPU (I don't want to use GPU at the moment). Below is simple sample of code with case I mean:
int limA = 1000;
int limB = 1000;
unsigned short tmp;
for (int i = 0; i < 10000; i++) {
for (int a = 0; a < limA; a++) {
for (int b = 0; b < limB; b++) {
tmp = [[array objectAtIndex:(a*b)] unsignedShortValue];
c_array[a*limB+b] += tmp;
}
}
}
assume that array and c_array is properly initialized etc... But as you can see, if we have many iterations (in this case: 10^10) it takes some time to execute this code. I thought that maybe It is simple possibility to execute this code in few threads, but how to synchronize c_array? What is the best way to improve time execution of this kind of code in objective-c? Maybe it could be done this way that iterations 0-2499 of most external for loop would be executed in thread 1 and 2500-4999 thread 2 etc... ? I know that this is silly way but I don't need "real time" performance... any ideas?
A few suggestions:
Do an initial pass over the array to extract all the shorts from their object wrappers:
short *tmp_array = calloc(limA * limB, sizeof(short));
int tmp_idx = 0;
for (NSNumber *num in array) {
tmp_array[tmp_idx++] = [num unsignedShortValue];
}
This has several benefits. You go from 10^10 method calls to 10^6, your inner loop stops being opaque to the compiler (it can't "see through" method calls), and your inner loop gets smaller and more likely to fit in the instruction cache.
Try to linearize access patterns. Right now you're doing 'strided' access, since the index is being multiplied each time. If you can rearrange the data in tmp_array so that elements that are processed sequentially are also sequential in the array, you should get much better performance (since each access to the array is loading a full cache line, which is 64 bytes on most processors).
Getting a benefit out of parallelism is likely to be tricky. You could try replacing the outer loop with:
dispatch_apply(10000, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t i) {
});
and the += in the inner loop with OSAtomicAdd, but my suspicion is that your speed is going to be dominated by memory accesses anyway, and adding more processors to the mix will just lead to them stepping on each other's toes (i.e. processor 0 loads c_array[1500] so that it knows what to add tmp to, which actually loads the cache line covering [1500-1531], then processor 1 writes to c_array[1512], invalidating that entire cache line and forcing it to be re-read). Also, I'm pretty sure you'd need to store 32 bit values in c_array to do that, since you'd be using OSAtomicAdd32 (there's no OSAtomicAdd16).
At the very least, if you're going to parallelize, then you need to figure out how to divide the work into non-overlapping chunks of 32 elements of c_array (i.e. 64 bytes), so that you can avoid contention. Dividing up the ranges of the array should also let you avoid needing to use atomic add operations.
(edit)
Check out an0's answer for some practical suggestions for parallelizing this, rather than this discussion of why the naive parallelization won't work :)
First, follow #Catfish_Man's suggestion, except for the parallelism part.
For the parallelism, here are my ideas:
The outmost loop is meaningless. Just use 10000 * tmp instead of tmp.
Since the segments of target array to be written to are strictly disjoint for different a values, the second level of loop can be easily parallelized. In fact, it also applies to b. But if we also parallelize over b the calculation unit left in the body will be too small for the splitting of the work load to be useful.
Code:
int limA = 1000;
int limB = 1000;
short *tmp_array = calloc(limA * limB, sizeof(short));
int tmp_idx = 0;
for (NSNumber *num in array) {
tmp_array[tmp_idx++] = [num unsignedShortValue];
}
dispatch_apply(limA, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t a) {
for (int b = 0; b < limB; b++) {
tmp = ;
c_array[a*limB+b] += 1000 * tmp_array[a*b];
}
});
free(tmp_array);
First, follow #Catfish_Man's suggestions. Then follow #an0's suggestions. Then do this as well:
// ...
short *tmp_array = calloc(limA * limB, sizeof(short));
unsigned short (*unsignedShortValueIMP)(id, SEL) = class_getMethodImplementation([NSNumber class], #selector(unsignedShortValue));
void * (*objectAtIndexIMP)(id, SEL, NSUInteger) = class_getMethodImplementation(array.class, #selector(objectAtIndex:));
NSUInteger n = array.count;
for (NSUInteger i = 0; i < n; ++i) {
void *obj = objectAtIndexIMP(array, #selector(objectAtIndex:), i);
tmp_array[i] = unsignedShortValueIMP((__bridge id)obj, #selector(unsignedShortValue));
}
// ...
By lifting the IMPs out of Objective-C, you bypass all the overhead of the message dispatch machinery and allow the compiler to "see through" the calls; while these selectors are part of Foundation and can't be inlined, removing the extra levels of indirection improves the holy heck out of the branch prediction and prefetching machinery in the CPU cores. In addition, by using a raw C for loop instead of Objective-C's array enumeration, AND not forcing the opacity of objc_msgSend() on the compiler, you allow Clang's loop unwinding and vectorization optimizers to work.
#Catfish_Man may be able to tell me this is an outmoded optimization no longer worth doing, but as far as I'm aware, it's still a win for massive repetitions of calling the same methods like this.
Final note: My code assumes ARC, so uses void * and a bridge cast instead of id on the objectAtIndex: IMP to bypass the extra implicit retain/release pair. This is evil shadow hackery, disabling ARC for the file in question is a better solution, and I should be ashamed of myself.
In this simple test, after being sure that the index is valid, does it worth to assign a variable instead of calling two times objectAtIndex: method ?
NSString *s = [myArray objectAtIndex:2];
if (s) {
Test *t = [Test initFromString:s];
}
instead of
if ([myArray objectAtIndex:2]) {
Test *t = [Test initFromString:[myArray objectAtIndex:2]];
}
From the performance point of view it’s not worth it, unless the code lies on a really hot path (and you would know that). Sending a message is practically free and looking up an object on a given index is also too fast to care in most situations.
The change makes the code more readable, though: First, you can name the thing that you pull from the container (like testName). Second, when reading the two repeated calls to objectAtIndex you have to make sure that it’s really the same code. After you introduce the separate variable it’s obvious, there’s less cognitive load.
Is there a difference between the following two code blocks in terms of the resulting machine code when using the llvm or gcc compilers?
When is this optimization actually worthwhile, if ever?
Not optimized:
for (int i=0; i<array.count; i++) {
//do some work
}
Optimized:
int count = array.count;
for (int i=0; i<count; i++) {
//do some work
}
EDIT: I should point out that array is immutable and array.count doesn't change during the loop's execution.
You really need to check it yourself. My guess is that there is a difference in the emitted code, but it might depend on compiler and compiler options, and it certainly can depend on the definition of array.
Nearly never, on the assumption that evaluating array.count is nearly always insignificant compared with "some work". The way to measure it, though, is to use a profiler (or equivalent) and observe what proportion of your program's runtime is spent at that line of code. Provided the profiler is accurate, that's the most you could hope to gain by changing it.
Suppose array.count is something really slow, that you happen to know will always return the same result but the compiler doesn't know that. Then it might be worth manually hoisting it. strlen gets used as an example. It's debateable how often strlen is actually slow in practice, but easy to manufacture examples likely to run slower than they need to:
char some_function(char a) {
return (a * 2 + 1) & 0x3F;
}
for (int i = 0; i < strlen(ptr); ++i) {
ptr[i] = some_function(ptr[i]); // faster than strlen for long enough strings.
}
You and I know that some_function never returns 0, and hence the length of the string never changes. The compiler might not see the definition of some_function, and even if it does see the definition might not realize that its non-zero-returningness is important.
The Steve Jessop answer is a good one. I just want to add:
Personally, I always use optimized version. It's just in my set of good practices to remove every constant component out of the loop. It's not much work and it makes code cleaner. It's not "premature optimization" and it does not introduce any problems or tradeoffs. It makes debugging easier (stepping). And it could potentially make the code faster. So it's a no-brainer to me.