Is it possible to compare two Objective-C blocks by content? - objective-c

float pi = 3.14;
float (^piSquare)(void) = ^(void){ return pi * pi; };
float (^piSquare2)(void) = ^(void){ return pi * pi; };
[piSquare isEqualTo: piSquare2]; // -> want it to behave like -isEqualToString...

To expand on Laurent's answer.
A Block is a combination of implementation and data. For two blocks to be equal, they would need to have both the exact same implementation and have captured the exact same data. Comparison, thus, requires comparing both the implementation and the data.
One might think comparing the implementation would be easy. It actually isn't because of the way the compiler's optimizer works.
While comparing simple data is fairly straightforward, blocks can capture objects-- including C++ objects (which might actually work someday)-- and comparison may or may not need to take that into account. A naive implementation would simply do a byte level comparison of the captured contents. However, one might also desire to test equality of objects using the object level comparators.
Then there is the issue of __block variables. A block, itself, doesn't actually have any metadata related to __block captured variables as it doesn't need it to fulfill the requirements of said variables. Thus, comparison couldn't compare __block values without significantly changing compiler codegen.
All of this is to say that, no, it isn't currently possible to compare blocks and to outline some of the reasons why. If you feel that this would be useful, file a bug via http://bugreport.apple.com/ and provide a use case.

Putting aside issues of compiler implementation and language design, what you're asking for is provably undecidable (unless you only care about detecting 100% identical programs). Deciding if two programs compute the same function is equivalent to solving the halting problem. This is a classic consequence of Rice's Theorem: Any "interesting" property of Turing machines is undecidable, where "interesting" just means that it's true for some machines and false for others.
Just for fun, here's the proof. Assume we can create a function to decide if two blocks are equivalent, called EQ(b1, b2). Now we'll use that function to solve the halting problem. We create a new function HALT(M, I) that tells us if Turing machine M will halt on input I like so:
BOOL HALT(M,I) {
return EQ(
^(int) {return 0;},
^(int) {M(I); return 0;}
);
}
If M(I) halts then the blocks are equivalent, so HALT(M,I) returns YES. If M(I) doesn't halt then the blocks are not equivalent, so HALT(M,I) returns NO. Note that we don't have to execute the blocks -- our hypothetical EQ function can compute their equivalence just by looking at them.
We have now solved the halting problem, which we know is not possible. Therefore, EQ cannot exist.

I don't think this is possible. Blocks can be roughly seen as advanced functions (with access to global or local variables). The same way you cannot compare functions' content, you cannot compare blocks' content.
All you can do is to compare their low-level implementation, but I doubt that the compiler will guarantee that two blocks with the same content share their implementation.

Related

Is it possible to get the native CPU size of an integer in Rust?

For fun, I'm writing a bignum library in Rust. My goal (as with most bignum libraries) is to make it as efficient as I can. I'd like it to be efficient even on unusual architectures.
It seems intuitive to me that a CPU will perform arithmetic faster on integers with the native number of bits for the architecture (i.e., u64 for 64-bit machines, u16 for 16-bit machines, etc.) As such, since I want to create a library that is efficient on all architectures, I need to take the target architecture's native integer size into account. The obvious way to do this would be to use the cfg attribute target_pointer_width. For instance, to define the smallest type which will always be able to hold more than the maximum native int size:
#[cfg(target_pointer_width = "16")]
type LargeInt = u32;
#[cfg(target_pointer_width = "32")]
type LargeInt = u64;
#[cfg(target_pointer_width = "64")]
type LargeInt = u128;
However, while looking into this, I came across this comment. It gives an example of an architecture where the native int size is different from the pointer width. Thus, my solution will not work for all architectures. Another potential solution would be to write a build script which codegens a small module which defines LargeInt based on the size of a usize (which we can acquire like so: std::mem::size_of::<usize>().) However, this has the same problem as above, since usize is based on the pointer width as well. A final obvious solution is to simply keep a map of native int sizes for each architecture. However, this solution is inelegant and doesn't scale well, so I'd like to avoid it.
So, my questions: is there a way to find the target's native int size, preferably before compilation, in order to reduce runtime overhead? Is this effort even worth it? That is, is there likely to be a significant difference between using the native int size as opposed to the pointer width?
It's generally hard (or impossible) to get compilers to emit optimal code for BigNum stuff, that's why https://gmplib.org/ has its low level primitive functions (mpn_... docs) hand-written in assembly for various target architectures with tuning for different micro-architecture, e.g. https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/core2/mul_basecase.asm for the general case of multi-limb * multi-limb numbers. And https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/coreisbr/aors_n.asm for mpn_add_n and mpn_sub_n (Add OR Sub = aors), tuned for SandyBridge-family which doesn't have partial-flag stalls so it can loop with dec/jnz.
Understanding what kind of asm is optimal may be helpful when writing code in a higher level language. Although in practice you can't even get close to that so it sometimes makes sense to use a different technique, like only using values up to 2^30 in 32-bit integers (like CPython does internally, getting the carry-out via a right shift, see the section about Python in this). In Rust you do have access to add_overflow to get the carry-out, but using it is still hard.
For practical use, writing Rust bindings for GMP is probably your best bet, unless that already exists.
Using the largest chunks possible is very good; on all current CPUs, add reg64, reg64 has the same throughput and latency as add reg32, reg32 or reg8. So you get twice as much work done per unit. And carry propagation through 64 bits of result in 1 cycle of latency.
(There are alternate ways to store BigInteger data that can make SIMD useful; #Mysticial explains in Can long integer routines benefit from SSE?. e.g. 30 value bits per 32-bit int, allowing you to defer normalization until after a few addition steps. But every use of such numbers has to be aware of these issues so it's not an easy drop-in replacement.)
In Rust, you probably want to just use u64 regardless of the target, unless you really care about small-number (single-limb) performance on 32-bit targets. Let the compiler build u64 operations for you out of add / adc (add with carry).
The only thing that might need to be ISA-specific is if u128 is not available on some targets. You want to use 64 * 64 => 128-bit full multiply as your building block for multiplication; if the compiler can do that for you with u128 then that's great, especially if it inlines efficiently.
See also discussion in comments under the question.
One stumbling block for getting compilers to emit efficient BigInt addition loops (even inside the body of one unrolled loop) is writing an add that takes a carry input and produces a carry output. Note that x += 0xff..ff + carry=1 needs to produce a carry out even though 0xff..ff + 1 wraps to zero. So in C or Rust, x += y + carry has to check for carry out in both the y+carry and the x+= parts.
It's really hard (probably impossible) to convince compiler back-ends like LLVM to emit a chain of adc instructions. An add/adc is doable when you don't need the carry-out from adc. Or probably if the compiler is doing it for you for u128.overflowing_add
Often compilers will turn the carry flag into a 0 / 1 in a register instead of using adc. You can hopefully avoid that for at least pairs of u64 in addition by combining the input u64 values to u128 for u128.overflowing_add. That will hopefully not cost any asm instructions because a u128 already has to be stored across two separate 64-bit registers, just like two separate u64 values.
So combining up to u128 could just be a local optimization for a function that adds arrays of u64 elements, to get the compiler to suck less.
In my library ibig what I do is:
Select architecture-specific size based on target_arch.
If I don't have a value for an architecture, select 16, 32 or 64 based on target_pointer_width.
If target_pointer_width is not one of these values, use 64.

How to convert Greensock's CustomEase functions to be usable in CreateJS's Tween system?

I'm currently working on a project that does not include GSAP (Greensock's JS Tweening library), but since it's super easy to create your own Custom Easing functions with it's visual editor - I was wondering if there is a way to break down the desired ease-function so that it can be reused in a CreateJS Tween?
Example:
var myEase = CustomEase.create("myCustomEase", [
{s:0,cp:0.413,e:0.672},{s:0.672,cp:0.931,e:1.036},
{s:1.036,cp:1.141,e:1.036},{s:1.036,cp:0.931,e:0.984},
{s:0.984,cp:1.03699,e:1.004},{s:1.004,cp:0.971,e:0.988},
{s:0.988,cp:1.00499,e:1}
]);
So that it turns it into something like:
var myEase = function(t, b, c, d) {
//Some magic algorithm performed on the 7 bezier/control points above...
}
(Here is what the graph would look like for this particular easing method.)
I took the time to port and optimize the original GSAP-based CustomEase class... but due to license restrictions / legal matters (basically a grizzly bear that I do not want to poke with a stick...), posting the ported code would violate it.
However, it's fair for my own use. Therefore, I believe it's only fair that I guide you and point you to the resources that made it possible.
The original code (not directly compatible with CreateJS) can be found here:
https://github.com/art0rz/gsap-customease/blob/master/CustomEase.js (looks like the author was also asked to take down the repo on github - sorry if the rest of this post makes no sense at all!)
Note that CreateJS's easing methods only takes a "time ratio" value (not time, start, end, duration like GSAP's easing method does). That time ratio is really all you need, given it goes from 0.0 (your start value) to 1.0 (your end value).
With a little bit of effort, you can discard those parameters from the ease() method and trim down the final returned expression.
Optimizations:
I took a few extra steps to optimize the above code.
1) In the constructor, you can store the segments.length value directly as this.length in a property of the CustomEase instance to cut down a bit on the amount of accessors / property lookups in the ease() method (where qty is set).
2) There's a few redundant calculations done per Segments that can be eliminated in the ease() method. For instance, the s.cp - s.s and s.e - s.s operations can be precalculated and stored in a couple of properties in each Segments (in its constructor).
3) Finally, I'm not sure why it was designed this way, but you can unwrap the function() {...}(); that are returning the constructors for each classes. Perhaps it was used to trap the scope of some variables, but I don't see why it couldn't have wrapped the entire thing instead of encapsulating each one separately.
Need more info? Leave a comment!

Concurrent access to a single FFTSetup data structure in GCD

Is it okay to create a single FFTSetup data structure and use it to perform multiple FFT computations concurrently? Would something like the following work?
FFTSetup fftSetup = vDSP_create_fftsetup(
16, // vDSP_Length __vDSP_log2n,
kFFTRadix2 // FFTRadix __vDSP_radix
);
NSAssert(fftSetup != NULL, #"vDSP_create_fftsetup() failed to allocate storage");
for (int i = 0; i < 100; i++)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0),
^{
vDSP_fft_zrip(
fftSetup, // FFTSetup __vDSP_setup,
&(splitComplex[i]), // DSPSplitComplex *__vDSP_ioData,
1, // vDSP_Stride __vDSP_stride,
16, // vDSP_Length __vDSP_log2n,
kFFTDirection_Forward // FFTDirection __vDSP_direction
);
});
}
I suppose that the answer depends on the following considerations:
1) Does vDSP_fft_zrip() only access the data within fftSetup (or the data pointed to by it) in a "read-only" fashion? Or are there perhaps some temporary buffers (scratch space) within fftSetup that is written to by vDSP_fft_zrip() in performing its FFT computations?
2) If data like that in fftSetup is being accessed in a "read-only" fashion, is it okay for multiple processes/threads/tasks/blocks to access it simultaneously? (I am thinking of the case where it is possible for more than one process to open the same file for reading, though not necessarily for writing or appending. Is this analogy appropriate?)
On a related note, just how much memory is being taken up by the FFTSetup data structure? Is there any way to find out? (It is an opaque data type.)
You may create one FFT setup and use it repeatedly and concurrently. This is the intended use. (I am the author of the current implementations of vDSP_fft_zrip and other FFT implementations in vDSP.)
In Using Fourier Transforms we're told that the FFTSetup contains the FFT weight array which is a series of complex exponentials. The vDSP_create_fftsetup documentation says
Once prepared, the setup structure can be used repeatedly by FFT
functions (which read the data in the structure and do not alter it)
for any (power of two) length up to that specified when you created
the structure.
so
conceptually, vDSP_fft_zrip should not need to modify the weight array and so it would appear to be one of the FFT functions that do not alter the FFTSetup (I haven't seen any that do apart from create/destroy), however there are no guarantees on what the actual implementation does - it could do anything.
if vDSP_fft_zrip truly accesses its FFTSetup in a read-only fashion, then it's fine to do that from multiple threads.
As for memory usage, the FFT weight array is e^{i*k*2*M_PI/N} for k = [0..N-1], which are N complex float values, so that would 2*N*sizeof(float).
But those complex exponentials are very symmetric so who knows, under the hood the implementation could require less memory. Or more!
In your case, N = 2^16, so it wouldn't be strange to see up to 256k being used.
Where does that leave you? I think it seems reasonable that the FFTSetup be accessible from multiple threads, but it appears to be undocumented. You could be lucky. Or unlucky and unpleasantly surprised now or in a future version of the framework.
So... do you feel lucky?
I wouldn't attempt any explicit concurrency with the vDSP functions, or any other function in the Accelerate framework (of which vDSP is a part) for that matter. Why? Because Accelerate is already designed to take advantage of multiple cores, as well as specific nuances of a given processor implementation, on your behalf - see http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man7/vecLib.7.html. You may end up essentially re-parallelizing already parallel computations that are internal to the implementation (if not now, then possibly in a later version). The best approach to the Accelerate framework is generally to assume that it's more clever than you are and just use it in the simplest way possible, then do your performance measurements. If those measurements reflect a level of performance that is somehow insufficient for your needs, then try your own optimizations (and/or file a bug report against the Accelerate framework at http://bugreport.apple.com since the authors of that framework are always interested in knowing where or if their efforts somehow fell short of developer requirements).

Techniques for controlling program order of execution

I'm wrestling with the concept of code "order of execution" and so far my research has come up short. I'm not sure if I'm phrasing it incorrectly, it's possible there is a more appropriate term for the concept. I'd appreciate it if someone could shed some light on my various stumbling blocks below.
I understand that if you call one method after another:
[self generateGrid1];
[self generateGrid2];
Both methods are run, but generateGrid1 doesn't necessarily wait for generateGrid2. But what if I need it to? Say generateGrid1 does some complex calculations (that take an unknown amount of time) and populate an array that generateGrid2 uses for it's calculations? This needs to be done every time an event is fired, it's not just a one time initialization.
I need a way to call methods sequentially, but have some methods wait for others. I've looked into call backs, but the concept is always married to delegates in all the examples I've seen.
I'm also not sure when to make the determinate that I can't reasonably expect a line of code to be parsed in time for it to be used. For example:
int myVar = [self complexFloatCalculation];
if (myVar <= 10.0f) {} else {}
How do I determine if something will take long enough to implement checks for "Is this other thing done before I start my thing". Just trial and error?
Or maybe I'm passing a method as parameter of another method? Does it wait for the arguments to be evaluated before executing the method?
[self getNameForValue:[self getIntValue]];
I understand that if you call one method after another:
[self generateGrid1];
[self generateGrid2];
Both methods are run, but generateGrid1 doesn't necessarily wait for generateGrid2. But what if I need it to?
False. generateGrid1 will run, and then generateGrid2 will run. This sequential execution is the very basis of procedural languages.
Technically, the compiler is allowed to rearrange statements, but only if the end result would be provably indistinguishable from the original. For example, look at the following code:
int x = 3;
int y = 4;
x = x + 6;
y = y - 1;
int z = x + y;
printf("z is %d", z);
It really doesn't matter whether the x+6 or the y-1 line happens first; the code as written does not make use of either of the intermediate values other than to calculate z, and that can happen in either order. So if the compiler can for some reason generate more efficient code by rearranging those lines, it is allowed to do so.
You'd never be able to see the effects of such rearranging, though, because as soon as you try to use one of those intermediate values (say, to log it), the compiler will recognize that the value is being used, and get rid of the optimization that would break your logging.
So really, the compiler is not required to execute your code in the order provided; it is only required to generate code that is functionally identical to the code you provided. This means that you actually can see the effects of these kinds of optimizations if you attach a debugger to a program that was compiled with optimizations in place. This leads to all sorts of confusing things, because the source code the debugger is tracking does not necessarily match up line-for-line with the code the compiled code the compiler generated. This is why optimizations are almost always turned off for debug builds of a program.
Anyway, the point is that the compiler can only do these sorts of tricks when it can prove that there will be no effect. Objective-c method calls are dynamically bound, meaning that the compiler has absolutely no guarantee about what will actually happen at runtime when that method is called. Since the compiler can't make any guarantees about what will happen, the compiler will never reorder Objective-C method calls. But again, this just falls back to the same principle I stated earlier: the compiler may change order of execution, but only if it is completely imperceptible to the user.
In other words, don't worry about it. Your code will always run top-to-bottom, each statement waiting for the one before it to complete.
In general, most method calls that you see in the style you described are synchronous, that means they'll have the effect you desire, running in the order the statements were coded, where the second call will only run after the first call finishes and returns.
Also, when a method takes parameters, its parameters are evaluated before the method is called.

Does the order of expression to check in boolean statement affect performance

If I have a Boolean expression to check
(A && B)
If A is found to be false will the language bother to check B? Does this vary from language to language?
The reason I ask is that I'm wondering if it's the case that B is checked even if A is false then wouldn't
if (A) {
if(B) {
} else {
// code x
}
} else {
// code x
}
be marginally quicker than
if (A && B) {
} else {
// code x
}
This depends on the language. Most languages will implement A && B as a short-circuit operator, meaning that if A evaluates to false, B will never be evaluated. There's a detailed list on Wikipedia.
Almost every language implements something called short-circuit evaluation, which means that yes, (A && B) will not evaluate B if A is false. This also takes effect if you write:
if (A || B) {
...
}
and A is true. This is worth remembering if B may take a long time to load, but generally it's not something to worry about.
As a bit of history, in my mind this is a bit of a sore part of LISP because code like this:
(if (and (= x 5) (my-expensive-query y)) "Yes" "No")
is not made of functions, but rather so-called "special forms" (that is, "and" could not be defun'd here).
This would depend 100% on how the language compiles said code. Anything is possible :)
Is there a specific language you're wondering about?
In short, no. A double branch involves various forms of branch prediction. If A and B are simple to evaluate, it may be faster to do if (A && B) in a non-short-circuit way than if (A) if (B). In addition, you've duplicated code in the second form. This is virtually always (exception to every rule .. I guess) bad and far worse than any gain.
Secondly, this is the kind of micro-optimization that you give to the language interpreter, JIT or compiler.
Many languages (including almost all curly-brace languages, like C/C++/Java/C#) offer short-circuit boolean evaluation. In those languages, if A is false then B won't be evaluated. You'll need to see (or ask) whether this is true for your specific language, or whether there's a way to do it (VB has AndAlso, for example).
If you find your language doesn't support it, you'll also need to consider whether the cost of evaluating B is worth having to maintain two identical pieces of code -- and the potential doubling in cache footprint (not to mention all the extra branching) that'd come from doing that duplication every time.
As others have said, it depends on the language and/or compiler. For me, I don't care how fast or slow it might be to short-circuit or not, the need to duplicate code is a deal-killer.
If A and B are actually calls that have side-effects (i.e. they do more than simply return a value suitable for comparison), then I would argue that those calls should be made into variable assignments that are then used in your comparison. It doesn't matter whether or not you always require those side-effects or only require them conditionally, the code will be more readable if you don't depend on whether or not short-circuit exists.
That last bit about readability is based on my feeling that reducing the need to refer to external documentation improves readability. Reading a book with a bunch of new words that require dictionary look-ups is much more challenging than reading that same book when you already have the necessary vocabulary. In this case, short-circuit is invisible, so anybody that needs to look it up won't even know that they need to look it up.