Metal compute values isnt the same as the CPU values - objective-c

I'm trying to implement a length of a vector of 3DPoints and when I compare the values retrieved by the GPU with the CPU they aren't entirely the same, usually having a large number of differences.
I initially used packed_float3 and it present a bit more differences, so I started to use float3 and improved a little bit but there are still differences that I would like to fix.
The values don't differ a lot, on average they differ by -0.00000000048358334004, but when I run operations like summing and subtracting two arrays the difference doesn't occur and I would like that that it would happen the same.
Here is a part of the Code
main.m
- (void) lenght_function:(NSArray*) array {
_buffer[0] = [_mDevice newBufferWithLength:_sp_size_alloc options:MTLResourceStorageModeShared];
_buffer[1] = [_mDevice newBufferWithLength:_sp_size_alloc options:MTLResourceStorageModeShared];
float3 *datapt = [_buffer[0] contents];
for (unsigned long index = 0 ; index< _sp_lenght ; index++) {
datapt[index].x = (float)[array[index] getX];
datapt[index].y = (float)[array[index] getY];
datapt[index].z = (float)[array[index] getZ];
}
commandBuffer = [_mCommandQueue commandBuffer];
assert(commandBuffer != nil);
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
assert(computeEncoder != nil);
[computeEncoder setComputePipelineState:_mLenghtFunctionPSO];
[computeEncoder setBuffer:_buffer[0] offset:0 atIndex:0];
[computeEncoder setBuffer:_buffer[1] offset:0 atIndex:1];
//[array1 makeData];
MTLSize gridSize = MTLSizeMake(_sp_lenght, 1, 1);
NSUInteger threadGroupSize = _mLenghtFunctionPSO.maxTotalThreadsPerThreadgroup;
if(threadGroupSize > _sp_lenght){
threadGroupSize = _sp_lenght;
}
MTLSize threadgroupsize = MTLSizeMake(threadGroupSize, 1, 1);
[computeEncoder dispatchThreads:gridSize threadsPerThreadgroup:threadgroupsize];
[computeEncoder endEncoding];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
float3 *arr1 = _buffer[0].contents;
float* result = _buffer[1].contents;
unsigned long counter = 0;
for (unsigned long index = 0; index < _sp_lenght; index++)
{
if (result[index] != sqrtf(arr1[index].x*arr1[index].x + arr1[index].y*arr1[index].y + arr1[index].z*arr1[index].z)){
counter++;;
}
}
NSLog(#"ERROR counter %lu\n",counter);
}
kernel.metal
kernel void lenght(const device float3 *arr1,
device float *result,
uint index[[thread_position_in_grid]]){
result[index] = precise::sqrt(precise::pow(arr1[index].x,2) + precise::pow(arr1[index].y,2) + precise::pow(arr1[index].z,2));
}

32-bit precision is only about 7 decimal places and the difference you show is around 9-10 decimal places. So what you show is actually a bit better than one can expect a 32-bit float for precision. It sounds like you want 64-bit double precision, but that is not a built-in Metal datatype.
It may help if you multiply the values by say 100 or 1000 to move the decimal place up, then after your values have been added divide by that number.
Another possibility is to normalize your values first, so they are all in a range of say 0 to 1. Then you might even be able to use half precision.

Related

Wildly varying hashing performance with CFSet and CFDictionary on OS X

When using CFSet and CFDictionary configured with custom callbacks to use integers as their keys, I've noticed some wildly varying performance of their internal hashing implementation. I'm using 64 bit integers (int64_t) with a range of roughly 1 - 1,000,000.
While profiling my application with, I noticed that every so often, a certain combination of factors would produce unusually poor performance. Looking at Instruments, CFBasicHash was taking much longer than usual.
After a bunch of investigating, I finally narrowed things down to a set of 400,000 integers that, when added to a CFSet or CFDictionary cause terrible performance with hashing.
The hashing implementation in CFBasicHash.m is beyond my understating for a problem like this, so I was wondering if anyone had any idea why such a completely random set of integers could cause such dreadful performance.
The following test application will output an average iteration time of 37ms for adding sequential integers to a set, but an average run time of 3622ms when adding the same number of integers but from the problematic data set.
(And if you insert the same number of completely random integers, then performance is much closer to 37ms. As well, adding these problematic integers to an std::map or std:set produces acceptable performance.)
#import <Foundation/Foundation.h>
extern uint64_t dispatch_benchmark(size_t count, void (^block)(void));
int main(int argc, char *argv[]) {
#autoreleasepool {
NSString *data = [NSString stringWithContentsOfFile:#"Integers.txt" encoding:NSUTF8StringEncoding error:NULL];
NSArray *components = [data componentsSeparatedByString:#","];
NSInteger count = components.count;
int64_t *numbers = (int64_t *)malloc(sizeof(int64_t) * count);
int64_t *sequentialNumbers = (int64_t *)malloc(sizeof(int64_t) * count);
for (NSInteger c = 0; c < count; c++) {
numbers[c] = [components[c] integerValue];
sequentialNumbers[c] = c;
}
NSLog(#"Beginning test with %# numbers...", #(count));
// Test #1 - Loading sequential integers
uint64_t t1 = dispatch_benchmark(10, ^{
CFMutableSetRef mutableSetRef = CFSetCreateMutable(NULL, 0, NULL);
for (NSInteger c = 0; c < count; c++) {
CFSetAddValue(mutableSetRef, (const void *)sequentialNumbers[c]);
}
NSLog(#"Sequential iteration completed with %# items in set.", #(CFSetGetCount(mutableSetRef)));
CFRelease(mutableSetRef);
});
NSLog(#"Sequential Numbers Average Runtime: %llu ms", t1 / NSEC_PER_MSEC);
NSLog(#"-----");
// Test #2 - Loading data set
uint64_t t2 = dispatch_benchmark(10, ^{
CFMutableSetRef mutableSetRef = CFSetCreateMutable(NULL, 0, NULL);
for (NSInteger c = 0; c < count; c++) {
CFSetAddValue(mutableSetRef, (const void *)numbers[c]);
}
NSLog(#"Dataset iteration completed with %# items in set.", #(CFSetGetCount(mutableSetRef)));
CFRelease(mutableSetRef);
});
NSLog(#"Dataset Average Runtime: %llu ms", t2 / NSEC_PER_MSEC);
free(sequentialNumbers);
free(numbers);
}
}
Example output:
Sequential Numbers Average Runtime: 37 ms
Dataset Average Runtime: 3622 ms
The integers are available here:
Gist (Integers.txt) or Dropbox (Integers.txt)
Can anyone help explain what is "special" about the given integers that might cause such a degradation in the hashing implementation used by CFSet and CFDictionary?

Performance of measuring text width in AppKit

Is there a way in AppKit to measure the width of a large number of NSString objects(say a million) really fast? I have tried 3 different ways to do this:
[NSString sizeWithAttributes:]
[NSAttributedString size]
NSLayoutManager (get text width instead of height)
Here are some performance metrics
Count\Mechanism sizeWithAttributes NSAttributedString NSLayoutManager
1000 0.057 0.031 0.007
10000 0.329 0.325 0.064
100000 3.06 3.14 0.689
1000000 29.5 31.3 7.06
NSLayoutManager is clearly the way to go, but the problem being
High memory footprint(more than 1GB according to profiler) because of the creation of heavyweight NSTextStorage objects.
High creation time. All of the time taken is during creation of the above strings, which is a dealbreaker in itself.(subsequently measuring NSTextStorage objects which have glyphs created and laid out only takes about 0.0002 seconds).
7 seconds is still too slow for what I am trying to do. Is there a faster way? To measure a million strings in about a second?
In case you want to play around, Here is the github project.
Here are some ideas I haven't tried.
Use Core Text directly. The other APIs are built on top of it.
Parallelize. All modern Macs (and even all modern iOS devices) have multiple cores. Divide up the string array into several subarrays. For each subarray, submit a block to a global GCD queue. In the block, create the necessary Core Text or NSLayoutManager objects and measure the strings in the subarray. Both APIs can be used safely this way. (Core Text) (NSLayoutManager)
Regarding “High memory footprint”: Use Local Autorelease Pool Blocks to Reduce Peak Memory Footprint.
Regarding “All of the time taken is during creation of the above strings, which is a dealbreaker in itself”: Are you saying all the time is spent in these lines:
double random = (double)arc4random_uniform(1000) / 1000;
NSString *randomNumber = [NSString stringWithFormat:#"%f", random];
Formatting a floating-point number is expensive. Is this your real use case? If you just want to format a random rational of the form n/1000 for 0 ≤ n < 1000, there are faster ways. Also, in many fonts, all digits have the same width, so that it's easy to typeset columns of numbers. If you pick such a font, you can avoid measuring the strings in the first place.
UPDATE
Here's the fastest code I've come up with using Core Text. The dispatched version is almost twice as fast as the single-threaded version on my Core i7 MacBook Pro. My fork of your project is here.
static CGFloat maxWidthOfStringsUsingCTFramesetter(
NSArray *strings, NSRange range) {
NSString *bigString =
[[strings subarrayWithRange:range] componentsJoinedByString:#"\n"];
NSAttributedString *richText =
[[NSAttributedString alloc]
initWithString:bigString
attributes:#{ NSFontAttributeName: (__bridge NSFont *)font }];
CGPathRef path =
CGPathCreateWithRect(CGRectMake(0, 0, CGFLOAT_MAX, CGFLOAT_MAX), NULL);
CGFloat width = 0.0;
CTFramesetterRef setter =
CTFramesetterCreateWithAttributedString(
(__bridge CFAttributedStringRef)richText);
CTFrameRef frame =
CTFramesetterCreateFrame(
setter, CFRangeMake(0, bigString.length), path, NULL);
NSArray *lines = (__bridge NSArray *)CTFrameGetLines(frame);
for (id item in lines) {
CTLineRef line = (__bridge CTLineRef)item;
width = MAX(width, CTLineGetTypographicBounds(line, NULL, NULL, NULL));
}
CFRelease(frame);
CFRelease(setter);
CFRelease(path);
return (CGFloat)width;
}
static void test_CTFramesetter() {
runTest(__func__, ^{
return maxWidthOfStringsUsingCTFramesetter(
testStrings, NSMakeRange(0, testStrings.count));
});
}
static void test_CTFramesetter_dispatched() {
runTest(__func__, ^{
dispatch_queue_t gatherQueue = dispatch_queue_create(
"test_CTFramesetter_dispatched result-gathering queue", nil);
dispatch_queue_t runQueue =
dispatch_get_global_queue(QOS_CLASS_UTILITY, 0);
dispatch_group_t group = dispatch_group_create();
__block CGFloat gatheredWidth = 0.0;
const size_t Parallelism = 16;
const size_t totalCount = testStrings.count;
// Force unsigned long to get 64-bit math to avoid overflow for
// large totalCounts.
for (unsigned long i = 0; i < Parallelism; ++i) {
NSUInteger start = (totalCount * i) / Parallelism;
NSUInteger end = (totalCount * (i + 1)) / Parallelism;
NSRange range = NSMakeRange(start, end - start);
dispatch_group_async(group, runQueue, ^{
double width =
maxWidthOfStringsUsingCTFramesetter(testStrings, range);
dispatch_sync(gatherQueue, ^{
gatheredWidth = MAX(gatheredWidth, width);
});
});
}
dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
return gatheredWidth;
});
}

Convert really large decimal string to hex?

I've got a really large decimal number in an NSString, which is too large to fit into any variable including NSDecimal. I was doing the math manually, but if I can't fit the number into a variable then I can't be dividing it. So what would be a good way to convert the string?
Example Input: 423723487924398723478243789243879243978234
Output: 4DD361F5A772159224CE9EB0C215D2915FA
I was looking at the first answer here, but it's in C# and I don't know it's objective C equivalent.
Does anyone have any ideas that don't involve using an external library?
If this is all you need, it's not too hard to implement, especially if you're willing to use Objective-C++. By using Objective-C++, you can use a vector to manage memory, which simplifies the code.
Here's the interface we'll implement:
// NSString+BigDecimalToHex.h
#interface NSString (BigDecimalToHex)
- (NSString *)hexStringFromDecimalString;
#end
To implement it, we'll represent an arbitrary-precision non-negative integer as a vector of base-65536 digits:
// NSString+BigDecimalToHex.mm
#import "NSString+BigDecimalToHex.h"
#import <vector>
// index 0 is the least significant digit
typedef std::vector<uint16_t> BigInt;
The "hard" part is to multiply a BigInt by 10 and add a single decimal digit to it. We can very easily implement this as long multiplication with a preloaded carry:
static void insertDecimalDigit(BigInt &b, uint16_t decimalDigit) {
uint32_t carry = decimalDigit;
for (size_t i = 0; i < b.size(); ++i) {
uint32_t product = b[i] * (uint32_t)10 + carry;
b[i] = (uint16_t)product;
carry = product >> 16;
}
if (carry > 0) {
b.push_back(carry);
}
}
With that helper method, we're ready to implement the interface. First, we need to convert the decimal digit string to a BigInt by calling the helper method once for each decimal digit:
- (NSString *)hexStringFromDecimalString {
NSUInteger length = self.length;
unichar decimalCharacters[length];
[self getCharacters:decimalCharacters range:NSMakeRange(0, length)];
BigInt b;
for (NSUInteger i = 0; i < length; ++i) {
insertDecimalDigit(b, decimalCharacters[i] - '0');
}
If the input string is empty, or all zeros, then b is empty. We need to check for that:
if (b.size() == 0) {
return #"0";
}
Now we need to convert b to a hex digit string. The most significant digit of b is at the highest index. To avoid leading zeros, we'll handle that digit specially:
NSMutableString *hexString = [NSMutableString stringWithFormat:#"%X", b.back()];
Then we convert each remaining base-65536 digit to four hex digits, in order from most significant to least significant:
for (ssize_t i = b.size() - 2; i >= 0; --i) {
[hexString appendFormat:#"%04X", b[i]];
}
And then we're done:
return hexString;
}
You can find my full test program (to run as a Mac command-line program) in this gist.

Objective C - Matrix Multiplication Slow Performance

I have 2 2-D NSMutableArrays and I am trying to do some basic matrix multiplication. I have my generic formula code below, but its performance is exceptionally slow (as expected). I have done lots of googling and have not found any easy nor easy to understand formulas to change up the code for performance enhancement. Can anyone point me in the right direction of a straightforward formula/tutorial/example of how to get better performance than 0(n^3) with matrix multiplication in Objective C.
+ (NSMutableArray*)multiply:(NSMutableArray*)a1 withArray:(NSMutableArray*)a2
{
if([[a1 objectAtIndex: 0] count] != [a2 count])
{
NSLog(#"Multiplicaton error!");
return NULL;
}
int a1_rowNum = [a1 count];
int a2_rowNum = [a2 count];
int a2_colNum = [[a2 objectAtIndex:0] count];
NSMutableArray *result = [NSMutableArray arrayWithCapacity:a1_rowNum];
for (int i = 0; i < a1_rowNum; i++) {
NSMutableArray *tempRow = [NSMutableArray arrayWithCapacity:a2_colNum];
for (int j = 0; j < a2_colNum; j++) {
double tempTotal = 0;
for (int k = 0; k < a2_rowNum; k++) {
double temp1 = [[[a1 objectAtIndex:i] objectAtIndex:k] doubleValue];
double temp2 = [[[a2 objectAtIndex:k] objectAtIndex:j] doubleValue];
tempTotal += temp1 * temp2;
}
//Stored as a string because I upload it to an online database for storage.
[tempRow addObject:[NSString stringWithFormat:#"%f",tempTotal]];
}
[result addObject:tempRow];
}
return result;
}
It will be much faster if you Write it in C.
double[] will be ridiculously fast compared to an NSArray of NSNumbers for this task. you'll have good cache coherency, minimal instructions, no need to go through the runtime or allocate in order to write or read an element. no need to perform reference count cycling on each element…
You need have a look at Apple's Accelerate frameWork for ios4.0 onwards.
You can do a lot of complex math and matrix manipulation with it and this framework is optimized to run on any iOS hardware.
Checkout:
https://developer.apple.com/performance/accelerateframework.html

Quickest way to be sure region of memory is blank (all NULL)?

If I have an unsigned char *data pointer and I want to check whether size_t length of the data at that pointer is NULL, what would be the fastest way to do that? In other words, what's the fastest way to make sure a region of memory is blank?
I am implementing in iOS, so you can assume iOS frameworks are available, if that helps. On the other hand, simple C approaches (memcmp and the like) are also OK.
Note, I am not trying to clear the memory, but rather trying to confirm that it is already clear (I am trying to find out whether there is anything at all in some bitmap data, if that helps). For example, I think the following would work, though I have not tried it yet:
- BOOL data:(unsigned char *)data isNullToLength:(size_t)length {
unsigned char tester[length] = {};
memset(tester, 0, length);
if (memcmp(tester, data, length) != 0) {
return NO;
}
return YES;
}
I would rather not create a tester array, though, because the source data may be quite large and I'd rather avoid allocating memory for the test, even temporarily. But I may just being too conservative there.
UPDATE: Some Tests
Thanks to everyone for the great responses below. I decided to create a test app to see how these performed, the answers surprised me, so I thought I'd share them. First I'll show you the version of the algorithms I used (in some cases they differ slightly from those proposed) and then I'll share some results from the field.
The Tests
First I created some sample data:
size_t length = 1024 * 768;
unsigned char *data = (unsigned char *)calloc(sizeof(unsigned char), (unsigned long)length);
int i;
int count;
long check;
int loop = 5000;
Each test consisted of a loop run loop times. During the loop some random data was added to and removed from the data byte stream. Note that half the time there was actually no data added, so half the time the test should not find any non-zero data. Note the testZeros call is a placeholder for calls to the test routines below. A timer was started before the loop and stopped after the loop.
count = 0;
for (i=0; i<loop; i++) {
int r = random() % length;
if (random() % 2) { data[r] = 1; }
if (! testZeros(data, length)) {
count++;
}
data[r] = 0;
}
Test A: nullToLength. This was more or less my original formulation above, debugged and simplified a bit.
- (BOOL)data:(void *)data isNullToLength:(size_t)length {
void *tester = (void *)calloc(sizeof(void), (unsigned long)length);
int test = memcmp(tester, data, length);
free(tester);
return (! test);
}
Test B: allZero. Proposal by Carrotman.
BOOL allZero (unsigned char *data, size_t length) {
bool allZero = true;
for (int i = 0; i < length; i++){
if (*data++){
allZero = false;
break;
}
}
return allZero;
}
Test C: is_all_zero. Proposed by Lundin.
BOOL is_all_zero (unsigned char *data, size_t length)
{
BOOL result = TRUE;
unsigned char* end = data + length;
unsigned char* i;
for(i=data; i<end; i++) {
if(*i > 0) {
result = FALSE;
break;
}
}
return result;
}
Test D: sumArray. This is the top answer from the nearly duplicate question, proposed by vladr.
BOOL sumArray (unsigned char *data, size_t length) {
int sum = 0;
for (int i = 0; i < length; ++i) {
sum |= data[i];
}
return (sum == 0);
}
Test E: lulz. Proposed by Steve Jessop.
BOOL lulz (unsigned char *data, size_t length) {
if (length == 0) return 1;
if (*data) return 0;
return memcmp(data, data+1, length-1) == 0;
}
Test F: NSData. This is a test using NSData object I discovered in the iOS SDK while working on all of these. It turns out Apple does have an idea of how to compare byte streams that is designed to be hardware independent.
- (BOOL)nsdTestData: (NSData *)nsdData length: (NSUInteger)length {
void *tester = (void *)calloc(sizeof(void), (unsigned long)length);
NSData *nsdTester = [NSData dataWithBytesNoCopy:tester length:(NSUInteger)length freeWhenDone:NO];
int test = [nsdData isEqualToData:nsdTester];
free(tester);
return (test);
}
Results
So how did these approaches compare? Here are two sets of data, each representing 5000 loops through the check. First I tried this on the iPhone Simulator running on a relatively old iMac, then I tried this running on a first generation iPad.
On the iPhone 4.3 Simulator running on an iMac:
// Test A, nullToLength: 0.727 seconds
// Test F, NSData: 0.727
// Test E, lulz: 0.735
// Test C, is_all_zero: 7.340
// Test B, allZero: 8.736
// Test D, sumArray: 13.995
On a first generation iPad:
// Test A, nullToLength: 21.770 seconds
// Test F, NSData: 22.184
// Test E, lulz: 26.036
// Test C, is_all_zero: 54.747
// Test B, allZero: 63.185
// Test D, sumArray: 84.014
These are just two samples, I ran the test many times with only slightly varying results. The order of performance was always the same: A & F very close, E just behind, C, B, and D. I'd say that A, F, and E are virtual ties, on iOS I'd prefer F because it takes advantage of Apple's protection from processor change issues, but A & E are very close. The memcmp approach clearly wins over the simple loop approach, close to ten times faster in the simulator and twice as fast on the device itself. Oddly enough, D, the winning answer from the other thread performed very poorly in this test, probably because it does not break out of the loop when it hits the first difference.
I think you should do it with an explicit loop, but just for lulz:
if (length == 0) return 1;
if (*pdata) return 0;
return memcmp(pdata, pdata+1, length-1) == 0;
Unlike memcpy, memcmp does not require that the two data sections don't overlap.
It may well be slower than the loop, though, because the un-alignedness of the input pointers means there probably isn't much the implementation of memcmp can do to optimize, plus it's comparing memory with memory rather than memory with a constant. Easy enough to profile it and find out.
Not sure if it's the best, but I probably would do something like this:
bool allZero = true;
for (int i = 0; i < size_t; i++){
if (*data++){
//Roll back so data points to the non-zero char
data--;
//Do whatever is needed if it isn't zero.
allZero = false;
break;
}
}
If you've just allocated this memory, you can always call calloc rather than malloc (calloc requires that all the data is zeroed out). (Edit: reading your comment on the first post, you don't really need this. I'll just leave it just in case)
If you're allocating the memory yourself, I'd suggest using the calloc() function. It's just like malloc(), except it zeros out the buffer first. It's what's used to allocate memory for Objective-C objects and is the reason that all ivars default to 0.
On the other hand, if this is a statically declared buffer, or a buffer you're not allocating yourself, memset() is the easy way to do this.
Logic to get a value, check it, and set it will be at least as expensive as just setting it. You want it to be null, so just set it to null using memset().
This would be the preferred way to do it in C:
BOOL is_all_zero (const unsigned char* data, size_t length)
{
BOOL result = TRUE;
const unsigned char* end = data + length;
const unsigned char* i;
for(i=data; i<end; i++)
{
if(*i > 0)
{
result = FALSE;
break;
}
}
return result;
}
(Though note that strictly and formally speaking, a memory cell containing a NULL pointer mustn't necessarily be 0, as long as a null pointer cast results in the value zero, and a cast of a zero to a pointer results in a NULL pointer. In practice, this shouldn't matter as all known compilers use 0 or (void*) 0 for NULL.)
Note the edit to the initial question above. I did some tests and it is clear that the memcmp approach or using Apple's NSData object and its isEqualToData: method are the best approaches for speed. The simple loops are clearer to me, but slower on the device.