NSMutableDictionary for huge dataset of floats - objective-c

I've got some code to convert a large (many gigabytes) XML file into another format.
Among other things, I need to store one or two gigabytes of floats in a hash table (two floats for each entry), with an int as the value's key.
Currently, I'm using NSMutableDictionary and a custom class containing the two floats:
// create the dictionary
NSMutableDictionary *points = [[NSMutableDictionary alloc] init];
// add an entry (the data is read from an XML file using libxml)
int pointId = 213453;
float x = 42.313554;
float y = -21.135213;
MyPoint *point = [[MyPoint alloc] initWithX:x Y:y];
[points setObject:point forKey:[NSNumber numberWithInt:pointId]];
[point release];
// retrieve an entry (this happens later on while parsing the same XML file)
int pointId = 213453;
float x;
float y;
MyPoint *point = [points objectForKey:[NSNumber numberWithInt:pointId]];
x = point.x;
y = point.y;
This data set is consuming about 800MB of RAM with the XML file I'm working with now, and it takes quite a long time to execute. I'd like to have better performance, but even more important I need to get the memory consumption down so I can process even larger XML files.
objc_msg_send is right up there in a profile of the code, as is - [NSNumber numberWithInt:], and I'm sure I can get the memory usage down by avoiding objects altogether, but I don't know much about C programming (this project is certainly teaching me!).
How can I replace NSMuableDictionary, NSNumber MyPoint with an efficient C data structure? Without any third party library dependencies?
I'd also like to be able to write this data structure to files on the disk, so I can work with a dataset that doesn't entirely fit into memory, but I can probably live without this capability.
(for those not familiar with Objective-C, the NSMutableDictionary class can only store Obj-C objects, and it the keys must also be objects. NSNumber and MyPoint are dumb container classes to allow NSMutableDictionary to work with float and int values.)
EDIT:
I've tried using CFMutableDictionary to store structs, as per apple's sample code. When the dictionary is empty, it performs great. But as the dictionary grows it gets slower and slower. About 25% through parsing a file (~4 million items in the dictionary) it starts to chug, two orders of magnitude slower than earlier in the file.
NSMutableDictionary doesn't have the same performance issue. Instruments shows a lot of activity applying hashes and comparing the keys of the dictionary (the intEqual() method below). Comparing an int is fast, so something is very wrong for it to be executing so often.
Here's my code to create the dictionary:
typedef struct {
float lat;
float lon;
} AGPrimitiveCoord;
void agPrimitveCoordRelease(CFAllocatorRef allocator, const void *ptr) {
CFAllocatorDeallocate(allocator, (AGPrimitiveCoord *)ptr);
}
Boolean agPrimitveCoordEqual(const void *ptr1, const void *ptr2) {
AGPrimitiveCoord *p1 = (AGPrimitiveCoord *)ptr1;
AGPrimitiveCoord *p2 = (AGPrimitiveCoord *)ptr2;
return (fabsf(p1->lat - p2->lat) < 0.0000001 && fabsf(p1->lon - p2->lon) < 0.0000001);
}
Boolean intEqual(const void *ptr1, const void *ptr2) {
return (int)ptr1 == (int)ptr2;
}
CFHashCode intHash(const void *ptr) {
return (CFHashCode)((int)ptr);
}
// init storage dictionary
CFDictionaryKeyCallBacks intKeyCallBacks = {0, NULL, NULL, NULL, intEqual, intHash};
CFDictionaryValueCallBacks agPrimitveCoordValueCallBacks = {0, NULL /*agPrimitveCoordRetain*/, agPrimitveCoordRelease, NULL, agPrimitveCoordEqual};
temporaryNodeStore = CFDictionaryCreateMutable(NULL, 0, &intKeyCallBacks, &agPrimitveCoordValueCallBacks);
// add an item to the dictionary
- (void)parserRecordNode:(int)nodeId lat:(float)lat lon:(float)lon
{
AGPrimitiveCoord *coordPtr = (AGPrimitiveCoord *)CFAllocatorAllocate(NULL, sizeof(AGPrimitiveCoord), 0);
coordPtr->lat = lat;
coordPtr->lon = lon;
CFDictionarySetValue(temporaryNodeStore, (void *)nodeId, coordPtr);
}
EDIT 2:
The performance problem was due to the almost useless hashing implementation in Apple's sample code. I got the performance way up by using this:
// hash algorithm from http://burtleburtle.net/bob/hash/integer.html
uint32_t a = abs((int)ptr);
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);

If you want NSMutableDictionary-like behavior but with malloc'd memory, you can drop down to CFDictionary (or in your case, CFMutableDictionary). It's actually the underpinnings of NSMutableDictionary, but it allows some customization, namely you can tell it that you're not storing objects. When you call CFDictionaryCreateMutable() you give it a struct that describes what sort of values you're handing it (it contains pointers that tell it how to retain, release, describe, hash, and compare your values). So if you want to use a struct containing two floats, and you're happy using malloc'd memory for each struct, you can malloc your struct, populate it, and hand that to the CFDictionary, and then you can write the callback functions such that they work with your particular struct. The only restriction on the keys and objects you can use CFDictionary with is they need to fit inside a void *.

For this sort of thing I would just use C++ containers std::unordered_map and std::pair. You can use them in Objective-C++. Just give your files a .mm extension instead of the usual .m extension.
Update
In your comment you said you've never done C++ before. In that case, you should either try Kevin Ballard's answer of CFDictionary, or check out the hcreate, hdestroy, and hsearch functions in the standard library.
hcreate man page

Rename your .m file to .mm and switch to using C++:
std::map<int, std::pair<float>> points;

Related

compare blocks and functions in objective C

As I am learning objective C, my understanding is new and incomplete. The concept of a block is very similar to a function. They even look almost identical:
FUNCTION named 'multiply'
#import <Foundation/Foundation.h>
int multiply (int x, int y)
{
return x * y;
}
int main(int argc, char *argv[]) {
#autoreleasepool {
int result = multiply(7, 4); // Result is 28.
NSLog(#"this is the result %u",result);
}
}
BLOCK named 'Multiply'
#import <Foundation/Foundation.h>
int (^Multiply)(int, int) = ^(int num1, int num2) {
return num1 * num2;
};
int main(int argc, char *argv[]) {
#autoreleasepool {
int result = Multiply(7, 4); // Result is 28.
NSLog(#"this is the result %u",result);
}
}
I found various statements on the web like:
"Blocks are implemented as Objective-C objects, except they can be put on the stack, so they don't necessarily have to be malloc'd (if you retain a reference to a block, it will be copied onto the heap, though). "
Ray Wenderlich says:
"Blocks are first-class functions"
I have no clue what all this means. My example shows that the same thing is accomplished as a block or a function. Can someone show an example where blocks can do something functions cannot? or vice versa?
Or is it something more subtle, like the way the variable 'result' is handled in memory?
or is one faster/safer?
Can either of them be used as a method in a class definition?
Thank you.
Blocks are Objective-C objects, and functions aren't. In practice, this means you can pass around a block from one piece of code to another like so:
NSArray *names = #[#"Bob", #"Alice"];
[names enumerateObjectsUsingBlock:^(id name, NSUInteger idx, BOOL *stop) {
NSLog(#"Hello, %#", name);
}];
In C, you can achieve similar effects by passing around pointers to functions. The main difference between doing this and using blocks, however, is that blocks can capture values. For instance, in the example above, if we wanted to use a variable greeting:
NSString *greeting = #"Hello";
NSArray *names = #[#"Bob", #"Alice"];
[names enumerateObjectsUsingBlock:^(id name, NSUInteger idx, BOOL *stop) {
NSLog(#"%#, %#", greeting, name);
}];
In this example, the compiler can see that the block depends on the local variable greeting and will "capture" the value of greeting and store it along with the block (in this case, that means retaining and storing a pointer to an NSString). Wherever the block ends up getting used (in this case, within the implementation of [NSArray -enumerateObjectsUsingBlock:]), it will have access to the greetings variable as it was at the time the block was declared. This lets you use any local variables in the scope of your block without having to worry about passing them into the block.
To do the same using function pointers in C, greeting would have to be passed in as a variable. However, this can't happen because the caller (in this case, NSArray) can't know (especially at compile time) exactly which arguments it has to pass to your function. Even if it did, you'd need to somehow pass the value of greeting to NSArray, along with every other local variable you wanted to use, which would get hairy really quickly:
void greet(NSString *greeting, NSString *name) {
NSLog(#"%#, %#", greeting, name);
}
// NSArray couldn't actually implement this
NSString *greeting = #"Hello";
NSArray *names = #[#"Bob", #"Alice"];
[names enumerateObjectsUsingFunction:greet withGreeting:greeting];
Blocks are closures -- they can capture local variables from the surrounding scope. This is the big difference between blocks (and anonymous functions in other modern languages) and functions in C.
Here's an example of a higher-order function, makeAdder, which creates and returns an "adder", a function which adds a certain base number to its argument. This base number is set by the argument to makeAdder. So makeAdder can return different "adders" with different behavior:
typedef int (^IntFunc)(int);
IntFunc makeAdder(int x) {
return ^(int y) { return x + y; }
}
IntFunc adder3 = makeAdder(3);
IntFund adder5 = makeAdder(5);
adder3(4); // returns 7
adder5(4); // returns 9
adder3(2); // returns 5
This would not be possible to do with function pointers in C, because each function pointer must point to an actual function in the code, of which there is a finite number fixed at compile time, and each function's behavior is fixed at compile time. So the ability to create a virtually unlimited number of potential "adders" depending on a value at runtime, like makeAdder does, is not possible. You would instead need to create a structure to hold the state.
A block which does not capture local variables from the surrounding scope, like in your example, is not much different from a plain function, aside from the type.

With NSPointerArray, how to iterate over opaque pointers?

I recently discovering these classes like NSMapTable and NSPointerArray, which work like the traditional collections, but also let you store weak references or plain old C pointers. Unfortunately it looks like you can't use the for...in syntax to iterate over non-NSObject pointers. For example:
typedef struct Segment {
CGPoint bottom, top;
} Segment;
...
NSPointerArray *segments = [[NSPointerArray alloc]
initWithOptions:NSPointerFunctionsOpaqueMemory];
...
Segment *s = malloc(sizeof(Segment));
[segments addPointer: s];
...
for (Segment *s in segments) { // nope...
The compiler does not like that last line. The error:
Selector element type 'Segment *' (aka 'struct Segment *') is not a valid object
So, do I need to do this?
for (int i=0, len=segments.count; i<len; i++) {
Segment *seg = [segments pointerAtIndex:i];
...
That's not the end of the world, but I just want to make sure.
(This might be more of theoretical interest.)
NSPointerArray does conform to the NSFastEnumeration protocol, it is only the
for (id object in collection) language construct that cannot be used with arbitrary pointers which
are not Objective-C pointers.
But you can get a whole bunch of pointers from the array by calling the NSFastEnumeration
method countByEnumeratingWithState:objects:count: directly. This is a bit tricky because
that method need not fill the supplied buffer (as explained here: How for in loop works internally - Objective C - Foundation).
Here is a simple example how this would work:
__unsafe_unretained id objs[10];
NSUInteger count = [segments countByEnumeratingWithState:&state
objects:objs count:10];
// Now state.itemsPtr points to an array of pointers:
for (NSUInteger i = 0; i < count; i++) {
Segment *s = (__bridge Segment *)state.itemsPtr[i];
NSLog(#"%p", s);
}
So this does not help to make the code simpler and you probably want to stick with
your explicit loop.
But for large arrays it might improve the performance because the pointers are "fetched"
in batches from the array instead of each pointer separately.
the for (... in ...) syntax won't work in this case because Segment is a struct, not an Objective C object. Your second for loop should work.

Fast way to store and retrieve pairs of numbers in Objective-C

I am implementing queued flood fill algorithm and need to store and retrieve pairs of numbers in NSMutableArray.
Basically, I am creating an array
m_queue = [NSMutableArray array];
then at some time I populate the array
[m_queue addObject:[NSValue valueWithCGPoint:CGPointMake(x + 1, y)]];
then I retrieve data for the next iteration and remove the value at the beginning of the array
NSValue* value = [m_queue objectAtIndex:0];
[m_queue removeObjectAtIndex:0];
CGPoint nextPoint = [value CGPointValue];
[self queueFloodFill8:nextPoint.x y:nextPoint.y];
The question is: what can I do to avoid creating large number of CGPoint and NSValue objects?
I don't really need points, the algorithm uses pairs of integer values, so I think there might be a better way to store such pairs.
UPDATE:
I looked into implementing C-style solution like #mattjgalloway and #CRD suggested.
I've introduced
typedef struct lookup_point_struct
{
int x;
int y;
struct lookup_point_struct* next;
} LookupPoint;
and have rewritten code to use linked list of such structs instead of NSMutableArray and CGPoint/NSValue.
All this made my code about 3 times faster. And memory consumption dropped significantly too.
There wouldn't really be a better Objective-C / Foundation way of doing it, apart from maybe creating your own class such as NumberPair or something which you put into the array rather than using NSValue and CGPoint. It might be slightly more memory efficient to do that and you could make NumberPair contain two integers rather than floats like you are concerned about. Something like:
#interface NumberPair : NSObject
#property (nonatomic, assign) int x;
#property (nonatomic, assign) int y;
#end
#implementation NumberPair
#synthesize x, y;
#end
...
m_queue = [NSMutableArray array];
NumberPair *newPair = [[NumberPair alloc] init];
newPair.x = 1;
newPair.y = 2;
[m_queue addObject:newPair];
...
NumberPair *nextPoint = [m_queue objectAtIndex:0];
[m_queue removeObjectAtIndex:0];
[self queueFloodFill8:nextPoint.x y:nextPoint.y];
Other than that you could do a more C-like thing of having a struct containing two integers, create a dynamically allocated array to store the structs (you'd need to know the max size of the queue or keep reallocating). Something like:
typedef struct {
int x;
int y;
} NumberPair;
NumberPair *m_queue = (NumberPair*)malloc(sizeof(NumberPair) * QUEUE_SIZE);
// ... etc
Also, you might want to check out my MJGStack class which wraps NSMutableArray to provide a stack like interface which you might be able to adjust slightly to do what you want rather than using NSMutableArray directly. Although that's not essential by any means.
How large do you expect your m_queue array to get?
If the cost of the NSMutableArray and NSValue objects (CGPoint is a struct, no real cost there) is impacting your algorithm then consider using a C-style array of structs as a circular buffer together with two indexes for front/back of the queue. You can abstract this into a queue class (or an adt using functions to save on dynamic method call overhead if you need to).
If you need to deal with an unbounded queue you can malloc & realloc the array with your queue class/adt as needed (which is essentially what NSMutableArray does behind the scenes but with more overhead for its generality).

Passing and calling dynamic blocks in Objective C

As part of a unit test framework, I'm writing a function genArray that will generate NSArrays populated by a passed in generator block. So [ObjCheck genArray: genInt] would generate an NSArray of random integers, [ObjCheck genArray: genChar] would generate an NSArray of random characters, etc. In particular, I'm getting compiler errors in my implementation of genArray and genString, a wrapper around [ObjCheck genArray: genChar].
I believe Objective C can manipulate blocks this dynamically, but I don't have the syntax right.
ObjCheck.m
+ (id) genArray: (id) gen {
NSArray* arr = [NSMutableArray array];
int len = [self genInt] % 100;
int i;
for (i = 0; i < len; i++) {
id value = gen();
arr = [arr arrayByAddingObject: value];
}
return arr;
}
+ (id) genString {
NSString* s = #"";
char (^g)() = ^() {
return [ObjCheck genChar];
};
NSArray* arr = [self genArray: g];
s = [arr componentsJoinedByString: #""];
return s;
}
When I try to compile, gcc complains that it can't do gen(), because gen is not a function. This makes sense, since gen is indeed not a function but an id which must be cast to a function.
But when I rewrite the signatures to use id^() instead of id, I also get compiler errors. Can Objective C handle arbitrarily typed blocks (genArray needs this), or is that too dynamic?
Given that blocks are objects, you can cast between block types and id whenever you want, though if you cast the block to the wrong block type and call it, you're going to get unexpected results (since there's no way to dynamically check at runtime what the "real" type of the block is*).
BTW, id^() isn't a type. You're thinking of id(^)(). This may be a source of compiler error for you. You should be able to update +genArray: to use
id value = ((id(^)())(gen))();
Naturally, that's pretty ugly.
*There actually is a way, llvm inserts an obj-c type-encoded string representing the type of the block into the block's internal structure, but this is an implementation detail and would rely on you casting the block to its internal implementation structure in order to extract.
Blocks are a C-level feature, not an ObjC one - you work with them analogously to function pointers. There's an article with a very concise overview of the syntax. (And most everything else.)
In your example, I'd make the gen parameter an id (^gen)(). (Or possibly make it return a void*, using id would imply to me that gen generates ObjC objects and not completely arbitrary types.)
No matter how you declare your variables and parameters, your code won't work. There's a problem that runs through all your compiler errors and it would be a problem even if you weren't doing convoluted things with blocks.
You are trying to add chars to an NSArray. You can't do that. You will have to wrap them them as some kind of Objective C object. Since your only requirement for this example to work is that the objects can be inputs to componentsJoinedByString, you can return single-character NSStrings from g. Then some variety of signature like id^() will work for genArray. I'm not sure how you parenthesize it. Something like this:
+ (id) genArray: (id^()) gen;
+ (id) genString {
...
NSString * (^g)() = ^() {
return [NSString stringWithFormat:#"%c", [ObjCheck genChar]];
};
...
}
NSString * is an id. char is not. You can pass NSString * ^() to id ^(), but you get a compiler error when you try to pass a char ^() to an id ^(). If you gave up some generality of genArray and declared it to accept char ^(), it would compile your call to genArray, but would have an error within genArray when you tried to call arrayByAddingObject and the argument isn't typed as an id.
Somebody who understands the intricacies of block syntax feel free to edit my post if I got some subtle syntax errors.
Btw, use an NSMutableArray as your local variable in genArray. Calling arrayByAddingObject over and over again will have O(n^2) time performance I imagine. You can still declare the return type as NSArray, which is a superclass of NSMutableArray, and the callers of genArray won't know the difference.

What is the best way to define string constants in an objective-c protocol?

I have defined a protocol that all my plug-ins must implement. I would also like the plug-ins to all use certain strings, like MyPluginErrorDomain. With integers this is quite easily achieved in an enum, but I can't figure out how to do the same with strings. Normally, in classes I would define
extern NSString * const MyPluginErrorDomain;
in the .h file and in the .m file:
NSString * const MyPluginErrorDomain = #"MyPluginErrorDomain";
but that doesn't work very well in a protocol, because then each plug-in would have to provide its own implementation which defeats the purpose of having a constant.
I then tried
#define MYPLUGIN_ERROR_DOMAIN #"MyPluginErrorDomain"
but the implementing classes in the plug-in can't seem to see the #define. Who knows a good solution?
You can declare them in the header with the protocol (but outside the protocol interface itself), then define them in an implementation file for the protocol (obviously it wouldn't have an #implementation section - just your NSString definitions).
Or have a separate .h/.m pair that is just for the string constants (the protocol header can import the string constants header).
You keep the .h definition:
extern NSString * const MyPluginErrorDomain;
but put this part into a separate .m file that gets included in your framework:
NSString * const MyPluginErrorDomain = #"MyPluginErrorDomain";
So plug-ins can still implement the interface but when compiling they link or compile in your other .m file, so they will see the value of MyPluginErrorDomain.
In C++, I would declare them in a header like this:
const char * const MYPLUGIN_ERROR_DOMAIN = "MyPluginErrorDomain";
const char * const MYPLUGIN_FOO_DOMAIN = "MyPluginFooDomain";
Note that as the pointers are const, they will be local to the translation units the header is #included in, and so there will be no need to use extern to prevent multiple definition errors.
You should implement it as extern strings as in your example:
extern NSString * const MyPluginErrorDomain;
or provide extern functions which return static storage data. For example:
/* h */
extern NSString * MyPluginErrorDomain();
/* m */
NSString * MyPluginErrorDomain() {
static NSString * const s = #"MyPluginErrorDomain";
return s;
}
The reason is that strings and keys are often used and compared by pointer value or hash value, rather than true string comparison (isEqualToString:).
At the implementation level, there is a big difference between:
In code, that means that when the strings compared are defined in multiple binaries:
Say 'MyPluginErrorDomain' and 'key' have identical string values, but are defined in different binaries (i.e. on in the plugin host, one in the plugin).
/////// Pointer comparison (NSString)
BOOL a = [MyPluginErrorDomain isEqualToString:key];
BOOL b = MyPluginErrorDomain == key;
// c may be false because a may be true, in that they represent the same character sequence, but do not point to the same object
BOOL c = a == b;
/////// Hash use (NSString)
// This is true
BOOL d = [MyPluginErrorDomain hash] == [key hash];
// This is indicative if true
BOOL e = [MyPluginErrorDomain hash] == [someOtherStringKey hash];
// because
BOOL f = [MyPluginErrorDomain isEqualToString:someOtherStringKey];
// g may be false (though the hash code is 'generally' correct)
BOOL g = e == f;
It is therefore necessary to provide the keys in many cases. It may seem like a trivial point, but it is hard to diagnose some of the problems associated with the difference.
Hash codes and pointer comparisons are used throughout Foundation and other objc technologies in the internals of dictionary storage, key value coding... If your dictionary is going straight out to xml, that's one thing, but runtime use is another and there are a few caveats in the implementation and runtime details.