Implementing -hash / -isEqual: / -isEqualTo...: for Objective-C collections - objective-c

Note: The following SO questions are related, but neither they nor the linked resources seem to fully answer my questions, particularly in relation to implementing equality tests for collections of objects.
Best practices for overriding -isEqual: and -hash
Techniques for implementing -hash on mutable Cocoa objects
Background
NSObject provides default implementations of -hash (which returns the address of the instance, like (NSUInteger)self) and -isEqual: (which returns NO unless the addresses of the receiver and the parameter are identical). These methods are designed to be overridden as necessary, but the documentation makes it clear that you should provide both or neither. Further, if -isEqual: returns YES for two objects, then the result of -hash for those objects must be the same. If not, problems can ensue when objects that should be the same — such as two string instances for which -compare: returns NSOrderedSame — are added to a Cocoa collection or compared directly.
Context
I develop CHDataStructures.framework, an open-source library of Objective-C data structures. I have implemented a number of collections, and am currently refining and enhancing their functionality. One of the features I want to add is the ability to compare collections for equality with another.
Rather than comparing only memory addresses, these comparisons should consider the objects present in the two collections (including ordering, if applicable). This approach has quite a precedent in Cocoa, and generally uses a separate method, including the following:
-[NSArray isEqualToArray:]
-[NSDate isEqualToDate:]
-[NSDictionary isEqualToDictionary:]
-[NSNumber isEqualToNumber:]
-[NSSet isEqualToSet:]
-[NSString isEqualToString:]
-[NSValue isEqualToValue:]
I want to make my custom collections robust to tests of equality, so they may safely (and predictably) be added to other collections, and allow others (like an NSSet) to determine whether two collections are equal/equivalent/duplicates.
Problems
An -isEqualTo...: method works great on its own, but classes which define these methods usually also override -isEqual: to invoke [self isEqualTo...:] if the parameter is of the same class (or perhaps subclass) as the receiver, or [super isEqual:] otherwise. This means the class must also define -hash such that it will return the same value for disparate instances that have the same contents.
In addition, Apple's documentation for -hash stipulates the following: (emphasis mine)
"If a mutable object is added to a collection that uses hash values to determine the object's position in the collection, the value returned by the hash method of the object must not change while the object is in the collection. Therefore, either the hash method must not rely on any of the object's internal state information or you must make sure the object's internal state information does not change while the object is in the collection. Thus, for example, a mutable dictionary can be put in a hash table but you must not change it while it is in there. (Note that it can be difficult to know whether or not a given object is in a collection.)"
Edit: I definitely understand why this is necessary and totally agree with the reasoning — I mentioned it here to provide additional context, and skirted the topic of why it's the case for the sake of brevity.
All of my collections are mutable, and the hash will have to consider at least some of the contents, so the only option here is to consider it a programming error to mutate a collection stored in another collection. (My collections all adopt NSCopying, so collections like NSDictionary can successfully make a copy to use as a key, etc.)
It makes sense for me to implement -isEqual: and -hash, since (for example) an indirect user of one of my classes may not know the specific -isEqualTo...: method to call, or even care whether two objects are instances of the same class. They should be able to call -isEqual: or -hash on any variable of type id and get the expected result.
Unlike -isEqual: (which has access to two instances being compared), -hash must return a result "blindly", with access only to the data within a particular instance. Since it can't know what the hash is being used for, the result must be consistent for all possible instances that should be considered equal/identical, and must always agree with -isEqual:. (Edit: This has been debunked by the answers below, and it certainly makes life easier.) Further, writing good hash functions is non-trivial — guaranteeing uniqueness is a challenge, especially when you only have an NSUInteger (32/64 bits) in which to represent it.
Questions
Are there best practices when implementing equality comparisons -hash for collections?
Are there any peculiarities to plan for in Objective-C and Cocoa-esque collections?
Are there any good approaches for unit testing -hash with a reasonable degree of confidence?
Any suggestions on implementing -hash to agree with -isEqual: for collections containing elements of arbitrary types? What pitfalls should I know about? (Edit: Not as problematic as I first thought — as #kperryua points out, "equal -hash values do not imply -isEqual:".)
Edit: I should have clarified that I'm not confused about how to implement -isEqual: or -isEqualTo...: for collections, that's straightforward. I think my confusion stemmed mainly from (mistakenly) thinking that -hash MUST return a different value if -isEqual: returns NO. Having done cryptography in the past, I was thinking that hashes for different values MUST be different. However, the answers below made me realize that a "good" hash function is really about minimizing bucket collisions and chaining for collections that use -hash. While unique hashes are preferable, they are not a strict requirement.

I think trying to come up with some generally useful hash function that will generate unique hash values for collections is an exercise in futility. U62's suggestion of combining the hashes of all the contents will not scale well, as it makes the hash function O(n). Hash functions should really be O(1) to ensure good performance, otherwise the purpose of the hash is defeated. (Consider the common Cocoa construct of plists, which are dictionaries containing arrays and other dictionaries, potentially ad nauseum. Attempting to take the hash of the top-level dictionary of a large plist would be excruciatingly slow if the collections' hash functions were O(n).)
My suggestion would be not to worry a great deal about a collection's hash. As you stated, -isEqual: implies equal -hash values. On the other hand, equal -hash values do not imply -isEqual:. That fact gives you a lot of leeway to create a simple hash.
If you're really worried about collisions though (and you have proof in concrete measurements of real-world situations that confirm it is something to be worried about), you could still follow U62's advice to some degree. For example, you could take the hash of, say, the first and/or last element in the collection, and combine that with, say, the -count of the collection. That be enough to provide a decent hash.
I hope that answers at least one of your questions.
As for No. 1: Implementing -isEqual: is pretty cut and dry. You enumerate the contents, and check isEqual: on each of the elements.
There is one thing to be careful of that may affect what you decide to do for your collections' -hash functions. Clients of your collections must also understand the rules governing -isEqual: and -hash. If you use the contents' -hash in your collection's -hash, your collection will break if the contents' isEqual: and -hash don't agree. It's the client's fault, of course, but that's another argument against basing your -hash off of the collection's contents.
No. 2 is kind of vague. Not sure what you have in mind there.

Two collections should be considered equal if they contain the same elements, and further if the collections are ordered, that the elements are in the same order.
On the subject of hashes for collections, it should be enough to combine the hashes of the elements in some way (XOR them or modulo add them). Note that while the rules state that two objects that are equal according to IsEqual need to return the same hash, the opposite does not hold : Although uniqueness of hashes is desireable, it is not necessary for correctness of the solution. Thus an ordered collection need not take account of the order of the elements.
The excerpt from the Apple documentation is a necessary restriction by the way. An object could not maintain the same hash value under mutation while also ensuring that objects with the same value have the same hash. That applies for the simplest of objects as well as collections. Of course it only usually matters that an object's hash changes when it is inside a container that uses the hash to organise it's elements. The upshot of all this is that mutable collections shouldn't mutate when placed inside another container, but then neither should any object that has a true hash function.

I have done some investigation into the NSArray and NSMutableArray default hash implementation and (unless I have misunderstood something) it seams like Apple do not follow thier own rules:
If a mutable object is added to a collection that uses hash values to
determine the object's position in the collection, the value returned
by the hash method of the object must not change while the object is
in the collection. Therefore, either the hash method must not rely on
any of the object's internal state information or you must make sure
the object's internal state information does not change while the
object is in the collection. Thus, for example, a mutable dictionary
can be put in a hash table but you must not change it while it is in
there. (Note that it can be difficult to know whether or not a given
object is in a collection.)
Here is my test code
NSMutableArray* myMutableArray = [NSMutableArray arrayWithObjects:#"a", #"b", #"c", nil];
NSMutableArray* containerForMutableArray = [NSMutableArray arrayWithObject:myMutableArray];
NSUInteger hashBeforeMutation = [[containerForMutableArray objectAtIndex:0] hash];
[[containerForMutableArray objectAtIndex:0] removeObjectAtIndex:1];
NSUInteger hashAfterMutation = [[containerForMutableArray objectAtIndex:0] hash];
NSLog(#"Hash Before: %d", hashBeforeMutation);
NSLog(#"Hash After : %d", hashAfterMutation);
The output is:
Hash Before: 3
Hash After : 2
So it seams like the default implementation for the Hash method on both NSArray and NSMutableArray is the count of the array and it dosn't care if its inside a collection or not.

Related

Does NSDictionary's objectForKey: rely on identity or equality?

Say I have an object called Person which has the property socialSecurityNumber, and this class overrides the isEqual: method to return true when the social security number properties are equal. And say I've put a bunch of instances of Person into an NSDictionary.
If I now instantiate a newPerson object which happens to have the same social security number as one already in the dictionary, and I do [myDictionary objectForKey:newPerson], will it use the isEqual: and return YES, or will it compare pointers and return NO?
I know I can write a simple test to find out, but I want to understand how exactly objectForKey: finds a match in a dictionary, and generally how consistent this is across Cocoa (i.e. does NSArray's indexofObject: work the same?)
NSDictionary works like a hashtable. So it uses both -hash and -isEqual: to find the object in the dictionary corresponding to the given key.
So to answer your question for NSDictionary, this uses isEqual: and not pointer comparison. But you also should implement hash in addition to isEqual: on your Person class for this to work.
From the NSDictionary Class Reference documentation:
A key-value pair within a dictionary is called an entry. Each entry consists of one object that represents the key and a second object that is that key’s value. Within a dictionary, the keys are unique. That is, no two keys in a single dictionary are equal (as determined by isEqual:).
From the isEqual: method documentation:
If two objects are equal, they must have the same hash value. This last point is particularly important if you define isEqual: in a subclass and intend to put instances of that subclass into a collection. Make sure you also define hash in your subclass.
This behavior is consistent across the various container classes in Cocoa. For example, from the NSArray's indexOfObject: method documentation:
Starting at index 0, each element of the array is sent an isEqual: message until a match is found or the end of the array is reached. This method passes the anObject parameter to each isEqual: message. Objects are considered equal if isEqual: (declared in the NSObject protocol) returns YES.
You should always read the documentation : as pointed out by the extracts quoted above, these kind of details are often explained in the "Discussion" or "Special Consideration" sections of the method documentation or in the "Overview" section of the class documentation itself.
how consistent this is across Cocoa (i.e. does NSArray's indexofObject: work the same?)
It is consistent and at the same time it isn't. What I mean is that there are two methods that could be used: isEqual and hash. You should not be too much concerned about which is used when. What you should instead focus on is to respect the NSObject protocol requirements and make sure that if two objects are equal according to isEqual they also have the same hash.
From the isEqual documentation in the NSObject Protocol Reference
If two objects are equal, they must have the same hash value. This
last point is particularly important if you define isEqual: in a
subclass and intend to put instances of that subclass into a
collection. Make sure you also define hash in your subclass.

objective-c complexity reference

For the c++ STL, there is a de-facto standard location (besides the de-jour standard, I mean) to find information about the complexity guarantees of standard container operations.
Is there an analogous, web-accessible document listing complexity guarantees for NSArray, NSDictionary, etc.?
For example, I cannot find a reference that gives complexity for [NSArray count]
Correct. There isn't one. C++ / the STL (based on my limited understanding) have a significant performance focus. Objective-C / Foundation basically don't.
NSArray, NSDictionary and friends are interfaces. They tell you how to use them, not how they behave. This gives them the freedom to switch implementation under the hood for performance reasons. The point is, you don't need to care, and this won't be specified in the API so you can't even if you want to ;)
For a really good read on this subject, highlighting implementation switches, and with a rough comparison between Foundation classes and STL / C data structures, check out the Ridiculous Fish (by someone on the Apple AppKit team) blog post about "Our arrays, aren't"
Is there an analogous, web-accessible document listing complexity
guarantees for NSArray, NSDictionary, etc.?
No. If you understand what the different containers do, you'll have a pretty good idea of how they behave (e.g. dictionary == map -> nearly constant-time lookups). But don't assume that you know exactly how these structures behave, because they may change their behavior based on circumstances. In other words, a class like NSArray may not be (certainly isn't) implemented as an actual array in the sense of a C-style array even though it has that same "ordered sequence of elements" behavior.
You can, of course, analyze the complexity of your own code: your own binary search through an NSArray is always going to take O(log n) operations any way you slice it. Just don't assume that inserting an element into an NSMutableArray is going to require moving all the subsequent elements, because your "array" might really be a linked list or something else.

Should I use == or [NSManagedObject isEqual:] to compare managed objects in the same context?

Let's say variable A and B hold instances of managed objects in the same managed object context. I need to make sure that they are associated with the same "record" in the persistent store. The section on Faulting and Uniquing in the Core Data Programming Guide says that:
Core Data ensures that—in a given managed object context—an entry in a persistent store is associated with only one managed object.
From this, it seems that a pointer comparison is sufficient for my purpose. Or does it ever make sense to use isEqual: to compare managed objects in the same context?
Use == to determine if two pointers point to the same object. Use -isEqual to determine if two objects are "equal", where the notion of equality depends on the objects being compared. -isEqual: normally compares the values returned by the -hash method. I wrote previously that it seemed possible that -isEqual: might return true if two managed objects contain the same values. That's clearly not right. There are some caveats in the docs about making sure that the hash value for a mutable object doesn't change while it's in a collection, and that knowing whether a given object is in a collection can be difficult. It seems certain that the hash for a managed object doesn't depend on the data that that object contains, and much more likely that it's connected to something immutable about the object; the object's -objectID value seems a likely candidate.
Given all that, I'm changing my opinion ;-). Each record is only represented once in a given context, so == is probably safe, but -isEqual: seems to better express your intention.
Pointer comparison is fine for objects retrieved from a single managed object context, the documentation on uniquing you quote promises as much.
ObjectID should be used for testing object equality across managed object contexts.
isEqual does not do attribute tests, because it is documented to not fault the object. In fact, looking at the disassembled function it is definitely just a pointer compare.
So the semantics of the equality test for managed objects are simply "points to the same object (record) in the managed object context" and will compare false for objects in different contexts.
Warning: Since NSManagedObject isEqual compares objectIDs, a comparison can fail if one instance is using the temporary objectID and the other instance is using the permanent objectID.
Background: When an NSManagedObject is created, it is assigned a temporary objectID. It is converted into a permanent objectID when the NSManagedObject is actually persisted into the store. You can see the difference if you print the objectID:
x-coredata:///MyEntity/t03BF9735-A005-4ED9-96BA-462BD65FA25F118 (temporary ID)
x-coredata://EB8922D9-DC06-4256-A21B-DFFD47D7E6DA/MyEntity/p3 (permanent ID)
When an objectID is converted to permanent, instances of the NSManagedObject in other threads and collections are not updated. So if you put an NSManagedObject into an NSArray when it has a temporary objectID, using methods like containsObject will fail if you try to find the object with the permanent objectID. Remember containsObject uses isEqual.
Finally, a couple of useful methods are NSManagedObjectID isTemporaryID and NSManagedObjectContext obtainPermanentIDsForObjects:error:.

Imutability in Objective-c

I'm beginning an objective-c project. I have a question regarding immutability. Is it worth trying to make objects immutable whenever I can? If I update a field, I have to return a pointer to a new object and dealloc the old. If I do this often, there might be performance issues. Also, the code will probably be more verbose. There are undoubtedly other considerations. What do you think?
Edit: Let me clarify what I mean when I write "update a field". Normally, when you update a field you call a setter and just change the value of the field. If the object is immutable, the setter does not actually update the field, instead it creates a new instance, with all the fields having the same value, except for the field you are trying to update. In java:
class User{
private String firstName;
private String lastName;
public User(String fn, String ln){ firstName = fn; lastName = ln; }
public User setFirstName(String fn){ return new User(fn, lastName); }
}
Use immutable objects whenever possible, due to the performance overhead of mutable objects.
Edit: Well, usually the above should be true, but it seems there are situations where NSMutableArray performance is actually better then NSArray. Read some more about it on the Cocos2d site:
Read some more about mutability on CocoaWithLove (great weblog for Mac / iOS developers so put it in your favorites!).
I'd also like to add that a lot of objects have the -mutableCopy instance method, this is an easy to use method to retrieve a mutable copy from an immutable objects, like a NSArray or NSString, e.g.:
NSArray *array = [NSArray arrayWithObjects:#"apple", #"pear", #"lemon"];
NSMutableArray *mutableArray = [array mutableCopy];
// remember to release the mutableArray at some point
// because we've created a copy ...
Just remember in some situations a mutable object is easier to use, for example for a UITableView that makes use of a datasource that is subject to a lot of changes over time.
Whether mutable or immutable objects are best is very situation dependent, so it's best if you give a more concrete example to discuss. But here are some things to think about.
Often object properties are somehow inter-related. For instance, a Person might have a givenName and surname, but might also have a fullName that combines those two, and it might have a nameOrder that indicates which comes first. If you make Person mutable, then there can be points in time that fullName might be incorrect because you have changed the surname but not the givenName (perhaps one of them is still nil). You now need a more complex interface to protect you against this.
If other objects use this mutable Person, they have to employ KVO or notifications to find out when it has changed. The fact that interrelated fields might change independently can make this complex, and you find yourself writing code to coalesce the changes.
If some combinations of properties are illegal, mutable objects can be very hard to error check. An immutable object can do all of its checking when it is constructed.
There are some middle-grounds between mutable and immutable. In the above example of Person and various name properties, one way to simplify much of it is to let Person be mutable, but create a separate immutable Name object that contains the various parts. That way you can make sure that the entire name is mutated in an atomic way.
Immutable objects greatly simplify multi-threaded code. Mutable objects require a lot more locking and synchronization, and this can significantly hurt performance and stability. It's very easy to screw this code up. Immutable objects in comparison are trivial.
To your point about creating and throwing away objects, immutable objects also give the opportunity for sharing, which can make them very efficient if there are likely to be many objects pointing to the same data contents. For instance, in our Person example, if I make an immutable Address object, then every person who lives at the same address can share the same object. If one changes their address, this doesn't impact all the others.
As an example of the above, my code has a lot of email addresses in it. It's extremely common for the same string to show up over and over again. Making EmailAddress immutable, and only allowing it to be constructed with +emailAddressForString: allows the class to maintain a cache and this can save significant memory and time to construct and destroy string objects. But this only works because EmailAddress is immutable.
Anyway, my experience is that it's often better to err towards immutable data objects for simplicity, and only make the mutable when immutability creates a performance problem. (Of course this only applies to data objects. Stateful objects are a different thing, and of course need to be mutable by their nature, but that doesn't mean that every part of them must be mutable.)
As in any other imperative language: it depends. I've seen decent boosts in code performance when we use immutable objects, but they're also usually infrequently-modified objects, ones which are read out of an archive or set by a user and then passed around to all different bits of code. It doesn't seem worth doing this for all your code, at least not to me, unless you plan on heavily leveraging multiprocessing and understand the tradeoffs you're making.
I think the bigger immutability concern is that if you've done good design to keep your data marked immutable when it is such, and mutable when it is such, then it's going to be a lot easier to take advantage of things like Grand Central Dispatch and other parallelization where you could realize far greater potential gains.
As a side note, moving to Objective C from Java, the first tip I can give you is to ditch the notion of public and private.

Is there any reason not to return a mutable object where one is not expected?

I have a number of functions similar to the following:
+ (NSArray *)arrayOfSomething
{
NSMutableArray *array = [NSMutableArray array];
// Add objects to the array
return [[array copy] autorelease];
}
My question is about the last line of this method: is it better to return the mutable object and avoid a copy operation, or to return an immutable copy? Are there any good reasons to avoid returning a mutable object where one is not expected?
(I know that it is legal to return a NSMutableArray since it is a subclass of NSArray. My question is whether or not this is a good idea.)
This is a complex topic. I think it's best to refer you to Apple's guidelines on object mutability.
Apple has this to say on the subject of using introspection to determine a returned object's mutability:
To determine whether it can change a received object, the receiver must rely on the formal type of the return value. If it receives, for instance, an array object typed as immutable, it should not attempt to mutate it. It is not an acceptable programming practice to determine if an object is mutable based on its class membership
(my emphasis)
The article goes on to give several very good reasons why you should not use introspection on a returned object to determine if you can mutate it e.g.
You read a property list from a file. When the Foundation framework processes the list it notices that various subsets of the property list are identical, so it creates a set of objects that it shares among all those subsets. Afterwards you look at the created property list objects and decide to mutate one subset. Suddenly, and without being aware of it, you’ve changed the tree in multiple places.
and
You ask NSView for its subviews (subviews method) and it returns an object that is declared to be an NSArray but which could be an NSMutableArray internally. Then you pass that array to some other code that, through introspection, determines it to be mutable and changes it. By changing this array, the code is mutating NSView’s internal data structures.
Given the above, it is perfectly acceptable for you to return the mutable array in your example (provided of course, you never mutate it yourself after having returned it, because then you would be breaking the contract).
Having said that, almost nobody has read that section of the Cocoa Objects Guide, so defensive programming would call for you to make an immutable copy and return that unless performance profiling shows that it is a problem to do that.
Short Answer: Don't do it
Long Answer: It depends. If the array is getting changed while being used by someone who expects it be static, you can cause some baffling errors that would be a pain to track down. It would be better to just do the copy/autorelease like you've done and only come back and revisit the return type of that method if it turns out that there is a significant performance hit.
In response to the comments, I think it's unlikely that returning a mutable array would cause any trouble, but, if it does cause trouble, it could be difficult to track down exactly what the issue is. If making a copy of the mutable array turns out to be a big performance hit, it will be very easy to determine what's causing the problem. You have a choice between two very unlikely issues, one that's easy to solve, one that's very difficult.