Random Generator slightly less random - objective-c

NSPredicate *predicate = [NSPredicate predicateWithFormat:#"category == %#", selectedCategory];
NSArray *filteredArray = [self.Quotes filteredArrayUsingPredicate:predicate];
// Get total number in filtered array
int array_tot = (int)[filteredArray count];
// As a safeguard only get quote when the array has rows in it
if (array_tot > 0) {
// Get random index
int index = (arc4random() % array_tot);
// Get the quote string for the index
NSString *quote = [[filteredArray objectAtIndex:index] valueForKey:#"quote"];
// Display quote
self.quote_text.text = quote;
// Update row to indicate that it has been displayed
int quote_array_tot = (int)[self.Quotes count];
NSString *quote1 = [[filteredArray objectAtIndex:index] valueForKey:#"quote"];
for (int x=0; x < quote_array_tot; x++) {
NSString *quote2 = [[Quotes objectAtIndex:x] valueForKey:#"quote"];
if ([quote1 isEqualToString:quote2]) {
NSMutableDictionary *itemAtIndex = (NSMutableDictionary *)[Quotes objectAtIndex:x];
[itemAtIndex setValue:#"DONE" forKey:#"source"];
}
}
Above is the code I use in my app for generating a random quote from one of two categories stored in a plist (in arrays, where the first line is category, and second is quote). However, it seems to have a preference of repeating ones it's already shown. I'd prefer it have a preference (but not exclusively) show ones it hasn't shown before.

Your question is an algorithm question. What you want is a sequence of numbers that seems random but is more uniform.
What you are looking for is called a low-discrepancy sequence. A simple form of this is a "shuffle bag", often used in game development, as described here or here.
With a shuffle bag, you basically generate all the indices (e.g. 0 1 2 3 4 5), shuffle them (e.g. 2 3 5 1 0 4) and then display the elements in this order. At the end, you generate another sequence (e.g. 4 1 0 2 3 5). Note that it is possible that the same element appears twice in the sequence, although it is rare. E.g. in this case, the "4" is a duplicate, because the full sequence is 2 3 5 1 0 4 4 1 0 2 3 5.
arc4random() is a good PRNG on Apple platforms, so it doesn't give you a "low discrepancy sequence". But: you can use it as a primitive to generate "low discrepancy sequences", you can also use it as a primitive to create a shuffle bag implementation.

Related

Get sequence of random numbers' pairs (Objective-c)

Good morning, i'm trying to generate a sequence of N pairs of numbers, for example 1-0, 2-4, 4-3. These numbers must range between 0 and 8 and the pair must be different for all the numbers.
I don't want that: 1-3 1-3
I found that if a and b are the numbers, (a+b)+(a-b) has to be different for all couples of numbers.
So I manage to do that, but the loop never ends.
Would you please correct my code or write me another one? I need it as soon as possible.
NSNumber*number1;
int risultato;
int riga;
int colonna;
NSMutableArray*array=[NSMutableArray array];
NSMutableArray*righe=[NSMutableArray array];
NSMutableArray*colonne=[NSMutableArray array];
for(int i=0; i<27; i++)
{
riga=arc4random()%9;
colonna=arc4random()%9;
risultato=(riga+colonna)+(riga-colonna);
number1=[NSNumber numberWithInt:risultato];
while([array containsObject:number1])
{
riga=arc4random()%9;
colonna=arc4random()%9;
risultato=(riga+colonna)+(riga-colonna);
number1=[NSNumber numberWithInt:risultato];
}
NSNumber*row=[NSNumber numberWithBool:riga];
NSNumber*column=[NSNumber numberWithInt:colonna];
[righe addObject:row];
[colonne addObject:column];
[array addObject:number1];
}
for(int i=0; i<27; i++)
{
NSNumber*one=[righe objectAtIndex:i];
NSNumber*two=[colonne objectAtIndex:i];
NSLog(#"VALUE1 %ld VALUE2 %ld", [one integerValue], (long)[two integerValue]);
}
edit:
I have two arrays (righe, colonne) and I want them to have 27 elements [0-8].
I want to obtain a sequence like it:
righe: 1 2 4 6 7 8 2 3 4 8 8 7
colonne: 1 3 4 4 2 1 5 2 7 6 5 6
I don't want to have that:
righe: 1 2 4 6 2
colonne: 1 3 5 2 3
Where you see that 2-3 is repeated once. Then I'd like to store these values in a primitive 2d array (array[2][27])
I found that if a and b are the numbers, (a+b)+(a-b) has to be different for all couples of numbers.
This is just 2 * a and is not a valid test.
What you are looking for are pairs of digits between 0 - 8, giving a total of 81 possible combinations.
Consider: Numbers written in base 9 (as opposed to the common bases of 2, 10 or 16) use the digits 0 - 8, and if you express the decimal numbers 0 -> 80 in base 9 you will get 0 -> 88 going through all the combinations of 0 - 8 for each digit.
Given that you can can restate your problem as requiring to generate 27 numbers in the range 0 - 80 decimal, no duplicates, and expressing the resultant numbers in base 9. You can extract the "digits" of your number using integer division (/ 9) and modulus (% 9)
To perform the duplicate test you can simply use an array of 81 boolean values: false - number not used, true - number used. For collisions you can just seek through the array (wrapping around) till you find an unused number.
Then I'd like to store these values in a primitive 2d array (array[2][27])
If that is the case just store the numbers directly into such an array, using NSMutableArray is pointless.
So after that long explanation, the really short code:
int pairs[2][27];
bool used[81]; // for the collision testing
// set used to all false
memset(used, false, sizeof(used));
for(int ix = 0; ix < 27; ix++)
{
// get a random number
int candidate = arc4random_uniform(81);
// make sure we haven't used this one yet
while(used[candidate]) candidate = (candidate + 1) % 81;
// record
pairs[0][ix] = candidate / 9;
pairs[1][ix] = candidate % 9;
// mark as used
used[candidate] = true;
}
HTH
Your assumption about (a+b)+(a-b) is incorrect: this formula effectively equals 2*a, which is obviously not what you want. I suggest storing the numbers in a CGPoint struct and checking in a do...while loop if you already have the newly generated tuple in your array:
// this array will contain objects of type NSValue,
// since you can not store CGPoint structs in NSMutableArray directly
NSMutableArray* array = [NSMutableArray array];
for(int i=0; i<27; i++) {
// declare a new CGPoint struct
CGPoint newPoint;
do {
// generate values for the CGPoint x and y fields
newPoint = CGPointMake(arc4random_uniform(9), arc4random_uniform(9));
} while([array indexOfObjectPassingTest:^BOOL(NSValue* _Nonnull pointValue, NSUInteger idx, BOOL * _Nonnull stop) {
// here we retrieve CGPoint structs from the array one by one
CGPoint point = [pointValue CGPointValue];
// and check if one of them equals to our new point
return CGPointEqualToPoint(point, newPoint);
}] != NSNotFound);
// previous while loop would regenerate CGPoint structs until
// we have no match in the array, so now we are sure that
// newPoint has unique values, and we can store it in the array
[array addObject:[NSValue valueWithCGPoint:newPoint]];
}
for(int i=0; i<27; i++)
{
NSValue* value = array[i];
// array contains NSValue objects, so we must convert them
// back to CGPoint structs
CGPoint point = [value CGPointValue];
NSInteger one = point.x;
NSInteger two = point.y;
NSLog(#"VALUE1 %ld VALUE2 %ld", one, two);
}

Add missing years and corresponding values into arrays

I've been messing around with the JBChartView library and it seems really good for charting. It's easy enough to use but i'm having some problems getting my data in a format that i need for a particular chart.
The user can enter a value and corresponding year. This is saved using core data. The data could look like as follows:
Year: 0 Value: 100
Year:2 Value 200
Year 3 Value 150
I would create 2 arrays, 1 for the year number and another for the value. in this case though, I would get 3 bars. What i'd like is a bar with value 0 for Year 1.
I think the best way to approach this would be to look through the Year array, check to see if the first value is 0, then check if every consecutive year value is +1. If not, add 1 to the previous year and insert a value of 0 into the values array at the same index position.
I would like to know if this is the best approach and if I could get some help doing the comparison.
Thanks
Ok I got to an answer to my own question and thought i'd post it as it may help someone in the future, especially when creating charts using this, or other libraries.
I first populate 2 mutable arrays
chartLegend = [NSMutableArray arrayWithObjects:#1,#3, nil];
chartData = [NSMutableArray arrayWithObjects:#"100",#"300", nil];
So i've got years 1 and 3, each with an associated value in the chartData array.
i now need to create a year 0 and year 2 so that my bar chart has a bar for every year from 0 to my maximum year, 3.
- (void)addItemsToArray {
for (int i=0; i<[chartLegend count]; i++)
{
//get the values from our array that are required for any calculations
int intPreviousValue = 0;
int intCurrentValue = [[chartLegend objectAtIndex:i]integerValue];
if (i>0)
{
intPreviousValue = [[chartLegend objectAtIndex:(i-1)]integerValue];
}
//Deal with the first item in the array which should be 0
if (i == 0)
{
if (intCurrentValue != 0)
{
[chartLegend insertObject:[NSNumber numberWithInt:0] atIndex:i];
[chartData insertObject:[NSNumber numberWithInt:0] atIndex:i];
}
}
//Now deal with all other array items
else if (intCurrentValue - intPreviousValue !=1)
{
int intNewValue = intPreviousValue +1;
[chartLegend insertObject:[NSNumber numberWithInt:intNewValue] atIndex:i];
[chartData insertObject:[NSNumber numberWithInt:0] atIndex:i];
}
}
//create a string with all of the values in the array
NSString *dates = [chartLegend componentsJoinedByString:#","];
NSString *values = [chartData componentsJoinedByString:#","];
//display the text in a couple of labels to check you get the intended result
self.yearsLabel.text = dates;
self.valuesLabel.text = values;
}
That seems to be working for me. It should be easy enough to populate your arrays using coreData information, just make sure it's sorted first.

How to print out an integer raised to the 100th power (handling overflow)

So my friend asked me this question as interview practice:
Using Objective-C & Foundation Kit, Write a method that takes a single digit int, and logs out to the console the precise result of that int being raised to the power of 100.
Initially I thought it sounded easy, but then I realized that even a single digit number raised to the power of 100 would quickly come close to 100 digits, which would overflow.
So I tried tackling this problem by creating an NSArray w/ NSNumbers (for reflection), where each object in the array is a place in the final result number. Then I perform the multiplication math (including factoring in carries), and then print out a string formed by concatenating the objects in the array. Here is my implementation w/ input 3:
NSNumber *firstNum = [NSNumber numberWithInteger:3];
NSMutableArray *numArray = [NSMutableArray arrayWithArray:#[firstNum]];
for( int i=0; i<99; i++)
{
int previousCarry = 0;
for( int j=0; j<[numArray count]; j++)
{
int newInt = [firstNum intValue] * [[numArray objectAtIndex:j] intValue] + previousCarry;
NSNumber *calculation = [NSNumber numberWithInteger:newInt];
previousCarry = [calculation intValue]/10;
NSNumber *newValue = [NSNumber numberWithInteger:(newInt % 10)];
[numArray replaceObjectAtIndex:j withObject:newValue];
}
if(previousCarry > 0)
{
[numArray addObject:[NSNumber numberWithInteger:previousCarry]];
}
}
NSArray* reversedArray = [[numArray reverseObjectEnumerator] allObjects];
NSString *finalNumber = [reversedArray componentsJoinedByString:#""];
NSLog(#"%#", finalNumber);
This isn't a problem out of a textbook or anything so I don't have any reference to double check my work. How does this solution sound to you guys? I'm a little worried that it's pretty naive even though the complexity is O(N), I can't help but feel like I'm not utilizing a type/class or method unique to Objective-C or Foundation Kit that would maybe produce a more optimal solution-- or at the very least make the algorithm cleaner and look more impressive
Write a method that takes a single digit int, and logs out to the console the precise result of that int being raised to the power of 100.
That strikes me as a typical interview "trick"[*] question - "single digit", "logs out to console"...
Here goes:
NSString *singleDigitTo100(int d)
{
static NSString *powers[] =
{
#"0",
#"1",
#"1267650600228229401496703205376",
#"515377520732011331036461129765621272702107522001",
#"1606938044258990275541962092341162602522202993782792835301376",
#"7888609052210118054117285652827862296732064351090230047702789306640625",
#"653318623500070906096690267158057820537143710472954871543071966369497141477376",
#"3234476509624757991344647769100216810857203198904625400933895331391691459636928060001",
#"2037035976334486086268445688409378161051468393665936250636140449354381299763336706183397376",
#"265613988875874769338781322035779626829233452653394495974574961739092490901302182994384699044001"
};
return powers[d % 10]; // simple bounds check...
}
And the rest is easy :-)
And if you are wondering, those numbers came from bc - standard command line calculator in U*ix and hence OS X. You could of course invoke bc from Objective-C if you really want to calculate the answers on the fly.
[*] It is not really a "trick" question but asking if you understand that sometimes the best solution is a simple lookup table.
As you have correctly figured out, you will need to use some sort of big integer library. This is a nice example you can refer to: https://mattmccutchen.net/bigint/
Furthermore, you can calculate x^n in O(lg(n)) rather than in O(n), using divide and conquer:
f(x, n):
if n == 0: # Stopping condition
return 1
temp = f(n/2)
result = temp * temp
if n%2 == 1:
result *= x
return result
x = 5 # Or another one digit number.
n = 100
result = f(x, 100) # This is the result you are looking for.
Note that x represents your integer and n the power you are raising x to.

Comparing Multiple Word names with Levenshtein Distances

I'm comparing building names on my campus with input from various databases. People entered these names, and everyone uses their own abbreviation scheme. I'm trying to find the best match from a user input to a canonical form of the name.
I've implemented a recursive Levenshtein Distance method, but there are a few edge cases I'm trying to tackle. My implementation is on GitHub.
Some of the building names are one word, while others are two. A single word on a single word produces fairly accurate results, but there are two things that I need to keep in mind.
Abbreviations: Assuming an input is a shortened version of a name, I can sometimes get the same Levenshtein Distance between the input and an arbitrary name, as well as the correct name.
For example, if my input is "Ing" and the building names1. are ["Boylan", "Ingersoll", "Whitman", "Whitehead", "Roosevelt", and "Library"], I end up with a LD of 6 for both Boylan and Ingersoll. The desired result is here Ingersoll.
Multiword Strings: The second edge cases is when the input and/or output is two words, separated by a space. For example, New Ing is an abbreviation for New Ingersoll. In this case, New Ingersoll and Boylan both score a Levenshtein Distance of 6. If I were to split the strings, New matches New perfectly, and then I just have to refer back to the solution to my previous edge case.
What's the best way to handle these two edge cases?
1. For the curious, these are the buildings at Brooklyn College, in New York City.
I think you should use the length of the Longest Common Subsequence instead of the Levenshtein Distance. That seems to be a better metric for your case. In essence, it prioritizes insertions and deletions over substitutions, as I suggested in my comment.
It clearly distiguishes between "Ing" -> "Ingersoll" and "Ing" -> "Boylan" (scores of 3 and 1) handles spaces without a problem ("New Ing" -> "New Ingersoll" scores 7 where "New Ing" -> "Boylan" again scores 1), and will also work nicely should you come across an abbreviation like "Ingsl".
The algorithm is straightforward. Where your two strings have length m and n, compare successive prefixes of the strings characterwise (starting with the empty prefixes), keeping scores in a matrix of size m+1, n+1. If a particular pair matches, add one to the score of the previous two prefixes (one row up and one column left in the matrix); otherwise keep the highest of the two scores of those prefixes (compare the entry immediately above and the entry immediately left and take the best). When you've gone through both strings, the last entry in the score matrix is the length of the LCS.
Example score matrix for "Ingsll" and "Ingersoll":
0 1 2 3 4 5 6
I n g s l l
---------------
0 | 0 0 0 0 0 0 0
1 I | 0 1 1 1 1 1 1
2 n | 0 1 2 2 2 2 2
3 g | 0 1 2 3 3 3 3
4 e | 0 1 2 3 3 3 3
5 r | 0 1 2 3 3 3 3
6 s | 0 1 2 3 4 4 4
7 o | 0 1 2 3 4 4 4
8 l | 0 1 2 3 4 5 5
9 l | 0 1 2 3 4 5 6
Here's an ObjC implementation of the length. Most of the complexity here is just due to wanting to handle composed character sequences -- #"o̶" for example -- correctly.
#import <Foundation/Foundation.h>
#interface NSString (WSSComposedLength)
- (NSUInteger)WSSComposedLength;
#end
#implementation NSString (WSSComposedLength)
- (NSUInteger)WSSComposedLength
{
__block NSUInteger length = 0;
[self enumerateSubstringsInRange:(NSRange){0, [self length]}
options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
length++;
}];
return length;
}
#end
#interface NSString (WSSLongestCommonSubsequence)
- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;
#end
#implementation NSString (WSSLongestCommonSubsequence)
- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target
{
NSUInteger * const * scores;
scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];
return scores[[target WSSComposedLength]][[self WSSComposedLength]];
}
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target
{
NSUInteger * const * scores;
scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];
//FIXME: Implement this.
return nil;
}
- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{
NSUInteger selfLength = [self WSSComposedLength];
NSUInteger targetLength = [target WSSComposedLength];
NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];
NSUInteger ** scores = [scoresData mutableBytes];
for( NSUInteger i = 0; i <= targetLength; i++ ){
scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];
}
/* Ranges in the enumeration Block are the same measure as
* -[NSString length] -- i.e., 16-bit code units -- as opposed to
* _composed_ length, which counts code points. Thus:
*
* Enumeration will miss the last character if composed length is used
* as the range and there's a substring range with length greater than one.
*/
NSRange selfFullRange = (NSRange){0, [self length]};
NSRange targetFullRange = (NSRange){0, [target length]};
/* Have to keep track of these indexes by hand, rather than using the
* Block's substringRange.location because, e.g., #"o̶", will have
* range {x, 2}, and the next substring will be {x+2, l}.
*/
__block NSUInteger col = 0;
__block NSUInteger row = 0;
[target enumerateSubstringsInRange:targetFullRange
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString * targetSubstring,
NSRange targetSubstringRange,
NSRange _, BOOL * _0)
{
row++;
col = 0;
[self enumerateSubstringsInRange:selfFullRange
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString * selfSubstring,
NSRange selfSubstringRange,
NSRange _, BOOL * _0)
{
col++;
NSUInteger newScore;
if( [selfSubstring isEqualToString:targetSubstring] ){
newScore = 1 + scores[row - 1][col - 1];
}
else {
NSUInteger upperScore = scores[row - 1][col];
NSUInteger leftScore = scores[row][col - 1];
newScore = MAX(upperScore, leftScore);
}
scores[row][col] = newScore;
}];
}];
return scoresData;
}
#end
int main(int argc, const char * argv[])
{
#autoreleasepool {
NSArray * testItems = #[#{#"source" : #"Ingso̶ll",
#"targets": #[
#{#"string" : #"Ingersoll",
#"score" : #6,
#"sequence" : #"Ingsll"},
#{#"string" : #"Boylan",
#"score" : #1,
#"sequence" : #"n"},
#{#"string" : #"New Ingersoll",
#"score" : #6,
#"sequence" : #"Ingsll"}]},
#{#"source" : #"Ing",
#"targets": #[
#{#"string" : #"Ingersoll",
#"score" : #3,
#"sequence" : #"Ing"},
#{#"string" : #"Boylan",
#"score" : #1,
#"sequence" : #"n"},
#{#"string" : #"New Ingersoll",
#"score" : #3,
#"sequence" : #"Ing"}]},
#{#"source" : #"New Ing",
#"targets": #[
#{#"string" : #"Ingersoll",
#"score" : #3,
#"sequence" : #"Ing"},
#{#"string" : #"Boylan",
#"score" : #1,
#"sequence" : #"n"},
#{#"string" : #"New Ingersoll",
#"score" : #7,
#"sequence" : #"New Ing"}]}];
for( NSDictionary * item in testItems ){
NSString * source = item[#"source"];
for( NSDictionary * target in item[#"targets"] ){
NSString * targetString = target[#"string"];
NSCAssert([target[#"score"] integerValue] ==
[source WSSLengthOfLongestCommonSubsequenceWithString:targetString],
#"");
// NSCAssert([target[#"sequence"] isEqualToString:
// [source longestCommonSubsequenceWithString:targetString]],
// #"");
}
}
}
return 0;
}
I think the Levenshtein distance is only useful when you are dealing with nearly similar words like casual misspellings. If the Levenshtein distance is longer than the word itself, it has no valuable meaning as likeness value. (In your example, "Ing" and "Boylan" haven't got anything in common; no-one would confuse these words. To get from "Ing" to "Boylan", you need six edits, twice as many as the word has letters.) I wouldn't even consider the Levenshtein distance between words that have significantly different lengths like "Ing" and "Ingersoll" and declare them different.
Instead, I'd check words that are shorter than the original in abbreviation mode. To check whether a word is an abbreviation of a longer word, you could check that all letters of the abbreviation appear in the original in the same order. You should also enforce that the words start with the same letter. That method doesn't account for mistyped abbreviations, however.
I think that multiword strings are better parsed word-wise. Do you need to distinguish between Ingersoll and New Ingersoll? In that case, you could establish a scoring system where a word match scores 100, maybe with ten times the Levenshtein distance subtracted. A non-match has a negative score, say -100. Then you assess the score of each word and divide by the number of words in the building:
If your string is "Ingersoll":
"Ingersoll" scores 100 / 1 == 100
"New Ingersoll" scores 100 / 2 == 50
If your string is "New Ingersoll":
"Ingersoll" scores (100 - 100) / 1 == 100
"New Ingersoll" scores (100 + 100) / 2 == 100
The word-wise approach falls flat when you have abbreviations that contain letters from various words, e.g. "NI" or "NIng" for New Ingersoll, so maybe you should try the abbreviation match above on the whole building name if you can't find an match in word-to-word matching.
(I realise that all this isn't really an answer, but more a loose bunch of thoughts.)

Optimizing algorithm for matching duplicates

I've written a small utility program that identifies duplicate tracks in iTunes.
The actual matching of tracks takes a long time, and I'd like to optimize it.
I am storing track data in an NSMutableDictionary that stores individual track data in
NSMutableDictionaries keyed by trackID. These individual track dictionaries have
at least the following keys:
TrackID
Name
Artist
Duration (in milli ####.####)
To determine if any tracks match one another, I must check:
If the duration of two tracks are within 5 seconds of each other
Name matches
Artist matches
The slow way for me to do it is using two for-loops:
-(void)findDuplicateTracks {
NSArray *allTracks = [tracks allValues];
BOOL isMatch = NO;
int numMatches = 0;
// outer loop
NSMutableDictionary *track = nil;
NSMutableDictionary *otherTrack = nil;
for (int i = 0; i < [allTracks count]; i++) {
track = [allTracks objectAtIndex:i];
NSDictionary *summary = nil;
if (![claimedTracks containsObject:track]) {
NSAutoreleasePool *aPool = [[NSAutoreleasePool alloc] init];
NSUInteger duration1 = (NSUInteger) [track objectForKey:kTotalTime];
NSString *nName = [track objectForKey:knName];
NSString *nArtist = [track objectForKey:knArtist];
// inner loop - no need to check tracks that have
// already appeared in i
for (int j = i + 1; j < [allTracks count]; j++) {
otherTrack = [allTracks objectAtIndex:j];
if (![claimedTracks containsObject:otherTrack]) {
NSUInteger duration2 = (NSUInteger)[otherTrack objectForKey:kTotalTime];
// duration check
isMatch = (abs(duration1 - duration2) < kDurationThreshold);
// match name
if (isMatch) {
NSString *onName = [otherTrack objectForKey:knName];
isMatch = [nName isEqualToString:onName];
}
// match artist
if (isMatch) {
NSString *onArtist = [otherTrack objectForKey:knArtist];
isMatch = [nArtist isEqualToString:onArtist];
}
// save match data
if (isMatch) {
++numMatches;
// claim both tracks
[claimedTracks addObject:track];
[claimedTracks addObject:otherTrack];
if (![summary isMemberOfClass:[NSDictionary class]]) {
[track setObject:[NSNumber numberWithBool:NO] forKey:#"willDelete"];
summary = [self dictionarySummaryForTrack:track];
}
[otherTrack setObject:[NSNumber numberWithBool:NO] forKey:#"willDelete"];
[[summary objectForKey:kMatches]
addObject:otherTrack];
}
}
}
[aPool drain];
}
}
}
This becomes quite slow for large music libraries, and only uses 1
processor. One recommended optimization was to use blocks and process
the tracks in batches (of 100 tracks). I tried that. If my code
originally took 9 hours to run, it now takes about 2 hours on a
quad-core. That's still too slow. But (talking above my pay grade here)
perhaps there is a way to store all the values I need in a C structure that "fits on the stack" and then I wouldn't have to fetch the values from slower memory. This seems too low-level for me, but I'm willing to learn if I had an example.
BTW, I profiled this in Instruments and [NSCFSet member:] takes up
86.6% percent of the CPU time.
Then I thought I should extract all the durations into a sorted array so I would not have
to look up the duration value in the dictionary. I think that is a good
idea, but when I started to implement it, I wondered how to determine
the best batch size.
If I have the following durations:
2 2 3 4 5 6 6 16 17 38 59 Duration
0 1 2 3 4 5 6 7 8 9 10 Index
Then just by iterating over the array, I know that to find matching
tracks of the song at index 0, I only need to compare it against songs
up to index 6. That's great, I have my first batch. But now I have to
start over at index 1 only to find that it's batch should also stop at
index 6 and exclude index 0. I'm assuming I'm wasting a lot of
processing cycles here determining what the batch should be/the duration
matches. This seemed like a "set" problem, but we didn't do much of
that in my Intro to Algorithms class.
My questions are:
1) What is the most efficient way to identify matching tracks? Is it
something similar to what's above? Is it using disjoint and [unified]
set operations that are slightly above my knowledge level? Is it
filtering arrays using NSArray? Is there an online resource that
describes this problem and solution?
I am willing to restructure the tracks dictionary in whatever way
(datastructure) is most efficient. I had at first thought I needed to
perform many lookups by TrackID, but that is no longer the case.
2) Is there a more efficient way to approach this problem? How do you
rock stars go from paragraph 1 to an optimized solution?
I have searched for the answer, longer than I care to admit, and found
these interesting, but unhelpful answers:
find duplicates
Find all duplicates and missing values in a sorted array
Thanks for any help you can provide,
Lance
My first thought is to maintain some sorted collections as indices into your dictionary so you can stop doing an O(n^2) search comparing every track to every other track.
If you had arrays of TrackIDs sorted by duration then for any track you could do a more efficient O(log n) binary search to find tracks with durations within your 5 second tolerance.
Even better for artist and name you can store a dictionary keyed on the artist or track name whose values are arrays of TrackIDs. Then you only need a O(1) lookup to get the set of tracks for a particular artist which should allow you to very quickly determine if there are any possible duplicates.
Finally if you've built that sort of dictionary of titles to TrackIDs then you can go through all of it's keys and only search for duplicates when there are multiple tracks with the same title. Doing further comparisons only when there are multiple tracks with the same title should eliminate a significant percentage of the library and massively reduce your search time (down to O(n) to build the dictionary and another O(n) for a worst case search for duplicates still leaves you at O(n) rather than the O(n^2) you have now).
If nothing else do that last optimization, the resulting performance increase should be huge for an library without a significant number of duplicates:
NSMutableArray *possibleDuplicates = [NSMutableArray array];
NSMutableDictionary *knownTitles = [NSMutableDictionary dictionary];
for (NSMutableDictionary *track in [tracks allKeys]) {
if ([knownTitles objectForKey:[track objectForKey:#"title"]] != nil) {
[possibleDuplicates addObject:track];
}
else {
[knownTitles addObject:[track objectForKey:#"TrackID"] forKey:[track objectForKey:#"title"]];
}
}
//check for duplicates of the tracks in possibleDuplicates only.
There are several ways to do this, but here's my first naïve guess:
Have a mutable dictionary.
The keys in this dictionary are the names of the songs.
The value of each key is another mutable dictionary.
The keys of this secondary mutable dictionary are the artists.
The value of each key is a mutable array of songs.
You'd end up with something like this:
NSArray *songs = ...; //your array of songs
NSMutableDictionary *nameCache = [NSMutableDictionary dictionary];
for (Song *song in songs) {
NSString *name = [song name];
NSMutableDictionary *artistCache = [nameCache objectForKey:name];
if (artistCache == nil) {
artistCache = [NSMutableDictionary dictionary];
[nameCache setObject:artistCache forKey:name];
}
NSString *artist = [song artist];
NSMutableArray *songCache = [artistCache objectForKey:artist];
if (songCache == nil) {
songCache = [NSMutableArray array];
[artistCache setObject:songCache forKey:artist];
}
for (Song *otherSong in songCache) {
//these are songs that have the same name and artist
NSTimeInterval myDuration = [song duration];
NSTimeInterval otherDuration = [otherSong duration];
if (fabs(myDuration - otherDuration) < 5.0f) {
//name matches, artist matches, and their difference in duration is less than 5 seconds
}
}
[songCache addObject:song];
}
This is a worst-case O(n2) algorithm (if every song has the same name, artist, and duration). It's a best-case O(n) algorithm (if every song has a different name/artist/duration), and will realistically end up being closer to O(n) than to O(n2) (most likely).