Parse NSString and replace certain substrings [duplicate] - objective-c

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I'm trying to write a method that will search an NSString, determine if an individual word within the string is over 6 characters long and replace that word with some other word (something arbitrary like 'hello').
I am starting with a long paragraph and I need to end up with a single NSString object whose format and spacing has not been affected by the find and replace.

Why another answer?
There are a couple of subtle problems with the simple solutions using componentsSeparatedByString::
Punctuation is not handled as word delimiters.
Whitespace other that the space character (newline, tab) is simply dropped.
On long strings a lot of memory is wasted.
It's slow.
Example
Assuming a substitution word of "–" a string like ...
“Essentially,” the D.H.C. concluded,
”bokanovskification consists of a series of arrests of development.”
... would result in ...
– the D.H.C. – – of a series of – of –
... while the correct output would be:
“–,” the D.H.C. –,”– – of a series of – of –.”
Solution
Fortunately there's a much better, yet simple solution in Cocoa: -[NSString enumerateSubstringsInRange:options:usingBlock:]
It provides fast iteration over substrings defined by the options argument. One possibility is the NSStringEnumerationByWords which enumerates all substrings that are actually real words (in the current locale). It even detects individual words in languages that don't use delimiters (spaces) to separate words, like japanese.
Comparing Solutions
Here's a simple demo project that works on the jargon file (1.6 MB, 237,239 words). It compares three different solutions:
componentsSeparatedByString: 270 ms
enumerateSubstringsInRange: 125 ms
stringByReplacingOccurrencesOfString, as described by #Monolo: 200 ms
Implementation
The core of it is the replacement loop:
NSMutableString *result = [NSMutableString stringWithCapacity:[originalString length]];
__block NSUInteger location = 0;
[originalString enumerateSubstringsInRange:(NSRange){0, [originalString length]}
options:NSStringEnumerationByWords | NSStringEnumerationLocalized | NSStringEnumerationSubstringNotRequired
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
if (substringRange.length > maxChar) {
NSString *charactersBetweenLongWords = [originalString substringWithRange:(NSRange){ location, substringRange.location - location }];
[result appendString:charactersBetweenLongWords];
[result appendString:replaceWord];
location = substringRange.location + substringRange.length;
}
}];
[result appendString:[originalString substringFromIndex:location]];
Caveat
As pointed out by Monolo the proposed code uses NSString's length to determine the number of characters of a word. That's a questionable approach, to say the least. In fact a string's length specifies the number of code fragments used to encode the string, a value that often defers from what a human would assume the number of characters.
As the term "character" has different meanings in various contexts and the OP didn't specify which kind of character count to use I just leave the code as it was. If you want a different count please refer to the documentation that discusses the topic:
Apple's String Programming Guide, Characters and Grapheme Clusters
Unicode FAQ: How are characters counted when measuring the length or position of a character in a string?

As you can see from the answers, there are several ways to accomplish what you are after, but personally I prefer to use the NSString class's stringByReplacingOccurrencesOfString:withString:options:range: method, which is made exactly to replace substrings with another string.
In your case we need to use the NSRegularExpressionSearch option which will allow to identify words with 7 or more letters (i.e., more than 6 letters as you state it).
If you use the \w* character expression you will automatically get Unicode support, so it works on as many languages as Apple (actually, ICU) supports.
It goes like this:
NSString *stringWithLongWords = #"There are some words of extended length in this text. One of them is Escher's. They will be identified with a regular expression and changed for some arbitrary word.";
NSString *overSixCharsPattern = #"(?w)\\b[\\w]{7,}\\b";
NSString *replacementString = #"hello";
NSString *result = [stringWithLongWords stringByReplacingOccurrencesOfString: overSixCharsPattern
withString: replacementString
options: NSRegularExpressionSearch
range: NSMakeRange(0, stringWithLongWords.length)];
The \b expressions denote a word boundary, which ensures that the whole word is matched and substituted. The w modifier makes \b use a more natural definition of word boundaries. Specifically, it handles the string "Escher's", the example mentioned by #NikolaiRuhe. Docs here, with a specific discussion of boundary detection here.
Also notice that a literal NSString (i.e., one you type directly in your Objective-C source file) needs two backslashes in the source code to produce one in the generated string.
There is more information in the NSString documentation
* Technically \w matches word characters, which also includes numbers in the definition used by regexes.

Related

Find and replace long words in an NSString? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I'm trying to write a method that will search an NSString, determine if an individual word within the string is over 6 characters long and replace that word with some other word (something arbitrary like 'hello').
I am starting with a long paragraph and I need to end up with a single NSString object whose format and spacing has not been affected by the find and replace.
Why another answer?
There are a couple of subtle problems with the simple solutions using componentsSeparatedByString::
Punctuation is not handled as word delimiters.
Whitespace other that the space character (newline, tab) is simply dropped.
On long strings a lot of memory is wasted.
It's slow.
Example
Assuming a substitution word of "–" a string like ...
“Essentially,” the D.H.C. concluded,
”bokanovskification consists of a series of arrests of development.”
... would result in ...
– the D.H.C. – – of a series of – of –
... while the correct output would be:
“–,” the D.H.C. –,”– – of a series of – of –.”
Solution
Fortunately there's a much better, yet simple solution in Cocoa: -[NSString enumerateSubstringsInRange:options:usingBlock:]
It provides fast iteration over substrings defined by the options argument. One possibility is the NSStringEnumerationByWords which enumerates all substrings that are actually real words (in the current locale). It even detects individual words in languages that don't use delimiters (spaces) to separate words, like japanese.
Comparing Solutions
Here's a simple demo project that works on the jargon file (1.6 MB, 237,239 words). It compares three different solutions:
componentsSeparatedByString: 270 ms
enumerateSubstringsInRange: 125 ms
stringByReplacingOccurrencesOfString, as described by #Monolo: 200 ms
Implementation
The core of it is the replacement loop:
NSMutableString *result = [NSMutableString stringWithCapacity:[originalString length]];
__block NSUInteger location = 0;
[originalString enumerateSubstringsInRange:(NSRange){0, [originalString length]}
options:NSStringEnumerationByWords | NSStringEnumerationLocalized | NSStringEnumerationSubstringNotRequired
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
if (substringRange.length > maxChar) {
NSString *charactersBetweenLongWords = [originalString substringWithRange:(NSRange){ location, substringRange.location - location }];
[result appendString:charactersBetweenLongWords];
[result appendString:replaceWord];
location = substringRange.location + substringRange.length;
}
}];
[result appendString:[originalString substringFromIndex:location]];
Caveat
As pointed out by Monolo the proposed code uses NSString's length to determine the number of characters of a word. That's a questionable approach, to say the least. In fact a string's length specifies the number of code fragments used to encode the string, a value that often defers from what a human would assume the number of characters.
As the term "character" has different meanings in various contexts and the OP didn't specify which kind of character count to use I just leave the code as it was. If you want a different count please refer to the documentation that discusses the topic:
Apple's String Programming Guide, Characters and Grapheme Clusters
Unicode FAQ: How are characters counted when measuring the length or position of a character in a string?
As you can see from the answers, there are several ways to accomplish what you are after, but personally I prefer to use the NSString class's stringByReplacingOccurrencesOfString:withString:options:range: method, which is made exactly to replace substrings with another string.
In your case we need to use the NSRegularExpressionSearch option which will allow to identify words with 7 or more letters (i.e., more than 6 letters as you state it).
If you use the \w* character expression you will automatically get Unicode support, so it works on as many languages as Apple (actually, ICU) supports.
It goes like this:
NSString *stringWithLongWords = #"There are some words of extended length in this text. One of them is Escher's. They will be identified with a regular expression and changed for some arbitrary word.";
NSString *overSixCharsPattern = #"(?w)\\b[\\w]{7,}\\b";
NSString *replacementString = #"hello";
NSString *result = [stringWithLongWords stringByReplacingOccurrencesOfString: overSixCharsPattern
withString: replacementString
options: NSRegularExpressionSearch
range: NSMakeRange(0, stringWithLongWords.length)];
The \b expressions denote a word boundary, which ensures that the whole word is matched and substituted. The w modifier makes \b use a more natural definition of word boundaries. Specifically, it handles the string "Escher's", the example mentioned by #NikolaiRuhe. Docs here, with a specific discussion of boundary detection here.
Also notice that a literal NSString (i.e., one you type directly in your Objective-C source file) needs two backslashes in the source code to produce one in the generated string.
There is more information in the NSString documentation
* Technically \w matches word characters, which also includes numbers in the definition used by regexes.

How do I convert a unicode code point range into an NSString character range?

I have an NSString and a unicode code point range that represents a specific section of the text in that NSString. Since the characters in that NSString do not correspond one-to-one with code points, I need to somehow convert my code point range into the corresponding character range. How do I do this?
I know I can use the NSString method -rangeOfComposedCharacterSequencesForRange: to convert a character range to a grapheme cluster range, but what I want to do is sort of the opposite of that, and I can't find an inverse of that method in the APIs. And even if there was such a method available, I don't think this is exactly what I'm looking for, since (if I understand this correctly) a grapheme cluster is not the same thing as a unicode code point, and can in fact be composed of more than one code point.
What you have is kind of mixed data from two different worlds. You might typically get a Unicode code point range along with a UTF-32 string (where the correspondence is one-to-one) so that extracting the substring would be trivial. You have two options:
Work in the UTF-32 world before you put the data into an NSString
Convert the Unicode code point range into a UTF-16 unit range
I assume from your question that #2 is the easiest option in your case.
As you say, characters in an NSString do not correspond one-to-one with Unicode code points since an NSString character is a UTF-16 unit. However, a Unicode code point corresponds to exactly 1 or 2 characters in an NSString. You can fairly easily write your own range conversion routine by iterating through the NSString characters and counting Unicode code points. This is made somewhat easier by the fact that you don't even care about the endianness of the UTF-16 data since valid BMP characters, lead surrogates, and trail surrogates are disjoint. CFString provides some functions to determine what each character is. So in pseudocode you counting would look like:
for each NSString character {
if (CFStringIsSurrogateHighCharacter(character) ||
CFStringIsSurrogateLowCharacter(character))
{
Skip forward another character in the NSString
}
Increment count of Unicode code points stepped through
}

numerical value of a unicode character in objective c

is it possible to get a numerical value from a unicode character in objective-c?
#"A" is 0041, #"➜" is 279C, #"Ω" is 03A9, #"झ" is 091D... ?
OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term “character” is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)
If you are asking for the UTF-16 code unit, then you can use
unichar ch = [myString characterAtIndex:ndx];
Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).
If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like
uint32_t codepoint = [myString characterAtIndex:ndx];
if ((codepoint & 0xfc00) == 0xd800) {
unichar ch2 = [myString characterAtIndex:ndx + 1];
codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}
Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.
Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a “character” (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of “character”). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.
Sometimes people (like #DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.
If you are trying to manipulate grapheme clusters (things the user might think of as “characters”), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:
NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];
Note that graphemeCluster may be arbitrarily long(!)
Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.
To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend “Unicode Demystified” if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.
Cocoa strings allow you to access the UTF-16 elements using -characterAtIndex:, so the following code will convert the string to a unicode code point:
unsigned strToChar(NSString *str)
{
unsigned c1, c2;
c1 = [str characterAtIndex:0];
if ((c1 & 0xfc00) == 0xd800) {
c2 = [str characterAtIndex:1];
return (((c1 & 0x3ff) << 10) | (c2 & 0x3ff)) + 0x10000;
} else {
return c1;
}
}
I am not aware of any convenience functions for this. You can use -characterAtIndex: by itself if you are okay with your code breaking horribly when someone uses characters outside the BMP; a number of applications on OS X break horribly in this way.
The following should render as a musical "G clef", U+1D11E, but if you copy and paste it into some text editors (TextMate), they'll let you do bizarre things like delete half of the character, at which point your text file is garbage.
𝄞

UITextChecker 25 Letter Words

I believe this is an Apple bug, but wanted to run it by you all and see if anyone else had run into the same/similar issues.
Simply, Apple's UITextChecker finds all words 25 letters or more as valid, spelled correctly words. Go ahead and open up Notes on your iOS device (or TextEdit on OS X) and type in a random 24 letter word. Hit enter, underlined red, right? Now add one more letter to that line so it is a 25 letter word. Hit enter again, underline red, right ... nope!
I don't know if this is related, but I have a similar unanswered question out there (UITextChecker is what dictionary?) questioning what dictionary is used for UITextChecker. In /usr/share/dict/words the longest word is 24 letters. Seems rather coincidental that 25 letters would be the first length of word that is not in the dictionary and it is always accepted as a valid word. But I don't know if that word list is the dictionary for UITextChecker.
This is important to note for anyone that might be confirming the spelling of a given word for something like a game. You really don't want players to able to use a random 25 letters to spell a word and most likely score massive points.
Here's my code to check for valid words:
- (BOOL) isValidWord:(NSString*)word {
// word is all lowercase
UITextChecker *checker = [[UITextChecker alloc] init];
NSRange searchRange = NSMakeRange(0, [word length]);
NSRange misspelledRange = [checker rangeOfMisspelledWordInString:word range:searchRange startingAt:0 wrap:NO language:#"en" ];
[checker release];
BOOL validWord = (misspelledRange.location == NSNotFound);
BOOL passOneCharTest = ([word length] > 1 || [word isEqualToString:#"a"] || [word isEqualToString:#"i"]);
BOOL passLengthTest = ([word length] > 0 && [word length] < 25); // I don't know any words more than 24 letters long
return validWord && passOneCharTest && passLengthTest;
}
So my question to the community, is this a documented 'feature' that I just haven't been able to locate?
This is likely to be caused by the algorithm used for spell-checking itself although I admit it sounds like a bit of a hole.
Even spell-checkers that use a dictionary often tend to use an algorithm to get rid of false negatives. The classic is to ignore:
(a) single-character words followed by certain punctuation (like that (a) back there); and
(b) words consisting of all uppercase like NATO or CHOGM, assuming that they're quite valid acronyms.
If the algorithm for UITextChecker also considers 25+-letter words to be okay, that's just one of the things you need to watch out for.
It may well be related to the expected use case. It may be expected to be used as not so much for a perfect checker, but more as a best-guess solution.
If you really want a perfect filter, you're probably better off doing your own, using a copy of the dictionary from somewhere. That way, you can exclude things that aren't valid in your game (acronyms in Scrabble®, for example).
You can also ensure you're not subject to the vagaries of algorithms that assume longer words are valid as appears to be the case here. Instead you could just assume any word not in your dictionary is invalid (but, of course, give the user the chance to add it if your dictionary is wrong).
Other than that, and filing a query/bug with Apple, there's probably not much else you can do.

Optimizing scanning large text and matching against list of words or phrases

I'm working on an app that takes an article (simple HTML page), and a list of vocabulary terms (each may be a word, a phrase, or even a sentence), and creates a link for each term it finds. The problem is that for larger texts with more terms it takes a long time. Currently we are dealing with this by initially displaying the unmarked text, processing the links in the background, and finally reloading the web view when processing finishes. Still, it can take a while and some of our users are not happy with it.
Right now the app uses a simple loop on the terms, doing a replacement in the HTML. Basically:
for (int i=0; i<terms.count; i++){
NSString *term = [terms objectAtIndex:i];
NSString *replaceString = [NSString stringWithFormat:#"<a href="myUrl:\\%d>%#</a>", i, term];
htmlString = [htmlString stringByReplacingOccurrencesOfString:term
withString:replaceString
options:NSCaseInsensitiveSearch
range:NSMakeRange(0, [htmlString length] )];
}
However, we are dealing with multiple languages, so there is not just one replacement per term, but twenty! That's because we have to deal with punctuation at the beginning (upside-down question marks in Spanish) and end of each term. We have to replace "term", "term.", and "term?" with an appropriate hyperlink.
Is there a more efficient method I could use to get this HTML marked up?
I need to keep the index of the original term so that it can be retrieved later when the user clicks the link.
You could process the text as follows:
Instead of looping over the vocabluary, split the text into words and look up each word in the vocabluary.
Create some index, hash table or dictionary to make the lookup efficient.
Don't use stringByReplacingOccurrencesOfString. Each time it's called it makes a copy of the whole text and won't release the memory until the autopool is drained. (Interestingly, you haven't run into memory problems yet.) Instead use a NSMutableString instance where you append each word (and the characters between them), either as it was in the original text or decorated as a link.
What you're doing right now is this:
for each vocabulary word 'term'
search the HTML text for instances of term
replace each instance of term with an appropriate hyperlink
If you have a large text, then each search takes that much longer. Further, every time you do a replacement, you have to create a new string containing a copy of the text to do the replacement on, since stringByReplacingOccurrencesOfString:withString:options:range: returns a new string rather than modifying the existing string. Multiply that by N replacements.
A better option would be to make a single pass through the string, searching for all terms at once, and building up the resulting output string in a mutable string to avoid a Shlemiel the Painter-like runtime.
For example, you could use regular expressions like so:
// Create a regular expression that is an alternation of all of the vocabulary
// words. You only need to create this once at startup.
NSMutableString *pattern = [[[NSMutableString alloc] init] autorelease];
[pattern appendString:#"\\b("];
BOOL isFirstTerm = YES;
for (NSString *term in vocabularyList)
{
if (!isFirstTerm)
{
[pattern appendString:#"|"];
isFirstTerm = NO;
}
[pattern appendString:term];
}
[pattern appendString:#")\\b"];
// Create regular expression object
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error];
// Replace vocabulary matches with a hyperlink
NSMutableString *htmlCopy = [[htmlString mutableCopy] autorelease];
[regex replaceMatchesInString:htmlCopy
options:0
range:NSMakeRange(0, [htmlString length])
withTemplate:#"\\1"];
// Now use htmlCopy
Since the string replace function your calling is Order N (it scans an replaces n words) and you're doing it for m vocabulary terms, you have an n^2 algorithm.
If you could do it in one pass, that would be optimal (order n - n words in html). The idea of presenting the un-replaced text first is still a good one unless it's unnoticeable even for large docs.
How about a hashset of vocabulary words, scan through the html word by (skipping html markup) and if the current scanned word is in the hash set, append that to the target buffer instead of the scanned word. That allows you to have 2 X the html content + 1 hash of vocabulary words in memory at most.
There are two approaches.
Hash Maps - if maximal length of you phrases is limited for example by two, you can iterate over all words and bigrams(2-words) and check them in HashMap - complexity is liniar, since Hash is constant time in ideal
Automaton theory
You can combine simple automatons which mach strings to single one and evaluation faster(i.e. dynamic programming). For example we have "John Smith"|"John Stuard" merge them and we get John S(mith|tuard) it is so called prefix optimisation(http://code.google.com/p/graph-expression/wiki/RegexpOptimization)
More advenced algorithm can be found here http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
I like this approach more becouse there are no limitation of phrase length and it allow to combine complex regexps.