Apple Emoji / iterate through NSString - objective-c

I want to iterate through the 'characters' of an Emoji input String (from a UITextField) and then, one after another, display those emoji icons with a UILabel.
for (int i=0; i < len; i++) {
unichar c = [transformedString characterAtIndex:i];
[label setText:[NSString stringWithFormat:#"%C", c]];
...
This works for ASCII text but not Emoji fonts (all except the heart symbol are empty). As I noticed, a single Emoji icon is represented by 2 characters in the string.
As far as I know, Emoji uses private area unicode chars.
Is there anyway to achieve this ?
Thank you very much, you save me some headache ...

You can used one of the enumerate* instance methods on NSString, with the option NSStringEnumerationByComposedCharacterSequences.
- (void)enumerateSubstringsInRange:(NSRange)range
options:(NSStringEnumerationOptions)opts
usingBlock:(void (^)(NSString *substring,
NSRange substringRange,
NSRange enclosingRange,
BOOL *stop))block
NSString uses UTF-16 which represents some codepoints as two 16 bit values. You could also manually check for these 'surrogate pairs' in the string and manually combine them, but then you'd still only be getting codepoints rather than characters.
[transformedString
enumerateSubstringsInRange:NSMakeRange(0,[transformedString length]
options:NSStringEnumerationByComposedCharacterSequences
usingBlock: ^(NSString *substring,NSRange,NSRange,BOOL *)
{
[label setText:substring];
}];

Related

Call a method on every word in NSString

I would like to loop through an NSString and call a custom function on every word that has certain criterion (For example, "has 2 'L's"). I was wondering what the best way of approaching that was. Should I use Find/Replace patterns? Blocks?
-(NSString *)convert:(NSString *)wordToConvert{
/// This I have already written
Return finalWord;
}
-(NSString *) method:(NSString *) sentenceContainingWords{
// match every word that meets the criteria (for example the 2Ls) and replace it with what convert: does.
}
To enumerate the words in a string, you should use -[NSString enumerateSubstringsInRange:options:usingBlock:] with NSStringEnumerationByWords and NSStringEnumerationLocalized. All of the other methods listed use a means of identifying words which may not be locale-appropriate or correspond to the system definition. For example, two words separated by a comma but not whitespace (e.g. "foo,bar") would not be treated as separate words by any of the other answers, but they are in Cocoa text views.
[aString enumerateSubstringsInRange:NSMakeRange(0, [aString length])
options:NSStringEnumerationByWords | NSStringEnumerationLocalized
usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop){
if ([substring rangeOfString:#"ll" options:NSCaseInsensitiveSearch].location != NSNotFound)
/* do whatever */;
}];
As documented for -enumerateSubstringsInRange:options:usingBlock:, if you call it on a mutable string, you can safely mutate the string being enumerated within the enclosingRange. So, if you want to replace the matching words, you can with something like [aString replaceCharactersInRange:substringRange withString:replacementString].
The two ways I know of looping an array that will work for you are as follows:
NSArray *words = [sentence componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
for (NSString *word in words)
{
NSString *transformedWord = [obj method:word];
}
and
NSArray *words = [sentence componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
[words enumerateObjectsWithOptions:NSEnumerationConcurrent usingBlock:^(id word, NSUInteger idx, BOOL *stop){
NSString *transformedWord = [obj method:word];
}];
The other method, –makeObjectsPerformSelector:withObject:, won't work for you. It expects to be able to call [word method:obj] which is backwards from what you expect.
If you could write your criteria with regular expressions, then you could probably do a regular expression matching to fetch these words and then pass them to your convert: method.
You could also do a split of string into an array of words using componentsSeparatedByString: or componentsSeparatedByCharactersInSet:, then go over the words in the array and detect if they fit your criteria somehow. If they fit, then pass them to convert:.
Hope this helps.
As of iOS 12/macOS 10.14 the recommended way to do this is with the Natural Language framework.
For example:
import NaturalLanguage
let myString = "..."
let tokeniser = NLTokenizer(unit: .word)
tokeniser.string = myString
tokeniser.enumerateTokens(in: myString.startIndex..<myString.endIndex) { wordRange, attributes in
performActionOnWord(myString[wordRange])
return true // or return false to stop enumeration
}
Using NLTokenizer also has the benefit of allowing you to optionally specify the language of the string beforehand:
tokeniser.setLanguage(.hebrew)
I would recommend using a while loop to go through the string like this.
NSRange spaceRange = [sentenceContainingWords rangeOfString:#" "];
NSRange previousRange = (NSRange){0,0};
do {
NSString *wordString;
wordString = [sentenceContainingWord substringWithRange:(NSRange){previousRange.location+1,(spaceRange.location-1)-(previousRange.location+1)}];
//use the +1's to not include the spaces in the strings
[self convert:wordString];
previousRange = spaceRange;
spaceRange = [sentenceContainingWords rangeOfString:#" "];
} while(spaceRange.location != NSNotFound);
This code would probably need to be rewritten because its pretty rough, but you should get the idea.
Edit: Just saw Jacob Gorban's post, you should definitely do it like that.

Get Unicode point of NSString and put that into another NSString

What's the easiest way to get the Unicode value from an NSString? For example,
NSString *str = "A";
NSString *hex;
Now, I want to set the value of hex to the Unicode value of str (i.e. 0041)... How would I go about doing that?
The unichar type is defined to be a 16-bit unicode value (eg, as indirectly documented in the description of the %C specifier), and you can get a unichar from a given position in an NSString using characterAtIndex:, or use getCharacters:range: if you want to fill a C array of unichars from the NSString more quickly than by querying them one by one.
NSUTF32StringEncoding is also a valid string encoding, as are a couple of endian-specific variants, in case you want to be absolutely future proof. You'd get a C array of those using the much more longwinded getBytes:maxLength:usedLength:encoding:options:range:remainingRange:.
EDIT: so, e.g.
NSString *str = #"A";
NSLog(#"16-bit unicode values are:");
for(int index = 0; index < [str length]; index++)
NSLog(#"%04x", [str characterAtIndex:index]);
You can use
NSData * u = [str dataUsingEncoding:NSUnicodeStringEncoding];
NSString *hex = [u description];
You may replace NSUnicodeStringEncoding by NSUTF8StringEncoding, NSUTF16StringEncoding (the same as NSUnicodeStringEncoding) or NSUTF32StringEncoding, or many other values.
See here
for more

Most efficient way to iterate over all the chars in an NSString

What's the best way to iterate over all the chars in an NSString? Would you want to loop over the length of the string and use the method.
[aNSString characterAtIndex:index];
or would you want to user a char buffer based on the NSString?
I think it's important that people understand how to deal with unicode, so I ended up writing a monster answer, but in the spirit of tl;dr I will start with a snippet that should work fine. If you want to know details (which you should!), please continue reading after the snippet.
NSUInteger len = [str length];
unichar buffer[len+1];
[str getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
NSLog(#"%C", buffer[i]);
}
Still with me? Good!
The current accepted answer seem to be confusing bytes with characters/letters. This is a common problem when encountering unicode, especially from a C background. Strings in Objective-C are represented as unicode characters (unichar) which are much bigger than bytes and shouldn't be used with standard C string manipulation functions.
(Edit: This is not the full story! To my great shame, I'd completely forgotten to account for composable characters, where a "letter" is made up of multiple unicode codepoints. This gives you a situation where you can have one "letter" resolving to multiple unichars, which in turn are multiple bytes each. Hoo boy. Please refer to this great answer for the details on that.)
The proper answer to the question depends on whether you want to iterate over the characters/letters (as distinct from the type char) or the bytes of the string (what the type char actually means). In the spirit of limiting confusion, I will use the terms byte and letter from now on, avoiding the possibly ambigious term character.
If you want to do the former and iterate over the letters in the string, you need to exclusively deal with unichars (sorry, but we're in the future now, you can't ignore it anymore). Finding the amount of letters is easy, it's the string's length property. An example snippet is as such (same as above):
NSUInteger len = [str length];
unichar buffer[len+1];
[str getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
NSLog(#"%C", buffer[i]);
}
If, on the other hand, you want to iterate over the bytes in a string, it starts getting complicated and the result will depend entirely upon the encoding you choose to use. The decent default choice is UTF8, so that's what I will show.
Doing this you have to figure out how many bytes the resulting UTF8 string will be, a step where it's easy to go wrong and use the string's -length. One main reason this very easy to do wrong, especially for a US developer, is that a string with letters falling into the 7-bit ASCII spectrum will have equal byte and letter lengths. This is because UTF8 encodes 7-bit ASCII letters with a single byte, so a simple test string and basic english text might work perfectly fine.
The proper way to do this is to use the method -lengthOfBytesUsingEncoding:NSUTF8StringEncoding (or other encoding), allocate a buffer with that length, then convert the string to the same encoding with -cStringUsingEncoding: and copy it into that buffer. Example code here:
NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
char proper_c_buffer[byteLength+1];
strncpy(proper_c_buffer, [str cStringUsingEncoding:NSUTF8StringEncoding], byteLength);
NSLog(#"strncpy with proper length");
for(int i = 0; i < byteLength; i++) {
NSLog(#"%c", proper_c_buffer[i]);
}
Just to drive the point home as to why it's important to keep things straight, I will show example code that handles this iteration in four different ways, two wrong and two correct. This is the code:
#import <Foundation/Foundation.h>
int main() {
NSString *str = #"буква";
NSUInteger len = [str length];
// Try to store unicode letters in a char array. This will fail horribly
// because getCharacters:range: takes a unichar array and will probably
// overflow or do other terrible things. (the compiler will warn you here,
// but warnings get ignored)
char c_buffer[len+1];
[str getCharacters:c_buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with char buffer");
for(int i = 0; i < len; i++) {
NSLog(#"Byte %d: %c", i, c_buffer[i]);
}
// Copy the UTF string into a char array, but use the amount of letters
// as the buffer size, which will truncate many non-ASCII strings.
strncpy(c_buffer, [str UTF8String], len);
NSLog(#"strncpy with UTF8String");
for(int i = 0; i < len; i++) {
NSLog(#"Byte %d: %c", i, c_buffer[i]);
}
// Do It Right (tm) for accessing letters by making a unichar buffer with
// the proper letter length
unichar buffer[len+1];
[str getCharacters:buffer range:NSMakeRange(0, len)];
NSLog(#"getCharacters:range: with unichar buffer");
for(int i = 0; i < len; i++) {
NSLog(#"Letter %d: %C", i, buffer[i]);
}
// Do It Right (tm) for accessing bytes, by using the proper
// encoding-handling methods
NSUInteger byteLength = [str lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
char proper_c_buffer[byteLength+1];
const char *utf8_buffer = [str cStringUsingEncoding:NSUTF8StringEncoding];
// We copy here because the documentation tells us the string can disappear
// under us and we should copy it. Just to be safe
strncpy(proper_c_buffer, utf8_buffer, byteLength);
NSLog(#"strncpy with proper length");
for(int i = 0; i < byteLength; i++) {
NSLog(#"Byte %d: %c", i, proper_c_buffer[i]);
}
return 0;
}
Running this code will output the following (with NSLog cruft trimmed out), showing exactly HOW different the byte and letter representations can be (the two last outputs):
getCharacters:range: with char buffer
Byte 0: 1
Byte 1:
Byte 2: C
Byte 3:
Byte 4: :
strncpy with UTF8String
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3:
Byte 4: Ð
getCharacters:range: with unichar buffer
Letter 0: б
Letter 1: у
Letter 2: к
Letter 3: в
Letter 4: а
strncpy with proper length
Byte 0: Ð
Byte 1: ±
Byte 2: Ñ
Byte 3:
Byte 4: Ð
Byte 5: º
Byte 6: Ð
Byte 7: ²
Byte 8: Ð
Byte 9: °
While Daniel's solution will probably work most of the time, I think the solution is dependent on the context. For example, I have a spelling app and need to iterate over each character as it appears onscreen which may not correspond to the way it is represented in memory. This is especially true for text provided by the user.
Using something like this category on NSString:
- (void) dumpChars
{
NSMutableArray *chars = [NSMutableArray array];
NSUInteger len = [self length];
unichar buffer[len+1];
[self getCharacters: buffer range: NSMakeRange(0, len)];
for (int i=0; i<len; i++) {
[chars addObject: [NSString stringWithFormat: #"%C", buffer[i]]];
}
NSLog(#"%# = %#", self, [chars componentsJoinedByString: #", "]);
}
And feeding it a word like mañana might produce:
mañana = m, a, ñ, a, n, a
But it could just as easily produce:
mañana = m, a, n, ̃, a, n, a
The former will be produced if the string is in precomposed unicode form and the later if it's in decomposed form.
You might think this could be avoided by using the result of NSString's precomposedStringWithCanonicalMapping or precomposedStringWithCompatibilityMapping, but this is not necessarily the case as Apple warns in Technical Q&A 1225. For example a string like e̊gâds (which I totally made up) still produces the following even after converting to a precomposed form.
e̊gâds = e, ̊, g, â, d, s
The solution for me is to use NSString's enumerateSubstringsInRange passing NSStringEnumerationByComposedCharacterSequences as the enumeration option. Rewriting the earlier example to look like this:
- (void) dumpSequences
{
NSMutableArray *chars = [NSMutableArray array];
[self enumerateSubstringsInRange: NSMakeRange(0, [self length]) options: NSStringEnumerationByComposedCharacterSequences
usingBlock: ^(NSString *inSubstring, NSRange inSubstringRange, NSRange inEnclosingRange, BOOL *outStop) {
[chars addObject: inSubstring];
}];
NSLog(#"%# = %#", self, [chars componentsJoinedByString: #", "]);
}
If we feed this version e̊gâds then we get
e̊gâds = e̊, g, â, d, s
as expected, which is what I want.
The section of documentation on Characters and Grapheme Clusters may also be helpful in explaining some of this.
Note: Looks like some of the unicode strings I used are tripping up SO when formatted as code. The strings I used are mañana, and e̊gâds.
Neither. The "Optimize Your Text Manipulations" section of the "Cocoa Performance Guidelines" in the Xcode Documentation recommends:
If you want to iterate over the
characters of a string, one of the
things you should not do is use the
characterAtIndex: method to retrieve
each character separately. This method
is not designed for repeated access.
Instead, consider fetching the
characters all at once using the
getCharacters:range: method and
iterating over the bytes directly.
If you want to search a string for
specific characters or substrings, do
not iterate through the characters one
by one. Instead, use higher level
methods such as rangeOfString:,
rangeOfCharacterFromSet:, or
substringWithRange:, which are
optimized for searching the NSString
characters.
See this Stack Overflow answer on How to remove whitespace from right end of NSString for an example of how to let rangeOfCharacterFromSet: iterate over the characters of the string instead of doing it yourself.
I would definitely get a char buffer first, then iterate over that.
NSString *someString = ...
unsigned int len = [someString length];
char buffer[len];
//This way:
strncpy(buffer, [someString UTF8String]);
//Or this way (preferred):
[someString getCharacters:buffer range:NSMakeRange(0, len)];
for(int i = 0; i < len; ++i) {
char current = buffer[i];
//do something with current...
}
try enum string with blocks
Create Category of NSString
.h
#interface NSString (Category)
- (void)enumerateCharactersUsingBlock:(void (^)(NSString *character, NSInteger idx, bool *stop))block;
#end
.m
#implementation NSString (Category)
- (void)enumerateCharactersUsingBlock:(void (^)(NSString *character, NSInteger idx, bool *stop))block
{
bool _stop = NO;
for(NSInteger i = 0; i < [self length] && !_stop; i++)
{
NSString *character = [self substringWithRange:NSMakeRange(i, 1)];
block(character, i, &_stop);
}
}
#end
example
NSString *string = #"Hello World";
[string enumerateCharactersUsingBlock:^(NSString *character, NSInteger idx, bool *stop) {
NSLog(#"char %#, i: %li",character, (long)idx);
}];
This is little different solution for the question but I thought maybe this will be useful for someone. What I wanted was to actually iterate as actual unicode character in NSString. So, I found this solution:
NSString * str = #"hello 🤠💩";
NSRange range = NSMakeRange(0, str.length);
[str enumerateSubstringsInRange:range
options:NSStringEnumerationByComposedCharacterSequences
usingBlock:^(NSString *substring, NSRange substringRange,
NSRange enclosingRange, BOOL *stop)
{
NSLog(#"%#", substring);
}];
Although you would technically be getting individual NSString values, here is an alternative approach:
NSRange range = NSMakeRange(0, 1);
for (__unused int i = range.location; range.location < [starring length]; range.location++) {
NSLog(#"%#", [aNSString substringWithRange:range]);
}
(The __unused int i bit is necessary to silence the compiler warning.)
You should not use
NSUInteger len = [str length];
unichar buffer[len+1];
you should use memory allocation
NSUInteger len = [str length];
unichar* buffer = (unichar*) malloc (len+1)*sizeof(unichar);
and in the end use
free(buffer);
in order to avoid memory problems.

How to get a single NSString character from an NSString

I want to get a character from somewhere inside an NSString. I want the result to be an NSString.
This is the code I use to get a single character at index it:
[[s substringToIndex:i] substringToIndex:1]
Is there a better way to do it?
This will also retrieve a character at index i as an NSString, and you're only using an NSRange struct rather than an extra NSString.
NSString * newString = [s substringWithRange:NSMakeRange(i, 1)];
If you just want to get one character from an a NSString, you can try this.
- (unichar)characterAtIndex:(NSUInteger)index;
Used like so:
NSString *originalString = #"hello";
int index = 2;
NSString *theCharacter = [NSString stringWithFormat:#"%c", [originalString characterAtIndex:index-1]];
//returns "e".
Your suggestion only works for simple characters like ASCII. NSStrings store unicode and if your character is several unichars long then you could end up with gibberish. Use
- (NSRange)rangeOfComposedCharacterSequenceAtIndex:(NSUInteger)index;
if you want to determine how many unichars your character is. I use this to step through my strings to determine where the character borders occur.
Being fully unicode able is a bit of work but depends on what languages you use. I see a lot of asian text so most characters spill over from one space and so it's work that I need to do.
NSMutableString *myString=[NSMutableString stringWithFormat:#"Malayalam"];
NSMutableString *revString=#"";
for (int i=0; i<myString.length; i++) {
revString=[NSMutableString stringWithFormat:#"%c%#",[myString characterAtIndex:i],revString];
}
NSLog(#"%#",revString);

NSString - Convert to pure alphabet only (i.e. remove accents+punctuation)

I'm trying to compare names without any punctuation, spaces, accents etc.
At the moment I am doing the following:
-(NSString*) prepareString:(NSString*)a {
//remove any accents and punctuation;
a=[[[NSString alloc] initWithData:[a dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES] encoding:NSASCIIStringEncoding] autorelease];
a=[a stringByReplacingOccurrencesOfString:#" " withString:#""];
a=[a stringByReplacingOccurrencesOfString:#"'" withString:#""];
a=[a stringByReplacingOccurrencesOfString:#"`" withString:#""];
a=[a stringByReplacingOccurrencesOfString:#"-" withString:#""];
a=[a stringByReplacingOccurrencesOfString:#"_" withString:#""];
a=[a lowercaseString];
return a;
}
However, I need to do this for hundreds of strings and I need to make this more efficient. Any ideas?
NSString* finish = [[start componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:#""];
Before using any of these solutions, don't forget to use decomposedStringWithCanonicalMapping to decompose any accented letters. This will turn, for example, é (U+00E9) into e ‌́ (U+0065 U+0301). Then, when you strip out the non-alphanumeric characters, the unaccented letters will remain.
The reason why this is important is that you probably don't want, say, “dän” and “dün”* to be treated as the same. If you stripped out all accented letters, as some of these solutions may do, you'll end up with “dn”, so those strings will compare as equal.
So, you should decompose them first, so that you can strip the accents and leave the letters.
*Example from German. Thanks to Joris Weimar for providing it.
On a similar question, Ole Begemann suggests using stringByFoldingWithOptions: and I believe this is the best solution here:
NSString *accentedString = #"ÁlgeBra";
NSString *unaccentedString = [accentedString stringByFoldingWithOptions:NSDiacriticInsensitiveSearch locale:[NSLocale currentLocale]];
Depending on the nature of the strings you want to convert, you might want to set a fixed locale (e.g. English) instead of using the user's current locale. That way, you can be sure to get the same results on every machine.
One important precision over the answer of BillyTheKid18756 (that was corrected by Luiz but it was not obvious in the explanation of the code):
DO NOT USE stringWithCString as a second step to remove accents, it can add unwanted characters at the end of your string as the NSData is not NULL-terminated (as stringWithCString expects it).
Or use it and add an additional NULL byte to your NSData, like Luiz did in his code.
I think a simpler answer is to replace:
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
By:
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
If I take back the code of BillyTheKid18756, here is the complete correct code:
// The input text
NSString *text = #"BûvérÈ!#$&%^&(*^(_()-*/48";
// Defining what characters to accept
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:#" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
// Corrected back-conversion from NSData to NSString
NSString *sanitizedText = [[[NSString alloc] initWithData:sanitizedData encoding:NSASCIIStringEncoding] autorelease];
// Removing unaccepted characters
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:#""];
If you are trying to compare strings, use one of these methods. Don't try to change data.
- (NSComparisonResult)localizedCompare:(NSString *)aString
- (NSComparisonResult)localizedCaseInsensitiveCompare:(NSString *)aString
- (NSComparisonResult)compare:(NSString *)aString options:(NSStringCompareOptions)mask range:(NSRange)range locale:(id)locale
You NEED to consider user locale to do things write with strings, particularly things like names.
In most languages, characters like ä and å are not the same other than they look similar. They are inherently distinct characters with meaning distinct from others, but the actual rules and semantics are distinct to each locale.
The correct way to compare and sort strings is by considering the user's locale. Anything else is naive, wrong and very 1990's. Stop doing it.
If you are trying to pass data to a system that cannot support non-ASCII, well, this is just a wrong thing to do. Pass it as data blobs.
https://developer.apple.com/library/ios/documentation/cocoa/Conceptual/Strings/Articles/SearchingStrings.html
Plus normalizing your strings first (see Peter Hosey's post) precomposing or decomposing, basically pick a normalized form.
- (NSString *)decomposedStringWithCanonicalMapping
- (NSString *)decomposedStringWithCompatibilityMapping
- (NSString *)precomposedStringWithCanonicalMapping
- (NSString *)precomposedStringWithCompatibilityMapping
No, it's not nearly as simple and easy as we tend to think.
Yes, it requires informed and careful decision making. (and a bit of non-English language experience helps)
Consider using the RegexKit framework. You could do something like:
NSString *searchString = #"This is neat.";
NSString *regexString = #"[\W]";
NSString *replaceWithString = #"";
NSString *replacedString = [searchString stringByReplacingOccurrencesOfRegex:regexString withString:replaceWithString];
NSLog (#"%#", replacedString);
//... Thisisneat
Consider using NSScanner, and specifically the methods -setCharactersToBeSkipped: (which accepts an NSCharacterSet) and -scanString:intoString: (which accepts a string and returns the scanned string by reference).
You may also want to couple this with -[NSString localizedCompare:], or perhaps -[NSString compare:options:] with the NSDiacriticInsensitiveSearch option. That could simplify having to remove/replace accents, so you can focus on removing puncuation, whitespace, etc.
If you must use an approach like you presented in your question, at least use an NSMutableString and replaceOccurrencesOfString:withString:options:range: — that will be much more efficient than creating tons of nearly-identical autoreleased strings. It could be that just reducing the number of allocations will boost performance "enough" for the time being.
To give a complete example by combining the answers from Luiz and Peter, adding a few lines, you get the code below.
The code does the following:
Creates a set of accepted characters
Turn accented letters into normal letters
Remove characters not in the set
Objective-C
// The input text
NSString *text = #"BûvérÈ!#$&%^&(*^(_()-*/48";
// Create set of accepted characters
NSMutableCharacterSet *acceptedCharacters = [[NSMutableCharacterSet alloc] init];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet letterCharacterSet]];
[acceptedCharacters formUnionWithCharacterSet:[NSCharacterSet decimalDigitCharacterSet]];
[acceptedCharacters addCharactersInString:#" _-.!"];
// Turn accented letters into normal letters (optional)
NSData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *sanitizedText = [NSString stringWithCString:[sanitizedData bytes] encoding:NSASCIIStringEncoding];
// Remove characters not in the set
NSString* output = [[sanitizedText componentsSeparatedByCharactersInSet:[acceptedCharacters invertedSet]] componentsJoinedByString:#""];
Swift (2.2) example
let text = "BûvérÈ!#$&%^&(*^(_()-*/48"
// Create set of accepted characters
let acceptedCharacters = NSMutableCharacterSet()
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.letterCharacterSet())
acceptedCharacters.formUnionWithCharacterSet(NSCharacterSet.decimalDigitCharacterSet())
acceptedCharacters.addCharactersInString(" _-.!")
// Turn accented letters into normal letters (optional)
let sanitizedData = text.dataUsingEncoding(NSASCIIStringEncoding, allowLossyConversion: true)
let sanitizedText = String(data: sanitizedData!, encoding: NSASCIIStringEncoding)
// Remove characters not in the set
let components = sanitizedText!.componentsSeparatedByCharactersInSet(acceptedCharacters.invertedSet)
let output = components.joinWithSeparator("")
Output
The output for both examples would be: BuverE!_-48
Just bumped into this, maybe its too late, but here is what worked for me:
// text is the input string, and this just removes accents from the letters
// lossy encoding turns accented letters into normal letters
NSMutableData *sanitizedData = [text dataUsingEncoding:NSASCIIStringEncoding
allowLossyConversion:YES];
// increase length by 1 adds a 0 byte (increaseLengthBy
// guarantees to fill the new space with 0s), effectively turning
// sanitizedData into a c-string
[sanitizedData increaseLengthBy:1];
// now we just create a string with the c-string in sanitizedData
NSString *final = [NSString stringWithCString:[sanitizedData bytes]];
#interface NSString (Filtering)
- (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet;
#end
#implementation NSString (Filtering)
- (NSString*)stringByFilteringCharacters:(NSCharacterSet*)charSet {
NSMutableString * mutString = [NSMutableString stringWithCapacity:[self length]];
for (int i = 0; i < [self length]; i++){
char c = [self characterAtIndex:i];
if(![charSet characterIsMember:c]) [mutString appendFormat:#"%c", c];
}
return [NSString stringWithString:mutString];
}
#end
These answers didn't work as expected for me. Specifically, decomposedStringWithCanonicalMapping didn't strip accents/umlauts as I'd expected.
Here's a variation on what I used that answers the brief:
// replace accents, umlauts etc with equivalent letter i.e 'é' becomes 'e'.
// Always use en_GB (or a locale without the characters you wish to strip) as locale, no matter which language we're taking as input
NSString *processedString = [string stringByFoldingWithOptions: NSDiacriticInsensitiveSearch locale: [NSLocale localeWithLocaleIdentifier: #"en_GB"]];
// remove non-letters
processedString = [[processedString componentsSeparatedByCharactersInSet:[[NSCharacterSet letterCharacterSet] invertedSet]] componentsJoinedByString:#""];
// trim whitespace
processedString = [processedString stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceCharacterSet]];
return processedString;
Peter's Solution in Swift:
let newString = oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
Example:
let oldString = "Jo_ - h !. nn y"
// "Jo_ - h !. nn y"
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet)
// ["Jo", "h", "nn", "y"]
oldString.componentsSeparatedByCharactersInSet(NSCharacterSet.letterCharacterSet().invertedSet).joinWithSeparator("")
// "Johnny"
I wanted to filter out everything except letters and numbers, so I adapted Lorean's implementation of a Category on NSString to work a little different. In this example, you specify a string with only the characters you want to keep, and everything else is filtered out:
#interface NSString (PraxCategories)
+ (NSString *)lettersAndNumbers;
- (NSString*)stringByKeepingOnlyLettersAndNumbers;
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string;
#end
#implementation NSString (PraxCategories)
+ (NSString *)lettersAndNumbers { return #"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"; }
- (NSString*)stringByKeepingOnlyLettersAndNumbers {
return [self stringByKeepingOnlyCharactersInString:[NSString lettersAndNumbers]];
}
- (NSString*)stringByKeepingOnlyCharactersInString:(NSString *)string {
NSCharacterSet *characterSet = [NSCharacterSet characterSetWithCharactersInString:string];
NSMutableString * mutableString = #"".mutableCopy;
for (int i = 0; i < [self length]; i++){
char character = [self characterAtIndex:i];
if([characterSet characterIsMember:character]) [mutableString appendFormat:#"%c", character];
}
return mutableString.copy;
}
#end
Once you've made your Categories, using them is trivial, and you can use them on any NSString:
NSString *string = someStringValueThatYouWantToFilter;
string = [string stringByKeepingOnlyLettersAndNumbers];
Or, for example, if you wanted to get rid of everything except vowels:
string = [string stringByKeepingOnlyCharactersInString:#"aeiouAEIOU"];
If you're still learning Objective-C and aren't using Categories, I encourage you to try them out. They're the best place to put things like this because it gives more functionality to all objects of the class you Categorize.
Categories simplify and encapsulate the code you're adding, making it easy to reuse on all of your projects. It's a great feature of Objective-C!