UTF8String giving different value for same string objective-c - objective-c

I am matching strings in condition. Both strings are exactly same.I also trimmed all whitespace and newline characters. But compiler saying both are not same.
I investigate a lot then I identify that both strings have UTF8String value as different.
po otherPersonName
"76000 13590"
po [otherPersonName UTF8String]
"76000 13590"
po findPersonName
"76000 13590"
po [findPersonName UTF8String]
"\xffffffc2\xffffffa076000\xffffffc2\xffffffa013590\xffffffe2\xffffff80\xffffffac"
Can I anyone explain what to do match correctly this strings.

In findPersonName there are non-breaking spaces (U+00A0, UTF-8 C2 A0, which po is showing as \xffffffc2\xffffffa0) at the start and between the numbers, and a POP DIRECTIONAL FORMATTING (U+202C, UTF-8 E2 80 AC, \xffffffe2\xffffff80\xffffffac) at the end (suggesting the value has come from a larger text with mixed scripts, left-to-right and right-to-left).
If these are the only characters that might occur a couple of calls to stringByReplacingOccurrencesOfString:withString: may be used to replace/remove them. However if there are other white space characters then look at other approaches to clean up the string - see NSString, NSCharacterSet, NSRegularExpression etc.
HTH

Related

Displaying special characters in a UILabel

I have a string which contains a mix of normal text and special characters. When setting the label text with this string the character codes are being displayed rather than the actual character. I was wondering is there any support for special characters or if there is a way to decode the values?
I have tried stringWithCString but haven't had any luck with it.
Setting:
self.stringLabel.text = myNSString
Result:
Hello world!
èéêëÄÄÄ
ÿ
ûüùúÅ«
Anyone else come across a similar issue?

How do I remove hidden characters from a NSString?

After copying pasting a text from the web, in my mac app NSTextArea, I see
EE
If I copy these 2 letters in a browser I see:
E?E
If I copy them in google translator I get
E 'E
I cannot identify this character in between the two E. But the question is: how do I remove these hidden characters from my NSString?
In your uploaded file the specific hex code for the hidden character is 0x18
(found via Hex Fiend)
This character, along with others are part of a 'control character set'. The set also contains characters such as the tab (0x09) and newline (0x0A) - obviously those we don't want to remove.
In Objective-C, we can use the NSCharacterSet controlCharacterSet in conjunction with whitespaceAndNewlineCharacterSet to get just the blank characters that have no rendered width.
NSMutableCharacterSet* zeroWidthCharacterSet = [[NSCharacterSet controlCharacterSet] mutableCopy];
[zeroWidthCharacterSet formIntersectionWithCharacterSet:[[NSCharacterSet whitespaceAndNewlineCharacterSet] invertedSet]];
Then we can simply use the good old split by character set method
string = [[string componentsSeparatedByCharactersInSet:zeroWidthCharacterSet] componentsJoinedByString:#""];
Note that if a special character that uses more than one UTF8 character to represent itself (like Emoji) uses 0x18 then stripping it will break the character combo
Because the control characters are special, I don't believe you'd ever find them in an Emoji sequence.

How do I check for this odd space character - " " in Objective-C?

I wrote some RegEx to play with spaces in strings, and it works beautifully, except for when I come across this character: " " instead of " ". You probably think I'm crazy, but apparently they're different. Check out this RegEx app (oddly enough, it often crashes it):
When I use the weird space:
When I use a normal space:
As you can see, there are many more spaces detected here, but it doesn't detect the weird spaces.
What is this space? How do I get rid of it?
Unicode has a lot of different space characters. The space you posted in your question -- in both the title and the body -- is a regular ASCII space, good old U+0020.
If you want to check exactly what you've copied onto your clipboard, you can run the command pbpaste(1) on Mac OS X. For example, if you copied a non-breaking space (U+00A0), you could identify it like so:
# Write pasteboard contents to stdout, convert from UTF-8 to UTF-32 for easy
# code point identification, then hex dump the contents
$ pbpaste | iconv -f utf-8 -t utf-32be | hexdump -C
00000000 00 00 00 a0 |....|
00000004
Depending on the regex engine you're using, it may not support them all, especially if you use the \s character class. If you want to be sure to match the space character you have, then include it explicitly in your character class, e.g. [\s<YOURSPACEHERE>], where <YOURSPACEHERE> is copy+pasted from the character you want to match.
Try "\p{Z}" for your regular expression. It's the unicode property for any kind of whitespace or invisible separator.
See: NSRegularExpression and Unicode Regular Expressions.
Just as a test of my answer, I constructed the following unit test.
- (void)testPattern
{
NSString *string = #"xxx\u00A0yyy";
NSString *pattern = #"\\p{Z}";
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:0 error:NULL];
NSUInteger number = [regex numberOfMatchesInString:string options:0 range:NSMakeRange(0, [string length])];
STAssertEquals(number, 1U, #"");
}
They're probably non-breaking spaces, seeing as all the lines end with spaces that are matched by \s rather than these mystery spaces. Try matching \0xA0.
You can match Unicode characters with \x{NNNN}, where NNNN is the Hexa code of the character. See ICU User Guide.

Which Unicode characters are "composing" characters (whose sole purpose is to add accent, tilda)?

This is related to
What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?
This is how I plan to do this:
Use http://msdn.microsoft.com/en-us/library/dd374126%28v=vs.85%29.aspx to turn the string into
KD form.
Basically it'll turn most variation such as superscript into the normal number. Also it decompose tilda and accent into 2 characters.
Next step would be to remove all characters whose sole purpose is tildaing or accenting character.
How do I know which characters are like that? Which characters are just "composing characters"
How do I find such characters? After I find those, how do I get rid of it? Should I scan character by character and remove all such "combining characters?"
For example:
Character from 300 to 362 can be gotten rid off.
Then what?
Combining characters are listed in UnicodeData.txt as having a nonzero Canonical_Combining_Class, and a General_Category of Mn (Mark, nonspacing).
For each character in the string, call GetUnicodeCategory and check the UnicodeCategory for NonSpacingMark, SpacingCombiningMark or EnclosingMark.
You may be able to do it more efficiently using regex, eg Regex.Replace(str, "\p{M}", "").

RegEx to find % symbols in a string that don't form the start of a legal two-digit escape sequence?

I would like a regular expression to find the %s in the source string that don't form the start of a valid two-hex-digit escaped character (defined as a % followed by exactly two hexadecimal digits, upper or lower case) that can be used to replace only these % symbols with %25.
(The motivation is to make the best guess attempt to create legally escaped strings from strings of various origins that may be legally percent escaped and may not, and may even be a mixture of the two, without damaging the data intent if the original string was already correctly encoded, e.g. by blanket re-encoding).
Here's an example input string.
He%20has%20a%2050%%20chance%20of%20living%2C%20but%20there%27s%20only%20a%2025%%20chance%20of%20that.
This doesn't conform to any encoding standard because it is a mix of valid escaped characters eg. %20 and two loose percentage symbols. I'd like to convert those %s to %25s.
My progress so far is to identify a regex %[0-9a-z]{2} that finds the % symbols that are legal but I can't work out how to modify it to find the ones that aren't legal.
%(?![0-9a-fA-F]{2})
Should do the trick. Use a look-ahead to find a % NOT followed by a valid two-digit hexadecimal value then replace the found % symbol with your %25 replacement.
(Hopefully this works with (presumably) NSRegularExpression, or whatever you're using)
%(?![a-fA-F0-9]{2})
That's a percent followed by a negative lookahead for two hex digits.