What are the characters that stringByAddingPercentEscapesUsingEncoding escapes? - objective-c

I've had to switch from stringByAddingPercentEscapesUsingEncoding to CFURLCreateStringByAddingPercentEscapes because it doesn't escape question marks (?). I'm curious what exactly it does escape, and the rationale behind the partial escaping vs RFC 3986.

Be careful not to leak memory on conversions when using CFStringRef. Here's what I came up with to work with Latin characters, and others. I use this to escape my parameters, not the entire URL. Depending on your use case, you may need to add or remove characters from "escapeChars"
CFStringRef escapeChars = (CFStringRef)#"%;/?¿:#&=$+,[]#!'()*<>¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ \"\n";
NSString *encodedString = (__bridge_transfer NSString *) CFURLCreateStringByAddingPercentEscapes(NULL, (__bridge_retained CFStringRef) url, NULL, escapeChars, kCFStringEncodingUTF8);
I hope this helps.

Some good categories have been created for doing just what you need:
http://iosdevelopertips.com/networking/a-better-url-encoding-method.html
http://www.cocoanetics.com/2009/08/url-encoding/
The rationale for leaving certain characters out is beyond me... except to say that the definition of the function is: Returns a representation of the receiver using a given encoding to determine the percent escapes necessary to convert the receiver into a legal URL string.
To be completely correct, + and & are legal characters within a URL, whereas a space is not. Hence the method will correctly escape a space, but leaves + and & intact.
Reading RFC2396 http://www.ietf.org/rfc/rfc2396.txt - there is a set of reserved and unreserved characters defined. My guess is that none of these characters are escaped by stringByAddingPercentEscapesUsingEncoding.

Related

Can I use memchr safely with an internal UTF-8 char * returned from an NSString?

I'd like to use memchr instead of strlen to find the length of a C string potentially used as the backing string of an NSString. Is this safe to do, or do I risk reading memory that I don't own, causing a crash? Let's assume that the NSString will not be released before I'm done with the internal buffer.
memchr(s, 0, XXX) and strlen(s) should pretty much behave identically, save for mechr()'s ability to terminate after XXX bytes. But strnlen() can do that, too.
And that behavior is probably exactly what you don't want.
Neither function accounts for any kind of unicode encoding. Thus, the returned length will be the length-in-bytes and not the # of characters.
Use -length on the NSString if you want the string length. Beyond that, what are you trying to do?

Understanding urls correctly

I'm writing RSS reader and taking article urls from feeds, but often have invalid urls while parsing with NSXMLParser. Sometimes have extra symbols at the end of url(for example \n,\t). This issue I fixed.
Most difficult trouble is urls with queries that have characters not allowed to be url-encoded.
Working url for URL-request http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa
'#' character will replaced to "%23" by "stringByAddingPercentEscapesUsingEncoding:" method and will not work. Site will say what page not found. I believe after '#' character is a query string.
Are there a way to get(encode) any url from feeds correctly, at least always removing a query strings from xml?
There two approaches you could use to create a legal URL string by either using stringByAddingPercentEncodingWithAllowedCharacters or by using CFURL core foundation class which gives you a whole range of options.
Example 1 (NSCharacterSet):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
NSLog(#"%#", [nonFormattedURL stringByAddingPercentEncodingWithAllowedCharacters:[[NSCharacterSet illegalCharacterSet] invertedSet]]);
This still keep the hash tag in place by inverting the illegalCharacterSet in NSCharacterSet object. If you like more control you also create your own mutable set.
Example 2 (CFURL.h):
NSString *nonFormattedURL = #"http://www.bbc.co.uk/news/education-23809095#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa";
CFAllocatorRef allocator = CFAllocatorGetDefault();
CFStringRef formattedURL = CFURLCreateStringByAddingPercentEscapes(allocator,
(__bridge CFStringRef) nonFormattedURL,
(__bridge CFStringRef) #"#", //leave unescaped
(__bridge CFStringRef) #"", // legal characters to be escaped like / = # ? etc
NSUTF8StringEncoding); // encoding
NSLog(#"%#", formattedURL);
Does the same as above code but with way more control: replacing certain characters with the equivalent percent escape sequence based on the encoding specified, see logs for example.

How to add UTF-8 characters to an NSString?

I need to make chemistry formulas (SO4^2-), and the easiest way to make subscripts and superscripts seems to be adding UTF-8 characters, since KCTSuperscriptAttributeName: property of NSAttributedString doesn't work.
Is it possible for me to make an nsstring with normal characters and utf-8 characters?
Thanks
According to NSString Reference "NSString is implemented to represent an array of Unicode characters, in other words, a text string."
It would be convenient to write as below:
NSString* myStr = #"Any Unicode Character You Want";
Just make sure that your default text encoding is unicode.
Justin's answer is good but I think what you might really be looking for is NSAttributedString (documentation linked for you) or NSMutableAttributedString, where you can add superscripts, subscripts, and other character styles that NSString by itself can't handle.
Take a look at other NSAttributedString questions here or via Google, like this potentially related question or this one.
Hope this helps you out!
yes. i assume you know the normal approach to make an NSString - here's one method to create an NSString from a utf8 string: -[NSString initWithUTF8String:].

iphone mail and special characters

In my iPhone app, I pass email content to the standalone iPhone mail app, but the content is truncated when it contains special characters. It's the same even if I pre-process the content with stringByAddingPercentEscapesUsingEncoding:.
stringByAddingPercentEscapesUsingEncoding: will not escape characters that are valid in a URL, such as &. In this case you need to escape them though because otherwise they would be interpreted as part of the URL's structure (indicating a new parameter) and not as part of the parameter itself. Use CFURLCreateStringByAddingPercentEscapes instead:
NSString *escaped = [(NSString *)CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault,
(CFStringRef)someURLParameter,
NULL,
(CFStringRef)#"!*'();:#&=+$,/?%#[]",
kCFStringEncodingUTF8) autorelease];
Percent escaping characters is only used in URLs. It's not part of the MIME spec.
I don't see why it wouldn't work. Are you sure these are proper UTF-8 characters?! So long as you're passing a string, Mail should package everything in an email itself.

numerical value of a unicode character in objective c

is it possible to get a numerical value from a unicode character in objective-c?
#"A" is 0041, #"➜" is 279C, #"Ω" is 03A9, #"झ" is 091D... ?
OK, so it’s perhaps worth pointing a few things out in a separate answer here. First, the term “character” is ambiguous, so we should choose a more appropriate term depending on what we mean. (See Characters and Grapheme Clusters in the Apple developer docs, as well as the Unicode website for more detail.)
If you are asking for the UTF-16 code unit, then you can use
unichar ch = [myString characterAtIndex:ndx];
Note that this is only equivalent to a Unicode code-point in the case where the code point is within the Basic Multilingual Plane (i.e. it is less than U+FFFF).
If you are asking for the Unicode code point, then you should be aware that UTF-16 supports characters outside of the BMP (i.e. U+10000 and above) using surrogate pairs. Thus there will be two UTF-16 code units for any code point above U+10000. To detect this case, you need to do something like
uint32_t codepoint = [myString characterAtIndex:ndx];
if ((codepoint & 0xfc00) == 0xd800) {
unichar ch2 = [myString characterAtIndex:ndx + 1];
codepoint = (((codepoint & 0x3ff) << 10) | (ch2 & 0x3ff)) + 0x10000;
}
Note that in production code, you should also test for and cope with the case where the surrogate pair has been truncated somehow.
Importantly, neither UTF-16 code units, nor Unicode code points necessarily correspond to anything that and end-user would regard as a “character” (the Unicode consortium generally refers to this as a grapheme cluster to distinguish it from other possible meanings of “character”). There are many examples, but the simplest to understand are probably the combining diacritical marks. For instance, the character ‘Ä’ can be represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 U+0308.
Sometimes people (like #DietrichEpp in the comments on his answer) will claim that you can deal with this by converting to precomposed form before dealing with your string. This is something of a red herring, because precomposed form only deals with characters that have a precomposed equivalent in Unicode. e.g. it will not help with all combining marks; it will not help with Indic or Arabic scripts; it will not help with Hangul Jamos. There are many other cases as well.
If you are trying to manipulate grapheme clusters (things the user might think of as “characters”), you should probably make use of the NSString methods -rangeOfComposedCharacterSequencesForRange:, rangeOfComposedCharacterSequenceAtIndex: or the CFString function CFStringGetRangeOfComposedCharactersAtIndex. Obviously you cannot hold a grapheme cluster in an integer variable and it has no inherent numerical value; rather, it is represented by a string of code points, which are represented by a string of code units. For instance:
NSRange gcRange = [myString rangeOfComposedCharacterSequenceAtIndex:ndx];
NSString *graphemeCluster = [myString substringWithRange:gcRange];
Note that graphemeCluster may be arbitrarily long(!)
Even then, we have ignored the effects of matters such as Unicode’s support for bidirectional text. That is, the order of the code points represented by the code units in your NSString may in some cases be the reverse of what you might expect. The worse cases involve things like English text embedded in Arabic or Hebrew; this is supported by the Cocoa Text system, and so you really can end up with bidirectional strings in your code.
To summarise: generally speaking one should avoid examining NSString and CFString instances unichar by unichar. If at all possible, use an appropriate NSString method or CFString function instead. If you do find yourself examining the UTF-16 code units, please familiarise yourself with the Unicode standard first (I recommend “Unicode Demystified” if you can’t stomach reading through the Unicode book itself), so that you can avoid the major pitfalls.
Cocoa strings allow you to access the UTF-16 elements using -characterAtIndex:, so the following code will convert the string to a unicode code point:
unsigned strToChar(NSString *str)
{
unsigned c1, c2;
c1 = [str characterAtIndex:0];
if ((c1 & 0xfc00) == 0xd800) {
c2 = [str characterAtIndex:1];
return (((c1 & 0x3ff) << 10) | (c2 & 0x3ff)) + 0x10000;
} else {
return c1;
}
}
I am not aware of any convenience functions for this. You can use -characterAtIndex: by itself if you are okay with your code breaking horribly when someone uses characters outside the BMP; a number of applications on OS X break horribly in this way.
The following should render as a musical "G clef", U+1D11E, but if you copy and paste it into some text editors (TextMate), they'll let you do bizarre things like delete half of the character, at which point your text file is garbage.
𝄞