How do I match non-ASCII characters with RegexKitLite? - objective-c

I am using RegexKitLite and I'm trying to match a pattern.
The following regex patterns do not capture my word that includes N with a titlde: ñ.
Is there a string conversion I am missing?
subjectString = #"define_añadir";
//regexString = #"^define_(.*)"; //this pattern does not match, so I assume to add the ñ
//regexString = #"^define_([.ñ]*)"; //tried this pattern first with a range
regexString = #"^define_((?:\\w|ñ)*)"; //tried second
NSString *captured= [subjectString stringByMatching:regexString capture:1L];
//I want captured == añadir

Looks like an encoding problem to me. Either you're saving the source code in an encoding that can't handle that character (like ASCII), or the compiler is using the wrong encoding to read the source files. Going back to the original regex, try creating the subject string like this:
subjectString = #"define_a\xC3\xB1adir";
or this:
subjectString = #"define_a\u00F1adir";
If that works, check the encoding of your source code files and make sure it's the same encoding the compiler expects.
EDIT: I've never worked with the iPhone technology stack, but according to this doc you should be using the stringWithUTF8String method to create the NSString, not the #"" literal syntax. In fact, it says you should never use non-ASCII characters (that is, anything not in the range 0x00..0x7F) in your code; that way you never have to worry about the source file's encoding. That's good advice no matter what language or toolset you're using.

Related

Inconsistencies in URL encoding methods across Objective-C and Swift

I have the following Objective-C code:
[#"http://www.google.com" stringByAddingPercentEncodingWithAllowedCharacters:[NSCharacterSet URLPathAllowedCharacterSet]];
// http%3A//www.google.com
And yet, in Swift:
"http://www.google.com".addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)
// http://www.google.com
To what can I attribute this discrepancy?
..and for extra credit, can I rely on this code to encode for url path reserved characters while passing a full url like this?
The issue actually rests in the difference between NSString method stringByAddingPercentEncodingWithAllowedCharacters and String method addingPercentEncoding(withAllowedCharacters:). And this behavior has been changing from version to version. (It looks like the latest beta of iOS 11 now restores this behavior we used to see.)
I believe the root of the issue rests in the particulars of how paths are percent encoded. Section 3.3 of RFC 3986 says that colons are permitted in paths except in the first segment of a relative path.
The NSString method captures this notion, e.g. imagine a path whose first directory was foo: (with a colon) and a subdirectory of bar: (also with a colon):
NSString *string = #"foo:/bar:";
NSCharacterSet *cs = [NSCharacterSet URLPathAllowedCharacterSet];
NSLog(#"%#", [string stringByAddingPercentEncodingWithAllowedCharacters:cs]);
That results in:
foo%3A/bar:
The : in the first segment of the page is percent encoded, but the : in subsequent segments are not. This captures the logic of how to handle colons in relative paths per RFC 3986.
The String method addingPercentEncoding(withAllowedCharacters:), however, does not do this:
let string = "foo:/bar:"
os_log("%#", string.addingPercentEncoding(withAllowedCharacters: .urlPathAllowed)!)
Yields:
foo:/bar:
Clearly, the String method does not attempt that position-sensitive logic. This implementation is more in keeping with the name of the method (it considers solely what characters are "allowed" with no special logic that tries to guess, based upon where the allowed character appears, whether it's truly allowed or not.)
I gather that you are saddled with the code supplied in the question, but we should note that this behavior of percent escaping colons in relative paths, while interesting to explain what you experienced, is not really relevant to your immediate problem. The code you have been provided is simply incorrect. It is attempting to percent encode a URL as if it was just a path. But, it’s not a path; it’s a URL, which is a different thing with its own rules.
The deeper insight in percent encoding URLs is to acknowledge that different components of a URL allow different sets of characters, i.e. they require different percent encoding. That’s why NSCharacterSet has so many different URL-related character sets.
You really should percent encode the individual components, percent encoding each with the character set allowed for that type of component. Only when the individual components are percent encoded should they then be concatenated together to form the whole the URL.
Alternatively, NSURLComponents is designed precisely for this purpose, getting you out of the weeds of percent-encoding the individual components yourself. For example:
var components = URLComponents(string: "http://httpbin.org/post")!
let foo = URLQueryItem(name: "foo", value: "bar & baz")
let qux = URLQueryItem(name: "qux", value: "42")
components.queryItems = [foo, qux]
let url = components.url!
That yields the following, with the & and the two spaces properly percent escaped within the foo value, but it correctly left the & in-between foo and qux:
http://httpbin.org/post?foo=bar%20%26%20baz&qux=42
It’s worth noting, though, that NSURLComponents has a small, yet fairly fundamental flaw: Specifically, if you have query values, NSURLQueryItem, that could have + characters, most web services need that percent escaped, but NSURLComponents won’t. If your URL has query components and if those query values might include + characters, I’d advise against NSURLComponents and would instead advise percent encoding the individual components of a URL yourself.

Sabre Web/ .NET - Special Characters In SabreCommandLLSRQ Response Not Handed Properly

I'm using VB.NET to consume Sabre Web Services, primarily using SabreCommandLLSRQ to send native Sabre commands. Sending special characters without any special encoding works fine, but when I try to manipulate any response that contain the Cross of Lorraine using the Response element of SabreCommandLLSRS all of the Cross of Lorraine chars are missing if I display my string in a MsgBox or try to manipulate it.
If I push that string into my clipboard and view it in Notepad++, the characters are there but they seem to be encoded improperly - they come through as something like "‡". I'm pretty new to unicode encoding so that's all a bit above my head.
I've tried using the Replace method of String Builder to change those characters to something visible no avail - anyone have a way around this issue?
Strangely, the other special characters (e.g. "¤") seem to come through just fine.
This section in Dev Studio includes references to special character hex codes:
https://developer.sabre.com/docs/read/soap_apis/management/utility/Send_Sabre_Command
Does this help?
This is a pain in the behind due to the invisible characters.
String replace does work you just need to make sure you capture the invisible character after the Â
Simply in the SabreCommandSend function before you send the string to Sabre put something like the below.
Hopefully this should copy and paste straight out including the invisible character.
if (tempCommand.Contains("‡"))
{
tempCommand = tempCommand.Replace("‡", "Â");
}
I figured out how to get this to work, but its not pretty so if anyone has a better way to do it, I'm all ears.
I couldn't figure out what char to use to do the simple string Replace method, so instead I'm casting the string to a byte array, iterating through the array and replacing any strange characters I find, recasting the byte array into a raw string and doing the string replace on that:
Imports System.Text
Dim byteArray() As Byte = System.Text.Encoding.ASCII.GetBytes(sabreResponse)
For i = 0 To byteArray.Length - 1
If byteArray(i) = 63 Then 'this is a question mark char
byteArray(i) = 94 'caret that doesn't exist in native Sabre
End If
Next
MyClass.respString = System.Text.ASCIIEncoding.ASCII.GetString(byteArray)
MyClass.respString = MyClass.respString.Replace("^", "¥")
For whatever reason, the string replace method works after I swap out the offending byte with a dummy character but not before.

Objective C - char with umlaute to NSString

I am using libical which is a library to parse the icalendar format (RFC 2445).
The problem is, that there may be some german umlaute for example in the location field.
Now libical returns a const char * for each value like:
"K\303\203\302\274nstlerhaus in M\303\203\302\274nchen"
I tried to convert it to NSString with:
[NSString stringWithCString:icalvalue_as_ical_string_r(value) encoding:NSUTF8StringEncoding];
But what I get is:
Künstlerhaus in München
Any suggestions? I would appreciate any help!
Seems like your string got doubly-UTF-8-encoded, because "Künstlerhaus in München" actually is UTF-8, if you UTF-8-decode that again you should get the correct string.
Bear in mind though that you shouldn't be satisfied with that result. There are combinations where a doubly-UTF-8-encoded string can't be simply be decoded by doing a double-UTF-8-decode. Some encoding combinations are irreversible. So in your situation I'd suggest you find out why the string got doubly-UTF-8-encoded in the first place, probably the ical is stored in the wrong encoding on the hard disk, or libical uses the wrong character set to access it, or if you're getting the ical from a server, perhaps the charset there is wrong for text/ical, etc, etc...
The C string does not seem to be encoded in UTF-8, as there are four bytes for each of the characters. For example ü would be encoded as \xc3\xbc (or \195\188) in UTF-8. So the input is either already garbled when you receive it or it uses some other encoding.

Objective-C RegexKitLite match one string or another

I'm trying to use regexkitlite for string matching in objective-c and I'm having some problems with it. What I'm trying to do is search a large string for substrings matching:
"http://[something].jpg"
"http://[something].png"
Basically, I want to find all links to images from the original string. What I have currently is:
NSString *regexString = #"http://[a-zA-Z0-9._%+-/]+\.jpg";
Now this is working for .jpg images, but of course it doesn't match .png images. I would really like to use one regexString that would match either, but I can't figure out how.
Reading some regex tutorials for other languages, I think it is something along the lines of:
NSString *regexString = #"http://[a-zA-Z0-9._%+-/]+\.(?:jpg|png)";
But I can't quite get it right.
Any help would be greatly appreciated.
You don't need a non-capturing group around the file extensions. It's good practice to use them, but it could be causing an error here. (Does the library support it?)
Also, I simplified your regex slightly by using a predefined character class.
NSString *regexString = #"http://[\w.%+-/]+\.(jpg|png)";
You can see this in action here.
You can also add any file extensions that you want. Ex: (jpg|png|gif|...).
Updated: Apple now includes regular expression support with NSRegularExpression, which is available in OS X v10.7 and later.

Remove & character from string objective c

How would I go about removing the "&" symbol from a string. It's making my xml parser fail.
I have tried
[currentParsedCharacterData setString: [currentParsedCharacterData stringByReplacingOccurrencesOfString:#"&" withString:#"and"]];
But it seems to have no effect
Really what this boils down to is you want to gracefully handle invalid XML. The XML Parser is properly telling you that this XML is invalid, and is thusly failing to parse. Assuming you have no control over this XML content, I would suggest pre-parsing it for common errors like this, the output of which would be a sanitized XML doc that has a better chance of success.
To sanitize the doc, it may be as simple as doing search and replace, the problem with just doing a blanket replace on any & is that there are valid uses of &, for example & or ©. You would end up munging the XML by creating something like this: andcopy;
You could search for "ampersand space" but that won't catch a string that has an ampersand as the last character (an out-case that might be easily handled). What you are really searching for are occurrences of & that are not followed by a ; or those of which where any type of whitespace is encountered before the following ; because the semi-colon is fine on its own.
If you need more power because you need to detect this, and other errors, I would suggest going to NSScanner or RegEx matching to search for occurrences of this and other common errors during your sanitization step. It is also very common for XML files to be rather large things, so you need to be careful when dealing with these as in-memory strings as this can easily lead to application crashes. Breaking it up into manageable chunks is something NSScanner can do very well.
For a quick attempt look at stringByReplacingOccurrencesOfString on NSString
NSString* str = #"a & b";
[str stringByReplacingOccurrencesOfString:#"&" withString:#"and"]; // better replace by &
However you should also deal with other characters i.e. < >