Regular expressions in an Objective-C Cocoa application - objective-c

Initial Googling indicates that there's no built-in way to do regular expressions in an Objective-C Cocoa application.
So four questions:
Is that really true?
Are you kidding me?
Ok, then is there a nice open-source library you recommend?
What are ways to get close enough without importing a library, perhaps with the NSScanner class?

I noticed that as of iOS 4.0 Apple provides a NSRegularExpression class. Additionally, as of 10.7, the class is available under OS X.

Yes, there's no regex support in Cocoa. If you're only interested in boolean matching, you can use NSPredicate which supports ICU regex syntax. But usually you're interested in the position of the match or position of subexpressions, and you cannot get it with NSPredicate.
As mentioned you can use regex POSIX functions. But they are considered slow, and the regex syntax is limited compared to other solutions (ICU/pcre).
There are many OSS libraries, CocoaDev has an extensive list.
RegExKitLite for example doesn't requires any libraries, just add the .m and .h to your project.
(My complaint against RegExKitLite is that it extends NSString via category, but it can be considered as a feature too. Also it uses the nonpublic ICU libraries shipped with the OS, which isn't recommended by Apple.)

RegexKit is the best I've found yet. Very Cocoa:y. I'm using the "Lite" version in several of our iPhone apps:
sourceforge
lingonikorg

You can use the POSIX Regular Expressions library (Yay for a POSIX compliant OS). Try
man 3 regex

The cheap and dirty hack solution that I use to solve REGEX and JSON parsing issues is to create a UIWebView object and inject Javascript function(s) to do the parsing. The javascript function then returns a string of the value (or list of values) I care about. In fact, you can store a small library set of functions customized for particular tasks and then just call them as needed.
I don't know if it this technique scales to huge volumes of repeated parsing requests, but for quick transactional stuff it gets the job done without depending on any extra external resources or code you might not understand.

I like the AGRegex framework which uses PCRE, handy if you are used to the PCRE syntax. The best version of this framework is the one in the Colloquy IRC client as it has been upgraded to use PCRE 6.7:
http://colloquy.info/project/browser/trunk/Frameworks/AGRegex
It's very lightweight, much more so than RegExKit (although not as capable of course).

NSRegularExpression is available since Mac OS X v10.7 and IOS 4.0.

During my search on this topic I came across CocoaOniguruma which uses Oniguruma, the Regular Expression engine behind Ruby1.9 and PHP5. It seems a bit newer compared to the existing OregKit (in Japanese). Not sure how these stack up against other bindings.

Googling alittle, found this library:
RegexOnNSString
Open source library, containing functions like:
-(NSString *) stringByReplacingRegexPattern:(NSString *)regex withString:(NSString *) replacement caseInsensitive:(BOOL)ignoreCase
and using NSRegularExpression class. Quite easy to use and no need to worry about anything.
Please, note that NSRegularExpression is available since Mac OS X v10.7 and IOS 4.0, as Datasmid mentioned.

I make it easy. I add a new C++ file to my Objective C project, rename it as .mm, and then create a standard C++ class inside. Then, I make a static class method in the "public:" section for a C++ function that takes an NSString and returns an NSString (or NSArray, if that's what you want). I then convert NSString to C++ std::string like so:
// If anyone knows a more efficient way, let me know in the comments.
// The "if" condition below is because ObjC crashes if converting to
// std::string if the string is nil or empty.
// assume #include <string>
std::string s = "";
if (([sInput != nil]) && (!([sInput isEqualTo:#""]))) {
std::string sTemp([sInput UTF8String]);
s = sTemp;
}
From there, I can use regex_replace like so:
// assume #include <regex>
std::string sResult = std::regex_replace(sSource,sRegExp,sReplaceWith);
Then, I can convert that std::string back into an NSString with:
NSString *sResponse2 = #(sResult.c_str());
If you're only using this C++ just for this function, then you may find it suitable to call this file extra.mm (class name Extra) and put this static class method in, and then add other static class methods when the situation arrives where it just makes sense to do it in C++ because it's less hassle in some cases. (There are cases where ObjC does something with less lines of code, and some cases where C++ does it with less lines of code.)
P.S. Still yet another way with this is to use a .mm file but make an Objective C wrapper around the use of std::string and std::regex_replace() (or regex_match()).

Related

Why does Math.sin() delegate to StrictMath.sin()?

I was wondering, why does Math.sin(double) delegate to StrictMath.sin(double) when I've found the problem in a Reddit thread. The mentioned code fragment looks like this (JDK 7u25):
Math.java :
public static double sin(double a) {
return StrictMath.sin(a); // default impl. delegates to StrictMath
}
StrictMath.java :
public static native double sin(double a);
The second declaration is native which is reasonable for me. The doc of Math states that:
Code generators are encouraged to use platform-specific native libraries or microprocessor instructions, where available (...)
And the question is: isn't the native library that implements StrictMath platform-specific enough? What more can a JIT know about the platform than an installed JRE (please only concentrate on this very case)? In ther words, why isn't Math.sin() native already?
I'll try to wrap up the entire discussion in a single post..
Generally, Math delegates to StrictMath. Obviously, the call can be inlined so this is not a performance issue.
StrictMath is a final class with native methods backed by native libraries. One might think, that native means optimal, but this doesn't necessarily has to be the case. Looking through StrictMath javadoc one can read the following:
(...) the definitions of some of the numeric functions in this package require that they produce the same results as certain published algorithms. These algorithms are available from the well-known network library netlib as the package "Freely Distributable Math Library," fdlibm. These algorithms, which are written in the C programming language, are then to be understood as executed with all floating-point operations following the rules of Java floating-point arithmetic.
How I understand this doc is that the native library implementing StrictMath is implemented in terms of fdlibm library, which is multi-platform and known to produce predictable results. Because it's multi-platform, it can't be expected to be an optimal implementation on every platform and I believe that this is the place where a smart JIT can fine-tune the actual performance e.g. by statistical analysis of input ranges and adjusting the algorithms/implementation accordingly.
Digging deeper into the implementation it quickly turns out, that the native library backing up StrictMath actually uses fdlibm:
StrictMath.c source in OpenJDK 7 looks like this:
#include "fdlibm.h"
...
JNIEXPORT jdouble JNICALL
Java_java_lang_StrictMath_sin(JNIEnv *env, jclass unused, jdouble d)
{
return (jdouble) jsin((double)d);
}
and the sine function is defined in fdlibm/src/s_sin.c refering in a few places to __kernel_sin function that comes directly from the header fdlibm.h.
While I'm temporarily accepting my own answer, I'd be glad to accept a more competent one when it comes up.
Why does Math.sin() delegate to StrictMath.sin()?
The JIT compiler should be able to inline the StrictMath.sin(a) call. So there's little point creating an extra native method for the Math.sin() case ... and adding extra JIT compiler smarts to optimize the calling sequence, etcetera.
In the light of that, your objection really boils down to an "elegance" issue. But the "pragmatic" viewpoint is more persuasive:
Fewer native calls makes the JVM core and JIT easier to maintain, less fragile, etcetera.
If it ain't broken, don't fix it.
At least, that's how I imagine how the Java team would view this.
The question assumes that the JVM actually runs the delegation code. On many JVMs, it won't. Calls to Math.sin(), etc.. will potentially be replaced by the JIT with some intrinsic function code (if suitable) transparently. This will typically be done in an unobservable way to the end user. This is a common trick for JVM implementers where interesting specializations can happen (even if the method is not tagged as native).
Note however that most platforms can't simply drop in the single processor instruction for sin due to suitable input ranges (eg see: Intel discussion).
Math API permits a non-strict but better-performing implementations of its methods but does not require it and by default Math simply uses StrictMath impl.

How to statically dump all ObjC methods called in a Cocoa App?

Assume I have a Cocoa-based Mac or iOS app. I'd like to run a static analyzer on my app's source code or my app's binary to retrieve a list of all Objective-C methods called therein. Is there a tool that can do this?
A few points:
I am looking for a static solution. I am not looking for a dynamic solution.
Something which can be run against either a binary or source code is acceptable.
Ideally the output would just be a massive de-duped list of Objective-C methods like:
…
-[MyClass foo]
…
+[NSMutableString stringWithCapacity:]
…
-[NSString length]
…
(If it's not de-duped that's cool)
If other types of symbols (C functions, static vars, etc) are present, that is fine.
I'm familiar with class-dump, but AFAIK, it dumps the declared Classes in your binary, not the called methods in your binary. That's not what I'm looking for. If I am wrong, and you can do this with class-dump, please correct me.
I'm not entirely sure this is feasible. So if it's not, that's a good answer too. :)
The closest I'm aware of is otx, which is a wrapper around otool and can reconstruct the selectors at objc_msgSend() call sites.
http://otx.osxninja.com/
If you are asking for finding a COMPLETE list of all methods called then this is impossible, both statically and dynamically. The reason is that methods may be called in a variety of ways and even be dynamically and programmatically assembled.
In addition to regular method invocations using the Objective-C messages like [Object message] you can also dispatch messages using the C-API functions from objc/message.h, e.g. objc_msgSend(str, del). Or you can dispatch them using the NSInvocation API or with performSelector:withObject: (and similar methods), see the examples here. The selectors used in all these cases can be static strings or they can even be constructed programmatically from strings, using things like NSSelectorFromString.
To make matters worse Objective-C even supports dynamic message resolution which allows an object to respond to messages that do not correspond to methods at all!
If you are satisfied with only specific method invocations then parsing the source code for the patterns listed above will give you a minimal list of methods that may be called during execution. But the list may be both incomplete (i.e., not contain methods that may be called) as well as overcomplete (i.e., may contain methods that are not called in practice).
Another great tool is class-dump which was always my first choices for static analysis.
otool -oV /path to executable/ | grep name | awk '{print $3}'

Add keyword to Objective-C using Clang

How would I go about adding a relatively trivial keyword to Objective-C using the Clang compiler? For example, adding a literal #yes which maps to [NSNumber numberWithBool:YES].
I have looked at the (excellent) source code for Clang and believe that most of the work I would need to do is in lib/Rewrite/RewriteObjC.cpp. There is the method RewriteObjC::RewriteObjCStringLiteral (see previous link) which does a similar job for literal NSString * instances.
I ask this question as Clang is very modular and I'm not sure which .td (see tablegen) files, .h files and AST visitor passes I would need to modify to achieve my goal.
If I understand the clang's code correctly (I'm still learning, so take caution), I think the starting point for this type of addition would be in Parser::ParseObjCAtExpression within clang/lib/Parse/ParseObjc.cpp.
One thing to note is that the Parser class is implemented in several files (seemingly separated by input language), but is declared entirely in clang/include/Parser.h.
Parser has many methods following the pattern of ParseObjCAt, e.g.,
ParseObjCAtExpression
ParseObjCAtStatement
ParseObjCAtDirectives
etc..
Specifically, line 1779 of ParseObjc.cpp appears to be where the parser detects an Objective-C string literal in the form of #"foo". However, it also calls ParsePostfixExpressionSuffix which I don't fully understand yet. I haven't figured out how it knows to parse a string literal (vs. an #synchronize, for example).
ExprResult Parser::ParseObjCAtExpression(SourceLocation AtLoc) {
...
return ParsePostfixExpressionSuffix(ParseObjCStringLiteral(AtLoc));
...
}
If you haven't yet, visit clang's "Getting Started" page to get started with compiling.

Is there a way to read in files in TypedStream format

I have a file in the following format:
NeXT/Apple typedstream data, little endian, version 4, system 1000
Looking at it in a hex editor, it's clearly made up of NSsomething objects (NSArray, NSValue, etc). It also appears to have an embedded plist!
I'm guessing there's a straightforward way to read this file and output it in some more readable fashion (similar to the output of repr() or print_r()).
I assume I'll need to do this using Objective-C?
First, some history:
Older versions of the Objective-C runtime (pre-OS X) included a psuedo-class called NXTypedStream, which is the pre-OPENSTEP ancestor of NSCoder. Older versions of Foundation contained a header called NSCompatibility.h, which had functions and categories for dealing with old NeXTStep formats. NSCompatibility.h no longer exists, but a (deprecated) subset of that functionality can still be found in NSCoder.h.
NSCoder debuted as part of the original Foundation Kit in OPENSTEP, but probably used typedstreams as its serialization format. At some point, it was switched over to a plist-based format. The current version of Interface Builder (as part of Xcode) is still able to read older, typedstream-based NIBs, which is a good clue that this functionality still exists in OS X.
Now, the solution:
I can't find this in any (current) Apple documentation, but it turns out that NSCoder/NSUnarchiver can still read typedstream files just fine. If you want to read a typedstream file in a Cocoa/Objective-C program, just do this:
NSUnarchiver *typedStreamUnarchiver = [[NSUnarchiver alloc] initForReadingWithData:[NSData dataWithContentsOfFile:#"<path to your typedstream file>"]];
That's it! The decoding is handled internally in a function called _decodeObject_old. Now you can unarchive using standard NSCoder methods, like:
id object = [typedStreamUnarchiver decodeObject];
NSLog(#"Decoded object: %#", object);
Note that if the class in the typedstream is not a valid class in your program, it will throw an NSArchiverArchiveInconsistency exception.
See also: http://www.stone.com/The_Cocoa_Files/Legacy_File_Formats.html
If it's a binary plist, it should be easy to read it with Xcode / Plist Editor, plutil, or your own Objective-C code. Otherwise, depending on the file, it could be more challenging. The file stores a serialized representation of Objective-C objects. As such, it's only intended to be readable by the software that created it.

Objective-C equivalent of Java Language Specification or C++ Standard?

What is the Objective-C equivalent of the Java Language Specification or the C++ Standard?
Is it this:
http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/Introduction/introObjectiveC.html ?
(I'm just looking for an (official) authoritative document which will explain the little nitty-gritties of the language. I'll skip the why for now :)
Appendix A of the document you linked to is a description of all of the language features, which is the closest we have to a specification (Appendix B used to be a grammar specification, but they've clearly removed that from the later versions of the document).
There has never been a standardisation of Objective-C and it's always been under the control of a single vendor - initially StepStone, then NeXT Computer licensed it (and ultimately bought the IP) and finally Apple consumed NeXT Software. I expect there's little motivation to go through the labourious process of standardisation on Apple's part, especially as there are no accusations of ObjC being an anticompetitive platform which standardisation could mitigate.
There is none. The link you provided is the only 'official' documentation, which is essentially a prose description, and not a rigorous language specification. Apple employees suggest that this is sufficient for most purposes, and if you require something more formal you should file a bug report (!). Sadly, the running joke is the Objective-C standard is defined by whatever the compiler is able to compile.
Many consider Objective-C to be either a "strict superset" or "superset" of C. IMHO, for 'classic' Objective-C (or, Objective-C 1.0), I would consider this to be a true statement. In particular, I'm not aware of any Objective-C language addition that does not map to an equivalent "plain C" statement. In essence, this means the Objective-C additions are pure syntactic sugar, and one can use the C standard in effect to reason about the nitty gritty. I'm not convinced that this is entirely true for Objective-C 2.0 with GC enabled. This is because pointers to GC managed memory need to be handled specially (the compiler must insert various barriers depending on the particulars of the pointer). Since the GC pointer type qualifiers, such as __strong, are implemented as __attribute__(()) inside gcc, this means that void *p; and void __strong *p; are similarly qualified pointers according to the C99 standard. The problems that this can cause, and even the ability to write programs that operate in a deterministic manner, are either self evident or not (consult your local language lawyer or compiler writer for more information).
Another problem that comes up from time to time is that the C language has continued to evolve relative to the Objective-C language. Objective-C dates back to the mid 1980's, which is pre-ANSI-C standard time. Consider the following code fragement:
NSMutableArray *mutableArray = [NSMutableArray array];
NSArray *array = mutableArray;
This is legal Objective-C code as defined by the official prose description of the language. This is also one of the main concepts behind Object Oriented programming. However, when one considers those statements couched from the perspective of "strict superset of C99", one runs in to a huge problem. In particular, this violates C99's strict aliasing rules. A standards grade language specification would help clarify the treatment and behavior of such conflicts. Unfortunatly, because no such document exists, there can be much debate over such details, and ultimately result in bugs in the compiler. This has resulted in a bug in gcc that dates all the way back to version 3.0 (gcc bug #39753).
Apple's document is about the best you're going to get. Like many other languages, Objective-C doesn't have a formal standard or specification; rather, it is described mostly by its canonical implementations.
Further resources include:
The Objective-C Language and GNUstep Base Library Programming Manual.
The NeXT developers library
Apple (now) using clang of the llvm.org project.
Some of the language elements are defined in this context
e.g. Objective-C literals --> http://clang.llvm.org/docs/ObjectiveCLiterals.html
But i didn't found a clear overview of all elements.
--- updated --
The source of Apples clang is available (as open source) here:
http://opensource.apple.com/source/clang/