How to Parse Some Wiki Markup - objective-c

Hey guys, given a data set in plain text such as the following:
==Events==
* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1524]] – [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] – Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] – [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
I would like to end up with an NSDictionary or other form of collection so that I can have the year (The Number on the left) mapping to the excerpt (The text on the right). So this is what the 'template' is like:
*[[YEAR]] – THE_TEXT
Though I would like the excerpt to be plain text, that is, no wiki markup so no [[ sets. Actually, this could prove difficult with alias links such as [[Edmund I of England|Edmund I]].
I am not all that experienced with regular expressions so I have a few questions. Should I first try to 'beautify' the data? For example, removing the first line which will always be ==Events==, and removing the [[ and ]] occurrences?
Or perhaps a better solution: Should I do this in passes? So for example, the first pass I can separate each line into * [[710]] and [[Saracen]] invasion of [[Sardinia]]. and store them into different NSArrays.
Then go through the first NSArray of years and only get the text within the [[]] (I say text and not number because it can be 530 BC), so * [[710]] becomes 710.
And then for the excerpt NSArray, go through and if an [[some_article|alias]] is found, make it only be [[alias]] somehow, and then remove all of the [[ and ]] sets?
Is this possible? Should I use regular expressions? Are there any ideas you can come up with for regular expressions that might help?
Thanks! I really appreciate it.
EDIT: Sorry for the confusion, but I only want to parse the above data. Assume that that's the only type of markup that I will encounter. I'm not necessarily looking forward to parsing wiki markup in general, unless there is already a pre-existing library which does this. Thanks again!

This code assumes you are using RegexKitLite:
NSString *data = #"* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
* [[710]] – [[Saracen]] invasion of [[Sardinia]].\n\
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
*[[1275]] – Traditional founding of the city of [[Amsterdam]].";
NSString *captureRegex = #"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\– )(.*)";
NSRange captureRange;
NSRange stringRange;
stringRange.location = 0;
stringRange.length = data.length;
do
{
captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
if ( captureRange.location != NSNotFound )
{
NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
stringRange.location = captureRange.location + captureRange.length;
stringRange.length = data.length - stringRange.location;
NSLog(#"Year:%#, Stuff:%#", year, textStuff);
}
}
while ( captureRange.location != NSNotFound );
Note that you really need to study up on RegEx's to build these well, but here's what the one I have is saying:
(?i)
Ignore case, I could have left that out since I'm not matching letters.
(?:\* *\[\[)
?: means don't capture this block, I escape * to match it, then there are zero or more spaces (" *") then I escape out two brackets (since brackets are also special characters in a regex).
([0-9]*)
Grab anything that is a number.
(?:\]\] \– )
Here's where we ignore stuff again, basically matching " – ". Note any "\" in the regex, I have to add another one to in the Objective-C string above since "\" is a special character in a string... and yes that means matching a regex escaped single "\" ends up as "\\" in an Obj-C string.
(.*)
Just grab anything else, by default the RegEX engine will stop matching at the end of a line which is why it doesn't just match everything else. You'll have to add code to strip out the [[LINK]] stuff from the text.
The NSRange variables are used to keep matching through the file without re-matching original matches. So to speak.
Don't forget after you add the RegExKitLite class files, you also need to add the special linker flag or you'll get lots of link errors (the RegexKitLite site has installation instructions).

I'm no good with regular expressions, but this sounds like a job for them. I imagine a regex would sort this out for you quite easily.
Have a look at the RegexKitLite library.

If you want to be able to parse Wikitext in general, you have a lot of work to do. Just one complicating factor is templates. How much effort do you want to go to cope with these?
If you're serious about this, you probably should be looking for an existing library which parses Wikitext. A brief look round finds this CPAN library, but I have not used it, so I can't cite it as a personal recommendation.
Alternatively, you might want to take a simpler approach and decide which particular parts of Wikitext you're going to cope with. This might be, for example, links and headings, but not lists. Then you have to focus on each of these and turn the Wikitext into whatever you want that to look like. Yes, regular expressions will help a lot with this bit, so read up on them, and if you have specific problems, come back and ask.
Good luck!

Related

Regexp in iOS to find comments

I am trying to find and process 'java-style' comments within a string in objective-C.
I have a few regex snippets which almost work but I am stuck on one hurdle: different options seem to make the different styles work.
For example, I am using this to match:
NSArray* matches = [[NSRegularExpression regularExpressionWithPattern:expression options:NSRegularExpressionAnchorsMatchLines error:nil] matchesInString:string options:0 range:searchRange];
The options here allow me successfully find and process single line comments (//) but not multiline (/* */), if I change the option to NSRegularExpressionDotMatchesLineSeparators then I can make multiline work fine but I can't find the 'end' of a single line comment.
I suppose really I need dot-matches-line-separators but I need a better way of finding the end of a single line comment?
The regexp I have so far are:
#"/\\*.*?\\*/"
#"//.*$"
it's clear to see if dot matches a line separator then the second one (single line) never 'finishes' but how do I fix this? I found some suggestions for single line that were more like:
#"(\/\/[^"\n\r]*(?:"[^"\n\r]*"[^"\n\r]*)*[\r\n])"
But that doesn't' seem to work at all!
Thanks in advance for any pointers.
So it turns out the example I had was pretty close its just for some reason I had some additional backslashes in there that weren't needed, it now reads:
#"(//[^\"\n\r]*(?:\"[^\"\n\r]*\"[^\"\n\r]*)*[\r\n])"
(that is, in the code for Objective-C). And to clarify my own point, I am using DotMatchesLineSeparator and this is working now exactly as I'd expect.

How to properly convert to a canonical string for searching in Cocoa?

I have a string field that I know that users will want to search on later. Inspired by the WWDC 2012 Core Data Best Practices session I plan to store a normalized version of the string into a separate field so I can optimize my search predicates.
My primary concern is case insensitivity, but while I'm normalizing strings I figure that I should also normalize the unicode representation. But I want to be sure I use the right normalization form (i.e. C,D,KC or KD). And does it matter whether I convert to lowercase first? (Localization is not my strong suit.)
So:
What are the proper methods to call to do the search normalization of the NSString?
What would be the optimal way to make sure the normalized version is stored.
I will post my first attempt as an answer, but I'd love to hear where I am wrong, other suggestions, or improvements. (Unfortunately while they showed the search predicates in that video, I don't think they showed the code from the session.)
For the use case you describe, it doesn't matter whether you pick precomposed or decomposed (C or D; although you will save a bit of space with precomposed), but think carefully about whether you want canonical or compatibility (K forms). TR15 has a nice figure that summarises the differences (Figure 6):
That is: if someone searches for "ſ" (a 'long s') do you want to match "s" (and vice versa)? These are regarded as "formatting distinctions", so you shouldn't replace the text the user enters with these forms (as you lose data), but you may want to ignore them when searching.
With regard to a case-insensitive comparison, it's not enough to simply make both strings lowercase and compare them. It will work for English, but there are languages where the mapping between lower and uppercase (if such a distinction even exists) is no so clear. The W3C wiki has a nice summary of these "case folding" issues. Unfortunately, you can't optimise this in your storage by keeping the data in one "case", you can only do a proper comparison when you know both strings and the locale.
Luckily, when working with an NSString it's -compare:options:range:locale: lets you specify an NSCaseInsensitiveSearch option and the locale (if you know it), which will handle these case folding problems for you (also take a look at NSDiacriticInsensitiveSearch and NSWidthInsensitiveSearch to see if you want to be agnostic about those differences too).
What I currently plan to do is override the setter for the field, like so:
- (void)setName:(NSString *)value
{
[self willChangeValueForKey:#"name"];
[self setPrimitiveValue:value forKey:#"name"];
[self didChangeValueForKey:#"name"];
//Store normalized for for searching
[self willChangeValueForKey:#"searchName"];
[self setPrimitiveValue:[[value lowercaseStringWithLocale:[NSLocale currentLocale]] decomposedStringWithCompatibilityMapping] forKey:#"searchName"];
[self didChangeValueForKey:#"searchName"];
}
I also made the searchName property read-only.

Change Url using Regex

I have url, for example:
http://i.myhost.com/myimage.jpg
I want to change this url to
http://i.myhost.com/myimageD.jpg.
(Add D after image name and before point)
i.e I want add some words after image name and before point using regex.
What is the best way do it using regex?
Try using ^(.*)\.([a-zA-Z]{3,5}) and replacing with \1D\2. I'm assuming the extension is 3-5 alphanumeric numbers but you can modify it to suit. E.g. if it's just jpg images then you can put that instead of the [a-zA-Z]{3,5}.
Sounds like a homework question given the solution must use a regex, on that assumption here is an outline to get you going.
If all you have is a URL then #mathematical.coffee's solution will suit. However if you have a chunk of text within which is one or more URLs and you have to locate and change just those then you'll need something a little more involved.
Look at the structure of a URL: {protocol}{address}{item}; where
{protocol} is "http://", "ftp://" etc.;
{address} is a name, e.g. "www.google.com", or a number, e.g. "74.125.237.116" - there will always be at least one dot in the address; and
{item} is "/name" where name is quite flexible - there will be zero or more items, you can think of them as directories and a file but this isn't strictly true. Also the sequence of items can end in a "/" (including when there are zero of them).
To make a regex which matches a URL start by matching each part. In the case of the items you'll want to match the last in the sequence separately - you'll have zero or more "directories" and one "file", the latter must be of the form "name.extension".
Once you have regexes for each part you just concatenate them to produce a regex for the whole. To form the replacement pattern you can surround parts of your regex with parentheses and refer to those parts using \number in the replacement string - see #mathematical.coffee's solution for an example.
The best way to learn regexs is to use an editor which supports them and just experiment. The exact syntax may not be the same as NSRegularExpression but they are mostly pretty similar for the basic stuff and you can translate from one to another easily.

Objective-C RegexKitLite match one string or another

I'm trying to use regexkitlite for string matching in objective-c and I'm having some problems with it. What I'm trying to do is search a large string for substrings matching:
"http://[something].jpg"
"http://[something].png"
Basically, I want to find all links to images from the original string. What I have currently is:
NSString *regexString = #"http://[a-zA-Z0-9._%+-/]+\.jpg";
Now this is working for .jpg images, but of course it doesn't match .png images. I would really like to use one regexString that would match either, but I can't figure out how.
Reading some regex tutorials for other languages, I think it is something along the lines of:
NSString *regexString = #"http://[a-zA-Z0-9._%+-/]+\.(?:jpg|png)";
But I can't quite get it right.
Any help would be greatly appreciated.
You don't need a non-capturing group around the file extensions. It's good practice to use them, but it could be causing an error here. (Does the library support it?)
Also, I simplified your regex slightly by using a predefined character class.
NSString *regexString = #"http://[\w.%+-/]+\.(jpg|png)";
You can see this in action here.
You can also add any file extensions that you want. Ex: (jpg|png|gif|...).
Updated: Apple now includes regular expression support with NSRegularExpression, which is available in OS X v10.7 and later.

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).