NSPredicate, whitespaces in CoreData. How to trim in predicate? - objective-c

I have a CoreData/SQLite application in which I have "Parent Categories" and "Categories". I do not have control over the data, some of the "Parent Categories" values have trailing white spaces.
I could use CONTAINS (or I should say it works with CONTAINS but this is something I can not use). For example I have 2 entries, MEN and MENS. If I use CONTAINS I will return both records, you can see how this would be an issue.
I can easily trim on my side, but the predicate will compare that with the database and will not match. So my question is how can I account for whitespaces in the predicate, if possible at all.
I have a category "MENS" which someone has selected in the application, and it is compared against "MENS " in the database.

I would trim the data prior to doing the lookup. You can do this easily usingstringByTrimmingCharactersInSet. By doing it beforehand, you'll also avoid any performance hit. That could be expensive if you're doing a character based comparison withCONTAINS.
So, let's say your search string is "MEN".
Here's the way to strip out any dodgy characters:
NSString *trimmed = [#"MEN " stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]];
There's alsowhitespaceAndNewlineCharacterSetwhich does what it says on the tin.
Alternatively, it's easy to create your own custom character of stuff you want to trim.
For that, have a look at:
NSCharacterSet Class Reference
and
Apple's String Programming Guide

Related

How to properly convert to a canonical string for searching in Cocoa?

I have a string field that I know that users will want to search on later. Inspired by the WWDC 2012 Core Data Best Practices session I plan to store a normalized version of the string into a separate field so I can optimize my search predicates.
My primary concern is case insensitivity, but while I'm normalizing strings I figure that I should also normalize the unicode representation. But I want to be sure I use the right normalization form (i.e. C,D,KC or KD). And does it matter whether I convert to lowercase first? (Localization is not my strong suit.)
So:
What are the proper methods to call to do the search normalization of the NSString?
What would be the optimal way to make sure the normalized version is stored.
I will post my first attempt as an answer, but I'd love to hear where I am wrong, other suggestions, or improvements. (Unfortunately while they showed the search predicates in that video, I don't think they showed the code from the session.)
For the use case you describe, it doesn't matter whether you pick precomposed or decomposed (C or D; although you will save a bit of space with precomposed), but think carefully about whether you want canonical or compatibility (K forms). TR15 has a nice figure that summarises the differences (Figure 6):
That is: if someone searches for "ſ" (a 'long s') do you want to match "s" (and vice versa)? These are regarded as "formatting distinctions", so you shouldn't replace the text the user enters with these forms (as you lose data), but you may want to ignore them when searching.
With regard to a case-insensitive comparison, it's not enough to simply make both strings lowercase and compare them. It will work for English, but there are languages where the mapping between lower and uppercase (if such a distinction even exists) is no so clear. The W3C wiki has a nice summary of these "case folding" issues. Unfortunately, you can't optimise this in your storage by keeping the data in one "case", you can only do a proper comparison when you know both strings and the locale.
Luckily, when working with an NSString it's -compare:options:range:locale: lets you specify an NSCaseInsensitiveSearch option and the locale (if you know it), which will handle these case folding problems for you (also take a look at NSDiacriticInsensitiveSearch and NSWidthInsensitiveSearch to see if you want to be agnostic about those differences too).
What I currently plan to do is override the setter for the field, like so:
- (void)setName:(NSString *)value
{
[self willChangeValueForKey:#"name"];
[self setPrimitiveValue:value forKey:#"name"];
[self didChangeValueForKey:#"name"];
//Store normalized for for searching
[self willChangeValueForKey:#"searchName"];
[self setPrimitiveValue:[[value lowercaseStringWithLocale:[NSLocale currentLocale]] decomposedStringWithCompatibilityMapping] forKey:#"searchName"];
[self didChangeValueForKey:#"searchName"];
}
I also made the searchName property read-only.

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.

Remove & character from string objective c

How would I go about removing the "&" symbol from a string. It's making my xml parser fail.
I have tried
[currentParsedCharacterData setString: [currentParsedCharacterData stringByReplacingOccurrencesOfString:#"&" withString:#"and"]];
But it seems to have no effect
Really what this boils down to is you want to gracefully handle invalid XML. The XML Parser is properly telling you that this XML is invalid, and is thusly failing to parse. Assuming you have no control over this XML content, I would suggest pre-parsing it for common errors like this, the output of which would be a sanitized XML doc that has a better chance of success.
To sanitize the doc, it may be as simple as doing search and replace, the problem with just doing a blanket replace on any & is that there are valid uses of &, for example & or ©. You would end up munging the XML by creating something like this: andcopy;
You could search for "ampersand space" but that won't catch a string that has an ampersand as the last character (an out-case that might be easily handled). What you are really searching for are occurrences of & that are not followed by a ; or those of which where any type of whitespace is encountered before the following ; because the semi-colon is fine on its own.
If you need more power because you need to detect this, and other errors, I would suggest going to NSScanner or RegEx matching to search for occurrences of this and other common errors during your sanitization step. It is also very common for XML files to be rather large things, so you need to be careful when dealing with these as in-memory strings as this can easily lead to application crashes. Breaking it up into manageable chunks is something NSScanner can do very well.
For a quick attempt look at stringByReplacingOccurrencesOfString on NSString
NSString* str = #"a & b";
[str stringByReplacingOccurrencesOfString:#"&" withString:#"and"]; // better replace by &
However you should also deal with other characters i.e. < >

How to Parse Some Wiki Markup

Hey guys, given a data set in plain text such as the following:
==Events==
* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1524]] – [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] – Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] – [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
I would like to end up with an NSDictionary or other form of collection so that I can have the year (The Number on the left) mapping to the excerpt (The text on the right). So this is what the 'template' is like:
*[[YEAR]] – THE_TEXT
Though I would like the excerpt to be plain text, that is, no wiki markup so no [[ sets. Actually, this could prove difficult with alias links such as [[Edmund I of England|Edmund I]].
I am not all that experienced with regular expressions so I have a few questions. Should I first try to 'beautify' the data? For example, removing the first line which will always be ==Events==, and removing the [[ and ]] occurrences?
Or perhaps a better solution: Should I do this in passes? So for example, the first pass I can separate each line into * [[710]] and [[Saracen]] invasion of [[Sardinia]]. and store them into different NSArrays.
Then go through the first NSArray of years and only get the text within the [[]] (I say text and not number because it can be 530 BC), so * [[710]] becomes 710.
And then for the excerpt NSArray, go through and if an [[some_article|alias]] is found, make it only be [[alias]] somehow, and then remove all of the [[ and ]] sets?
Is this possible? Should I use regular expressions? Are there any ideas you can come up with for regular expressions that might help?
Thanks! I really appreciate it.
EDIT: Sorry for the confusion, but I only want to parse the above data. Assume that that's the only type of markup that I will encounter. I'm not necessarily looking forward to parsing wiki markup in general, unless there is already a pre-existing library which does this. Thanks again!
This code assumes you are using RegexKitLite:
NSString *data = #"* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
* [[710]] – [[Saracen]] invasion of [[Sardinia]].\n\
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
*[[1275]] – Traditional founding of the city of [[Amsterdam]].";
NSString *captureRegex = #"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\– )(.*)";
NSRange captureRange;
NSRange stringRange;
stringRange.location = 0;
stringRange.length = data.length;
do
{
captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
if ( captureRange.location != NSNotFound )
{
NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
stringRange.location = captureRange.location + captureRange.length;
stringRange.length = data.length - stringRange.location;
NSLog(#"Year:%#, Stuff:%#", year, textStuff);
}
}
while ( captureRange.location != NSNotFound );
Note that you really need to study up on RegEx's to build these well, but here's what the one I have is saying:
(?i)
Ignore case, I could have left that out since I'm not matching letters.
(?:\* *\[\[)
?: means don't capture this block, I escape * to match it, then there are zero or more spaces (" *") then I escape out two brackets (since brackets are also special characters in a regex).
([0-9]*)
Grab anything that is a number.
(?:\]\] \– )
Here's where we ignore stuff again, basically matching " – ". Note any "\" in the regex, I have to add another one to in the Objective-C string above since "\" is a special character in a string... and yes that means matching a regex escaped single "\" ends up as "\\" in an Obj-C string.
(.*)
Just grab anything else, by default the RegEX engine will stop matching at the end of a line which is why it doesn't just match everything else. You'll have to add code to strip out the [[LINK]] stuff from the text.
The NSRange variables are used to keep matching through the file without re-matching original matches. So to speak.
Don't forget after you add the RegExKitLite class files, you also need to add the special linker flag or you'll get lots of link errors (the RegexKitLite site has installation instructions).
I'm no good with regular expressions, but this sounds like a job for them. I imagine a regex would sort this out for you quite easily.
Have a look at the RegexKitLite library.
If you want to be able to parse Wikitext in general, you have a lot of work to do. Just one complicating factor is templates. How much effort do you want to go to cope with these?
If you're serious about this, you probably should be looking for an existing library which parses Wikitext. A brief look round finds this CPAN library, but I have not used it, so I can't cite it as a personal recommendation.
Alternatively, you might want to take a simpler approach and decide which particular parts of Wikitext you're going to cope with. This might be, for example, links and headings, but not lists. Then you have to focus on each of these and turn the Wikitext into whatever you want that to look like. Yes, regular expressions will help a lot with this bit, so read up on them, and if you have specific problems, come back and ask.
Good luck!