Input character validation using word validation regular expression - objective-c

Let's say, I have a regular expression that checks the validation of the input value as a whole. For example, it is an email input box and when user hits enter, I check it against ^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$ to see if it is a valid email address.
What I want to achieve is, I want to intercept the character input too, and check every single input character to see if that character is also a valid character. I can do this by adding an extra regular expression, e.g. [A-Z0-9._%+-] but that is not what I want.
Is there a way to extract the widest possible range of acceptable characters from a given regular expression? So in the example above, can I extract all the valid characters that are defined by the original regular expression (i.e. ^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$) programmatically?
I would appreciate any help or hint.
P.S. This is project for iOS written in Objective-C.

If you don't mind writing half a regex parser, certainly. You would have to be able to distinguish literals from meta-characters and to unroll/merge all character classes (including negated character classes, and nested negated character classes, if you regex flavor supports them).
If NSRegularExpressions doesn't come with some convenience method, I cannot imagine how it would be possible otherwise. Just think about ^. When it is outside of a character class, it's a meta-character that you can ignore. If it is inside a character class, it's a meta-character, that negates the character class unless it is not the first character. - is a meta-character inside character classes, unless it is the first character, the last character, or right after another character range (depending on regex flavor). And I'm not even speaking about escaped characters.
I don't know about NSRegularExpressions, but some flavors also support nested character classes (like [a-z[^aeiou]] for all consonants). I think you get where I am going with this.

Related

ANTLR lexer patern [\p{Emoji}]+ is matching numbers

The ANTLR4 lexer pattern [\p{Emoji}]+ is matching numbers. See screenshot. Note that it correctly rejects alpha chars. Is there an issue with the pattern?
\p{Emoji} matches everything that has the Unicode Emoji property. Numbers do have that property, so \p{Emoji} is correct in matching them. Why though?
The Unicode standard defines any codepoint to have the Emoji property if it can appear as part of an Emoji. Numbers can appear as parts of emojis (for example I think shapes with numbers on them, which for them reason count as emojis, consist of a shape, followed by a join, followed by the number), so they have that property.
If you only want to match codepoints that are emojis by themselves, you can just use the Emoji_Presentation property instead. This will fail to match combined emojis though.
If you want to match any sequence that creates an emoji, I think you'll want to match something like "Emoji_Presentation, followed by zero or more of '(Join_Control or Variation_Selector) followed by Emoji'" (here you want Emoji instead of Emoji_Presentation because that's where numbers are allowed).
However, for the purpose of allowing emojis in identifiers (as opposed to a lexer rule to match emojis and nothing else), you don't actually have to worry about whether a number is part of an emoji or not, just that it doesn't appear as the first character of the identifier. So you could simply define your fragment for the starting character to only include Emoji_Presentation and then the fragment for continuing characters to include Emoji as well as Join_Control and Variation_Selector.
So something like this would work:
fragment IdStart
: [_\p{Alpha}\p{General_Category=Other_Letter}\p{Emoji_Presentation}]
;
fragment IdContinue
: IdStart
// The `\p{Number}` might be redundant, I'm not sure. I don't know
// whether there are any (non-ascii) numeric codepoints that don't
// also have the `Emoji` property.
| [\p{Number}\p{Emoji}\p{Join_Control}\p{Variation_Selector}]
;
Identifier: IdStart IdContinue*;
Of course that's assuming you actually want to allow characters besides emojis. The definition in your question only included emojis (or was meant to anyway), but since it was called Identifier, I'm assuming you just removed the other allowed categories to simplify it.
Looking at the code that seems to define emoji code points:
UnicodeSet emojiRKUnicodeSet = new UnicodeSet("[\\p{GCB=Regional_Indicator}\\*#0-9\\u00a9\\u00ae\\u2122\\u3030\\u303d]");
it looks to be including digits (why, I don't know, checkout sepp2k's excellent explanation). You can always raise an issue if you think something is wrong.
You could also just use a character class like this instead:
Identifier
: [\u00a9\u00ae\u2000-\u3300\ud83c\ud000-\udfff\ud83d\ud000-\udfff\ud83e\ud000-\udfff]+
;

Usage of Regular Expression Extractor JMeter?

Using Regular Extractor in JMeter, I need to get the value of "fullBkupUNIXTime" from the below response,
{"fullBackupTimeString":["Mon 10 Apr 2017 14:14:36"],"fullBkupUNIXTime":["1491833676"],"fullBackupDirName":["10_04_2017_0636"]}
I tried with Ref Name as time and
Regular Expression: "fullBkupUNIXTime": "([0-9])" and "(.+?)"
and pass them as input for 2nd request ${time}
The above 2 two doesn't work out for me.
Please Help me out of this.
First of all: why not just use this thing?
Then, if you firm with your RegExp adventure to get happen.
First expression is not going to work because you've defined it to match exactly one [0-9] charcter.
Add the appropriate repetition character, like "fullBkupUNIXTime": "([0-9]+)".
And basically it make sense to tell the engine to stop at first narrowest match too: "fullBkupUNIXTime": "([0-9]+?)"
Next, make sure you're handling space chars between key and value and colon mark properly. Better mark them explicitly, if any, with \s
And last but not least: make sure you're properly handle multiple lines (if appropriate, of course). Add the (?m) modifier to your expression.
And/or (?im) to be not case-sensitive, in addition.
[ is a reserve character in regex, you need to escape it, in your case use:
Regular Expression fullBkupUNIXTime":\["(\d+)
Template: $1$
Match No.: 1

Replacing first and last character of every word using REGEXP_REPLACE

My question is somewhat specific, I'm not using any kind of code compiler to achieve the result in the title, I am using a IRC Client that allows the use of "Quirks" so the users can have specific mannerisms when chatting, like starting every word with an uppercase, or changing every "s" into a "2".
Problem is that I can't see the whole code so even though I'm not familiar with REGEXP_REPLACE it makes things harder to learn.
The client simplifies the whole coding process, here's a screenshot of the
interface
Filling the text boxes with "^(\w)" and "upper(\1)" respectively makes the first character capitalized, "(\w)$" and "upper(\1)" does the same with the last character.
I've discovered that "\b(\w)" will uppercase the first character of every word, i've tried "\b(\w)%" for the last character but it didn't work, as there is some syntax error, probably...
So, how do I get every last character capitalized?
1:

User input text translation

I'm working on a translator that will take English language text (as user input into a UITextView) and (with a button press) replace specific words with alternatives. I have both the English words in scope plus their alternatives in separate Arrays (englishArray and alternativeArray), indexed correspondingly.
My challenge is finding an algorithm that will allow me to identify a word in the input text (a UITextView) ignoring characters like <",.()>, lookup the word in englishArray (case insensitive), locate the corresponding word in alternativeArray and then use that word in place of the original - writing it back to the UITextView.
Any help greatly appreciated.
NB. I have created a Category extending the NSArray functionality with a indexOfCaseInsensitiveString method that ignores case when doing an indexOfObject type lookup if that helps.
Tony.
I think that using an NSScanner would be best to parse the string into separate words which you could then pass to your indexOfCaseInsensitiveString method. scanCharactersFromSet:intoString: using a set of all the characters you want to ignore, including whitespace and newline characters should get you to the start of a word, and then you could use scanUpToCharactersFromSet:intoString: using the same set to scan to the end of the word. Using scanLocation at the beginning and end of each scan should allow you to get the range of that word, so if you find a match in your array, you will know where in your string to make the replacement.
Thanks for your suggestion. It's working with one exception.
I want to capture all punctuation so I can recreate the original input but with the substituted words. Even though I have a 'space' in my Character Set, the scanner is not putting the spaces into the 'intoString'. Other characters I specify in the Character Set such as '(' and ';' are represented in the 'intoString'.
Net is that when I recreate the input, it's perfect except that I get individual words running into each other.
UPDATE: I fixed that issue by including:
[theScanner setCharactersToBeSkipped:nil];
Thanks again.

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.