ANTLR lexer patern [\p{Emoji}]+ is matching numbers - antlr

The ANTLR4 lexer pattern [\p{Emoji}]+ is matching numbers. See screenshot. Note that it correctly rejects alpha chars. Is there an issue with the pattern?

\p{Emoji} matches everything that has the Unicode Emoji property. Numbers do have that property, so \p{Emoji} is correct in matching them. Why though?
The Unicode standard defines any codepoint to have the Emoji property if it can appear as part of an Emoji. Numbers can appear as parts of emojis (for example I think shapes with numbers on them, which for them reason count as emojis, consist of a shape, followed by a join, followed by the number), so they have that property.
If you only want to match codepoints that are emojis by themselves, you can just use the Emoji_Presentation property instead. This will fail to match combined emojis though.
If you want to match any sequence that creates an emoji, I think you'll want to match something like "Emoji_Presentation, followed by zero or more of '(Join_Control or Variation_Selector) followed by Emoji'" (here you want Emoji instead of Emoji_Presentation because that's where numbers are allowed).
However, for the purpose of allowing emojis in identifiers (as opposed to a lexer rule to match emojis and nothing else), you don't actually have to worry about whether a number is part of an emoji or not, just that it doesn't appear as the first character of the identifier. So you could simply define your fragment for the starting character to only include Emoji_Presentation and then the fragment for continuing characters to include Emoji as well as Join_Control and Variation_Selector.
So something like this would work:
fragment IdStart
: [_\p{Alpha}\p{General_Category=Other_Letter}\p{Emoji_Presentation}]
;
fragment IdContinue
: IdStart
// The `\p{Number}` might be redundant, I'm not sure. I don't know
// whether there are any (non-ascii) numeric codepoints that don't
// also have the `Emoji` property.
| [\p{Number}\p{Emoji}\p{Join_Control}\p{Variation_Selector}]
;
Identifier: IdStart IdContinue*;
Of course that's assuming you actually want to allow characters besides emojis. The definition in your question only included emojis (or was meant to anyway), but since it was called Identifier, I'm assuming you just removed the other allowed categories to simplify it.

Looking at the code that seems to define emoji code points:
UnicodeSet emojiRKUnicodeSet = new UnicodeSet("[\\p{GCB=Regional_Indicator}\\*#0-9\\u00a9\\u00ae\\u2122\\u3030\\u303d]");
it looks to be including digits (why, I don't know, checkout sepp2k's excellent explanation). You can always raise an issue if you think something is wrong.
You could also just use a character class like this instead:
Identifier
: [\u00a9\u00ae\u2000-\u3300\ud83c\ud000-\udfff\ud83d\ud000-\udfff\ud83e\ud000-\udfff]+
;

Related

Regex matching sequence of characters

I have a test string such as: The Sun and the Moon together, forever
I want to be able to type a few characters or words and be able to match this string if the characters appear in the correct sequence together, even if there are missing words. For example, the following search word(s) should all match against this string:
The Moon
Sun tog
Tsmoon
The get ever
What regex pattern should I be using for this? I should add that the supplied test strings are going to be dynamic within an app, and so I'd like to be able to use a pattern based on the search string.
From your example Tsmoon you show partial words (T), ignoring case (s, m) and allow anything between each entered character. So as a first attempt you can:
Set the ignore case option
Between each chapter input insert the regular expression to match zero or more of anything. You can choose whether to match the shortest or longest run.
Try that, reading the documentation for NSRegularExpression if you're stuck, and see how it goes. If you get stuck ask a new question showing your code and the RE constructed and explain what happens/doesn't work as expected.
HTH

Input character validation using word validation regular expression

Let's say, I have a regular expression that checks the validation of the input value as a whole. For example, it is an email input box and when user hits enter, I check it against ^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$ to see if it is a valid email address.
What I want to achieve is, I want to intercept the character input too, and check every single input character to see if that character is also a valid character. I can do this by adding an extra regular expression, e.g. [A-Z0-9._%+-] but that is not what I want.
Is there a way to extract the widest possible range of acceptable characters from a given regular expression? So in the example above, can I extract all the valid characters that are defined by the original regular expression (i.e. ^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$) programmatically?
I would appreciate any help or hint.
P.S. This is project for iOS written in Objective-C.
If you don't mind writing half a regex parser, certainly. You would have to be able to distinguish literals from meta-characters and to unroll/merge all character classes (including negated character classes, and nested negated character classes, if you regex flavor supports them).
If NSRegularExpressions doesn't come with some convenience method, I cannot imagine how it would be possible otherwise. Just think about ^. When it is outside of a character class, it's a meta-character that you can ignore. If it is inside a character class, it's a meta-character, that negates the character class unless it is not the first character. - is a meta-character inside character classes, unless it is the first character, the last character, or right after another character range (depending on regex flavor). And I'm not even speaking about escaped characters.
I don't know about NSRegularExpressions, but some flavors also support nested character classes (like [a-z[^aeiou]] for all consonants). I think you get where I am going with this.

How to check if text in a UITextField matches a specific pattern?

I have a UITextField and I need to check, as the user types, if the text they have entered in the textfield so far matches a specific pattern.
More specifically, I need the text to match the pattern ####-##-## where # is any digit 0-9 and - is a dash (note that this is NOT a phone number or email). For example, the entry 1990-12-09 matches, 1990:12:09 does not match and 1990-12 DOES match because it has not yet violated the pattern (even though the text does not yet completely match the pattern).
How should I approach this? Ideally I would not have to hard code in a series of if statements .
The difficulty is that I want to check if the text in the textfield matches this pattern, as the user types. I don't want to just check it at the end.
I'm thinking that regular expressions are probably the way, but I'm not experienced enough at them to know if they hold the solution.
You could set the keyboard type to myTextField.keyboardType = UIKeyboardTypeNumberPad that why you are limiting the type of input.
Then this post should help answer your formatting question: UITextField format in xx-xx-xxx

RFC 6570 URL Templates : the role of / vs. other prefixes

I recently read some of : https://www.rfc-editor.org/rfc/rfc6570#section-1
And I found the following URL template examples :
GIVEN :
var="value";
x=1024;
path=/foo/bar;
{/var,x}/here /value/1024/here
{#path,x}/here #/foo/bar,1024/here
These seem contradictory.
In the first one, it appears that the / replaces ,
In the 2nd one, it appears that the , is kept .
Thus, I'm wondering wether there are inconsistencies in this particular RFC. I'm new to these RFC's so maybe I don't fully understand the culture behind how these develop.
There's no contradiction in those two examples. They illustrate the point that the rules for expanding an expression whose first character is / are different from the rules for expanding an expression whose first character is #. These alternative expansion rules are pretty much the entire point of having a variety of different magic leading characters -- which are called operators in the RFC.
The expression with the leading / is expanded according to a rule that says "each variable in the expression is replaced by its value, preceded by a / character". (I'm paraphrasing the real rule, which is described in section 3.2.6 of that RFC.) The expression with the leading # is expanded according to a rule that says "each variable in the expression is replaced by its value, with the first variable preceded by a # and subsequent variables preceded by a ,. (Again paraphrased, see section 3.2.4 for the real rule.)

Is it possible to ignore characters in a string when matching with a regular expression

I'd like to create a regular expression such that when I compare the a string against an array of strings, matches are returned with the regex ignoring certain characters.
Here's one example. Consider the following array of names:
{
"Andy O'Brien",
"Bob O'Brian",
"Jim OBrien",
"Larry Oberlin"
}
If a user enters "ob", I'd like the app to apply a regex predicate to the array and all of the names in the above array would match (e.g. the ' is ignored).
I know I can run the match twice, first against each name and second against each name with the ignored chars stripped from the string. I'd rather this by done by a single regex so I don't need two passes.
Is this possible? This is for an iOS app and I'm using NSPredicate.
EDIT: clarification on use
From the initial answers I realized I wasn't clear. The example above is a specific one. I need a general solution where the array of names is a large array with diverse names and the string I am matching against is entered by the user. So I can't hard code the regex like [o]'?[b].
Also, I know how to do case-insensitive searches so don't need the answer to focus on that. Just need a solution to ignore the chars I don't want to match against.
Since you have discarded all the answers showing the ways it can be done, you are left with the answer:
NO, this cannot be done. Regex does not have an option to 'ignore' characters. Your only options are to modify the regex to match them, or to do a pass on your source text to get rid of the characters you want to ignore and then match against that. (Of course, then you may have the problem of correlating your 'cleaned' text with the actual source text.)
If I understand correctly, you want a way to match the characters "ob" 1) regardless of capitalization, and 2) regardless of whether there is an apostrophe in between them. That should be easy enough.
1) Use a case-insensitivity modifier, or use a regexp that specifies that the capital and lowercase version of the letter are both acceptable: [Oo][Bb]
2) Use the ? modifier to indicate that a character may be present either one or zero times. o'?b will match both "o'b" and "ob". If you want to include other characters that may or may not be present, you can group them with the apostrophe. For example, o['-~]?b will match "ob", "o'b", "o-b", and "o~b".
So the complete answer would be [Oo]'?[Bb].
Update: The OP asked for a solution that would cause the given character to be ignored in an arbitrary search string. You can do this by inserting '? after every character of the search string. For example, if you were given the search string oleary, you'd transform it into o'?l'?e'?a'?r'?y'?. Foolproof, though probably not optimal for performance. Note that this would match "o'leary" but also "o'lea'r'y'" if that's a concern.
In this particular case, just throw the set of characters into the middle of the regex as optional. This works specifically because you have only two characters in your match string, otherwise the regex might get a bit verbose. For example, match case-insensitive against:
o[']*b
You can add more characters to that character class in the middle to ignore them. Note that the * matches any number of characters (so O'''Brien will match) - for a single instance, change to ?:
o[']?b
You can make particular characters optional with a question mark, which means that it will match whether they're there or not, e.g:
/o\'?b/
Would match all of the above, add .+ to either side to match all other characters, and a space to denote the start of the surname:
/.+? o\'?b.+/
And use the case-insensitivity modifier to make it match regardless of capitalisation.