RFC 6570 URL Templates : the role of / vs. other prefixes - rfc

I recently read some of : https://www.rfc-editor.org/rfc/rfc6570#section-1
And I found the following URL template examples :
GIVEN :
var="value";
x=1024;
path=/foo/bar;
{/var,x}/here /value/1024/here
{#path,x}/here #/foo/bar,1024/here
These seem contradictory.
In the first one, it appears that the / replaces ,
In the 2nd one, it appears that the , is kept .
Thus, I'm wondering wether there are inconsistencies in this particular RFC. I'm new to these RFC's so maybe I don't fully understand the culture behind how these develop.

There's no contradiction in those two examples. They illustrate the point that the rules for expanding an expression whose first character is / are different from the rules for expanding an expression whose first character is #. These alternative expansion rules are pretty much the entire point of having a variety of different magic leading characters -- which are called operators in the RFC.
The expression with the leading / is expanded according to a rule that says "each variable in the expression is replaced by its value, preceded by a / character". (I'm paraphrasing the real rule, which is described in section 3.2.6 of that RFC.) The expression with the leading # is expanded according to a rule that says "each variable in the expression is replaced by its value, with the first variable preceded by a # and subsequent variables preceded by a ,. (Again paraphrased, see section 3.2.4 for the real rule.)

Related

URL-parameters input seems inconsistent

I have review multiple instructions on URL-parameters which all suggest 2 approaches:
Parameters can follow / forward slashes or be specified by parameter name and then by parameter value. so either:
1) http://numbersapi.com/42
or
2) http://numbersapi.com/random?min=10&max=20
For the 2nd one, I provide parameter name and then parameter value by using the ?. I also provide multiple parameters using ampersand.
Now I have see the request below which works fine but does not fit into the rules above:
http://numbersapi.com/42?json
I understand that the requests sets 42 as a parameter but why is the ? not followed by the parameter name and just by the value. Also the ? seems to be used as an ampersand???
From Wikipedia:
Every HTTP URL conforms to the syntax of a generic URI. The URI generic syntax consists of a hierarchical sequence of five components:
URI = scheme:[//authority]path[?query][#fragment]
where the authority component divides into three subcomponents:
authority = [userinfo#]host[:port]
This is represented in a syntax diagram as:
As you can see, the ? ends the path part of the URL and starts the query part.
The query part is usually a &-separated string of name=value pairs, but it doesn't have to be, so json is a valid value for the query part.
Or, as the Wikipedia articles says it:
An optional query component preceded by a question mark (?), containing a query string of non-hierarchical data. Its syntax is not well defined, but by convention is most often a sequence of attribute–value pairs separated by a delimiter.
It is also fairly common for request processors to treat a name=value pair that is missing the = sign, as if the it was name=.
E.g. if you're writing Servlet code and call servletRequest.getParameter("json"), it would return an empty string ("") for that last URL in the question.

ANTLR lexer patern [\p{Emoji}]+ is matching numbers

The ANTLR4 lexer pattern [\p{Emoji}]+ is matching numbers. See screenshot. Note that it correctly rejects alpha chars. Is there an issue with the pattern?
\p{Emoji} matches everything that has the Unicode Emoji property. Numbers do have that property, so \p{Emoji} is correct in matching them. Why though?
The Unicode standard defines any codepoint to have the Emoji property if it can appear as part of an Emoji. Numbers can appear as parts of emojis (for example I think shapes with numbers on them, which for them reason count as emojis, consist of a shape, followed by a join, followed by the number), so they have that property.
If you only want to match codepoints that are emojis by themselves, you can just use the Emoji_Presentation property instead. This will fail to match combined emojis though.
If you want to match any sequence that creates an emoji, I think you'll want to match something like "Emoji_Presentation, followed by zero or more of '(Join_Control or Variation_Selector) followed by Emoji'" (here you want Emoji instead of Emoji_Presentation because that's where numbers are allowed).
However, for the purpose of allowing emojis in identifiers (as opposed to a lexer rule to match emojis and nothing else), you don't actually have to worry about whether a number is part of an emoji or not, just that it doesn't appear as the first character of the identifier. So you could simply define your fragment for the starting character to only include Emoji_Presentation and then the fragment for continuing characters to include Emoji as well as Join_Control and Variation_Selector.
So something like this would work:
fragment IdStart
: [_\p{Alpha}\p{General_Category=Other_Letter}\p{Emoji_Presentation}]
;
fragment IdContinue
: IdStart
// The `\p{Number}` might be redundant, I'm not sure. I don't know
// whether there are any (non-ascii) numeric codepoints that don't
// also have the `Emoji` property.
| [\p{Number}\p{Emoji}\p{Join_Control}\p{Variation_Selector}]
;
Identifier: IdStart IdContinue*;
Of course that's assuming you actually want to allow characters besides emojis. The definition in your question only included emojis (or was meant to anyway), but since it was called Identifier, I'm assuming you just removed the other allowed categories to simplify it.
Looking at the code that seems to define emoji code points:
UnicodeSet emojiRKUnicodeSet = new UnicodeSet("[\\p{GCB=Regional_Indicator}\\*#0-9\\u00a9\\u00ae\\u2122\\u3030\\u303d]");
it looks to be including digits (why, I don't know, checkout sepp2k's excellent explanation). You can always raise an issue if you think something is wrong.
You could also just use a character class like this instead:
Identifier
: [\u00a9\u00ae\u2000-\u3300\ud83c\ud000-\udfff\ud83d\ud000-\udfff\ud83e\ud000-\udfff]+
;

ABNF rule `zero = ["0"] "0"` matches `00` but not `0`

I have the following ABNF grammar:
zero = ["0"] "0"
I would expect this to match the strings 0 and 00, but it only seems to match 00? Why?
repl-it demo: https://repl.it/#DanStevens/abnf-rule-zero-0-0-matches-00-but-not-0
Good question.
ABNF ("Augmented Backus Naur Form"9 is defined by RFC 5234, which is the current version of a document intended to clarify a notation used (with variations) by many RFCs.
Unfortunately, while RFC 5234 exhaustively describes the syntax of ABNF, it does not provide much in the way of a clear statement of semantics. In particular, it does not specify whether ABNF alternation is unordered (as it is in the formal language definitions of BNF) or ordered (as it is in "PEG" -- Parsing Expression Grammar -- notation). Note that optionality/repetition are just types of alternation, so if you choose one convention for alternation, you'll most likely choose it for optionality and repetition as well.
The difference is important in cases like this. If alternation is ordered, then the parser will not backup to try a different alternative after some alternative succeeds. In terms of optionality, this means that if an optional element is present in the stream, the parser will never reconsider the decision to accept the optional element, even if some subsequent element cannot be matched. If you take that view, then alternation does not distribute over concatenation. ["0"]"0" is precisely ("0"/"")"0", which is different from "00"/"0". The latter expression would match a single 0 because the second alternative would be tried after the first one failed. The former expression, which you use, will not.
I do not believe that the authors of RFC 5234 took this view, although it would have been a lot more helpful had they made that decision explicit in the document. My only real evidence to support my belief is that the ABNF included in RFC 5234 to describe ABNF itself would fail if repetition was considered ordered. In particular, the rule for repetitions:
repetition = [repeat] element
repeat = 1*DIGIT / (*DIGIT "*" *DIGIT)
cannot match 7*"0", since the 7 will be matched by the first alternative of repeat, which will be accepted as satisfying the optional [repeat] in repetition, and element will subsequently fail.
In fact, this example (or one similar to it) was reported to the IETF as an erratum in RFC 5234, and the erratum was rejected as unnecessary, because the verifier believed that the correct parse should be produced, thus providing evidence that the official view is that ABNF is not a variant of PEG. Apparently, this view is not shared by the author of the APG parser generator (who also does not appear to document their interpretation.) The suggested erratum chose roughly the same solution as you came up with:
repeat = *DIGIT ["*" *DIGIT]
although that's not strictly speaking the same; the original repeat cannot match the empty string, but the replacement one can. (Since the only use of repeat in the grammar is optional, this doesn't make any practical difference.)
(Disclosure note: I am not a fan of PEG. So it's possible the above answer is not free of bias.)

Input character validation using word validation regular expression

Let's say, I have a regular expression that checks the validation of the input value as a whole. For example, it is an email input box and when user hits enter, I check it against ^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$ to see if it is a valid email address.
What I want to achieve is, I want to intercept the character input too, and check every single input character to see if that character is also a valid character. I can do this by adding an extra regular expression, e.g. [A-Z0-9._%+-] but that is not what I want.
Is there a way to extract the widest possible range of acceptable characters from a given regular expression? So in the example above, can I extract all the valid characters that are defined by the original regular expression (i.e. ^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$) programmatically?
I would appreciate any help or hint.
P.S. This is project for iOS written in Objective-C.
If you don't mind writing half a regex parser, certainly. You would have to be able to distinguish literals from meta-characters and to unroll/merge all character classes (including negated character classes, and nested negated character classes, if you regex flavor supports them).
If NSRegularExpressions doesn't come with some convenience method, I cannot imagine how it would be possible otherwise. Just think about ^. When it is outside of a character class, it's a meta-character that you can ignore. If it is inside a character class, it's a meta-character, that negates the character class unless it is not the first character. - is a meta-character inside character classes, unless it is the first character, the last character, or right after another character range (depending on regex flavor). And I'm not even speaking about escaped characters.
I don't know about NSRegularExpressions, but some flavors also support nested character classes (like [a-z[^aeiou]] for all consonants). I think you get where I am going with this.

IP Address/Hostname match regex

I need to match two ipaddress/hostname with a regular expression:
Like 20.20.20.20
should match with 20.20.20.20
should match with [http://20.20.20.20/abcd]
should not match with 20.20.20.200
should not match with [http://20.20.20.200/abcd]
should not match with [http://120.20.20.20/abcd]
should match with AB_20.20.20.20
should match with 20.20.20.20_AB
At present i am using something like this regular expression: "(.*[^(\w)]|^)20.20.20.20([^(\w)].*|$)"
But it is not working for the last two cases. As the "\w" is equal to [a-zA-Z0-9_]. Here I also want to eliminate the "_" underscore. I tried different combination but not able to succeed. Please help me with this regular expression.
(.*[_]|[^(\w)]|^)10.10.10.10([_]|[^(\w)].*|$)
I spent some more time on this.This regular expression seems to work.
I don't know which language you're using, but with Perl-like regular expressions you could use the following, shorter expression:
(?:\b|\D)20\.20\.20\.20(?:\b|\D)
This effectively says:
Match word boundary (\b, here: the start of the word) or a non-digit (\D).
Match IP address.
Match word boundary (\b, here: the end of the word) or a non-digit (\D).
Note 1: ?: causes the grouping (\b|\D) not to create a backreference, i.e. to store what it has found. You probably don't need the word boundaries/non-digits to be stored. If you actually need them stored, just remove the two ?:s.
Note 2: This might be nit-picking, but you need to escape the dots in the IP address part of the regular expression, otherwise you'd also match any other character at those positions. Using 20.20.20.20 instead of 20\.20\.20\.20, you might for example match a line carrying a timestamp when you're searching through a log file...
2012-07-18 20:20:20,20 INFO Application startup successful, IP=20.20.20.200
...even though you're looking for IP addresses and that particular one (20.20.20.200) explicitly shouldn't match, according to your question. Admittedly though, this example is quite an edge case.