ANTLR Matching all tokens except - antlr

Is there any way to match a token in antlr except a specific one?
I have a rule which states that a '_' can be an ID. Now I have a specific situation in which I want to match an ID, but in this particular case I want it to ignore the '_' alternative. Is it possible?

I think something like
(ID {!$ID.text.equals("_")}?)
should do it (if you are using Java as target language). Otherwise you will have to write that semantic predicate in a way that your language understands it.
In short, this will check whether the text does not equal "_" and only then will the subrule match.
Another possible way to do this:
id: ID
| '_'
;
ID: // lexer rule to match every valid identifier EXCEPT '_' ;
That way, whenever you mean "either '_' or any other ID", you use id to match this, if you disallow "_", you can use _.

Related

ANTLR4 predicates with greedy * quantifier: avoid unnecessary predicate calls (lexing)

Following lexer grammar snippet is supposed to tokenize 'custom names' depending on a predicate that is defined in a class LexerHelper:
fragment NUMERICAL : [0-9];
fragment XML_NameStartChar
: [:a-zA-Z]
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment XML_NameChar : XML_NameStartChar
| '-' | '_' | '.' | NUMERICAL
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment XML_NAME_FRAG : XML_NameStartChar XML_NameChar*;
CUSTOM_NAME : XML_NAME_FRAG ':' XML_NAME_FRAG {LexerHelper.myPredicate(getText())}?;
The correct match for CUSTOM_NAME is always the longest possible match. Now if the lexer encounters a custom name such as some:cname then I would like it to lex the entire string some:cname and then call the predicate once with 'some:cname' as argument.
Instead, the lexer calls the predicate with each possible 'valid' match it finds along the way, so some:c, some:cn, some:cna, some:cnam until finally some:cname.
Is there a way to change the behaviour to force antlr4 to first find the longest possible match, before calling the predicate? Alternatively, is there an efficient way for the predicate to determine that the match is not the longest one yet to simply return with false in that case?
EDIT: The funny thing about this behavior is that as long as only partial matches are passed to the predicate, the result of the predicate seems to be completely ignored by the lexer anyway. This seems oddly inefficient.
As it turns out, the behavior is known and permitted by Antlr. Antlr may or may not call predicates more than necessary (see here for more details). To avoid that behavior I am now using actions instead, which only get executed once the rule has completely and successfully matched. This allows me to e.g. switch modes in an action.

how the order in this recursive rule does not give the same result?

can anyone tell me what's the difference between the following two rules (Notice the order)?
the first which doesn't work
without => "[" "]" without | "[" "]"
with => "[" INDEX "]" with | "[" INDEX "]"
array => ID with | ID without | ID with without
the second which seemingly works
without => without "[" "]"| "[" "]"
with => with "[" INDEX "]" | "[" INDEX "]"
array => ID with | ID without | ID with without
i am trying to achieve the syntax of an n-dims array with a size, like C# arrays. So the following syntax should work arr[], arr[1], arr[1][], arr[1][1], arr[][] but not the ones like arr[][1].
I'm assuming that by "doesn't work", you mean that bison reports a shift/reduce conflict. If you go ahead and use the generated parser anyway, then it will not parse correctly in many cases, because the conflict is real and cannot be resolved by any static rule.
The issue is simple. Remember that a LALR(1) bottom-up parser like the one generate by bison performs every reduction exactly at the end of the right-hand side, taking into account only the next token (the "lookahead token"). So it must know which production to use at the moment the production is completely read. (That gives it a lot more latitude than a top-down parser, which needs to know which production it will use at the beginning of the production. But it's still not always enough.)
The problematic case is the production ID with without. Here, whatever input matches with needs to be reduced to a single non-terminal with before the continues with without. To get to this point, the parser must have passed over some number of '[' INDEX ']' dimensions, and the lookahead token must be [, regardless of whether the next dimension has a definite size or not.
If the with rule is right-recursive:
with: '[' INDEX ']' with
| '[' INDEX ']'
then the parser is really stuck. If what follows has a definite dimension, it needs to continue trying the first production, which means shifting the [. If what follows has no INDEX, it needs to reduce the second production, which will trigger a chain of reductions leading back to the beginning of the list of dimensions.
On the other hand, with a left recursive rule:
with: with '[' INDEX ']'
| '[' INDEX ']'
the parser has no problem at all, because each with is reduced as soon as the ] is seen. That means that the parser doesn't have to know what follows in order to decide to reduce. It decides between the two rules based on the past, not the future: the first dimension in the array uses the second production, and the remaining ones (which follow a with) use the first one.
That's not to say that left-recursion is always the answer, although it often is. As can be seen in this case, right-recursion of a list means that individual list elements pile up on the parser stack until the list is eventually terminated, while left-recursion allows the reductions to happen immediately, so that the parser stack doesn't need to grow. So if you have a choice, you should generally prefer left-recursion.
But sometimes right-recursion can be convenient, particularly in syntaxes like this where the end of the list is different from the beginning. Another way of writing the grammar could be:
array : ID dims
dims : without
| '[' INDEX ']'
| '[' INDEX ']' dims
without: '[' ']'
| '[' ']' without
Here, the grammar only accepts empty dimensions at the end of the list because of the structure of dims. But to achieve that effect, dims must be right-recursive, since it is the end of the list which has the expanded syntax.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

ANTLR behaviour with conflicting tokens

How is ANTLR lexer behavior defined in the case of conflicting tokens?
Let me explain what I mean by "conflicting" tokens.
For example, assume that the following is defined:
INT_STAGE : '1'..'6';
INT : '0'..'9'+;
There is a conflict here, because after reading a sequence of digits, the lexer would not know whether there is one INT or many INT_STAGE tokens (or different combinations of both).
After a test, it looks like that if INT is defined after INT_STAGE, the lexer would prefer to find INT_STAGE, but maybe not INT then? Otherwise, no INT_STAGE would ever be found.
Another example would be:
FOOL: ' fool'
FOO: 'foo'
ID : ('a'..'z'|'A'..'Z'|'_'|'%') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'%')*;
I was told that this is the "right" order to recognize all the tokens:
while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.
The following logic applies:
the lexer matches as much characters as possible
if after applying rule 1, there are 2 or more rules that match the same amount of characters, the rule defined first will "win"
Taking this into account, the input "1", "2", ..., "6" is tokenized as an INT_STAGE: both INT_STAGE and INT match the same amount of characters, but INT_STAGE is defined first.
The input "12" is tokenized as a INT since it matches the most characters.
I was told that this is the "right" order to recognize all the tokens: while reading "fool" the lexer will find one FOOL token and not FOO ID or something else.
That is correct.

Using SQL like for pattern query

I have a PHP function that accepts a parameter called $letter and I want to set the default value of the parameter to a pattern which is "any number or any symbol". How can I do that?
This is my query by the way .
select ID from $wpdb->posts where post_title LIKE '".$letter."%
I tried posting at wordpress stackexchange and they told me to post it here as this is an SQL/general programming question that specific to wordpress.
Thank you! Replies much appreciated :)
In order to match just numbers or letters (I'm not sure exactly what you mean by symbols) you can use the RLIKE operator in MySQL:
SELECT ... WHERE post_title RLIKE '^[A-Za-z0-9]'
That means by default $letter would be [A-Za-z0-9] - this means all letters from a to z (both cases) and numbers from 0-9. If you need specific symbols you can add them to the list (but - has to be first or last, since otherwise it has a special meaning of range). The ^ character tells it to be at the beginning of the string. So you will need something like:
"select ID from $wpdb->posts where post_title RLIKE '^".$letter."%'"
Of course I have to warn you against SQL injection attacks if you build your query like this without sanitizing the input (making sure it doesn't have any ' (apostrophe) in it.
Edit
To match a title that starts with a number just use [0-9] - that means it will match one digit from 0 to 9