Does logical AND and NOT exists in ANTLR? - antlr

Is there NOT logic in ANTLR? Im basically trying to negate a rule that i have and was wondering if its possible, also is there AND logic?

#larsmans already supplied the answer, I just like to give an example of the legal negations in ANTLR rules (since it happens quite a lot that mistakes are made with them).
The negation operator in ANTLR is ~ (tilde). Inside lexer rules, the ~ negates a single character:
NOT_A : ~'A';
matches any character except 'A' and:
NOT_LOWER_CASE : ~('a'..'z');
matches any character except a lowercase ASCII letter. The lats example could also be written as:
NOT_LOWER_CASE : ~LOWER_CASE;
LOWER_CASE : 'a'..'z';
As long as you negate just a single character, it's valid to use ~. It is invalid to do something like this:
INVALID : ~('a' | 'aa');
because you can't negate the string 'aa'.
Inside parser rules, negation does not work with characters, but on tokens. So the parse rule:
parse
: ~B
;
A : 'a';
B : 'b';
C : 'c';
does not match any character other than 'b', but matches any token other than the B token. So it'd match either token A (character 'a') or token C (character 'c').
The same logic applies to the . (DOT) operator:
inside lexer rules it matches any character from the set \u0000..\uFFFF;
inside parser rules it matches any token (any lexer rule).

ANTLR produces parsers for context-free languages (CFLs). In that context, not would translate to complement and and to intersection. However, CFLs aren't closed under complement and intersection, i.e. not(rule) is not necessarily a CFG rule.
In other words, it's impossible to implement not and and in a sane way, so they're not supported.

Related

Conflict in lexer rules

I'm trying to use ANTLR4 to parse a file, where elements can be the character "b" or simple literals, the problem appears when the Literal is just one character with a "b".
Here's a simplified grammar
Lexer file:
B
: 'b'
;
LETTER
: [a-z]
;
LETTERS
: LETTER+
;
Parser file:
pointer
: B '.' LETTERS
;
b.f works but b.b doesn't, I get "line 1:2 mismatched input 'b' expecting LETTERS". How can I avoid the conflict between the two lexical rules without putting Letter above B, where the problem will just change to B.
First note that the problem isn't just going to occur with b, but with any single letter. Letters other than b would simply be matched by the LETTER rule, which is still not the same as LETTERS. Since you never actually use LETTER, you can solve that part of the problem by simply removing LETTER from the grammar altogether.
As far as B is concerned, this is what's known as a contextual keyword: something that matches the rule for an identifier (or a LETTERS in this case), should be treated specially in some positions, but still be allowed as an identifier in other positions. The common way to implement contextual keywords is to define a non-terminal for identifiers that can either match an actual identifiers or any of the language's contextual keywords. So in your case, you could do this:
letters: LETTERS | B; // You can add "| LETTER" if you want to keep LETTER
pointer: B '.' letters;

Optional Prefix in ANTLR parser/lexer

I'm trying to use ANTLR4 to parse input strings that are described by a grammar like:
grammar MyGrammar;
parse : PREFIX? SEARCH;
PREFIX
: [0-9]+ ':'
;
SEARCH
: .+
;
e.g. valid input strings include:
0: maracujá
apple
3:€53.60
1: 10kg
2:chilli pepper
But the SEARCH rule always matches the whole string - whether it has a prefix or not.
I understand this is because the ANTLR4 lexer gives preference to the rules that match the longest string. Therefore the SEARCH rule matches all input, not giving the PREFIX rule a chance.
And the non-greedy version (i.e. SEARCH : .+? ;) has the same problem because (as I understand) it's only non-greedy within the rule - and the SEARCH rule doesn't have any other parts to constrain it.
If it helps, I could constrain the SEARCH text to exclude ':' but I really would prefer it recognise anything else - unicode characters, symbols, numbers, space etc.
I've read Lexer to handle lines with line number prefix but in that case, the body of the string (after the prefix) is significantly more constrained.
Note: SEARCH text might have a structure to it - like €53.00 and 10kg above (which I'd also like ANTLR4 to parse) or it might just be free text - like apple, maracujá and chilli pepper above. But I've tried to simplify so I can solve the problem of extracting the PREFIX first.
ANTLR does lexing before parsing. The lexer prefers long matches and SEARCH tokens match every PREFIX token and even any character appended to it, so your complete line is matched by SEARCH.
To prevent this: Keep the lexer rules disjunct, or at least the tokens should not subsume each other.
parse : prefix? search;
search: (WORD | NUMBER)+;
prefix: NUMBER ':';
NUMBER : [0-9]+;
WORD : (~[0-9:])+;

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

antlr3 - read closure value to a variable

I would like to parse and read a closure value in a simple text line like this:
1 !something
line
: (NUMBER EXCLAMATION myText=~('\r\n')*)
{ myFunction($myText.text); }
NUMBER
: '0'..'9'+;
EXCLAMATION
: '!';
What I get in myText variable is just the final 'g' of 'something' because as can see in generated code myText is rewrited in a while loop for each occurence of ~('\r\n').
My answer is: is there any elegant way to read the 'something' value to the variable 'myText'?
TIA
Inside parser rules, the ~ does not negate characters, but tokens. So ~('\r\n') would match any token other than the literal '\r\n' token (in your example, that would be a NUMBER or EXCLAMATION).
The lexer cannot be "driven" by the parser: after the parser matched a NUMBER and a EXCLAMATION, you can't tell the lexer to produce some other tokens than it has previously done. The lexer will always produce tokens based on some simple rules, regardless of what the parser "needs".
In other words: you can't handle this in the parser.