Meaning of [\p{Lu}] in ANTLR grammar - antlr

How should one interpret the rule
fragment LETTER_UPPERCASE : [\p{Lu}] ;
at
https://github.com/okellogg/ada_antlr_grammar/blob/master/antlr4/ada.g4#L200
Is \p an ANTLR-specific escape sequence?

\p{Lu} or \p{Uppercase_Letter} is a unicode category. It matches an uppercase letter that has a lowercase variant. By wrapping [ and ] around it, it is made into a character set.
See:
https://www.regular-expressions.info/unicode.html
https://github.com/antlr/antlr4/pull/1688

Related

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

ANTLR v4: Same character has different meaning in different contexts

This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;

Antlr greedy-option

(I edited my question based on the first comment of #Bart Kiers - thank you!)
I have the following grammar:
SPACE : (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;};
START : 'START:';
STRING_LITERAL : ('"' .* '"')+;
rule : START STRING_LITERAL;
and I want to parse languages like: 'START: "abcd" START: "img src="test.jpg""' (string literals could be inside string literals).
The grammar defined above does not work if there are string literals inside a string literal because for the language 'START: "img src="test.jpg""' the lexer translates it into the following tokens: START('START:') STRING_LITERAL("img src=") test.jpg.
Is there any way to define a grammar which is fine for my problem?
There are a couple of things wrong here:
you cannot use fragment rules inside parser rules. You grammar will never create a START token;
a . char (DOT-char) inside a parser rule matches any token, while inside a lexer rule, it matches any character;
if you let .* match greedily (and you had defined a proper lexer rule that matches a string literal), the input START: "abcd" START: "img src="test.jpg"" would then have one large string in it: "abcd" START: "img src="test.jpg"" (the first and the last quote would be matched).
So, you cannot embed string literals inside string literals using the same quotes. The lexer is not able to determine if a quote is meant to close the string, or if it's the start of a (new) embedded string. You will need to change that in your grammar.

Does logical AND and NOT exists in ANTLR?

Is there NOT logic in ANTLR? Im basically trying to negate a rule that i have and was wondering if its possible, also is there AND logic?
#larsmans already supplied the answer, I just like to give an example of the legal negations in ANTLR rules (since it happens quite a lot that mistakes are made with them).
The negation operator in ANTLR is ~ (tilde). Inside lexer rules, the ~ negates a single character:
NOT_A : ~'A';
matches any character except 'A' and:
NOT_LOWER_CASE : ~('a'..'z');
matches any character except a lowercase ASCII letter. The lats example could also be written as:
NOT_LOWER_CASE : ~LOWER_CASE;
LOWER_CASE : 'a'..'z';
As long as you negate just a single character, it's valid to use ~. It is invalid to do something like this:
INVALID : ~('a' | 'aa');
because you can't negate the string 'aa'.
Inside parser rules, negation does not work with characters, but on tokens. So the parse rule:
parse
: ~B
;
A : 'a';
B : 'b';
C : 'c';
does not match any character other than 'b', but matches any token other than the B token. So it'd match either token A (character 'a') or token C (character 'c').
The same logic applies to the . (DOT) operator:
inside lexer rules it matches any character from the set \u0000..\uFFFF;
inside parser rules it matches any token (any lexer rule).
ANTLR produces parsers for context-free languages (CFLs). In that context, not would translate to complement and and to intersection. However, CFLs aren't closed under complement and intersection, i.e. not(rule) is not necessarily a CFG rule.
In other words, it's impossible to implement not and and in a sane way, so they're not supported.