Conflict in lexer rules - antlr

I'm trying to use ANTLR4 to parse a file, where elements can be the character "b" or simple literals, the problem appears when the Literal is just one character with a "b".
Here's a simplified grammar
Lexer file:
B
: 'b'
;
LETTER
: [a-z]
;
LETTERS
: LETTER+
;
Parser file:
pointer
: B '.' LETTERS
;
b.f works but b.b doesn't, I get "line 1:2 mismatched input 'b' expecting LETTERS". How can I avoid the conflict between the two lexical rules without putting Letter above B, where the problem will just change to B.

First note that the problem isn't just going to occur with b, but with any single letter. Letters other than b would simply be matched by the LETTER rule, which is still not the same as LETTERS. Since you never actually use LETTER, you can solve that part of the problem by simply removing LETTER from the grammar altogether.
As far as B is concerned, this is what's known as a contextual keyword: something that matches the rule for an identifier (or a LETTERS in this case), should be treated specially in some positions, but still be allowed as an identifier in other positions. The common way to implement contextual keywords is to define a non-terminal for identifiers that can either match an actual identifiers or any of the language's contextual keywords. So in your case, you could do this:
letters: LETTERS | B; // You can add "| LETTER" if you want to keep LETTER
pointer: B '.' letters;

Related

unable to write lexer to parse this

I am designing my own data format :
-key=value-123
It is "DASH KEY EQUAL IDENTIFIER", problem is, identifier also contain dash, so it eats all characters. Please help
DASH : '-';
EQUAL : '=';
IDENTIFIER : [a-zA-Z0-9 -_<>#:\\.#()/]+;
thanks
Peter
This is how ANTLR s work. If multiple Lexer rules match an input stream of characters, the rule with the longest match will win (when the length matches, the the first rule wins), Since your IDENTIFIER rule includes ‘-‘ but excludes ‘=‘, ANTLR will create the longer token for IDENTIFIER. You won’t be able to get a match for DASH unless your input begins with “-=“ (of course, then there’d be no IDENTIFIER).
If you are designing your own format, you could make the choice to disallow “-“ in IDENTIFIERS and you should be good to go.
Is this the full picture of what you are attempting to parse, or just a small subset? If this is the full picture, then you’d be able to easily “parse” this with a REGEX and capture groups. ANTLR would be overkill.
You could take the following approach if you really have to have a DASH in your identifier:
1 - remove the "-" from the IDENTIFIER Lexer rule (we'll call that ID), and we'll handle the full identifier in an identifier parse rule:
keyValue : DASH key=identifier EQUAL val=identifier;
identifier: ID (DASH ID)+;
DASH : '-';
EQUAL : '=';
ID : [a-zA-Z0-9 _<>#:\\.#()/]+;
In a listener (or visitor for the IdentiferCtx (ex: (enter|exit)Identifer for a listener), you can call cox.getText() for the string of the full identifier rule, and have the full text of your identifier

ANTLR parser for alpha numeric words which may have whitespace in between

First I tried to identify a normal word and below works fine:
grammar Test;
myToken: WORD;
WORD: (LOWERCASE | UPPERCASE )+ ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT: '0'..'9' ;
WHITESPACE : (' ' | '\t')+;
Just when I added below parser rule just beneath "myToken", even my WORD tokens weren't getting recognised with input string as "abc"
ALPHA_NUMERIC_WS: ( WORD | DIGIT | WHITESPACE)+;
Does anyone have any idea why is that?
This is because ANTLR's lexer matches "first come, first serve". That means it will tray to match the given input with the first specified (in the source code) rule and if that one can match the input, it won't try to match it with the other ones.
In your case ALPHA_NUMERIC_WS does match the same content as WORD (and more) and because it is specified before WORD, WORD will never be used to match the input as there is no input that can be matched by WORD that can't be matched by the first processed ALPHA_NUMERIC_WS. (The same applies for the WS and the DIGIT) rule.
I guess that what you want is not to create a ALPHA_NUMERIC_WS-token (as is done by specifying it as a lexer rule) but to make it a parser rule instead so it then can be referenced from another parsre rule to allow an arbitrary sequence of WORDs, DIGITs and WSs.
Therefore you'd want to write it like this:
alpha_numweric_ws: ( WORD | DIGIT | WHITESPACE)+;
If you actually want to create the respective token you can either remove the following rules or you need to think about what a lexer's job is and where to draw the line between lexer and parser (You need to redesign your grammar in order for this to work).

Antlr Lexer exclude a certain pattern

In Antlr Lexer, How can I achieve parsing a token like this:
A word that contains any non-space letter but not '.{' inside it. Best I can come up with is using a semantics predicate.
WORD: WL+ {!getText().contains(".{")};
WL: ~[ \n\r\t];
I'm a bit worried to use semantics predicate though cause WORD here will be lexed millions of times I would think to put a semantics predicate will hit the performance.
This is coming from the requirement that I need to parse something like:
TOKEN_ONE.{TOKEN_TWO}
while TOKEN_ONE can include . and { in its letter.
I'm using Antlr 4.
You need to limit your predicate evaluation to the case immediately following a . in the input.
WORD
: ( ~[. \t\r\n]
| '.' {_input.LA(1)!='{'}?
)+
;
How about rephrasing your question to the equivalent "A word contains any character except whitespace or dot or left brace-bracket."
Then the lexer rule is just:
WORD: ~[ \n\r\t.{]*

ANTLR v4: Same character has different meaning in different contexts

This is my first crack at parser generators, and, consequently ANTLR. I'm using ANTLR v4 trying to generate a simple practice parser for Morse Code with the following extra rules:
A letter (e.g., ... [the letter 's']) can be denoted as capitalized if a '^' precedes it
ex.: ^... denotes a capital 'S'
Special characters can be embeded in parentheses
ex.: (#)
Each encoded entity will be separated by whitespace
So I could encode the following sentence:
ABC a#b.com
as (with corresponding letters shown underneath):
^.- ^-... ^-.-. ( ) ._ (#) -... (.) -.-. --- --
A B C ' ' a '#' b '.' c o m
Particularly note the two following entities: ( ) (which denotes a space) and (.) (which denotes a period.
There is mainly one things that I'm finding hard to wrap my head around: The same token can take on different meanings depending on whether it is in parentheses or not. That is, I want to tell ANTLR that I want to discard whitespace, yet not in the ( ) case. Also, a Morse Code character can consist of dots-and-dashes (periods-and-dashes), yet, I don't want to consider the period in (.) as "any charachter".
Here is the grammar I have got so far:
grammar MorseCode;
file: entity*;
entity:
special
| morse_char;
special: '(' SPECIAL ')';
morse_char: '^'? (DOT_OR_DASH)+;
SPECIAL : .; // match any character
DOT_OR_DASH : ('.' | '-');
WS : [ \t\r\n]+ -> skip; // we don't care about whitespace (or do we?)
When I try it against the following input:
^... --- ...(#)
I get the following output (from grun ... -tokens):
[#0,0:0='^',<1>,1:0]
[#1,1:1='.',<4>,1:1]
...
[#15,15:14='<EOF>',<-1>,1:15]
line 1:1 mismatched input '.' expecting DOT_OR_DASH
It seems there is trouble with ambiguity between SPECIAL and DOT_OR_DASH?
It seems like your (#) syntax behaves like a quoted string in other programming languages. I would start by defining SPECIAL as:
SPECIAL : '(' .*? ')';
To ensure that . . and .. are actually different, you can use this:
SYMBOL : [.-]+;
Then you can define your ^ operator:
CARET : '^';
With these three tokens (and leaving WS as-is), you can simplify your parser rules significantly:
file
: entity* EOF
;
entity
: morse_char
| SPECIAL
;
morse_char
: CARET? SYMBOL
;

Does logical AND and NOT exists in ANTLR?

Is there NOT logic in ANTLR? Im basically trying to negate a rule that i have and was wondering if its possible, also is there AND logic?
#larsmans already supplied the answer, I just like to give an example of the legal negations in ANTLR rules (since it happens quite a lot that mistakes are made with them).
The negation operator in ANTLR is ~ (tilde). Inside lexer rules, the ~ negates a single character:
NOT_A : ~'A';
matches any character except 'A' and:
NOT_LOWER_CASE : ~('a'..'z');
matches any character except a lowercase ASCII letter. The lats example could also be written as:
NOT_LOWER_CASE : ~LOWER_CASE;
LOWER_CASE : 'a'..'z';
As long as you negate just a single character, it's valid to use ~. It is invalid to do something like this:
INVALID : ~('a' | 'aa');
because you can't negate the string 'aa'.
Inside parser rules, negation does not work with characters, but on tokens. So the parse rule:
parse
: ~B
;
A : 'a';
B : 'b';
C : 'c';
does not match any character other than 'b', but matches any token other than the B token. So it'd match either token A (character 'a') or token C (character 'c').
The same logic applies to the . (DOT) operator:
inside lexer rules it matches any character from the set \u0000..\uFFFF;
inside parser rules it matches any token (any lexer rule).
ANTLR produces parsers for context-free languages (CFLs). In that context, not would translate to complement and and to intersection. However, CFLs aren't closed under complement and intersection, i.e. not(rule) is not necessarily a CFG rule.
In other words, it's impossible to implement not and and in a sane way, so they're not supported.