ANTLR v4 signed integer rule - antlr

in ANTLR4 grammar for minijava example I want to parse signed integers by following rules:
IntegerLiteral:IntegerSign? DecimalIntegerLiteral;
fragment
DecimalIntegerLiteral:DecimalNumeral IntegertypeSuffix?;
fragment
IntegerSign:'+'|'-';
fragment
IntegertypeSuffix:[lL];
fragment
DecimalNumeral:'0'| NonZeroDigit(Digits?| Underscores Digits);
fragment
Digits:Digit(DigitsAndUnderscores? Digit)?;
fragment
Digit:'0'| NonZeroDigit;
fragment
NonZeroDigit:[1-9];
fragment
DigitsAndUnderscores:DigitOrUnderscore+;
fragment
DigitOrUnderscore:Digit| '_';
fragment
Underscores:'_'+;
but in parsing it get errors due to expression rule:
expression:expression PLUS expression # addExpression
how should I avoid this conflict?

Related

Can a fragment use another fragment in ANTLR4

When I use a fragment in ANTLR4, can I use another fragment?
For example, I want to define a fragment NUM_FRAGMENT which uses other fragments:
fragment DIGIT: [1-9];
fragment ZERO: [0];
fragment NUM_FRAGMENT: ZERO | DIGIT | [0-9];
Is the example above allowed in ANTLR4?
Yes, fragments can use other fragments.
In your example fragment NUM_FRAGMENT: ZERO | DIGIT | [0-9];, you can write fragment NUM_FRAGMENT: ZERO | DIGIT;.
Note that the naming of the rules is not entirely correct: DIGIT suggests it matches any digit (from 0 to 9). And NUM_FRAGMENT suggests it matches a number which should match one or more digits.
I'd write the rules like this:
fragment NON_ZERO : [1-9];
fragment ZERO : '0';
fragment DIGIT : ZERO | NON_ZERO;
fragment NUM : DIGIT+;

antlr4 two lexer rule match the same string

I'm currently using antlr4 to build a parser, but I encountered a problem which I tried my best but didn't figure out. Can you help me to explaain and solve it ?
# grammer file : PluginDoc.g4:
grammer PluginDoc
pluginDef : pluginName | pluginDesc;
pluginName : PluginName IDENTIFIER;
pluginDesc : PluginDesc TEXT;
PluginName '#pluginName'
PluginDesc '#pluginDesc'
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
input content is:
#pluginName kafka
#pluginDesc abc
If I put IDENTIFIER before TEXT, I will get "mismatched input 'abc' expecting TEXT"
If I put TEXT before IDENTIFIER, I will get "mismatched input 'kafka' expecting IDENTIFIER"
Looks like both IDENTIFIER and TEXT are matched, how can I only match IDENTIFIER in pluginName and only match TEXT in pluginDesc ?
First of all, you have several errors in the grammar that you posted:
The header of the file should specify grammar, not grammer. Your Lexer tokens PluginName and PluginDesc do not have a colon in front of them and semicolon to terminate them. It is also an (unwritten?) rule to write your parser rules as all lower-case and your lexer rules as all upper-case.
grammar PluginDoc;
pluginDef : pluginName | pluginDesc;
pluginName : PLUGIN_NAME IDENTIFIER;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME : '#pluginName';
PLUGIN_DESC : '#pluginDesc';
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
Some of the problems that I encountered while testing your grammar were due to the unhandled whitespace. First of all, you should include a Lexer rule to skip the whitespace at the end of the file after all of the other Lexer rules.
WS: [ \n\t\r]+ -> skip;
Next, there is a problem with your TEXT and IDENTIFIER clashing with each other. When the character stream is tokenized by the Lexer, kafka and abc can be both IDENTIFIER and TEXT token. Since the Lexer lexes in a top-down fashion, they are both tokenized as whateve Lexer rule comes first in your grammar. This causes the error that you encounter - whatever you define as the second rule cannot be matched in the parser because it was not sent in as a token.
As suggested by Lucas, you should probably match both of these as TEXT and do the subsequent checking for validity of the input in your Listener/Visitor.
grammar PluginDoc;
pluginDef : (pluginName | pluginDesc)* EOF;
pluginName : PLUGIN_NAME TEXT;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME: '#pluginName';
PLUGIN_DESC: '#pluginDesc';
TEXT : ~[ \r\n\t]+;
WS: [ \r\n\t]+ -> skip;
I also changed the pluginDef Parser rule to
pluginDef : (pluginName | pluginDesc)* EOF;
since it was my impression that you want to input both #pluginName X and #pluginDesc Y at once and identify them. If this is not the case, feel free to change back to what you had before.
The resulting AST produced by the modified grammar above onyour sample input:
You can also run this with a text file as an input.

how to define a rule of a pattern repeated by a fixed number of times using antlr pure lexer grammar

I'm trying to define a pure lexer grammar in Antlr that recognizes 32-bit in hexadecimal notation.
for now I have:
lexer grammar Grammar;
WS : [ \r\t\n]+ -> skip;
fragment HexDigit : ([0-9]|[A-f]|[a-f]);
fragment HexDigitNoZero : ([1-9]|[A-f]|[a-f]);
fragment HexNumber : (HexDigitNoZero)(HexDigit)*;
fragment Eight : HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit;
Hex :'0x'Eight;
I'd like to know if theres any way to define a range 8 in a pure lexer grammar.
Like Flex does with 'a'{8}.
You could use a predicate (same as action but has ? at the end)
(HexDigitNoZero)(HexDigit)* {getText().lenght!= 8}? {do_something;};
Tecnhically this is pure lexer, but uses gramar actions

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

ANTLR: Lexer Throwing NoViableAltException

After a break of a few weeks, it's time to fight with ANTLR again...
Anyhow, I have the following Lexer tokens defined:
fragment EQ: '=';
fragment NE: '<>';
BOOLEAN_FIELD
: ('ISTRAINED'|'ISCITIZEN')
;
BOOLEAN_CONSTANT
: ('TRUE'|'FALSE'|'Y'|'N')
;
BOOLEAN_LOGICAL
: BOOLEAN_FIELD (EQ|NE) (BOOLEAN_FIELD|BOOLEAN_CONSTANT)
;
Unfortunately, the BOOLEAN_LOGICAL token is throwning NoViableAltException on simple terms such as "ISTRAINED = ISTRAINED".
I know some of the responses are going to be "This should be in the parser". It WAS previously in the parser, however, I'm trying to offload some simple items into the lexer since I just need a "Yes/No, is this text block valid?"
Any help is appreciated.
BOOLEAN_LOGICAL should not be a lexer rule. A lexer rule must (or should) be a single token. As a lexer rule, there cannot be any spaces between BOOLEAN_FIELD and (EQ|NE) (you might have skipped spaces during lexing, but that will only cause spaces to be skipped from inside parser rules!).
Do this instead:
boolean_logical
: BOOLEAN_FIELD (EQ|NE) (BOOLEAN_FIELD|BOOLEAN_CONSTANT)
;
which would also mean that EQ and NE cannot be fragment rules anymore:
EQ : '=';
NE : '<>';
This does look like it should be a parser rule. However, if you want to keep it as a lexer rule, you need to allow whitespace.
BOOLEAN_LOGICAL
: BOOLEAN_FIELD WS+ (EQ|NE) WS+ (BOOLEAN_FIELD|BOOLEAN_CONSTANT)
;