My grammar doesn't work as required in antlr - antlr

I have written a grammar in antlr as follows:
grammar names;
init : stat+;
stat : name NEWLINE
| name SPACE NEWLINE
|NEWLINE
| name SPACE name SPACE
;
name : ID ;
ID : [a-zA-Z]+ ;
NEWLINE:'\r'? '\n' ;
SPACE : ' ';
this grammar should accept input of form :
name1
name1<space>name2<space>
name1<space>
I am not getting the required output. As of now the generated tree shows only the first value. I am a novice to antlr and any help would be appreciated.

You stat rule should be more like this:
stat: name (SPACE name)* SPACE? NEWLINE;
assuming from you small example that each line represents a complete entry matched by the stat rule.

Related

ANTLR: "for" keyword used for loops conflicts with "for" used in messages

I have the following grammar:
myg : line+ EOF ;
line : ( for_loop | command params ) NEWLINE;
for_loop : FOR WORD INT DO NEWLINE stmt_body;
stmt_body: line+ END;
params : ( param | WHITESPACE)*;
param : WORD | INT;
command : WORD;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
WORD : (LOWERCASE | UPPERCASE | DIGIT | [_."'/\\-])+ (DIGIT)* ;
INT : DIGIT+ ;
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
FOR: 'for';
DO: 'do';
END: 'end';
My problem is that the 2 following are valid in this language:
message please wait for 90 seconds
This would be a valid command printing a message with the word "for".
for n 2 do
This would be the beginning of a for loop.
The problem is that with the current lexer it doesn't match the for loop since 'for' is matched by the WORD rule as it appears first.
I could solve that by putting the FOR rule before the WORD rule but then 'for' in message would be matched by the FOR rule
This is the typical keywords versus identifier problem and I thought there were quite a number of questions regarding that here on Stackoverflow. But to my surprise I can only find an old answer of mine for ANTLR3.
Even though the principle mentioned there remains the same, you no longer can change the returned token type in a parser rule, with ANTLR4.
There are 2 steps required to make your scenario work.
Define the keywords before the WORD rule. This way they get own token types you need for grammar parts which require specific keywords.
Add keywords selectively to rules, which parse names, where you want to allow those keywords too.
For the second step modify your rules:
param: WORD | INT | commandKeyword;
command: WORD | commandKeyword;
commandKeyword: FOR | DO | END; // Keywords allowed as names in commands.

Not able to parse continuos string using antlr (without spaces)

I have to parse the following query using antlr
sys_nameLIKEvalue
Here sys_name is a variable which has lower case and underscores.
LIKE is a fixed key word.
value is a variable which can contain lower case uppercase as well as number.
Below the grammer rule i am using
**expression : parameter 'LIKE' values EOF;
parameter : (ID);
ID : (LOWERCASE) (LOWERCASE | UNDERSCORE)* ;
values : (VALUE);
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
LOWERCASE : 'a'..'z' ;
UPPERCASE : 'A'..'Z' ;
NUMBER : '0'..'9' ;
UNDERSCORE : '_' ;**
Test Case 1
Input : sys_nameLIKEabc
error thrown : line 1:8 missing 'LIKE' at 'LIKEabc'
Test Case 2
Input : sysnameLIKEabc
error thrown : line 1:0 mismatched input 'sysnameLIKEabc' expecting ID
A literal token inside your parser rule will be translated into a plain lexer rule. So, your grammar really looks like this:
expression : parameter LIKE values EOF;
parameter : ID;
values : VALUE;
LIKE : 'LIKE';
ID : LOWERCASE (LOWERCASE | UNDERSCORE)* ;
VALUE : (LOWERCASE | NUMBER | UPPERCASE)+ ;
// Fragment rules will never become tokens of their own: good practice!
fragment LOWERCASE : 'a'..'z' ;
fragment UPPERCASE : 'A'..'Z' ;
fragment NUMBER : '0'..'9' ;
fragment UNDERSCORE : '_' ;
Since lexer rules are greedy, and if two or more lexer rules match the same amount of character the first will "win", your input is tokenized as follows:
Input: sys_nameLIKEabc, 2 tokens:
sys_name: ID
LIKEabc: VALUE
Input: sysnameLIKEabc, 1 token:
sys_nameLIKEabc: VALUE
So, the token LIKE will never be created with your test input, so none of your parser rule will ever match. It also seems a bit odd to parse input without any delimiters, like spaces.
To fix your issue, you will either have to introduce delimiters, or disallow your VALUE to contain uppercases.

Simple ANTLR grammar behaves different than expected [duplicate]

I have a Hello.g4 grammar file with a grammar definition:
definition : wordsWithPunctuation ;
words : (WORD)+ ;
wordsWithPunctuation : word ( word | punctuation word | word punctuation | '(' wordsWithPunctuation ')' | '"' wordsWithPunctuation '"' )* ;
NUMBER : [0-9]+ ;
word : WORD ;
WORD : [A-Za-z-]+ ;
punctuation : PUNCTUATION ;
PUNCTUATION : (','|'!'|'?'|'\''|':'|'.') ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Now, if I am trying to build a parse tree from the following input:
a b c d of at of abc bcd of
a b c d at abc, bcd
a b c d of at of abc, bcd of
it returns errors:
Hello::definition:1:31: extraneous input 'of' expecting {<EOF>, '(', '"', WORD, PUNCTUATION}
though the:
a b c d at: abc bcd!
works correct.
What is wrong with the grammar or input or interpreter?
If I modify the wordsWithPunctuation rule, by adding (... | 'of' | ',' word | ...) then it matches the input completely, but it looks suspicious for me - how the word of is different from the word a or abc? Or why the , is different from other punctuation characters (i.e., why does it match the : or !, but not ,?)?
Update1:
I am working with ANTLR4 plugin for Eclipse, so the project build happens with the following output:
ANTLR Tool v4.2.2 (/var/folders/.../antlr-4.2.2-complete.jar)
Hello.g4 -o /Users/.../eclipse_workspace/antlr_test_project/target/generated-sources/antlr4 -listener -no-visitor -encoding UTF-8
Update2:
the presented above grammar is just a partial from:
grammar Hello;
text : (entry)+ ;
entry : blub 'abrr' '-' ('1')? '.' ('(' NUMBER ')')? sims '-' '(' definitionAndExamples ')' 'Hello' 'all' 'the' 'people' 'of' 'the' 'world';
blub : WORD ;
sims : sim (',' sim)* ;
sim : words ;
definitionAndExamples : definitions (';' examples)? ;
definitions : definition (';' definition )* ;
definition : wordsWithPunctuation ;
examples : example (';' example )* ;
example : '"' wordsWithPunctuation '"' ;
words : (WORD)+ ;
wordsWithPunctuation : word ( word | punctuation word | word punctuation | '(' wordsWithPunctuation ')' | '"' wordsWithPunctuation '"' )* ;
NUMBER : [0-9]+ ;
word : WORD ;
WORD : [A-Za-z-]+ ;
punctuation : PUNCTUATION ;
PUNCTUATION : (','|'!'|'?'|'\''|':'|'.') ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
It looks now for me, that the words from the entry rule somehow breaking the other rules within the entry rule. But why? Is it a kind an anti-pattern in the grammar?
By including 'of' in a parser rule, ANTLR is creating an implicit anonymous token to represent that input. The word of will always have that special token type, so it will never have the type WORD. The only place it may appear in your parse tree is at a location where 'of' appears in a parser rule.
You can prevent ANTLR from creating these anonymous token types by separating your grammar into a separate lexer grammar HelloLexer in HelloLexer.g4 and parser grammar HelloParser in HelloParser.g4. I highly recommend you always use this form for the following reasons:
Lexer modes only work if you do this.
Implicitly-defined tokens are one of the most common sources of bugs in a grammar, and separating the grammar prevents it from ever happening.
Once you have the grammar separated, you can update your word parser rule to allow the special token of to be treated as a word.
word
: WORD
| 'of'
| ... other keywords which are also "words"
;

ANTLR gramar to detect ambiguous tokens

I'm creating a simple grammar in ANTLR to match somekind of commands. I'm stuck with tokens which use special characters.
Those commands would match sentences like...
connect "HAL" computer 4
connect "HAL256" computer 8
connect "HAL2⁸" computer 16
connect "HAL 9000" computer 32
connect "HAL \x0A25 | 32" computer 64
... to produce something like:
It's clear that my problem is in the ID token, but I don't know how to solve it. Here is my current grammar:
grammar foo;
ID : '"' ('\u0000'..'\uFFFF')+ '"' ;
NUMBER : ('0'..'9')* ;
SENTENCE : 'connect ' ID ' computer' NUMBER ;
How could I do it?
There are a couple of issues with your grammar:
NUMBER matches an empty string: lexer rules must always match at least 1 character
SENTENCE should be a parser rule (see: Practical difference between parser rules and lexer rules in ANTLR?)
('\u0000'..'\uFFFF')+ also matches a '"', which you most probably son't want
Try something like this instead:
sentence : K_CONNECT ID K_COMPUTER NUMBER;
K_CONNECT : 'connect';
K_COMPUTER : 'computer';
ID : '"' (~'"')+ '"';
NUMBER : ('0'..'9')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};

ANTLR AST Grammar Issue Mismatched Token Exception

my real grammar is way more complex but I could strip down my problem. So this is the grammar:
grammar test2;
options {language=CSharp3;}
#parser::namespace { Test.Parser }
#lexer::namespace { Test.Parser }
start : 'VERSION' INT INT project;
project : START 'project' NAME TEXT END 'project';
START: '/begin';
END: '/end';
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
INT : '0'..'9'+;
NAME: ('a'..'z' | 'A'..'Z')+;
TEXT : '"' ( '\\' (.) |'"''"' |~( '\\' | '"' | '\n' | '\r' ) )* '"';
STARTA
: '/begin hello';
And I want to parse this (for example):
VERSION 1 1
/begin project
testproject "description goes here"
/end
project
Now it will not work like this (Mismatched token exception). If I remove the last Token STARTA, it works. But why? I don't get it.
Help is really appreciated.
Thanks.
When the lexer sees the input "/begin " (including the space!), it is committed to the rule STARTA. When it can't match said rule, because the next char in the input is a "p" (from "project") and not a "h" (from "hello"), it will try to match another rule that can match "/begin " (including the space!). But there is no such rule, producing the error:
mismatched character 'p' expecting 'h'
and the lexer will not give up the space and match the START rule.
Remember that last part: once the lexer has matched something, it will not give up on it. It might try other rules that match the same input, but it will not backtrack to match a rule that matches less characters!
This is simply how the lexer works in ANTLR 3.x, no way around it.