antlr4 grammar- Help required for the grammar creation - grammar

I would like to create a grammar for the following input:
LogFile = c:\folder\logfile.txt
or
LogFile = \192.168.120.120\folder\logfile.txt
The file path should contain a file extension.
See below my grammar:
grammar CustomLanguage;
autoTask : logFileCommand;
logFileCommand : 'LogFile' ASSIGNMENT PARAMETER_PATH_FILENAME;
ASSIGNMENT : '=';
PARAMETER_PATH_FILENAME : ??????? FILE_EXTENSION ;
FILE_EXTENSION : '.'('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z');
What is the correct rule for PARAMETER_PATH_FILENAME?
PARAMETER_PATH_FILENAME : ??????? FILE_EXTENSION ;
Thanks

You can start with an existing URL grammar like this one: https://github.com/antlr/grammars-v4/blob/master/url/url.g4 and extend the parts you need (namely adding backward slash support and a variation for file URLs - drive instead of protocol).

Related

antlr4 two lexer rule match the same string

I'm currently using antlr4 to build a parser, but I encountered a problem which I tried my best but didn't figure out. Can you help me to explaain and solve it ?
# grammer file : PluginDoc.g4:
grammer PluginDoc
pluginDef : pluginName | pluginDesc;
pluginName : PluginName IDENTIFIER;
pluginDesc : PluginDesc TEXT;
PluginName '#pluginName'
PluginDesc '#pluginDesc'
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
input content is:
#pluginName kafka
#pluginDesc abc
If I put IDENTIFIER before TEXT, I will get "mismatched input 'abc' expecting TEXT"
If I put TEXT before IDENTIFIER, I will get "mismatched input 'kafka' expecting IDENTIFIER"
Looks like both IDENTIFIER and TEXT are matched, how can I only match IDENTIFIER in pluginName and only match TEXT in pluginDesc ?
First of all, you have several errors in the grammar that you posted:
The header of the file should specify grammar, not grammer. Your Lexer tokens PluginName and PluginDesc do not have a colon in front of them and semicolon to terminate them. It is also an (unwritten?) rule to write your parser rules as all lower-case and your lexer rules as all upper-case.
grammar PluginDoc;
pluginDef : pluginName | pluginDesc;
pluginName : PLUGIN_NAME IDENTIFIER;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME : '#pluginName';
PLUGIN_DESC : '#pluginDesc';
IDENTIFIER : [a-zA-Z_]+;
TEXT : ~( ' ' | '\n' | '\t' )+;
Some of the problems that I encountered while testing your grammar were due to the unhandled whitespace. First of all, you should include a Lexer rule to skip the whitespace at the end of the file after all of the other Lexer rules.
WS: [ \n\t\r]+ -> skip;
Next, there is a problem with your TEXT and IDENTIFIER clashing with each other. When the character stream is tokenized by the Lexer, kafka and abc can be both IDENTIFIER and TEXT token. Since the Lexer lexes in a top-down fashion, they are both tokenized as whateve Lexer rule comes first in your grammar. This causes the error that you encounter - whatever you define as the second rule cannot be matched in the parser because it was not sent in as a token.
As suggested by Lucas, you should probably match both of these as TEXT and do the subsequent checking for validity of the input in your Listener/Visitor.
grammar PluginDoc;
pluginDef : (pluginName | pluginDesc)* EOF;
pluginName : PLUGIN_NAME TEXT;
pluginDesc : PLUGIN_DESC TEXT;
PLUGIN_NAME: '#pluginName';
PLUGIN_DESC: '#pluginDesc';
TEXT : ~[ \r\n\t]+;
WS: [ \r\n\t]+ -> skip;
I also changed the pluginDef Parser rule to
pluginDef : (pluginName | pluginDesc)* EOF;
since it was my impression that you want to input both #pluginName X and #pluginDesc Y at once and identify them. If this is not the case, feel free to change back to what you had before.
The resulting AST produced by the modified grammar above onyour sample input:
You can also run this with a text file as an input.

ANTLR4 disambiguation of terminal tokens

This is my grammar in ANTLR4:
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
When I type in something like:
hello buddy
I got the following error message:
line 1 missing WORD at 'hello'
But, if I change the grammar in
grammar Hello;
r : WORD ID ;
ID : [a-z]+ ;
WORD : [1-9]+ ;
WS : [ \t\r\n]+ -> skip ;
where now WORD is a number, everything is ok.
I strongly suspect that since in the first grammar we have two terminal node with the same regex, the parser doesn't know the correspondance of the real word.
So am I wrong thinking of it? If not, how would you solve this issue keeping more than one terminal with the same regex?
You cannot have two terminals that match the same pattern.
If your grammar actually needs to match twice [a-z]+, then use a production like
r : WORD WORD ;
and the discrimination will be done at the parser / tree traversal level.
If either WORD or ID can be restricted to a fixed list, you could declare all the possible words as terminals then use them to define e.g. what a WORD can be.
where now WORD is a number, everything is ok.
Not really :
$ alias
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Hello r -tokens data.txt
[#0,0:4='hello',<ID>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
line 1:0 missing WORD at 'hello'
When the lexer can match some input with two rules, there is an ambiguity, and it chooses the first rule. With a hello buddy input, the lexer produces two ID tokens
with the first grammar, because it's ambiguous and ID comes first
with the second grammar, the input can only be matched by ID WS ID
You can disambiguate with a predicate in the lexer rule like so :
grammar Question;
/* Ambiguous input */
file
: HELLO ID
;
HELLO
: [a-z]+ {getText().equals("hello")}? ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
Execution :
$ grun Question file -tokens data.txt
[#0,0:4='hello',<HELLO>,1:0]
[#1,6:10='buddy',<ID>,1:6]
[#2,12:11='<EOF>',<EOF>,2:0]
More on semantic predicates in The Definitive ANTLR Reference.

Problems with ANTLR4 grammar

I have a very simple grammar file, which looks like this:
grammar Wort;
// Parser Rules:
word
: ANY_WORD EOF
;
// Lexer Rules:
ANY_WORD
: SMALL_WORD | CAPITAL_WORD
;
SMALL_WORD
: SMALL_LETTER (SMALL_LETTER)+
;
CAPITAL_WORD
: CAPITAL_LETTER (SMALL_LETTER)+
;
fragment SMALL_LETTER
: ('a'..'z')
;
fragment CAPITAL_LETTER
: ('A'..'Z')
;
If i try to parse the input "Hello", everything is OK, BUT if if modify my grammar file like this:
...
// Parser Rules:
word
: CAPITAL_WORD EOF
;
...
the input "Hello" is no longer recognized as a valid input. Can anybody explain, what is going wrong?
Thanx, Lars
The issue here has to do with precedence in the lexer grammar. Because ANY_WORD is listed before CAPITAL_WORD, it is given higher precedence. The lexer will identify Hello as a CAPITAL_WORD, but since an ANY_WORD can be just a CAPITAL_WORD, and the lexer is set up to prefer ANY_WORD, it will output the token ANY_WORD. The parser acts on the output of the lexer, and since ANY_WORD EOF doesn't match any of its rules, the parse fails.
You can make the lexer behave differently by moving CAPITAL_WORD above ANY_WORD in the grammar, but that will create the opposite problem -- capitalized words will never lex as ANY_WORDs. The best thing to do is probably what Mephy suggested -- make ANY_WORD a parser rule.

ANTLR rule works on its own, but fails when included in another rule

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?
When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).

How to consume text until newline in ANTLR?

How do you do something like this with ANTLR?
Example input:
title: hello world
Grammar:
header : IDENT ':' REST_OF_LINE ;
IDENT : 'a'..'z'+ ;
REST_OF_LINE : ~'\n'* '\n' ;
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
(I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.)
It fails, with line 1:0 mismatched input 'title: hello world\n' expecting IDENT
You must understand that the lexer operates independently from the parser. No matter what the parser would "like" to match at a certain time, the lexer simply creates tokens following some strict rules:
try to match tokens from top to bottom in the lexer rules (rules defined first are tried first);
match as much text as possible. In case 2 rules match the same amount of text, the rule defined first will be matched.
Because of rule 2, your REST_OF_LINE will always "win" from the IDENT rule. The only time an IDENT token will be created is when there's no more \n at the end. That is what's going wrong with your grammars: the error messages states that it expects a IDENT token, which isn't found (but a REST_OF_LINE token is produced).
I know ANTLR is overkill for parsing MIME-like headers, but this is just at the top of a more complex file.
You can't just define tokens (lexer rules) you want to apply to the header of a file. These tokens will also apply to the rest of the more complex file. Perhaps you should pre-process the header separately from the rest of the file?
antlr parsing is usually done in 2 steps.
1. construct your ast
2. define your grammer
pseudo code (been a few years since I played with antlr) - AST:
WORD : 'a'..'z'+ ;
SEPARATOR : ':';
SPACE : ' ';
pseudo code - tree parser:
header: WORD SEPARATOR WORD (SPACE WORD)+
Hope that helps....