How to use similar lexers - antlr

I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.

But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.
Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.
Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).

Related

Parsing letter ranges with ANTLR

I have the following parser rules:
defDirective : defType whiteSpace letterSpec (whiteSpace? COMMA whiteSpace? letterSpec)*;
defType :
DEFBOOL | DEFBYTE | DEFINT | DEFLNG | DEFLNGLNG | DEFLNGPTR | DEFCUR |
DEFSNG | DEFDBL | DEFDATE |
DEFSTR | DEFOBJ | DEFVAR
;
letterSpec : universalLetterRange | letterRange | singleLetter;
singleLetter : RESTRICTED_LETTER;
universalLetterRange : upperCaseA whiteSpace? MINUS whiteSpace? upperCaseZ;
upperCaseA : {_input.Lt(1).Text.Equals("A")}? RESTRICTED_LETTER;
upperCaseZ : {_input.Lt(1).Text.Equals("Z")}? RESTRICTED_LETTER;
letterRange : firstLetter whiteSpace? MINUS whiteSpace? lastLetter;
firstLetter : RESTRICTED_LETTER;
lastLetter : RESTRICTED_LETTER;
whiteSpace : (WS | LINE_CONTINUATION)+;
with the relevant Lexer Rules:
RESTRICTED_LETTER : [a-zA-Z];
MINUS : '-';
COMMA : ',';
WS : [ \t];
LINE_CONTINUATION : [ \t]* UNDERSCORE [ \t]* '\r'? '\n';
and the DefTypes matching their camel-case spelling.
Now when I try to test this on the following inputs, it works exactly as expected:
DefInt I,J,K
DefBool A-Z
It does not work however on arbitary letter ranges (see rule letterRange). When I use the input DefByte B-F, I get the error message "line 1:8 mismatched input 'B' expecting RESTRICTED_LETTER"
I've tried expressing RESTRICTED_IDENTIFIER as a range ('A'..'Z'|'a'..'z'), but that didn't change anything about the error message.
When changing the first whiteSpace in defDirective to whiteSpace+ the error message gets a little longer (now including WS and LINE_CONTINUATION in the expected alternatives).
Also the parse-tree generated by the IntelliJ ANTLR Plugin suddenly starts recognizing the F as a singleLetter, which it previously didn't.
This behaviour seems to be consistent between targetlanguages Java and CSharp.
Previously the rule used to be a lot more relaxed, but that led to incorrect parse-trees, so I kinda want to fix this.
How can I correctly recognize letterRange here?
So ... #BartKiers had the right suspicion. The given Lexer rules weren't all the rules involved in the process.
The full grammar contains a lexer rule B_CHAR : B that's used in a special case of an unrelated grammar rule. That B_CHAR took precedence over RESTRICTED_LETTER when lexing the input stream.
The grammar rules presented are correct (and work fine), but the B_CHAR token needs to be removed from the Tokens lexed.

ANTLR with non-greedy rules

I would like to have the following grammar (part of it):
expression
:
expression 'AND' expression
| expression 'OR' expression
| StringSequence
;
StringSequence
:
StringCharacters
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\]
| EscapeSequence
;
It should match things like "a b c d f" (without the quotes), as well as things like "a AND b AND c".
The problem is that my rule StringSequence is greedy, and consumes the OR/AND as well. I've tried different approaches but couldn't get my grammar to work in the correct way. Is this possible with ANTLR4? Note that I don't want to put quotes around every string. Putting quotes works fine because the rule becomes non greedy, i.e.:
StringSequence
: '"' StringCharacters? '"'
;
You have no whitespace rule so StringCharacter matches everything except quote and backslash chars (+ the escape sequenc). Include a whitespace rule to make it match individual AND/OR tokens. Additionally, I recommend to define lexer rules for string literals ('AND', 'OR') instead of embedding them in the (parser) rule(s). This way you not only get speaking names for the tokens (instead of auto generated ones) but you also can better control the match order.
Yet a naive solution:
StringSequence :
(StringCharacter | NotAnd | NotOr)+
;
fragment NotAnd :
'AN' ~'D'
| 'A' ~'N'
;
fragment NotOr:
'O' ~('R')
;
fragment StringCharacter :
~('O'|'A')
;
Gets a bit more complex with Whitespace rules. Another solution would be with semantic predicates looking ahead and preventing the read of keywords.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

multi alternative and rule of thumb of grammar granularity

I asked related questions here and here, now I have a new question, but really I am asking for some general rule of thinking.
Here is the grammar:
grammar post2;
post2: action_cmd+
;
action_cmd
: cmd_name action_cmd_def
;
action_cmd_def
: (cmd_chars | cmd_literal)+ Semi_colon
;
cmd_name
: 'a'..'z' ('a'..'z' | '0'..'9' | '_' )*
;
cmd_chars
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.' | ':' | '-' |'\\')
;
cmd_literal
: SINGLE_QUOTE ~(SINGLE_QUOTE | '\n' | '\r') SINGLE_QUOTE
;
SINGLE_QUOTE
: '\''
;
Semi_colon
: ';'
;
WS : ('\t' | ' ')+ {$channel = HIDDEN;};
New_Line : ('\r' | '\n')+ {$channel = HIDDEN;};
It is not a surprise I got this error -
warning(200): post2.g:16:45:
Decision can match input such as "'_'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
The error is about rule "cmd_name".
I believe the reason is, as Bart indicated in another thread, when there is such input as "abc__", it can be parsed as either "abc_"(cmd_name) and "_"(action_cmd_def/cmd_chars) or "abc__"(cmd_name).
Here are my questions:
1) How to fix it? I tried adding "options {greedy=true;}" in front of cmd_name, but the error persists.
2) I know if I combine cmd_name and action_cmd_def into one, then the problem will be gone, this leads to the question of grammar granularity. Since ANTLR has such a powerful lexer/parser function, I really like to use the grammar to filter out meaningful string out, in this case, I know the input data for "action_cmd" must start with a command name string and then follow some messy stuff, so I like the grammar to do separate the 2 parts; otherwise I will have to write in action part using the target language (C in my case), but going deeper granularity brings so much trouble, I am in doubt if I am at a wrong track.
With this, I like to ask, what is your rule of thumb as of the grammar granularity? Am I going nuts in using grammar?
This is a genuine ambiguity, but the greedy option should work for you. Maybe it needs to be at the subrule level? See if this works:
cmd_name
: 'a'..'z' (options {greedy=true;} : 'a'..'z' | '0'..'9' | '_' )*
As for the second part of your question, I think your rule granularity is fine. You can also resort to using syntactic predicates if there is an ambiguity that needs more than just the greedy flag to solve. It is well documented in the ANTLR 3 book, but not so well on the website.
It amounts to trying to match the predicate syntactically. If it succeeds then it matches it for real, if it fails then it uses the other alternatives. For instance, in C you don't know if you have a function declaration or definition until you see the end of the declaration, which has no lower limit on its length. So you use a syntactic predicate to say "let's see if it is a declaration, if it is, then match it for real, if not then try the other alternatives.
externalDef
: ( "typedef" | declaration )=> declaration
| functionDef
| asm_expr
;

ANTLR rule works on its own, but fails when included in another rule

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?
When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).