ANTLR4 input grammar - grammar

How can I write this grammar expression for ANTLR4 input?
Originally expression:
<int_literal> = 0|(1 -9){0 -9}
<char_literal> = ’( ESC |~( ’|\| LF | CR )) ’
<string_literal> = "{ ESC |~("|\| LF | CR )}"
I tried the following expression:
int_literal : '0' | ('1'..'9')('0'..'9')*;
char_literal : '('ESC' | '~'('\'|'''|'LF'|'CR'))';
But it returned:
syntax error: '\' came as a complete surprise to me
syntax error: mismatched input ')' expecting SEMI while matching a rule
unterminated string literal

Your quotes don't match:
'('ESC' | '~'('\'|'''|'LF'|'CR'))'
^ ^ ^ ^ ^ ^ ^
| | | | | | |
o c o c o c error
o is open, c is close
I'd read "{ ESC |~("|\| LF | CR )}" as this:
// A string literal is zero or more chars other than ", \, \r and \n
// enclosed in double quotes
StringLiteral
: '"' ( Escape | ~( '"' | '\\' | '\r' | '\n' ) )* '"'
;
Escape
: '\\' ???
;
Also note that ANTLR4 has short hand char classes ([0-9] equals '0'..'9'), so you can do this:
IntLiteral
: '0'
| [1-9] [0-9]*
;
StringLiteral
: '"' ( Escape | ~["\\\r\n] )* '"'
;
Also not that lexer rules start with an uppercase letter! Otherwise they become parser rules (see: Practical difference between parser rules and lexer rules in ANTLR?).

Related

Inline comments and empty line in antlr4 grammar

please can anyone explain me, what i need to change i this grammar to support inline comments (such as // some text) and empty line (which contains any number of space characters). I write following grammar, but this doesn't work.
program: line* EOF ;
line: (expression | assignment) (NEWLINE | EOF);
assignment : VARIABLE '=' expression ;
expression : '(' expression ')' #parenthesisExpression
| '-' expression #unaryExpression
| left=expression OP1 right=expression #firstPriorityExpression
| left=expression OP2 right=expression #secondPriorityExpression
| number=NUMBER #numericExpression
| variable=VARIABLE #variableExpression
;
NUMBER : [0-9]+ ;
VARIABLE : [a-zA-Z][a-zA-Z0-9]* ;
OP1 : '*' | '/' ;
OP2 : '+' | '-' ;
NEWLINE : '\r'? '\n' ;
WHITESPACE : [ \t\r]+ -> skip ;
COMMENT : '//' ~[\n\r]* -> skip ;
The fact you added - in a parser rule as a literal token, and also made OP2 match this character causes OP2 to never match a -. You need to have a lexer rule that matches only the single minus sign (as I showed earlier):
op1
: MUL
| DIV
;
op2
: ADD
| MIN
;
...
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
MIN : '-' ;
and then use MIN in your unary alternative:
...
| MIN expression #unaryExpression
...
When you have a separate MIN : '-' ; rule, you could do this:
...
| '-' expression #unaryExpression
...
because now ANTLR "knows" you mean the rule that matches a single -, but ANTLR does not "know" this when you have a lexer rule that matches a either a - or + like your OP2 rule:
OP2 : '+' | '-' ;

how to parse string of SQL contain ESCAPE in ANTLR4?

there are two ESCAPE type in SQL: \' AND ''
a input may like:
SELECT '\'', '''';
I parse the string with this grammar:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )* '\''
;
but ANTLR parse the input error, the tree like this:
error parsed tree
I also tried another type of STRING_LITERAL grammar with GREEDY: "?":
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )*? '\''
;
but it also give me a error parse resule like this:
error parsed tree in another grammar
the '''' should parsed as a string contain but not two empty string.
How should I modify the grammar to fix the problem?
You didn't exclude the \ in the ( ... )*. Try this:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~['\\] )* '\''
;
where ~['\\] matches any char except ' and \. You may want to include line break chars in it: ~[\r\n'\\].

Missing token and extraneous input

I use Python3.g4 grammar from here and try to modify it. I want to add type hints, starting with 3 chars "#t ". They can be on separate line and after statements.
Added and modified rules:
simple_stmt
: small_stmt ( ';' small_stmt )* ';'? type_comment? NEWLINE
| type_comment NEWLINE
;
type_comment
: TYPE_COMMENT
;
TYPE_COMMENT
: '#' 't' ' ' ~[\r\n]*
;
Other relevant rules:
stmt
: simple_stmt
| compound_stmt
;
fragment COMMENT
: '#' ~[\r\n]*
;
compound_stmt
: if_stmt
| while_stmt
| for_stmt
| try_stmt
| with_stmt
| funcdef
| classdef
| decorated
;
while_stmt
: WHILE test ':' suite ( ELSE ':' suite )?
;
suite
: simple_stmt
| NEWLINE INDENT stmt+ DEDENT
;
With input
a = 1 #t int
#t int
#t str
s = "string"
I get following errors:
line 3:0 missing NEWLINE at '#t int'
line 5:0 extraneous input '#t str' expecting NEWLINE
When line
| type_comment NEWLINE
changed to
| type_comment
I receive other similar errors. What is the correct version of this grammar?
My guess would be that the #t ... gets matched by a rule defined before your TYPE_COMMENT. Try to define TYPE_COMMENT as the first lexer rule. If that doesn't help, please post your entire grammar.

Parsing multiline strings with antlr

I have a grammar that looks like this:
a: b c d ;
b: x STRING y ;
where
STRING: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
And my file contains one 'a' production in each line so I'm currently dropping all newlines. I would however want to parse multiline strings, how can I do that? It doesn't work if I just allow '\r' and '\n' inside the string.
IIUC, you are just looking for a multi-line string lexer rule. The fact that you are dropping newlines really does not affect the construction of the string rule. The newlines that match within the string rule will be consumed there before the lexer ever considers the whitespace rule.
STRING : DQUOTE ( STR_TEXT | EOL )* DQUOTE ;
WS : [ \t\r\n] -> skip;
fragment STR_TEXT: ( ~["\r\n\\] | ESC_SEQ )+ ;
fragment ESC_SEQ : '\\' ( [btf"\\] | EOF )
fragment DQUOTE : '"' ;
fragment EOL : '\r'? '\n' ;

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.