there are two ESCAPE type in SQL: \' AND ''
a input may like:
SELECT '\'', '''';
I parse the string with this grammar:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )* '\''
;
but ANTLR parse the input error, the tree like this:
error parsed tree
I also tried another type of STRING_LITERAL grammar with GREEDY: "?":
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~'\'' )*? '\''
;
but it also give me a error parse resule like this:
error parsed tree in another grammar
the '''' should parsed as a string contain but not two empty string.
How should I modify the grammar to fix the problem?
You didn't exclude the \ in the ( ... )*. Try this:
STRING_LITERAL
: '\'' ( '\\\'' | '\'\'' | ~['\\] )* '\''
;
where ~['\\] matches any char except ' and \. You may want to include line break chars in it: ~[\r\n'\\].
Related
Im beginner of antlr,
Im try to write an antlr grammar (.g4 file) for follow rules:
Accept AND (&) between two variable: A&B, ABC&X, ...
Accept Unicode string begin with U&'hex string': U&'000b', U&'0020', ...
Accept concat string between variable (string type) and string: A&'123', ABC&'XyZ', ...
My question is how to reject concat string between U&'XyZ'? because U& is prefix of unicode string
Thank you for reading
U&'XyZ' should never be able to be recognised as a AND expression by the parser because U&'XyZ' is already tokenised as a single Unicode string token in the lexer:
expr
: expr '&' expr
| STRING
| ID
;
STRING
: 'U&'? '\'' ( ~[\\'\r\n] | '\\' ~[\r\n] )* '\''
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
which will parse U&'XyZ'&X as this:
How can I write this grammar expression for ANTLR4 input?
Originally expression:
<int_literal> = 0|(1 -9){0 -9}
<char_literal> = ’( ESC |~( ’|\| LF | CR )) ’
<string_literal> = "{ ESC |~("|\| LF | CR )}"
I tried the following expression:
int_literal : '0' | ('1'..'9')('0'..'9')*;
char_literal : '('ESC' | '~'('\'|'''|'LF'|'CR'))';
But it returned:
syntax error: '\' came as a complete surprise to me
syntax error: mismatched input ')' expecting SEMI while matching a rule
unterminated string literal
Your quotes don't match:
'('ESC' | '~'('\'|'''|'LF'|'CR'))'
^ ^ ^ ^ ^ ^ ^
| | | | | | |
o c o c o c error
o is open, c is close
I'd read "{ ESC |~("|\| LF | CR )}" as this:
// A string literal is zero or more chars other than ", \, \r and \n
// enclosed in double quotes
StringLiteral
: '"' ( Escape | ~( '"' | '\\' | '\r' | '\n' ) )* '"'
;
Escape
: '\\' ???
;
Also note that ANTLR4 has short hand char classes ([0-9] equals '0'..'9'), so you can do this:
IntLiteral
: '0'
| [1-9] [0-9]*
;
StringLiteral
: '"' ( Escape | ~["\\\r\n] )* '"'
;
Also not that lexer rules start with an uppercase letter! Otherwise they become parser rules (see: Practical difference between parser rules and lexer rules in ANTLR?).
I use Python3.g4 grammar from here and try to modify it. I want to add type hints, starting with 3 chars "#t ". They can be on separate line and after statements.
Added and modified rules:
simple_stmt
: small_stmt ( ';' small_stmt )* ';'? type_comment? NEWLINE
| type_comment NEWLINE
;
type_comment
: TYPE_COMMENT
;
TYPE_COMMENT
: '#' 't' ' ' ~[\r\n]*
;
Other relevant rules:
stmt
: simple_stmt
| compound_stmt
;
fragment COMMENT
: '#' ~[\r\n]*
;
compound_stmt
: if_stmt
| while_stmt
| for_stmt
| try_stmt
| with_stmt
| funcdef
| classdef
| decorated
;
while_stmt
: WHILE test ':' suite ( ELSE ':' suite )?
;
suite
: simple_stmt
| NEWLINE INDENT stmt+ DEDENT
;
With input
a = 1 #t int
#t int
#t str
s = "string"
I get following errors:
line 3:0 missing NEWLINE at '#t int'
line 5:0 extraneous input '#t str' expecting NEWLINE
When line
| type_comment NEWLINE
changed to
| type_comment
I receive other similar errors. What is the correct version of this grammar?
My guess would be that the #t ... gets matched by a rule defined before your TYPE_COMMENT. Try to define TYPE_COMMENT as the first lexer rule. If that doesn't help, please post your entire grammar.
I have a grammar that looks like this:
a: b c d ;
b: x STRING y ;
where
STRING: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
And my file contains one 'a' production in each line so I'm currently dropping all newlines. I would however want to parse multiline strings, how can I do that? It doesn't work if I just allow '\r' and '\n' inside the string.
IIUC, you are just looking for a multi-line string lexer rule. The fact that you are dropping newlines really does not affect the construction of the string rule. The newlines that match within the string rule will be consumed there before the lexer ever considers the whitespace rule.
STRING : DQUOTE ( STR_TEXT | EOL )* DQUOTE ;
WS : [ \t\r\n] -> skip;
fragment STR_TEXT: ( ~["\r\n\\] | ESC_SEQ )+ ;
fragment ESC_SEQ : '\\' ( [btf"\\] | EOF )
fragment DQUOTE : '"' ;
fragment EOL : '\r'? '\n' ;
I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.