Parsing multiline strings with antlr - antlr

I have a grammar that looks like this:
a: b c d ;
b: x STRING y ;
where
STRING: '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
And my file contains one 'a' production in each line so I'm currently dropping all newlines. I would however want to parse multiline strings, how can I do that? It doesn't work if I just allow '\r' and '\n' inside the string.

IIUC, you are just looking for a multi-line string lexer rule. The fact that you are dropping newlines really does not affect the construction of the string rule. The newlines that match within the string rule will be consumed there before the lexer ever considers the whitespace rule.
STRING : DQUOTE ( STR_TEXT | EOL )* DQUOTE ;
WS : [ \t\r\n] -> skip;
fragment STR_TEXT: ( ~["\r\n\\] | ESC_SEQ )+ ;
fragment ESC_SEQ : '\\' ( [btf"\\] | EOF )
fragment DQUOTE : '"' ;
fragment EOL : '\r'? '\n' ;

Related

ANTLR: too greedy rule

It looks like I have a problem understanding a too greedy rule match. I'm trying to lex a .g4 file for syntax coloring. Here is a minimum (simplified) extract for making this problem reproducible:
lexer grammar ANTLRv4Lexer;
Range
: '[' RangeChar+ ']'
;
fragment EscapedChar
: '\\' ~[u]
| '\\u' EscapedCharHex EscapedCharHex EscapedCharHex EscapedCharHex
;
fragment EscapedCharHex
: [0-9A-Fa-f]
;
fragment RangeChar
: ~']'
| EscapedChar
;
Punctuation
: [:;()+\->*[\]~|]
;
Identifier
: [a-zA-Z0-9]+
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
LineComment
: '//' ~[\r\n]*
;
The (incomplete) test file is following:
: (~ [\]\\] | EscAny)+ -> more
;
// ------
fragment Id
: NameStartChar NameChar*
;
String2Part
: ( ~['\\]
| EscapeSequence
)+
;
I don't understand why it matches Range so greedy:
[#0,3:3=':',<Punctuation>,1:3]
[#1,5:5='(',<Punctuation>,1:5]
[#2,6:6='~',<Punctuation>,1:6]
[#3,8:135='[\]\\] | EscAny)+ -> more\r\n ;\r\n\r\n // ------\r\n\r\nfragment Id\r\n : NameStartChar NameChar*\r\n ;\r\n\r\n\r\nString2Part\r\n\t: ( ~['\\]',<Range>,1:8]
[#4,141:141='|',<Punctuation>,13:3]
[#5,143:156='EscapeSequence',<Identifier>,13:5]
[#6,162:162=')',<Punctuation>,14:3]
[#7,163:163='+',<Punctuation>,14:4]
[#8,167:167=';',<Punctuation>,15:1]
[#9,170:169='<EOF>',<EOF>,16:0]
I understand why in the first line it matches [, \] and \\, but why it obviously treats ] as RangeChar?
Your lexer matches the first \ in \\] using the ~']' alternative and then matches the remaining \] as an EscapedChar. The reason it does this is that this interpretation leads to a longer match than the one where \\ is the EscapedChar and ] is the end of the range and when there are multiple valid ways to match a lexer rule, ANTLR always chooses the longest one (except when *? is involved).
To fix this, you should change RangeChar, so that backslashes are only allowed as part of escape sequences, i.e. replace ~']' with ~[\]\\].

ANTLR4 grammar regex and tilde

I want to have an ANTLR grammar for CSV input.
What's the difference between (~["])+ and (~['"'])+ ?
Why ~ is important?
Here is my grammar:
grammar Exercice4;
csv : ligne+
;
ligne : exp (',' exp)* ('\n' | EOF)
;
exp : ID
| INT
| STRING
;
INT : '0'..'9'+;
ID : ('0'..'9' | 'a'..'z' | 'A'..'Z')+;
STRING : '"' (~["])+ '"';
WS : [ ,\n, \t, \r] -> skip;
In a lexer rule, the characters inside square brackets define a character set. So ["] is the set with the single character ". Being a set, every character is either in the set or not, so defining a character twice, as in [""] makes no difference, it's the same as ["].
~ negates the set, so ~["] means any character except ".

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).

Why is this grammar giving me a "non LL(*) decision" error?

I am trying to add support for expressions in my grammar. I am following the example given by Scott Stanchfield's Antlr Tutorial. For some reason the add rule is causing an error. It is causing a non-LL(*) error saying, "Decision can match input such as "'+'..'-' IDENT" using multiple alternatives"
Simple input like:
a.b.c + 4
causes the error. I am using the AntlrWorks Interpreter to test my grammar as I go. There seems to be a problem with how the tree is built for the unary +/- and the add rule. I don't understand why there are two possible parses.
Here's the grammar:
path : (IDENT)('.'IDENT)* //(NAME | LCSTNAME)('.'(NAME | LCSTNAME))*
;
term : path
| '(' expression ')'
| NUMBER
;
negation
: '!'* term
;
unary : ('+' | '-')* negation
;
mult : unary (('*' | '/' | '%') unary)*
;
add : mult (( '+' | '-' ) mult)*
;
relation
: add (('==' | '!=' | '<' | '>' | '>=' | '<=') add)*
;
expression
: relation (('&&' | '||') relation)*
;
multiFunc
: IDENT expression+
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER|UCLETTER)(LCLETTER|UCLETTER|DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : (' ' | '\t' | '\r' | '\n' | '\f')+ {$channel = HIDDEN;}
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
I need an extra set of eyes. What am I missing?
The fact that you let one or more expressions match in:
multiFunc
: IDENT expression+
;
makes your grammar ambiguous. Let's say you're trying to match "a 1 - - 2" using the multiFunc rule. The parser now has 2 possible ways to parse this: a is matched by IDENT, but the 2 minus signs 1 - - 2 cause trouble for expression+. The following 2 parses are possible:
parse 1
parse 2
Your grammar in rule multiFunc has a list of expressions. An expression can begin with + or - on behalf of unary, thus due to the list, it can also be followed by the same tokens. This is in conflict with the add rule: there is a problem deciding between continuation and termination.

Parse sentences with different word types

I'm looking for a grammar for analyzing two type of sentences, that
means words separated by white spaces:
ID1: sentences with words not beginning with numbers
ID2: sentences with words not beginning with numbers and numbers
Basically, the structure of the grammar should look like
ID1 separator ID2
ID1: Word can contain number like Var1234 but not start with a number
ID2: Same as above but 1234 is allowed
separator: e. g. '='
#Bart
I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word.
Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:
The following token definitions can never be matched because prior tokens match the same input: Special
But when I add fragment before Special, I don't get that error. Why?
grammar Sentence1b1;
tokens
{
TCUnderscore = '_' ;
TCQuote = '"' ;
}
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: ( Word | Int )+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
;
Special
: ( TCUnderscore | TCQuote )
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
fragment Digit
: '0'..'9'
;
Lexer-rule Special shall then be used in lexer-rule Word:
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
I'd go for something like this:
grammar Sentence;
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: (Word | Int)+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
fragment Digit
: '0'..'9'
;
which will parse the input:
Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed
as follows:
EDIT
To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):
// wrong!
Special : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote : '"';
Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:
// wrong!
TCUnderscore : '_';
TCQuote : '"';
Special : (TCUnderscore | TCQuote);
the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:
The following token definitions can never be matched because prior tokens match the same input: ...
If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:
// good!
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).
// wrong!
parse : TCUnderscore;
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
I'm not sure if that fits your needs but with Bart's help in my post
ANTLR - identifier with whitespace
i came to this grammar:
grammar PropertyAssignment;
assignment
: id_nodigitstart '=' id_digitstart EOF
;
id_nodigitstart
: ID_NODIGITSTART+
;
id_digitstart
: (ID_DIGITSTART|ID_NODIGITSTART)+
;
ID_NODIGITSTART
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
;
ID_DIGITSTART
: ('0'..'9'|'a'..'z'|'A'..'Z')+
;
WS : (' ')+ {skip();}
;
"a name = my 4value" works while "4a name = my 4value" causes an exception.