Problems matching text and containing syntax elements - antlr

I've written a small portion of a combined ANTLR4 grammar:
grammar TestCombined;
NL
: [\r\n]
;
SUBHEADLINE
: '##' .*? '##'
;
HEADLINE
: '#' .*? '#'
;
LEAD
: '###' .*? '###'
;
SUBHEADING
: '####' .*? '####'
;
TEXT
: .+?
;
/* ---- */
dnpMD
: subheadline headline lead bodyElements*
;
subheadline
: SUBHEADLINE NL NL
;
headline
: HEADLINE NL NL
;
lead
: LEAD NL NL
;
subheading
: SUBHEADING
;
bodyElements
: TEXT
| subheading
;
The first three headline types are working extremely well. Thanks to another question (and the answer) this is way clearer to me than before.
But I've problems understanding, why the TEXT rule/token is not getting matched correctly. I'm new to ANTLR4 and I think I'm missing something very important that hampers me of understanding the underlying problem.
This is an example input:
## Test ##
# Test123 #
### Test1234 ###
#### Another Test ####
this is not getting recognized.
What am I missing? Is it impossible to write those things in/with ANTLR4? The text could possibly contain more syntax elements like italic and stuff like that.

The current solution looks like this lexer and grammar rules:
lexer grammar dnpMDAuslagernLexer;
/*#members {
public static final int COMMENTS = 1;
}*/
NL
: [\r\n]
;
SUBHEADLINE
: '##' (~[\r\n])+? '##'
;
HEADLINE
: '#' ('\\#'|~[\r\n])+? '#'
;
LEAD
: '###' (~[\r\n])+? '###'
;
SUBHEADING
: '####' (~[\r\n])+? '####'
;
CAPTION
: '#####' (~[\r\n])+? '#####'
;
LISTING
: '~~~~~' .+? '~~~~~'
;
ELEMENTPATH
: '[[[[[' (~[\r\n])+? ']]]]]'
;
LABELREF
: '{##' (~[\r\n])+? '##}'
;
LABEL
: '{#' (~[\r\n])+? '#}'
;
ITALIC
: '*' (~[\r\n])+? '*'
;
SINGLE_COMMENT
: '//' (~[\r\n])+ -> channel(1)
;
MULTI_COMMENT
: '/*' .*? '*/' -> channel(1)
;
STAR
: '*'
;
BRACE_OPEN
: '{'
;
TEXT
: (~[\r\n*{])+
;
parser grammar dnpMDAuslagernParser;
options { tokenVocab=dnpMDAuslagernLexer; }
dnpMD
: head body
;
head
: subheadline headline lead
;
subheadline
: SUBHEADLINE NL+
;
headline
: HEADLINE NL+
;
lead
: LEAD
;
subheading
: SUBHEADING
;
caption
: CAPTION
;
listing
: LISTING (NL listingPath)? (NL label)? NL caption
;
image
: caption (NL label)? (NL imagePath)?
;
listingPath
: ELEMENTPATH
;
imagePath
: ELEMENTPATH
;
labelRef
: LABELREF
;
label
: LABEL
;
italic
: ITALIC
;
singleComment
: SINGLE_COMMENT
;
multiComment
: MULTI_COMMENT
;
paragraph
: TEXT? italic TEXT?
| TEXT? STAR TEXT?
| TEXT? labelRef TEXT?
| TEXT? BRACE_OPEN TEXT?
| TEXT? LABEL TEXT?
| ELEMENTPATH
| TEXT
;
newlines
: NL+
;
body
: bodyElements+
;
bodyElements
: singleComment
| multiComment
| paragraph
| subheading
| listing
| image
| newlines
;
This language is working fine and maybe someone can benefit from it.
Thanks to all who helped out!
Fabian

Related

ANTLR4: wrong lexer rule matches

I'm at a very beginning of learning ANTLR4 lexer rules. My goal is to create a simple grammar for Java properties files. Here is what I have so far:
lexer grammar PropertiesLexer;
LineComment
: ( LineCommentHash
| LineCommentExcl
)
-> skip
;
fragment LineCommentHash
: '#' ~[\r\n]*
;
fragment LineCommentExcl
: '!' ~[\r\n]*
;
fragment WrappedLine
: '\\'
( '\r' '\n'?
| '\n'
)
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
Key
: KeyLetterStart
( KeyLetter
| Escaped
)*
;
fragment KeyLetterStart
: ~[ \t\r\n:=]
;
fragment KeyLetter
: ~[\t\r\n:=]
;
fragment Escaped
: '\\' .?
;
Equal
: ( '\\'? ':'
| '\\'? '='
)
;
Value
: ValueLetterBegin
( ValueLetter
| Escaped
| WrappedLine
)*
;
fragment ValueLetterBegin
: ~[ \t\r\n]
;
fragment ValueLetter
: ~ [\r\n]+
;
Whitespace
: [ \t]+
-> skip
;
My test file is this one:
# comment 1
# comment 2
#
.key1= value1
key2\:sub=value2
key3 \= value3
key4=value41\
value42
# comment3
#comment4
key=value
When I run grun, I'm getting following output:
[#0,30:42='.key1= value1',<Value>,4:0]
[#1,45:60='key2\:sub=value2',<Value>,5:0]
[#2,63:76='key3 \= value3',<Value>,6:0]
[#3,81:102='key4=value41\\r\nvalue42',<Value>,8:0]
[#4,130:138='key=value',<Value>,13:0]
[#5,141:140='<EOF>',<EOF>,14:0]
I don't understand why the Value definition is matched. When commenting out the Value definition, however, it recognizes the Key and Equal definitions:
[#0,30:34='.key1',<Key>,4:0]
[#1,35:35='=',<Equal>,4:5]
[#2,37:42='value1',<Key>,4:7]
[#3,45:49='key2\',<Key>,5:0]
[#4,50:50=':',<Equal>,5:5]
[#5,51:53='sub',<Key>,5:6]
[#6,54:54='=',<Equal>,5:9]
[#7,55:60='value2',<Key>,5:10]
[#8,63:68='key3 \',<Key>,6:0]
[#9,69:69='=',<Equal>,6:6]
[#10,71:76='value3',<Key>,6:8]
[#11,81:84='key4',<Key>,8:0]
[#12,85:85='=',<Equal>,8:4]
[#13,86:93='value41\',<Key>,8:5]
[#14,96:102='value42',<Key>,9:0]
[#15,130:132='key',<Key>,13:0]
[#16,133:133='=',<Equal>,13:3]
[#17,134:138='value',<Key>,13:4]
[#18,141:140='<EOF>',<EOF>,14:0]
but how to let it recognize the Key, Equal and Value definitons?
ANTLR's lexer rules match as much characters as possible, that is why you're seeing all these Value tokens being created (they match the most characters).
Lexical modes seem like a good fit to use here. Something like this:
lexer grammar PropertiesLexer;
COMMENT
: [!#] ~[\r\n]* -> skip
;
KEY
: ( '\\' ~[\r\n] | ~[\r\n\\=:] )+
;
EQUAL
: [=:] -> pushMode(VALUE_MODE)
;
NL
: [\r\n]+ -> skip
;
mode VALUE_MODE;
VALUE
: ( ~[\\\r\n] | '\\' . )+
;
END_VALUE
: [\r\n]+ -> skip, popMode
;

ParserRule matching the wrong token

I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.
This is what I've got:
compileUnit
:
typedeclaration EOF
;
typedeclaration
:
ID LPAREN DATATYPE INT RPAREN
;
DATATYPE
:
DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
:
'A'
;
DATATYPE_NUMERIC
:
'N'
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[a-zA-Z]
;
INT
:
DIGIT+
;
ID
:
LETTER
(
LETTER
| DIGIT
)*
;
LPAREN
:
'('
;
RPAREN
:
')'
;
WS
:
[ \t\f]+ -> skip
;
What I want to be able to parse:
TEST (A10)
what I get:
typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE
I am however able to write:
TEST (A 10)
Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working.
Why is a space needed between DATATYPE and INT? I'm a bit confused on that one.
I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?
You should ignore 'A' and 'N' chats at first position of ID. As #CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: ID LPAREN datatype INT RPAREN
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC
: 'A'
;
DATATYPE_NUMERIC
: 'N'
;
INT
: DIGIT+
;
ID
: [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
fragment
DIGIT
: [0-9]
;
fragment
LETTER
: [a-zA-Z]
;
Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:
grammar expr;
compileUnit
: typedeclaration EOF
;
typedeclaration
: id LPAREN datatype DIGIT+ RPAREN
;
id
: (datatype | LETTER) (datatype | LETTER | DIGIT)*
;
datatype
: DATATYPE_ALPHANUMERIC
| DATATYPE_NUMERIC
;
DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC: 'N';
// List with another Data types.
LETTER: [a-zA-Z];
LPAREN
: '('
;
RPAREN
: ')'
;
WS
: [ \t\f]+ -> skip
;
DIGIT
: [0-9]
;

ANTLR3 Resolving Grammar With non-LL(*) decisions

As a homework I should create a parser for VC language V10.1 using ANTLR.
Here is my work:
grammar MyVCgrm2;
program : ( funcDecl | varDecl )*
;
funcDecl :
type identifier paraList compoundStmt
;
varDecl
: type initDeclaratorList ';'
;
initDeclaratorList
:  initDeclarator ( ',' initDeclarator )*
;
initDeclarator
: declarator ( '=' initialiser )?
;
declarator
: identifier identifier2 
;
identifier2
: |'[' INTLITERAL? ']'
;
initialiser
: expr
                |  '{' expr ( ',' expr ) *  '}'
;
// primitive types
type
: 'void' | 'boolean' | 'int' | 'float'
;
// identifiers
identifier
: ID
;
// statements
compoundStmt
: '{' varDecl*  stmt* '}'
;
stmt
:
compoundStmt
        |  ifStmt
        |  forStmt
        |  whileStmt
        |  breakStmt
        |  continueStmt
        |  returnStmt
|  exprStmt
;
ifStmt
: 'if' '(' expr ')' stmt ( 'else' stmt )?
;
forStmt
: 'for' '(' expr? ';' expr? ';' expr? ')' stmt
;
whileStmt
: 'while' '(' expr ')' stmt
;
breakStmt
: 'break' ';'
;
continueStmt
: 'continue' ';'
;
returnStmt
: 'return' expr? ';'
;
exprStmt
: expr? ';'
;
// expressions
expr
: assignmentExpr
;
assignmentExpr
: ( condOrExpr '=' )*  condOrExpr
;
condOrExpr
: condAndExpr condOrExpr2
;
condOrExpr2
: '||' condAndExpr condOrExpr
|
;
condAndExpr
: equalityExpr condAndExpr2
;
condAndExpr2
: '&&' equalityExpr condAndExpr2|
;
equalityExpr
: relExpr equalityExpr2
;
equalityExpr2
: '==' relExpr equalityExpr2 | '!=' relExpr equalityExpr2|
;
relExpr
:       additiveExpr relExpr2 
;
relExpr2
: '<' additiveExpr relExpr | '<=' additiveExpr relExpr | '>' additiveExpr relExpr|  '>=' additiveExpr relExpr|
;
additiveExpr
:
multiplicativeExpr additiveExpr2
;
additiveExpr2
:
'+' multiplicativeExpr additiveExpr2
| '-' multiplicativeExpr additiveExpr2
|
;
multiplicativeExpr
:
unaryExpr multiplicativeExpr2
;
multiplicativeExpr2
:
'*' unaryExpr multiplicativeExpr2
| '/' unaryExpr multiplicativeExpr2
|
;
unaryExpr
: '+' unaryExpr
                |  '-' unaryExpr
                |  '!' unaryExpr
                |  primaryExpr
;
primaryExpr
: identifier primaryExpr2
        | '(' expr ')'
                | INTLITERAL
                | FLOATLITERAL
                | BOOLLITERAL
                | STRINGLITERAL
;
primaryExpr2
: argList?
                | '[' expr ']'
;
// parameters
paraList
: '(' properParaList? ')'
;
properParaList
: paraDecl ( ',' paraDecl ) *
;
paraDecl
: type declarator
;
argList
:'(' properArgList? ')'
;
properArgList
: arg ( ',' arg ) *
;
arg
: expr
;
BOOLLITERAL
: 'true'|'false'
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?')*
;
INTLITERAL : '0'..'9'+
;
FLOATLITERAL
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRINGLITERAL
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
But when by trying generate code I got these errors in console:
[13:48:38] Checking Grammar MyVCgrm2.g...
[13:48:38] warning(200): MyVCgrm2.g:53:27:
Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:78:21: [fatal] rule assignmentExpr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] error(211): MyVCgrm2.g:111:2: [fatal] rule additiveExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[13:48:38] warning(200): MyVCgrm2.g:111:2:
Decision can match input such as "'+' ID" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
[13:48:38] error(211): MyVCgrm2.g:142:5: [fatal] rule primaryExpr2 has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
Warnings and errors are about these productions (in order): ifStmt, assignmentExpr, additiveExpr2, primaryExpr2.
I've done all left factorings and left recursion eliminations. I don't understand what ANTLR meant by "Resolve by left-factoring".
Why do I get these errors and how do I resolve them?
Thank you

Guide or approval for ANTLR example

I have an AlgebraRelacional.g4 file with this. I need to read a file with a syntax like a CSV file, put the content in some memory tables and then resolve relational algebra operations with that. Can you tell me if I am doing it right?
Example data file to read:
cod_buy(char);name_suc(char);Import(int);date_buy(date)
“P-11”;”DC Med”;900;01/03/14
“P-14”;”Center”;1500;02/05/14
Current ANTLR grammar:
grammar AlgebraRelacional;
SEL : '\u03C3'
;
PRO : '\u220F'
;
UNI : '\u222A'
;
DIF : '\u002D'
;
PROC : '\u0058'
;
INT : '\u2229'
;
AND : 'AND'
;
OR : 'OR'
;
NOT : 'NOT'
;
EQ : '='
;
DIFERENTE : '!='
;
MAYOR : '>'
;
MENOR : '<'
;
SUMA : '+'
;
MULTI : '*'
;
IPAREN : '('
;
DPAREN : ')'
;
COMA : ','
;
PCOMA : ';'
;
Comillas: '"'
;
file : hdr row+ ;
hdr : row ;
row : field (',' field)* '\r'? '\n' ;
field : TEXT | STRING | ;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ;
I suggest you that read this document (http://is.muni.cz/th/208197/fi_b/bc_thesis.pdf), It contains usefull information about how to write a parser for relational algebra. That is not ANTLR, but you only has to translate the grammar in BNF to EBNF.

Antlr parsing numbers problem

I have a problem parsing integer & hex numbers. I want to parse C++ enums with the following rules:
grammar enum;
rule_enum
: 'enum' ID '{' enum_values+ '}'';';
enum_values
: enum_value (COMMA enum_value)+;
enum_value
: ID ('=' number)?;
number : hex_number | integer_number;
hex_number
: '0' 'x' HEX_DIGIT+;
integer_number
: DIGIT+;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9');
COMMA : ',';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
The problem I have is the following - when parsing code like:
enum Enum
{
Option1 = 0,
Option2 = 1
};
it does not recognize the 0 as integer_number but tries to parse it as hex_number. How can I resolve this?
Thank you.
Tobias
First, fragment rules can only be "seen" by lexer rules, not parser rules. So, the following is invalid:
integer_number
: DIGIT+; // can't use DIGIT here!
fragment
DIGIT : ('0'..'9');
To fix your ambiguity with these numbers, it's IMO best to make these integer- and hex numbers lexer rules instead of parser rules.
An example:
grammar enum;
rule_enum
: 'enum' ID '{' enum_values+ '}'';';
enum_values
: enum_value (COMMA enum_value)+;
enum_value
: ID ('=' number)?;
number
: HEX_NUMBER
| INTEGER_NUMBER
;
HEX_NUMBER
: '0' 'x' HEX_DIGIT+;
INTEGER_NUMBER
: DIGIT+;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
DIGIT : ('0'..'9');
COMMA : ',';
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SPACE : (' ' | '\t' | '\r' | '\n') {skip();};
which produces the following parse tree of your example snippet:
The following ANTLR works for just the number bit of the enum.
(editted to include Bart's advice below)
grammar enum;
number :
integer_number | hex_number ;
hex_number
: HEX_NUMBER;
integer_number
: INT_NUMBER;
HEX_NUMBER
: HEX_INTRO HEX_DIGIT+;
INT_NUMBER
: DIGIT+;
HEX_INTRO
: '0x';
DIGIT : ('0'..'9');
HEX_DIGIT
: ('0'..'9'|'a'..'f'|'A'..'F') ;