I am new with ANTLR4 and I am trying to visualize the Parse Tree of a text input in a simple form :
grammar Expr;
contract: (I WS SEND WS quantity WS asset WS TO WS beneficiary WS ON WS send_date WS)*;
asset: '$'| 'TND' | 'USD';
quantity:Q;
beneficiary: B;
send_date : day SLASH month SLASH year;
day: D ;
month: M ;
year: Y ;
B : LETTERUP (LETTERLOW+)+ LETTERLOW*;
Q : DIGITO DIGITZ*|DIGITO DIGITZ* POINT DIGITZ*;
D : DIGIT0 DIGITO|(DIGIT1|DIGIT2)DIGITZ|DIGIT3(DIGIT0|DIGIT1);
M : DIGIT0 DIGITO| DIGIT1(DIGIT0|DIGIT1|DIGIT2);
Y : DIGIT2 DIGIT0((DIGIT1(DIGIT7|DIGIT8|DIGIT9))|(DIGIT2 DIGITZ));
I: 'I';
SEND: 'send';
TO:'to' ;
ON: 'on';
LETTER : [a-zA-Z];
LETTERUP : [A-Z];
LETTERLOW : [a-z];
DIGITZ : [0-9];
DIGITO : [1-9];
DIGIT0 : [0];
DIGIT1 : [1];
DIGIT2 : [2];
DIGIT3 : [3];
DIGIT4 : [4];
DIGIT5 : [5];
DIGIT6 : [6];
DIGIT7 : [7];
DIGIT8 : [8];
DIGIT9 : [9];
SLASH:'/';
POINT:'.'|',';
WS : (' ' | '\t' |'\n' |'\r' )+ ;
But it keeps mismatching the send_date as you can see here:
I know it is a seriously complex numerical grammar I did just want some control the 01<= day <= 31 , 01<= month <= 12 and 2017<= year <= 2029 that's all
is there any help? and thanks
The problem happens because your grammar is ambiguous. 07 can match D and 2017 can match Q.
You can fix it like this:
grammar Expr;
contract: (I WS SEND WS quantity WS asset WS TO WS beneficiary WS ON WS send_date WS)*;
asset: '$'| 'TND' | 'USD';
quantity:Q;
beneficiary: B;
send_date : day month year ;
day: D ;
month: M ;
year: Y ;
D : DIGIT0 DIGITO|(DIGIT1|DIGIT2)DIGITZ|DIGIT3(DIGIT0|DIGIT1);
M : SLASH (DIGIT0 DIGITO| DIGIT1(DIGIT0|DIGIT1|DIGIT2));
Y : SLASH (DIGIT2 DIGIT0((DIGIT1(DIGIT7|DIGIT8|DIGIT9))|(DIGIT2 DIGITZ)));
B : LETTERUP (LETTERLOW+)+ LETTERLOW*;
Q : DIGITO DIGITZ*|DIGITO DIGITZ* POINT DIGITZ*;
I: 'I';
SEND: 'send';
TO:'to' ;
ON: 'on';
LETTER : [a-zA-Z];
LETTERUP : [A-Z];
LETTERLOW : [a-z];
DIGITZ : [0-9];
DIGITO : [1-9];
DIGIT0 : [0];
DIGIT1 : [1];
DIGIT2 : [2];
DIGIT3 : [3];
DIGIT4 : [4];
DIGIT5 : [5];
DIGIT6 : [6];
DIGIT7 : [7];
DIGIT8 : [8];
DIGIT9 : [9];
SLASH:'/';
POINT:'.'|',';
WS : (' ' | '\t' |'\n' |'\r' )+ ;
That's a seriously complex numerical grammar. Perhaps you could simplify:
day: NUMBER ;
month: NUMBER ;
year: NUMBER ;
NUMBER : DIGITZ+ ;
DIGITZ : [0-9];
You could enforce semantics like limiting year to [2017...2020] or whatever in your code. Just an idea. Simplifying often helps and then you can enhance it from there, knowing if you make a mistake you can always revert to something that will at least work.
EDIT:
The reason your grammar doesn't work is because the month is being lexed as a day:
[#0,0:0='I',<'I'>,1:0]
[#1,1:1=' ',<WS>,1:1]
[#2,2:5='send',<'send'>,1:2]
[#3,6:6=' ',<WS>,1:6]
[#4,7:9='300',<Q>,1:7]
[#5,10:10=' ',<WS>,1:10]
[#6,11:11='$',<'$'>,1:11]
[#7,12:12=' ',<WS>,1:12]
[#8,13:14='to',<'to'>,1:13]
[#9,15:15=' ',<WS>,1:15]
[#10,16:20='Ahmed',<B>,1:16]
[#11,21:21=' ',<WS>,1:21]
[#12,22:23='on',<'on'>,1:22]
[#13,24:24=' ',<WS>,1:24]
[#14,25:26='03',<D>,1:25]
[#15,27:27='/',<'/'>,1:27]
[#16,28:29='07',<D>,1:28] <-- see, this is being lexed as a D (day)
[#17,30:30='/',<'/'>,1:30]
[#18,31:34='2017',<Q>,1:31] <-- and this is being lexed as a Q (quantity)
[#19,35:36='\r\n',<WS>,1:35]
[#20,37:36='<EOF>',<EOF>,2:0]
line 1:28 mismatched input '05' expecting M
line 1:31 mismatched input '2017' expecting Y
Lexer rules are applied in the order in which they appear, and Day appears before Month. Quantity appears before Year. Hence the improper lexing. This is a scenario, honestly, where I think you need to simplify and just accept numbers. Then in your code, enforce the semantics (make sure year is in range, etc) in your code and provide a helpful error message to the user if values are not in range. Your total effort spend will be less that way.
NEW VERSION
grammar Test2;
contract: (I SEND quantity asset TO beneficiary ON send_date)*;
asset: '$'| 'TND' | 'USD';
send_date : DATE ;
quantity: NUMBER;
beneficiary: B;
DATE : NUMBER SLASH NUMBER SLASH NUMBER ;
B : LETTERUP (LETTERLOW+)+ LETTERLOW*;
I: 'I';
SEND: 'send';
TO:'to' ;
ON: 'on';
LETTER : [a-zA-Z];
LETTERUP : [A-Z];
LETTERLOW : [a-z];
NUMBER: DIGIT+;
DIGIT : [0-9];
SLASH:'/';
POINT:'.'|',';
WS : [ \t\n\r]+ -> skip;
Improvements:
1. Better handling of whitespace much more conventional.
2. Simplified number syntax.
3. It works
[#0,0:0='I',<'I'>,1:0]
[#1,2:5='send',<'send'>,1:2]
[#2,7:9='300',<NUMBER>,1:7]
[#3,11:11='$',<'$'>,1:11]
[#4,13:14='to',<'to'>,1:13]
[#5,16:20='Ahmed',<B>,1:16]
[#6,22:23='on',<'on'>,1:22]
[#7,25:34='03/07/2017',<DATE>,1:25]
[#8,37:36='<EOF>',<EOF>,2:0]
Problem: I simplified away the ability to do decimal numbers for quantity. You can add that back in as you wish.
Hello when running antlr4 with the following input i get the following error
image showing problem
[
I have been trying to fix it by doing some changes here and there but it seems it only works if I write every component of whileLoop in a new line.
Could you please tell me what i am missing here and why the problem persits?
grammar AM;
COMMENTS :
'{'~[\n|\r]*'}' -> skip
;
body : ('BODY' ' '*) anything | 'BODY' 'BEGIN' anything* 'END' ;
anything : whileLoop | write ;
write : 'WRITE' '(' '"' sentance '"' ')' ;
read : 'READ' '(' '"' sentance '"' ')' ;
whileLoop : 'WHILE' expression 'DO' ;
block : 'BODY' anything 'END';
expression : 'TRUE'|'FALSE' ;
test : ID? {System.out.println("Done");};
logicalOperators : '<' | '>' | '<>' | '<=' | '>=' | '=' ;
numberExpressionS : (NUMBER numberExpression)* ;
numberExpression : ('-' | '/' | '*' | '+' | '%') NUMBER ;
sentance : (ID)* {System.out.println("Sentance");};
WS : [ \t\r\n]+ -> skip ;
NUMBER : [0-9]+ ;
ID : [a-zA-Z0-9]* ;
**`strong text`**
Your lexer rules produce conflicts:
body : ('BODY' ' '*) anything | 'BODY' 'BEGIN' anything* 'END' ;
vs
WS : [ \t\r\n]+ -> skip ;
The critical section is the ' '*. This defines an implicit lexer token. It matches spaces and it is defined above of WS. So any sequence of spaces is not handled as WS but as implicit token.
If I am right putting tabs between the components of whileloop will work, also putting more than one space between them should work. You should simply remove ' '*, since whitespace is to be skipped anyway.
all I have develop ANTR4 grammar. During parse the string
Time;25 10 * * *;'faccalc_minus1_cron.out.'yyyyMMdd.HHmm;America/New_York
I have following errors
Invalid chars in expression! Expression: ;' Invalid chars: ;'
extraneous input ';' expecting {'', INTEGER, '-', '/', ','}
missing ';' at '_'
Incorrect timezone format :faccalc_minus1
I don't undestand why, as regex rule contain '_'.
How to fix it?
Regards,
Vladimir
lexer grammar FileTriggerLexer;
CRON
:
'cron'
;
MARKET_CRON
:
'marketCron'
;
COMBINED
:
'combined'
;
FILE_FEED
:
'FileFeed'
;
MANUAL_NOTICE
:
'ManualNotice'
;
TIME
:
'Time'
;
MARKET_TIME
:
'MarketTime'
;
SCHEDULE
:
'Schedule'
;
PRODUCT
:
'Product'
;
UCA_CLIENT
:
'UCAClient'
;
APEX_GSM
:
'ApexGSM'
;
DELAY
:
'Delay'
;
CATEGORY
:
'Category'
;
EXCHANGE
:
'Exchange'
;
CALENDAR_EXCHANGE
:
'CalendarExchange'
;
FEED
:
'Feed'
;
RANGE
:
'Range'
;
SYNTH
:
'Synth'
;
TRIGGER
:
'Trigger'
;
DELAYED_TRIGGER
:
'DelayedTrigger'
;
INTRA_TRIGGER
:
'IntraTrigger'
;
CURRENT_TRIGGER
:
'CurrentTrigger'
;
CALENDAR_FILE_FEED
:
'CalendarFileFeed'
;
PREVIOUS
:
'Previous'
;
LATE_DELAY
:
'LateDelay'
;
BUILD_ARCHIVE
:
'BuildArchive'
;
COMPRESS
:
'Compress'
;
LATE_TIME
:
'LateTime'
;
CALENDAR_CATEGORY
:
'CalendarCategory'
;
APEX_GPM
:
'ApexGPM'
;
PORTFOLIO_NOTICE
:
'PortfolioNotice'
;
FixedTimeOfDay: 'FixedTimeOfDay';
SEMICOLON
:
';'
;
ASTERISK
:
'*'
;
LBRACKET
:
'('
;
RBRACKET
:
')'
;
PERCENT
:
'%'
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
DOUBLE_QUOTE
:
'"'
;
QUOTE
:
'\''
;
SLASH
:
'/'
;
DOT
:
'.'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
EQUAL
:
'='
;
MORE_THAN
:
'>'
;
LESS
:
'<'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
WS
:
[ \t\r\n]+ -> skip
;
/**
* Define Fied Trigger valdiator grammar
*/
grammar FileTriggerValidator;
options
{
tokenVocab = FileTriggerLexer;
}
r
:
(
schedule
| file_feed
| time_feed
| market_time_feed
| manual_notice
| portfolio_notice
| not_checked
)+
;
not_checked
:
(
PRODUCT
| UCA_CLIENT
| APEX_GSM
| APEX_GPM
| DELAY
| CATEGORY
| CALENDAR_CATEGORY
| EXCHANGE
| CALENDAR_EXCHANGE
| FEED
| RANGE
| SYNTH
| TRIGGER
| DELAYED_TRIGGER
| INTRA_TRIGGER
| CURRENT_TRIGGER
| CALENDAR_FILE_FEED
| PREVIOUS
| LATE_DELAY
| LATE_TIME
| COMPRESS
| BUILD_ARCHIVE
)
(
SEMICOLON anyList
)?
;
anyList
:
anyElement
(
SEMICOLON anyElement
)*
;
anyElement
:
cron
| file_name
| with_step_value
| source_file
| timezone
| regEx
;
portfolio_notice
:
PORTFOLIO_NOTICE SEMICOLON regEx
;
manual_notice
:
MANUAL_NOTICE SEMICOLON file_name SEMICOLON timezone
;
time_feed
:
TIME SEMICOLON cron_part
(
timezone?
) SEMICOLON file_name SEMICOLON timezone
;
market_time_feed
:
MARKET_TIME SEMICOLON cron_part timezone SEMICOLON file_name SEMICOLON
timezone
(
SEMICOLON UNDERSCORE? INTEGER
)*
;
file_feed
:
file_feed_name SEMICOLON source_file SEMICOLON source_host SEMICOLON
source_host SEMICOLON regEx SEMICOLON regEx
(
SEMICOLON source_host
)*
;
regEx
:
(
ID
| DOT
| ASTERISK
| INTEGER
| PERCENT
| UNDERSCORE
| DASH
| LESS
| MORE_THAN
| EQUAL
| SLASH
| LBRACKET
| RBRACKET
| DOUBLE_QUOTE
| QUOTE
| COMMA
)+
;
source_host
:
ID
(
DASH ID
)*
;
file_feed_name
:
FILE_FEED
;
source_file
:
(
ID
| DASH
| UNDERSCORE
)+
;
schedule
:
SCHEDULE SEMICOLON schedule_defining SEMICOLON file_name SEMICOLON timezone
(
SEMICOLON DASH? INTEGER
)*
;
schedule_defining
:
cron
| market_cron
| combined_cron
;
cron
:
CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE RBRACKET
;
market_cron
:
MARKET_CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE COMMA
DOUBLE_QUOTE ID DOUBLE_QUOTE RBRACKET
;
combined_cron
:
COMBINED LBRACKET cron_list_element
(
COMMA cron_list_element
)* RBRACKET
;
mic_defining
:
ID
;
file_name
:
regEx
;
cron_list_element
:
cron
| market_cron
;
//
schedule_defined_string
:
cron
;
//
cron_part
:
minutes hours days_of_month month week_days
;
//
minutes
:
with_step_value
;
hours
:
with_step_value
;
//
int_list
:
INTEGER
| interval
(
COMMA INTEGER
| interval
)*
;
interval
:
INTEGER DASH INTEGER
;
//
days_of_month
:
with_step_value
;
//
month
:
with_step_value
;
//
week_days
:
with_step_value
;
//
timezone
:
timezone_part
(
SLASH timezone_part
)?
;
//
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
//
with_step_value
:
(
INTEGER
| COMMA
| SLASH
| ASTERISK
| DASH
)+
;
step
:
SLASH int_list
;
To analyze this kind of problem, dump the token stream to see what the lexer is actually doing. To directly dump the tokens, see this answer. AntlrDT, for example, also provides a graphical analysis of the corresponding parse-tree (I am the author of AntlrDT).
From this, easy to see that the first error occurs in the with_step_value rule: does not allow for a trailing semicolon.
Second error is in the timezone_part rule: does not allow for repeated ID UNDERSCORE occurrences.
I'm new to Xtext and I'm trying to create a simple DSL for railway systems, here's my grammar :
grammar org.xtext.railway.RailWay with org.eclipse.xtext.common.Terminals
generate railWay "http://www.xtext.org/railway/RailWay"
Model:
(trains+=Train)*
| (paths+=Path)*
| (sections+=Section)*
;
Train:
'Train' name=ID ':'
'Path' path=[Path]
'Speed' speed=INT
'end'
;
Path:
'Path' name=ID ':'
'Sections' ('{' sections+=[Section] (',' sections+=[Section] )+ '}' ) | sections+=[Section]
'end'
;
Section:
'Section' name=ID ':'
'Start' start=INT
'End' end=INT
('SpeedMax' speedMax=INT)?
'end'
;
But when I put this code at the Eclipse instance :
Section brestStBrieux :
Start 0
End 5
end
Section StBrieuxLeMan :
Start 5
End 10
end
Section leManParis :
Start 1
End 12
end
Path brestParis :
Sections { brestStBrieux, StBrieuxLeMan, leManParis}
end
Train tgv :
Path brestParis
Speed 23
end
I got this error three times:
mismatched input '0' expecting RULE_INT
mismatched input '1' expecting RULE_INT
mismatched input '5' expecting RULE_INT
I can't see where those errors come from, what can I do to fix them. Any idea?
Christian is right, since the FLOAT terminal is no longer defined, the original problem is resolved. Anyway, a remaining issue is the rule
Path:
'Path' name=ID ':'
'Sections' ('{' sections+=[Section] (',' sections+=[Section] )+ '}' ) | sections+=[Section]
'end'
;
which currently has this precedence:
Path:
(
'Path' name=ID ':' 'Sections'
('{' sections+=[Section] (',' sections+=[Section] )+ '}' )
)
|
(sections+=[Section] 'end')
;
You may want to rewrite it to
Path:
'Path' name=ID ':'
'Sections'
(
('{' sections+=[Section] (',' sections+=[Section] )+ '}' )
| sections+=[Section]
) 'end'
;
lexing and parsing are different steps. thus no using does not matter. and your grammar gets ambigous (have a look at the warnings when generating the lang) you should turn that into a datatype rule (simply omit the terminal keyword)
=> change your grammar to
grammar org.xtext.example.mydsl2.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl2/MyDsl"
Model:
(trains+=Train)*
| (paths+=Path)*
| (sections+=Section)*
;
Train:
'Train' name=ID ':'
'Path' path=[Path]
'Speed' speed=INT
'end'
;
Path:
'Path' name=ID ':'
'Sections' ('{' sections+=[Section] (',' sections+=[Section] )+ '}' ) | sections+=[Section]
'end'
;
Section:
'Section' name=ID ':'
'Start' start=INT
'End' end=INT
('SpeedMax' speedMax=INT)?
'end'
;
FLOAT : '-'? INT ('.' INT)?;
I would like to parse following expresion with antlr4
termspannear ( xxx, xxx , 5 , true )
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
Where termspannear functions can be nested
Here is my grammar:
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : TERMSPAN ;
TERMSPAN : TERMSPANNEAR | 'xxx' ;
TERMSPANNEAR: 'termspannear' OPENP BODY CLOSEP ;
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
After running:
antlr4 TermSpanNear.g4
javac TermSpanNear*.java
grun TermSpanNear start -gui
termspannear ( xxx, xxx , 5 , true )
^D![enter image description here][1]
line 1:0 token recognition error at: 'termspannear '
line 1:13 extraneous input '(' expecting TERMSPAN
and the tree looks like:
Can someone help me with this grammar ?
So the parsed tree contains all params and and also nesting works
NOTE:
After suggestion by I rewrote it to
//Define a gramar to parse TermSpanNear
grammar TermSpanNear;
start : termspan EOF;
termspan : termspannear | 'xxx' ;
termspannear: 'termspannear' '(' body ')' ;
body : termspan ',' termspan ',' SLOP ',' ORDERED ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
I think now it works
I'm geting the following trees:
For
termspannear ( xxx, xxx , 5 , true )
For
termspannear ( xxx, termspannear ( xxx, xxx , 5 , true ) , 5 , true )
You're using way too many lexer rules.
When you're defining a token like this:
BODY : TERMSPAN COMMA TERMSPAN COMMA SLOP COMMA ORDERED ;
then the tokenizer (lexer) will try to create the (single!) token: xxx,xxx,5,true. E.g. it does not allow any space in between it. Lexer rules (the ones starting with a capital) should really be the "atoms" of your language (the smallest parts). Whenever you start creating elements like a body, you glue atoms together in parser rules, not in lexer rules.
Try something like this:
grammar TermSpanNear;
// parser rules (the elements)
start : termpsan EOF ;
termpsan : termpsannear | 'xxx' ;
termpsannear : 'termspannear' OPENP body CLOSEP ;
body : termpsan COMMA termpsan COMMA SLOP COMMA ORDERED ;
// lexer rules (the atoms)
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
SLOP : [0-9]+ ;
ORDERED : 'true' | 'false' ;
WS : [ \t\r\n]+ -> skip ;