Handling blank lines when White Space is important in ANTLR4 - grammar

This may be a newbee question, since I don't have a lot of ANTLR experience, but I've done a lot of research and troubleshooting and have not found a solution so resorting to asking. I am trying to write a parser for a very odd format file (PCGEN open source role playing game character editor) that I plan to use for several uses, not the least of which is learning ANTLR. I am to the point that I have everything I want working on the LEX and Parse, except that it stops parsing when it hits blank lines. I know I could add a line to throw away all whitespace, but the file format is such that strings are not really quoted, and white space is usually important, so the only white space that should be ignored is a totally blank line. When I run the Lexer it gives the tokens for the entire file, so I thought the Parser would process the tokens without concern for where they came from, so I am missing something simple. Here is the beggining of my input:
PCGVERSION:2.0
# System Information
CAMPAIGN:Advanced Player's Guide|CAMPAIGN:Ultimate Magic|CAMPAIGN:Ultimate Combat
VERSION:6.07.05
ROLLMETHOD:3|EXPRESSION:2d6+6
PURCHASEPOINTS:N
And this is my current grammar:
grammar PCG;
pcgFile : lines=line+;
line : statement (NEWLINE | EOF)
;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
COMMENT : '#' ~[\r\n]*->skip ;
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Any help would be appreciated.

You need to allow for more than 1 new line between statements. Do that by removing the rule and rewriting to this:
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
The main problem is that your lexer matches # System Information as a TEXT token. Whenever 2 or more rules match the same amount of characters, the rule defined first will "win" *. So that's TEXT. When you place COMMENT before TEXT, it will work:
grammar PCG;
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
COMMENT : '#' ~[\r\n]* ->skip ;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Keep in mind that ~(':'|'|'|'\r'|'\n'|'['|']')+ is dangerous: it could easily match a lot of characters.
* because the lexer works like this, input like 12 will never be tokenised as a VERSIONNUM token since INT matches this too an occurs before VERSIONNUM. Fix it by doing something like this:
statement : ...
| KEYWORD ASSIGN versionnum
| ...
;
versionnum : VERSIONNUM
| INT
;
...
INT : [0-9]+;
...
VERSIONNUM : [0-9]* '.' [0-9]+ ('.' [0-9]+)?
;
...

Related

ANTLR4 - How to close "longest-match-wins" and use first match rule?

Orignial question:
My code to parse:
N100G1M4
What I expcted: N100 G1 M4
But ANTLR can not idetify this because ANTLR always match longest substring?
How to handle the case?
Update
What I am going to do:
I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.
The G-Code rule is :
// Define a grammar called Hello
grammar GCode;
script : blocks+ EOF;
blocks:
assign_stat
| ncblock
| NEWLINE
;
ncblock :
ncelements NEWLINE //
;
ncelements :
ncelement+
;
ncelement
:
LINENUMEXPR // linenumber N100
| GCODEEXPR // G10 G54.1
| MCODEEXPR // M30
| coordexpr // X100 Y100 Z[A+b*c]
| FeedExpr // F10.12
| AccExpr // E2.0
// | callSubroutine
;
assign_stat:
VARNAME '=' expression NEWLINE
;
expression:
multiplyingExpression ('+' | '-') multiplyingExpression
;
multiplyingExpression
: powExpression (('*' | '/') powExpression)*
;
powExpression
: signedAtom ('^' signedAtom)*
;
signedAtom
: '+' signedAtom
| '-' signedAtom
| atom
;
atom
: scientific
| variable
| '(' expression ')'
;
LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr:
CoordExpr
| ParameterKeyword getValueExpr
;
getValueExpr:
'[' expression ']'
;
CoordExpr
:
ParameterKeyword SCIENTIFIC_NUMBER
;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;
fragment
GPOSTFIX
: Digit+ ('.' Digit+)*
;
variable
: VARNAME
;
scientific
: SCIENTIFIC_NUMBER
;
SCIENTIFIC_NUMBER
: SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
;
fragment NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
HEX_INTEGER
: '0' [xX] HEX_DIGIT+
;
fragment HEX_DIGIT
: [0-9a-fA-F]
;
INT : Digit+;
fragment
Digit : [0-9];
fragment
SIGN
: ('+' | '-')
;
VARNAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
NEWLINE
: '\r'? '\n'
;
WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines
Sample program(it works well except the last line):
N200 G54.1
a = 100
b = 10
c = a + b
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20
N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line
I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to
-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:
I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.
Append question:
What's the best tool to finish the task?
ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.
How to handle the case?
The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.
Scannerless Parser Generators

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.

How to allow an identifer which can start with a digit without causing MismatchedTokenException

I want to match the following input:
statement span=1m 0_dur=12
with the following grammar:
options {
language = Java;
output=AST;
ASTLabelType=CommonTree;
}
statement :'statement' 'span' '=' INTEGER 'm' ident '=' INTEGER;
INTEGER
: DIGIT+
;
ident : IDENT | 'AVG' | 'COUNT';
IDENT
: (LETTER | DIGIT | '_')+ ;
WHITESPACE
: ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
LETTER : ('a'..'z' | 'A'..'Z') ;
fragment
DIGIT : '0'..'9';
but it cause an error:
MismatchedTokenException : line 1:15 mismatched input '1m' expecting '\u0004'
Does anyone has any idea how to solve this?
THanks
Charles
I think your grammar is context sensitive, even at the lexical analyser(Tokenizer) level. The string "1m" is recognized as IDENT, not INTEGER followed by 'm'. You either redefine your syntax, or use predicated parsing, or embed Java code in your grammar to detect the context (e.g. If the number is presented after "span" followed by "=", then parse it as INTEGER).

antlr lexer rule matching a prefix of another rule

I am not sure that the issue is actually the prefixes, but here goes.
I have these two rules in my grammar (among many others)
DOT_T : '.' ;
AND_T : '.AND.' | '.and.' ;
and I need to parse strings like this:
a.eq.b.and.c.ne.d
c.append(b)
this should get lexed as:
ID[a] EQ_T ID[b] AND_T ID[c] NE_T ID[d]
ID[c] DOT_T ID[append] LPAREN_T ID[b] RPAREN_T
the error I get for the second line is:
line 1:3 mismatched character "p"; expecting "n"
It doesn't lex the . as a DOT_T but instead tries to match .and. because it sees the a after ..
Any idea on what I need to do to make this work?
UPDATE
I added the following rule and thought I'd use the same trick
NUMBER_T
: DIGIT+
( (DECIMAL)=> DECIMAL
| (KIND)=> KIND
)?
;
fragment DECIMAL
: '.' DIGIT+ ;
fragment KIND
: '.' DIGIT+ '_' (ALPHA+ | DIGIT+) ;
but when I try parsing this:
lda.eq.3.and.dim.eq.3
it gives me the following error:
line 1:9 no viable alternative at character "a"
while lexing the 3. So I'm guessing the same thing is happening as above, but the solution doesn't work in this case :S Now I'm properly confused...
Yes, that is because of the prefixed '.'-s.
Whenever the lexer stumbles upon ".a", it tries to create a AND_T token. If the characters "nd" can then not be found, the lexer tries to construct another token that starts with a ".a", which isn't present (and ANTLR produces an error). So, the lexer will not give back the character "a" and fall back to create a DOT_T token (and then an ID token)! This is how ANTLR works.
What you can do is optionally match these AND_T, EQ_T, ... inside the DOT_T rule. But still, you will need to "help" the lexer a bit by adding some syntactic predicates that force the lexer to look ahead in the character stream to be sure it can match these tokens.
A demo:
grammar T;
parse
: (t=. {System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
DOT_T
: '.' ( (AND_T)=> AND_T {$type=AND_T;}
| (EQ_T)=> EQ_T {$type=EQ_T; }
| (NE_T)=> NE_T {$type=NE_T; }
)?
;
ID
: ('a'..'z' | 'A'..'Z')+
;
LPAREN_T
: '('
;
RPAREN_T
: ')'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL)?
;
fragment DECIMAL : '.' DIGIT+ ;
fragment AND_T : ('AND' | 'and') '.' ;
fragment EQ_T : ('EQ' | 'eq' ) '.' ;
fragment NE_T : ('NE' | 'ne' ) '.' ;
fragment DIGIT : '0'..'9';
And if you feed the generated parser the input:
a.eq.b.and.c.ne.d
c.append(b)
the following output will be printed:
ID 'a'
EQ_T '.eq.'
ID 'b'
AND_T '.and.'
ID 'c'
NE_T '.ne.'
ID 'd'
ID 'c'
DOT_T '.'
ID 'append'
LPAREN_T '('
ID 'b'
RPAREN_T ')'
And for the input:
lda.eq.3.and.dim.eq.3
the following is printed:
ID 'lda'
EQ_T '.eq.'
NUMBER_T '3'
AND_T '.and.'
ID 'dim'
EQ_T '.eq.'
NUMBER_T '3'
EDIT
The fact that DECIMAL and KIND both start with '.' DIGIT+ is not good. Try something like this:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL ((KIND)=> KIND)?)?
;
fragment DECIMAL : '.' DIGIT+;
fragment KIND : '_' (ALPHA+ | DIGIT+); // removed ('.' DIGIT+) from this fragment
Note that the rule NUMBER_T will now never produce DECIMAL or KIND tokens. If you want that to happen, you need to change the type:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL {/*change type*/} ((KIND)=> KIND {/*change type*/})?)?
;

Parse sentences with different word types

I'm looking for a grammar for analyzing two type of sentences, that
means words separated by white spaces:
ID1: sentences with words not beginning with numbers
ID2: sentences with words not beginning with numbers and numbers
Basically, the structure of the grammar should look like
ID1 separator ID2
ID1: Word can contain number like Var1234 but not start with a number
ID2: Same as above but 1234 is allowed
separator: e. g. '='
#Bart
I just tried to add two tokens '_' and '"' as lexer-rule Special for later use in lexer-rule Word.
Even I haven't used Special in the following grammar, I get the following error in ANTLRWorks 1.4.2:
The following token definitions can never be matched because prior tokens match the same input: Special
But when I add fragment before Special, I don't get that error. Why?
grammar Sentence1b1;
tokens
{
TCUnderscore = '_' ;
TCQuote = '"' ;
}
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: ( Word | Int )+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ( 'a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit )*
;
Special
: ( TCUnderscore | TCQuote )
;
Space
: ( ' ' | '\t' | '\r' | '\n' ) { $channel = HIDDEN; }
;
fragment Digit
: '0'..'9'
;
Lexer-rule Special shall then be used in lexer-rule Word:
Word
: ( 'a'..'z' | 'A'..'Z' | Special ) ('a'..'z' | 'A'..'Z' | Special | Digit )*
;
I'd go for something like this:
grammar Sentence;
assignment
: id1 '=' id2
;
id1
: Word+
;
id2
: (Word | Int)+
;
Int
: Digit+
;
// A word must start with a letter
Word
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | Digit)*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
fragment Digit
: '0'..'9'
;
which will parse the input:
Word can contain number like Var1234 but not start with a number = Same as above but 1234 is allowed
as follows:
EDIT
To keep lexer rule nicely packed together, I'd keep them all at the bottom of the grammar instead of partly in the tokens { ... } block, which I only use for defining "imaginary tokens" (used in AST creation):
// wrong!
Special : (TCUnderscore | TCQuote);
TCUnderscore : '_';
TCQuote : '"';
Now, with the rules above, TCUnderscore and TCQuote can never become a token because when the lexer stumbles upon a _ or ", a Special token is created. Or in this case:
// wrong!
TCUnderscore : '_';
TCQuote : '"';
Special : (TCUnderscore | TCQuote);
the Special token can never be created because the lexer would first create TCUnderscore and TCQuote tokens. Hence the error:
The following token definitions can never be matched because prior tokens match the same input: ...
If you make TCUnderscore and TCQuote a fragment rule, you don't have that problem because fragment rules only "serve" other lexer rules. So this works:
// good!
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
Also, fragment rules can therefor never be "visible" in any of your parser rules (the lexer will never create a TCUnderscore or TCQuote token!).
// wrong!
parse : TCUnderscore;
Special : (TCUnderscore | TCQuote);
fragment TCUnderscore : '_';
fragment TCQuote : '"';
I'm not sure if that fits your needs but with Bart's help in my post
ANTLR - identifier with whitespace
i came to this grammar:
grammar PropertyAssignment;
assignment
: id_nodigitstart '=' id_digitstart EOF
;
id_nodigitstart
: ID_NODIGITSTART+
;
id_digitstart
: (ID_DIGITSTART|ID_NODIGITSTART)+
;
ID_NODIGITSTART
: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9')*
;
ID_DIGITSTART
: ('0'..'9'|'a'..'z'|'A'..'Z')+
;
WS : (' ')+ {skip();}
;
"a name = my 4value" works while "4a name = my 4value" causes an exception.