Im using a cut down version of a pascal grammar to create a compiler which converts pascal to javascript, however i keep running into this error
line 3:4 no viable alternative at input 'PROCEDURE'
line 3:38 extraneous input ':' expecting {'END', ';'}
line 5:4 no viable alternative at input 'VAR'
The following is the relevant parts of my Grammar:
grammar pascal;
program
: programHeading ('INTERFACE')?
block
DOT
;
programHeading
: 'PROGRAM' identifier (LPAREN identifierList RPAREN)? SEMI
| 'UNIT' identifier SEMI
;
identifier
: IDENT
;
block
: ( labelDeclarationPart
| constantDefinitionPart
| typeDefinitionPart
| variableDeclarationPart
| procedureAndFunctionDeclarationPart
| usesUnitsPart
| 'IMPLEMENTATION'
)*
| compoundStatement
;
procedureAndFunctionDeclarationPart
: procedureOrFunctionDeclaration SEMI
;
procedureOrFunctionDeclaration
: procedureDeclaration
| functionDeclaration
;
procedureDeclaration
: 'PROCEDURE' identifier (formalParameterList)? SEMI
( block | directive )
;
functionDeclaration
: 'FUNCTION' identifier (formalParameterList)? COLON resultType SEMI
( block | directive )
;
compoundStatement
: 'BEGIN'
statements
'END'
;
statements
: statement ( SEMI statement )*
;
statement
: label COLON unlabelledStatement
| unlabelledStatement
;
im using antlr-4.5-complete and was just hoping someone could shed some light on this.
This is the program im trying to compile:
PROGRAM Lesson1_PROGRAM3;
BEGIN
PROCEDURE DrawLine(X : Integer; Y : Integer);
VAR
Num1, Num2, Sum : Integer;
BEGIN
Write('Input number 1:');
Readln(Num1);
Writeln('Input number 2:');
Readln(Num2);
Sum := Num1 + Num2;
Writeln(Sum);
Readln;
IF Sel = '1' THEN
BEGIN
Total := N1 + N2;
Write('Press any key TO continue...');
Readkey;
GOTO 1;
END;
FOR Counter := 1 TO 7 DO
writeln('for loop');
Readln;
END;
END.
Related
Orignial question:
My code to parse:
N100G1M4
What I expcted: N100 G1 M4
But ANTLR can not idetify this because ANTLR always match longest substring?
How to handle the case?
Update
What I am going to do:
I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.
The G-Code rule is :
// Define a grammar called Hello
grammar GCode;
script : blocks+ EOF;
blocks:
assign_stat
| ncblock
| NEWLINE
;
ncblock :
ncelements NEWLINE //
;
ncelements :
ncelement+
;
ncelement
:
LINENUMEXPR // linenumber N100
| GCODEEXPR // G10 G54.1
| MCODEEXPR // M30
| coordexpr // X100 Y100 Z[A+b*c]
| FeedExpr // F10.12
| AccExpr // E2.0
// | callSubroutine
;
assign_stat:
VARNAME '=' expression NEWLINE
;
expression:
multiplyingExpression ('+' | '-') multiplyingExpression
;
multiplyingExpression
: powExpression (('*' | '/') powExpression)*
;
powExpression
: signedAtom ('^' signedAtom)*
;
signedAtom
: '+' signedAtom
| '-' signedAtom
| atom
;
atom
: scientific
| variable
| '(' expression ')'
;
LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr:
CoordExpr
| ParameterKeyword getValueExpr
;
getValueExpr:
'[' expression ']'
;
CoordExpr
:
ParameterKeyword SCIENTIFIC_NUMBER
;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;
fragment
GPOSTFIX
: Digit+ ('.' Digit+)*
;
variable
: VARNAME
;
scientific
: SCIENTIFIC_NUMBER
;
SCIENTIFIC_NUMBER
: SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
;
fragment NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
HEX_INTEGER
: '0' [xX] HEX_DIGIT+
;
fragment HEX_DIGIT
: [0-9a-fA-F]
;
INT : Digit+;
fragment
Digit : [0-9];
fragment
SIGN
: ('+' | '-')
;
VARNAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
NEWLINE
: '\r'? '\n'
;
WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines
Sample program(it works well except the last line):
N200 G54.1
a = 100
b = 10
c = a + b
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20
N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line
I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to
-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:
I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.
Append question:
What's the best tool to finish the task?
ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.
How to handle the case?
The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.
Scannerless Parser Generators
This may be a newbee question, since I don't have a lot of ANTLR experience, but I've done a lot of research and troubleshooting and have not found a solution so resorting to asking. I am trying to write a parser for a very odd format file (PCGEN open source role playing game character editor) that I plan to use for several uses, not the least of which is learning ANTLR. I am to the point that I have everything I want working on the LEX and Parse, except that it stops parsing when it hits blank lines. I know I could add a line to throw away all whitespace, but the file format is such that strings are not really quoted, and white space is usually important, so the only white space that should be ignored is a totally blank line. When I run the Lexer it gives the tokens for the entire file, so I thought the Parser would process the tokens without concern for where they came from, so I am missing something simple. Here is the beggining of my input:
PCGVERSION:2.0
# System Information
CAMPAIGN:Advanced Player's Guide|CAMPAIGN:Ultimate Magic|CAMPAIGN:Ultimate Combat
VERSION:6.07.05
ROLLMETHOD:3|EXPRESSION:2d6+6
PURCHASEPOINTS:N
And this is my current grammar:
grammar PCG;
pcgFile : lines=line+;
line : statement (NEWLINE | EOF)
;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
COMMENT : '#' ~[\r\n]*->skip ;
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Any help would be appreciated.
You need to allow for more than 1 new line between statements. Do that by removing the rule and rewriting to this:
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
The main problem is that your lexer matches # System Information as a TEXT token. Whenever 2 or more rules match the same amount of characters, the rule defined first will "win" *. So that's TEXT. When you place COMMENT before TEXT, it will work:
grammar PCG;
pcgFile : NEWLINE* statement ( NEWLINE+ statement )* NEWLINE* EOF;
statement : KEYWORD ASSIGN
| KEYWORD ASSIGN YES_NO
| KEYWORD ASSIGN TEXT
| KEYWORD ASSIGN VERSIONNUM
| KEYWORD ( ASSIGN INT )+
| KEYWORD ASSIGN INT
| KEYWORD ASSIGN SUB_START statement SUB_END
| statement SEP statement
;
NEWLINE : '\r\n' | 'r' | '\n' ;
YES_NO : ('Y'|'N');
KEYWORD : [A-Z]+;
INT : [0-9]+;
COMMENT : '#' ~[\r\n]* ->skip ;
TEXT : ~(':'|'|'|'\r'|'\n'|'['|']')+;
ASSIGN : ':';
SEP : '|';
VERSIONNUM : ([0-9]+ ('.' [0-9]+)?)
| ('.' [0-9]+)
| ([0-9]+ ('.' [0-9]+) ('.' [0-9]+)?)
;
ROLL : INT [dD] INT (('+'|'-') INT)?;
SUB_START : '[';
SUB_END : ']';
Keep in mind that ~(':'|'|'|'\r'|'\n'|'['|']')+ is dangerous: it could easily match a lot of characters.
* because the lexer works like this, input like 12 will never be tokenised as a VERSIONNUM token since INT matches this too an occurs before VERSIONNUM. Fix it by doing something like this:
statement : ...
| KEYWORD ASSIGN versionnum
| ...
;
versionnum : VERSIONNUM
| INT
;
...
INT : [0-9]+;
...
VERSIONNUM : [0-9]* '.' [0-9]+ ('.' [0-9]+)?
;
...
I am using antlr 2.7.6.
I am programming a parser for plc 61131-3 ST language and I can't resolve an issue with my grammar.
The grammar is:
case_Stmt : 'CASE' expression 'OF' case_Selection + ( 'ELSE' stmt_List )? 'END_CASE';
case_Selection : case_List ':' stmt_List;
case_List : case_List_Elem ( ',' case_List_Elem )*;
case_List_Elem : subrange | constant_Expr;
constant_Expr : constant | enum_Value;
stmt_List : ( Stmt ? ';' )*;
stmt : assign_Stmt | subprog_Ctrl_Stmt | selection_Stmt | Iteration_Stmt;
assign_Stmt : ( variable ':=' expression )
enum_Value : ( identifier '#' )? identifier;
variable : identifier | ...
The problem occurs with "enum_Value" as "case_Selection", the parser interprets it as a new "stmt" instead of the new "Case_Selection" it was supposed to.
Example:
CASE (enumVariable) OF
enum#literal1: Variable1 := 1;
enum#liteal2: Variable1 := 2;
enum#liteal3: Variable1 := 3;
ELSE
Variable1 := 4;
END_CASE;
In the above example instead of taking " enum.liteal2" as the new "case_Selection" it interprets it as "assign_Stmt" and gives error because it doesn't found the ':='.
Is there a way to try to read the maximum of characthers till we find the ':' or the ':=' to understand if we realy have a new "stmt" or not?
Thank you!
Edit1: better syntax;
I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.
I'm currently working on a ANTLR grammar that accepts sql statements. I want to use this grammar to allow programmers to create mysql queries and the application will automatically change the query into the proper format for the needed database.
For example if you use LIMIT 0,5 in your query it will automatically transform this query to the proper format for mssql
This is my grammer up until now
grammar sql2;
query
: select ';'? EOF
;
select
: 'SELECT' top? select_exp 'FROM' table ('WHERE' compare_exp)? limit?
;
compare_exp
: field CompareOperator param (BooleanOperator compare_exp)?
;
param
: '#' ID
;
top
: 'TOP' INT
;
limit
: 'LIMIT' INT (',' INT)?
;
select_exp
: field (',' select_exp)?
;
table
: '`' ID '`' ('.`' ID '`')?
| ID ('.' ID)?
;
field
: table
;
ID : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
CompareOperator
: ( '=' | '<>' )
;
BooleanOperator
: ('AND' | 'OR')
;
INT : '0'..'9'+
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
If I test this with the input
SELECT TOP 5 bla_asdf, bal, lab FROM `x`.`y` WHERE xdf = #tf AND bla = #b LIMIT 0, 5
it stops parsing my query at the AND bla = #b, at that point it gives me an NoViableAltException...
If I put the input
SELECT TOP 5 bla_asdf, bal, lab FROM `x`.`y` WHERE xdf = #tf LIMIT 0, 5
It will give me no problems whatsoever.
I'm absolutely no expert at ANTLR but that also means that I don't see what I'm doing wrong here.
Can anybody help me on this?
Cheers
AND is tokenized as an ID, because the first lexer rule matching the longest input fragment wins.
So you should define BooleanOperator before ID.