I'm trying to figure out the grammar for the following syntax.
foreach
where x = 1 when some_variable = true
where x = 2 when some_variable = false
where y = 0
print z // Main block
when none // optional
print 'not found' // Exception block
endfor
My grammar looks like:
foreach_stmt : 'for' 'each' where_opt* blockstmt* whennone_opt? 'endfor'
;
where_opt : 'where' cond_clause
;
cond_clause : test when_opt*
;
when_opt : 'when' test
;
whennone_opt : 'when' 'none' blockstmt*
;
test : or_test
;
// further rules omitted
But when the main block is blank, for example
foreach
where x = 1
// main block is blank, do nothing
when none
print 'none'
endfor
In this case my grammar considers "when none" is a cond_clause to "where x = 1" which is not what I'm expecting.
Also consider the following case:
foreach
where x = 1 when none = 2
print 'none'
// exceptional block is blank
endfor
Where the "none" can be a variable, and "none = 2" should match the "test" rule so it's part of "where...when...".
However when "none" is not in an expression statement, I want "when none" match the "foreach" rather than the previous "where". How can I modify my grammar to do this?
Sorry this title sucks but I don't know how to describe the problem in a few words. Any help would be greatly appreciated.
The parser generated from the following ANTLR grammar:
grammar Genexus;
parse
: foreach_stmt* EOF
;
foreach_stmt
: 'foreach' where_opt* blockstmt* whennone_opt? 'endfor'
;
where_opt
: 'where' cond_clause
;
cond_clause
: test when_opt*
;
when_opt
: 'when' test
;
whennone_opt
: 'when' 'none' blockstmt*
;
test
: identifier '=' atom
;
identifier
: 'none'
| Identifier
;
blockstmt
: 'print' atom
;
atom
: Boolean
| Number
| StringLiteral
| Identifier
;
Number
: '0'..'9'+
;
Boolean
: 'true'
| 'false'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_')+
;
StringLiteral
: '\'' ~'\''* '\''
;
Ignore
: (' ' | '\t' | '\r' | '\n') {skip();}
| '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Produces the following 3 parse-trees from your examples:
1
Source:
foreach
where x = 1 when some_variable = true
where x = 2 when some_variable = false
where y = 0
print z // Main block
when none // optional
print 'not found' // Exception block
endfor
Parse tree:
larger image
2
Source:
foreach
where x = 1
// main block is blank, do nothing
when none
print 'none'
endfor
Parse tree:
3
Source:
foreach
where x = 1 when none = 2
print 'none'
// exceptional block is blank
endfor
Parse tree:
Related
i am working on a grammar to parse apache velocity on my own and i ran into the issue that i am not able to detect normal text neither the markup.
I am getting this message during the first line of the source.
line 1:0 extraneous input '// ${Name}.java' expecting {BREAK, FOREACH, IF, INCLUDE, PARSE, SET, STOP, '#[[', RAW_TEXT, '$'}
The input '// ${Name}.Java' should be tokenized to RAW_TEXT '$' '{' IDENTIFIER '}' RAW_TEXT. The parser rules should be rawText reference rawText. These parser rules are statements.
This is my source file. It is a java template in this case but the source file could or might be also a html template like mentioned in the user guide of apache velocity.
// ${Name}.java
#foreach ( $vertice in $Vertices )
#if ( $vertice.Type == "Class" )
public class $vertice.Name {
#foreach ( $edge in $Edges )
#if ( $edge.from == $vertice.Name)
// From $edge.from to $edge.to
private $edge.to $edge.to.toLowerCase();
public $edge.to get{$edge.to}() {
return this.${edge.to.toLowerCase()};
}
public void set${edge.to}(${edge.to} new${edge.to}) {
$edge.to old${edge.to} = this.${edge.to.toLowerCase()};
if (old${edge.to} != new${edge.to}) {
if (old${edge.to} != null) {
this.${edge.to.toLowerCase()} = null;
old${edge.to}.set${edge.from}(null);
}
this.${edge.to.toLowerCase()} = new${edge.to};
if (new${edge.to} != null) {
new${edge.to}.set${edge.from}(this);
}
}
}
public $edge.from with${edge.to}(${edge.to} new${edge.to}) {
this.set${edge.to}(new${edge.to});
return this;
}
#end
#end
}
#end
#end
This is my grammar.
grammar Velocity;
/* -- Parser Rules --- */
/*
* Start Rule
*/
template
: statementSet EOF?
;
/*
* Statements
*/
statementSet
: statement+
;
statement
: rawText # RawTextStatement
| unparsed # UnparsedStatement
| reference # ReferenceStatement
| setDirective # SetStatement
| ifDirective # IfStatement
| foreachDirective # ForeachStatement
| includeDirective # IncludeStatement
| parseDirective # ParseStatement
| breakDirective # BreakStatement
| stopDirective # StopStatement
;
rawText
: RAW_TEXT
;
unparsed
: UNPARSED UnparsedText=(TEXT | NL)* UNPARSED_END
;
setDirective
: SET '(' assignment ')'
;
ifDirective
: ifPart (elseifPart)* (elsePart)? END
;
foreachDirective
: FOREACH '(' variableReference 'in' enumerable ')' statementSet END
;
includeDirective
: INCLUDE '(' stringValue (',' stringValue)* ')'
;
parseDirective
: PARSE '(' stringValue ')'
;
breakDirective
: BREAK
;
stopDirective
: STOP
;
/*
* Expressions
*/
assignment
: assignableReference '=' expression
;
expression
: reference # ReferenceExpression
| string # StringLiteralExpression
| NUMBER # NumberLiteralExpression
| array # ArrayExpression
| map # MapExpression
| range # RangeExpression
| arithmeticOperation # ArithmeticOperationExpression
| booleanOperation # BooleanOperationExpression
;
enumerable
: array
| map
| range
| reference
;
stringValue
: string # StringValue_String
| reference # StringValue_Reference
;
/*
* References
*/
reference
: DOLLAR Quiet='!'? (referenceType | '{' referenceType '}')
;
assignableReference
: DOLLAR Quiet='!'? (assignableReferenceType | '{' assignableReferenceType '}')
;
referenceType
: assignableReferenceType # ReferenceType_AssignableReferenceType
| methodReference # ReferenceType_MethodReference
;
assignableReferenceType
: variableReference # AssignableReferenceType_VariableReference
| propertyReference # AssignableReferenceType_PropertyReference
;
variableReference
: IDENTIFIER indexNotation?
;
propertyReference
: IDENTIFIER ('.' IDENTIFIER)+ indexNotation?
;
methodReference
: IDENTIFIER ('.' IDENTIFIER)* '.' IDENTIFIER '(' (expression (',' expression)*)? ')' indexNotation?
;
indexNotation
: '[' NUMBER ']' # IndexNotation_Number
| '[' reference ']' # IndexNotation_Reference
| '[' string ']' # IndexNotation_String
;
/*
* Parsed Types
*/
string
: '"' stringText* '"' # DoubleQuotedString
| '\'' TEXT? '\'' # SingleQuotedString
;
stringText
: TEXT # StringText_Text
| reference # StringText_Reference
;
/*
* Container Types
*/
array
: '[' (expression (',' expression)*)? ']'
;
map
: '{' (expression ':' expression (',' expression ':' expression))? '}'
;
range
: '[' n=NUMBER '..' m=NUMBER ']'
;
/*
* Arithmetic Operators
*/
arithmeticOperation
: sum
;
sum
: term (followingTerm)*
;
followingTerm
: Operator=('+' | '-') term
;
term
: factor (followingFactor)*
;
followingFactor
: Operator=('*' | '/' | '%') factor
;
factor
: NUMBER # Factor_Number
| reference # Factor_Reference
| '(' arithmeticOperation ')' # Factor_InnerArithmeticOperation
;
/*
* Boolean Operators
*/
booleanOperation
: disjunction
;
disjunction
: conjunction (followingConjunction)*
;
followingConjunction
: Operator=OR conjunction
;
conjunction
: booleanComparison (followingBooleanComparison)*
;
followingBooleanComparison
: Operator=AND booleanComparison
;
booleanComparison
: booleanFactor (followingBooleanFactor)*
;
followingBooleanFactor
: Operator=(EQUALS | NOT_EQUALS) booleanFactor
;
booleanFactor
: BOOLEAN # BooleanFactor_Boolean
| reference # BooleanFactor_Reference
| negation # BooleanFactor_Negation
| arithmeticComparison # BooleanFactor_ArithmeticComparison
| '(' booleanOperation ')' # BooleanFactor_InnerBooleanOperation
;
arithmeticComparison
: LeftHandSide=arithmeticOperation Operator=(EQUALS | NOT_EQUALS | GREATER_THAN | GREATER_THAN_OR_EQUAL_TO | LESS_THAN | LESS_THAN_OR_EQUAL_TO) RightHandSide=arithmeticOperation
;
negation
: NOT booleanFactor
;
/*
* Conditionals
*/
ifPart
: IF '(' booleanOperation ')' statementSet
;
elseifPart
: ELSEIF '(' booleanOperation ')' statementSet
;
elsePart
: ELSE statementSet
;
/* --- Lexer Rules --- */
/*
* Comments
*/
SINGLE_LINE_COMMENT
: '##' TEXT? NL -> skip
;
MULTI_LINE_COMMENT
: '#*' (TEXT | NL)* '*#' -> skip
;
COMMENT_BLOCK
: '#**' (TEXT | NL)* '*#' -> skip
;
/*
* Directives
*/
BREAK
: '#break'
| '#{break}'
;
DEFINE
: '#define'
| '#{define}'
;
ELSE
: '#else'
| '#{else}'
;
ELSEIF
: '#elseif'
| '#{elseif}'
;
END
: '#end'
| '#{end}'
;
EVALUATE
: '#evaluate'
| '#{evaluate}'
;
FOREACH
: '#foreach'
| '#{foreach}'
;
IF
: '#if'
| '#{if}'
;
INCLUDE
: '#include'
| '#{include}'
;
MACRO
: '#macro'
| '#{macro}'
;
PARSE
: '#parse'
| '#{parse}'
;
SET
: '#set'
| '#{set}'
;
STOP
: '#stop'
| '#{stop}'
;
UNPARSED
: '#[['
;
UNPARSED_END
: ']]#'
;
/*
* Identifier
*/
DOLLAR
: '$' -> more
;
IDENTIFIER
: CHARACTER+ (CHARACTER | INTEGER | HYPHEN | UNDERSCORE)*
;
/*
* Boolean Values
*/
TRUE
: 'true'
;
FALSE
: 'false'
;
/*
* Boolean Operators
*/
EQUALS
: '=='
| 'eq'
;
NOT_EQUALS
: '!='
| 'ne'
;
GREATER_THAN
: '>'
| 'gt'
;
GREATER_THAN_OR_EQUAL_TO
: '>='
| 'ge'
;
LESS_THAN
: '<'
| 'lt'
;
LESS_THAN_OR_EQUAL_TO
: '<='
| 'le'
;
OR
: '||'
;
AND
: '&&'
;
NOT
: '!'
| 'not'
;
/*
* Literals
*/
BOOLEAN
: TRUE
| FALSE
;
NUMBER
: '-'? INTEGER
| '-'? INTEGER '.' INTEGER
;
/*
* Content
*/
RAW_TEXT
: ~[*#$]+
;
TEXT
: (ESC | SAFE_CODE_POINT)+
;
fragment ESC
: '\\' (["\\/#$!bftrn] | UNICODE)
;
fragment UNICODE
: 'u' HEX HEX HEX HEX
;
fragment HEX
: [0-9a-fA-F]
;
fragment SAFE_CODE_POINT
: ~["\\\u0000-\u001F]
;
/*
* Atomic elements
*/
CHARACTER
: [a-zA-Z]+
;
INTEGER
: [0-9]+
;
HYPHEN
: '-'
;
UNDERSCORE
: '_'
;
NL
: '\r'
| '\n'
| '\r\n'
;
WS
: ('\t' | ' ' | '\r' | '\n' | '\r\n')+ -> skip
;
What details am i missing here? What has to be done to actually parse velocity code?
Best Regards
Update:
I have changed these lexer rules.
DOLLAR
: '$'
;
RAW_TEXT
: ~[*#$]*
;
TEXT
: (ESC | SAFE_CODE_POINT)*?
;
fragment SAFE_CODE_POINT
: ~[$"\\\u0000-\u001F]
;
And now i'm getting this messages.
[0] line 1:4 mismatched input '{Name}.java\r\n' expecting {'!', '{', IDENTIFIER}
[0] line 2:8 mismatched input ' ( ' expecting '('
[0] line 2:12 mismatched input 'vertice in ' expecting {'!', '{', IDENTIFIER}
[0] line 2:24 mismatched input 'Vertices )\r\n' expecting {'!', '{', IDENTIFIER}
[0] line 3:3 mismatched input ' ( ' expecting '('
[0] line 3:7 mismatched input 'vertice.Type == "Class" )\r\npublic class ' expecting {'!', '{', IDENTIFIER}
[0] line 4:14 mismatched input 'vertice.Name {\r\n\t' expecting {'!', '{', IDENTIFIER}
[0] line 5:9 mismatched input ' ( ' expecting '('
[0] line 5:13 mismatched input 'edge in ' expecting {'!', '{', IDENTIFIER}
[0] line 5:22 mismatched input 'Edges )\r\n\t' expecting {'!', '{', IDENTIFIER}
[0] line 6:4 mismatched input ' ( ' expecting '('
[0] line 6:8 mismatched input 'edge.from == ' expecting {'!', '{', IDENTIFIER}
[0] line 6:22 mismatched input 'vertice.Name)\r\n\t' expecting {'!', '{', IDENTIFIER}
It helped, but the lexer is still stealing the $ symbol and why is it expecting a '{' character while the input starts with a '{' character? I will have a look at this problem.
Im trying to skip/ignore the text outside a custom tag:
This text is a unique token to skip < ?compo \5+5\ ?> also this < ?compo \1+1\ ?>
I tried with the follow lexer:
TAG_OPEN : '<?compo ' -> pushMode(COMPOSER);
mode COMPOSER;
TAG_CLOSE : ' ?>' -> popMode;
NUMBER_DIGIT : '1'..'9';
ZERO : '0';
LOGICOP
: OR
| AND
;
COMPAREOP
: EQ
| NE
| GT
| GE
| LT
| LE
;
WS : ' ';
NEWLINE : ('\r\n'|'\n'|'\r');
TAB : ('\t');
...
and parser:
instructions
: (TAG_OPEN statement TAG_CLOSE)+?;
statement
: if_statement
| else
| else_if
| if_end
| operation_statement
| mnemonic
| comment
| transparent;
But it doesn't work (I test it by using the intelliJ tester on the rule "instructions")...
I have also add some skip rules outside the "COMPOSER" mode:
TEXT_SKIP : TAG_CLOSE .*? (TAG_OPEN | EOF) -> skip;
But i don't have any results...
Someone can help me?
EDIT:
I change "instructions" and now the parser tree is correctly builded for every instruction of every tag:
instructions : (.*? TAG_OPEN statement TAG_CLOSE .*?)+;
But i have a not recognized character error outside the the tags...
Below is a quick demo that worked for me.
Lexer grammar:
lexer grammar CompModeLexer;
TAG_OPEN
: '<?compo' -> pushMode(COMPOSER)
;
OTHER
: . -> skip
;
mode COMPOSER;
TAG_CLOSE
: '?>' -> popMode
;
OPAR
: '('
;
CPAR
: ')'
;
INT
: '0'
| [1-9] [0-9]*
;
LOGICOP
: 'AND'
| 'OR'
;
COMPAREOP
: [<>!] '='
| [<>=]
;
MULTOP
: [*/%]
;
ADDOP
: [+-]
;
SPACE
: [ \t\r\n\f] -> skip
;
Parser grammar:
parser grammar CompModeParser;
options {
tokenVocab=CompModeLexer;
}
parse
: tag* EOF
;
tag
: TAG_OPEN statement TAG_CLOSE
;
statement
: expr
;
expr
: '(' expr ')'
| expr MULTOP expr
| expr ADDOP expr
| expr COMPAREOP expr
| expr LOGICOP expr
| INT
;
A test with the input This text is a unique token to skip <?compo 5+5 ?> also this <?compo 1+1 ?> resulted in the following tree:
I found another solution (not elegant as the previous):
Create a generic TEXT token in the general context (so outside the tag's mode)
TEXT : ( ~[<] | '<' ~[?])+ -> skip;
Create a parser rule for handle a generic text
code
: TEXT
| (TEXT? instruction TEXT?)+;
Create a parser rule for handle an instruction
instruction
: TAG_OPEN statement TAG_CLOSE;
In my grammar I use:
WS: [ \t\r\n]+ -> skip;
when I change this to use HIDDEN channel:
WS: [ \t\r\n]+ -> channel(HIDDEN);
I receive errors (extraneous input ' '...) I did not receive while using 'skip'.
I thought, that skipping and sending to a channel does not differ if it comes to a content passed to a parser.
Below you can find a code excerpt in which the parser is executed:
CharStream charStream = new ANTLRInputStream(formulaString);
FormulaLexer lexer = new FormulaLexer(charStream);
BufferedTokenStream tokens = new BufferedTokenStream(lexer);
FormulaParser parser = new FormulaParser(tokens);
ParseTree tree = parser.startRule();
StartRuleVisitor startRuleVisitor = new StartRuleVisitor();
startRuleVisitor.visit(tree);
VariableVisitor variableVisitor = new VariableVisitor(tokens);
variableVisitor.visit(tree);
And a grammar itself:
grammar Formula;
startRule
: variable RELATION_OPERATOR integer
;
integer
: DIGIT+
;
identifier
: (LETTER | DIGIT) ( DIGIT | LETTER | '_' | '.')+
;
tableId
: 'T_' (identifier | WILDCARD)
;
rowId
: 'R_' (identifier | WILDCARD)
;
columnId
: 'C_' (identifier | WILDCARD)
;
sheetId
: 'S_' (identifier | WILDCARD)
;
variable
: L_CURLY_BRACKET cellIdComponent (COMMA cellIdComponent)+ R_CURLY_BRACKET
;
cellIdComponent
: tableId | rowId | columnId | sheetId
;
COMMA
: ','
;
RELATION_OPERATOR
: EQ
;
WILDCARD
: 'NNN'
;
L_CURLY_BRACKET
: '{'
;
R_CURLY_BRACKET
: '}'
;
LETTER
: ('a' .. 'z') | ('A' .. 'Z')
;
DIGIT
: ('0' .. '9')
;
EQ
: '='
| 'EQ' | 'eq'
;
WS
: [ \t\r\n]+ -> channel(HIDDEN)
;
String I try to parse:
{T_C 00.01, R_010, C_010} = 1
Output I get with channel(HIDDEN) used:
line 1:4 extraneous input ' ' expecting {'_', '.', LETTER, DIGIT}
line 1:11 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:18 extraneous input ' ' expecting {'T_', 'R_', 'C_', 'S_'}
line 1:27 extraneous input ' ' expecting RELATION_OPERATOR
line 1:29 extraneous input ' ' expecting DIGIT
But if I change channel(HIDDEN) to 'skip' there are no errors.
What is more, I have observed that for more complex grammar than this i get 'no viable alternative at input...' if I use channel(HIDDEN) and once again the error disappear for the 'skip'.
Do you know what may be the cause of it?
You should use CommonTokenStream instead of BufferedTokenStream. See BufferedTokenStream description on github:
This token stream ignores the value of {#link Token#getChannel}. If your
parser requires the token stream filter tokens to only those on a particular
channel, such as {#link Token#DEFAULT_CHANNEL} or
{#link Token#HIDDEN_CHANNEL}, use a filtering token stream such a
{#link CommonTokenStream}.
I've written a very simple grammar definition for a calculation expression:
grammar SimpleCalc;
options {
output=AST;
}
tokens {
PLUS = '+' ;
MINUS = '-' ;
MULT = '*' ;
DIV = '/' ;
}
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
ID : ('a'..'z' | 'A' .. 'Z' | '0' .. '9')+ ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { Skip(); } ;
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
start: expr EOF;
expr : multExpr ((PLUS | MINUS)^ multExpr)*;
multExpr : atom ((MULT | DIV)^ atom )*;
atom : ID
| '(' expr ')' -> expr;
I've tried the invalid expression ABC &* DEF by start but it passed. It looks like the & charactor is ignored. What's the problem here?
Actually your invalid expression ABC &= DEF hasn't been passed; it causes NoViableAltException.
I've created a small grammar in ANTLR using python (a grammar that can accept either a list of numbers of a list of IDs), and yet when I input a string such as December 12 1965, ANTLR will run on the file and show me no errors with the following code (and all of the python code that I'm using is imbedded via the #main):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
char_stream = antlr3.ANTLRInputStream(open(sys.argv[1],'r'))
lexer = ParserLangLexer(char_stream)
tokens = CommonTokenStream(lexer)
parser = ParserLangParser(tokens);
rule = parser.entry_rule()
}
program : idList EOF
| integerList EOF
;
idList : ID whitespace idList
| ID
;
integerList : INTEGER whitespace integerList
| INTEGER
;
whitespace : (WHITESPACE | COMMENT) +;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/') | ('//' .* '\n') { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Am I doing something wrong?
EDIT: When I use ANTLRWorks with the same grammar an input, a NoViableAltException is thrown. How do I get that error via code?
I could not reproduce it. When I generate a lexer and parser from your input after fixing the error in the grammar (rule = parser.entry_rule() should be: rule = parser.program()), and parse the input "December 12 1965" (either as input from a file, or as a plain string), I get the following error:
line 1:0 no viable alternative at input u'December'
Which may seem strange since that could be the start of a idList. The fact is, your grammar contains one more error and a small thing that could be improved:
WHITESPACE and COMMENT are placed on the HIDDEN channel, and are therefor not available in parser rules (at least, not without changing the channel from which the parser reads its tokens...);
a COMMENT at the end of the input, that is, without a \n at the end, will not be properly tokenized. Better define a single line comment like this: '//' ~('\r' | '\n')*. The trailing line break will be captured by the WHITESPACE rule after all.
Because the parser cannot match an idList (or a integerList for that matter) because of the whitespace rule, an error is produced pointing at the very first token ('December').
Here's a grammar that works (as expected):
grammar ParserLang;
options {
language=Python;
}
#header {
import sys
import antlr3
from ParserLangLexer import ParserLangLexer
}
#main {
def main(argv, otherArg=None):
lexer = ParserLangLexer(antlr3.ANTLRStringStream('December 12 1965'))
parser = ParserLangParser(CommonTokenStream(lexer))
parser.program()
}
program : idList EOF
| integerList EOF
;
idList : ID+
;
integerList : INTEGER+
;
ID : LETTER (DIGIT | LETTER)*;
INTEGER : (NONZERO_DIGIT DIGIT*) | ZERO ;
WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+ { $channel = HIDDEN; } ;
COMMENT : ('/*' .* '*/' | '//' ~('\r' | '\n')*) { $channel = HIDDEN; } ;
fragment ZERO : '0' ;
fragment DIGIT : '0' .. '9';
fragment NONZERO_DIGIT : '1' .. '9';
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
Running the parser generated from the grammar above will also produce an error:
line 1:9 missing EOF at u'12'
but that is expected: after an idList, the parser expects the EOF, but it encounters '12' instead.