why is this grammar an error 208? - antlr

I don't understand why the following grammar leads to error 208 complaining IF will be never matched:
error(208): test.g:11:1: The following token definitions can never be matched because prior tokens match the same input: IF
ANTLRWorks 1.4.3
ANTLT 3.4
grammar test;
#lexer::members {
private boolean rawAhead() {
}
}
parse : IF*;
RAW : ({rawAhead()}?=> . )+;
IF : 'if';
ID : ('A'..'Z'|'a'..'z')+;
Either remove RAW rule or ID rule solves the error...
From my point of view, IF does have the possibility to be matched when rawAhead() returns false.

Bood wrote:
I think it actually matters, say if we have an and just an 'if' outside of the mmode, e.g. <#/>if<#/>, then the if here will be matched with IF, not RAW it should be (same length, match the first), right?
Yeah, you're right, good point. Giving it some more thought that is the expected behavior AFAIK. But, it seems things work a bit differently: the RAW rule gets precedence over the ID and IF rules, even when placed at the end of the lexer grammar as you can see:
freemarker_simple.g
grammar freemarker_simple;
#lexer::members {
private boolean mmode = false;
private boolean rawAhead() {
if(mmode) return false;
int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
return !(
(ch1 == '<' && ch2 == '#') ||
(ch1 == '<' && ch2 == '/' && ch3 == '#') ||
(ch1 == '$' && ch2 == '{')
);
}
}
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : ({rawAhead()}?=> . )+;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRStringStream("<#/if>if<#if>foo<#if>"));
freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
parser.parse();
}
}
will print the following to the console:
TAG_START '<#'
IF 'if'
TAG_END '>'
RAW 'if'
TAG_START '<#'
IF 'if'
TAG_END '>'
RAW 'foo'
TAG_START '<#'
IF 'if'
TAG_END '>'
As you can see, the 'if' and 'foo' are tokenized as RAW in the input:
<#/if>if<#if>foo<#if>
^^ ^^^

Related

Using Antlr to parse formulas with multiple locales

I'm very new to Antlr, so forgive what may be a very easy question.
I am creating a grammar which parses Excel-like formulas and it needs to support multiple locales based on the list separator (, for en-US) and decimal separator (. for en-US). I would prefer not to choose between separate grammars to parse with based on locale.
Can I modify or inherit from the CommonTokenStream class to accomplish this, or is there another way to do this? Examples would be helpful.
I am using the Antlr v4.5.0-alpha003 NuGet package in my VS2015 C# project.
What you can do is add a locale (or custom separator- and grouping-characters) to your lexer, and add a semantic predicate before the lexer rule that inspects your custom separator- and grouping-characters and match these tokens dynamically.
I don't have ANTLR and C# running here, but the Java demo should be pretty similar:
grammar LocaleDemo;
#lexer::header {
import java.text.DecimalFormatSymbols;
import java.util.Locale;
}
#lexer::members {
private char decimalSeparator = '.';
private char groupingSeparator = ',';
public LocaleDemoLexer(CharStream input, Locale locale) {
this(input);
DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
this.decimalSeparator = dfs.getDecimalSeparator();
this.groupingSeparator = dfs.getGroupingSeparator();
}
}
parse
: .*? EOF
;
NUMBER
: D D? ( DG D D D )* ( DS D+ )?
;
OTHER
: .
;
fragment D : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}? . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;
To test the grammar above, run this class:
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;
public class Main {
private static void tokenize(String input, Locale locale) {
LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);
for (Token t : lexer.getAllTokens()) {
System.out.printf(" %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
public static void main(String[] args) throws Exception {
tokenize("1.23", Locale.ENGLISH);
tokenize("1.23", Locale.GERMAN);
tokenize("12.345.678,90", Locale.ENGLISH);
tokenize("12.345.678,90", Locale.GERMAN);
}
}
which would print:
input='1.23', locale=en, tokens:
NUMBER '1.23'
input='1.23', locale=de, tokens:
NUMBER '1'
OTHER '.'
NUMBER '23'
input='12.345.678,90', locale=en, tokens:
NUMBER '12.345'
OTHER '.'
NUMBER '67'
NUMBER '8'
OTHER ','
NUMBER '90'
input='12.345.678,90', locale=de, tokens:
NUMBER '12.345.678,90'
Related Q&A's:
What is a 'semantic predicate' in ANTLR?
What does "fragment" mean in ANTLR?
As a follow-up to Bart's answer, this is the grammar I created with his suggestions:
grammar ExcelScript;
#lexer::header
{
using System;
using System.Globalization;
}
#lexer::members
{
private Int32 listseparator = 44; // UTF16 value for comma
private Int32 decimalseparator = 46; // UTF16 value for period
/// <summary>
/// Creates a new lexer object
/// </summary>
/// <param name="input">The input stream</param>
/// <param name="locale">The locale to use in parsing numbers</param>
/// <returns>A new lexer object</returns>
public ExcelScriptLexer (ICharStream input, CultureInfo locale)
: this(input)
{
this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);
// special case for 8 locales where the list separator is a , and the number separator is a , too
// Excel uses semicolon for list separator, so we will too
if (this.listseparator == 44 && this.decimalseparator == 44)
this.listseparator = 59; // UTF16 value for semicolon
}
}
/*
* Parser Rules
*/
formula
: numberLiteral
| Identifier
| '=' expression
;
expression
: primary # PrimaryExpression
| Identifier arguments # FunctionCallExpression
| ('+' | '-') expression # UnarySignExpression
| expression ('*' | '/' | '%') expression # MulDivModExpression
| expression ('+' | '-') expression # AddSubExpression
| expression ('<=' | '>=' | '>' | '<') expression # CompareExpression
| expression ('=' | '<>') expression # EqualCompareExpression
;
primary
: '(' expression ')' # ParenExpression
| literal # LiteralExpression
| Identifier # IdentifierExpression
;
literal
: numberLiteral # NumberLiteralRule
| booleanLiteral # BooleanLiteralRule
;
numberLiteral
: IntegerLiteral
| FloatingPointLiteral
;
booleanLiteral
: TrueKeyword
| FalseKeyword
;
arguments
: '(' expressionList? ')'
;
expressionList
: expression (ListSeparator expression)*
;
/*
* Lexer Rules
*/
AddOperator : '+' ;
SubOperator : '-' ;
MulOperator : '*' ;
DivOperator : '/' ;
PowOperator : '^' ;
EqOperator : '=' ;
NeqOperator : '<>' ;
LeOperator : '<=' ;
GeOperator : '>=' ;
LtOperator : '<' ;
GtOperator : '>' ;
ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;
TrueKeyword : [Tt][Rr][Uu][Ee] ;
FalseKeyword : [Ff][Aa][Ll][Ss][Ee] ;
Identifier
: Letter (Letter | Digit)*
;
fragment Letter
: [A-Z_a-z]
;
fragment Digit
: [0-9]
;
IntegerLiteral
: '0'
| [1-9] [0-9]*
;
FloatingPointLiteral
: [0-9]+ DecimalSeparator [0-9]* Exponent?
| DecimalSeparator [0-9]+ Exponent?
| [0-9]+ Exponent
;
fragment Exponent
: ('e' | 'E') ('+' | '-')? ('0'..'9')+
;
WhiteSpace
: [ \t]+ -> channel(HIDDEN)
;

ANTLR: How to skip multiline comments

Given the following lexer:
lexer grammar CodeTableLexer;
#header {
package ch.bsource.ice.parsers;
}
CodeTabHeader : OBracket Code ' ' Table ' ' Version CBracket;
CodeTable : Code ' '* Table;
EndCodeTable : 'end' ' '* Code ' '* Table;
Code : 'code';
Table : 'table';
Version : '1.0';
Row : 'row';
Tabdef : 'tabdef';
Override : 'override' | 'no_override';
Obsolete : 'obsolete';
Substitute : 'substitute';
Status : 'activ' | 'inactive';
Pkg : 'include_pkg' | 'exclude_pkg';
Ddic : 'include_ddic' | 'exclude_ddic';
Tab : 'tab';
Naming : 'naming';
Dfltlang : 'dfltlang';
Language : 'english' | 'german' | 'french' | 'italian' | 'spanish';
Null : 'null';
Comma : ',';
OBracket : '[';
CBracket : ']';
Boolean
: 'true'
| 'false'
;
Number
: Int* ('.' Digit*)?
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '$' | '#' | '.' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length() - 1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"'))* '"'
;
Comment
: '--' ~('\r' | '\n')* { skip(); }
| '/*' .* '*/' { skip(); }
;
Space
: (' ' | '\t') { skip(); }
;
NewLine
: ('\r' | '\n' | '\u000C') { skip(); }
;
fragment Int
: '1'..'9'
| '0'
;
fragment Digit
: '0'..'9'
;
... and the following parser:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
parse
: block EOF
;
block
: CodeTabHeader^ codeTable endCodeTable
;
codeTable
: CodeTable^ codeTableData
;
codeTableData
: (Identifier^ obsolete?) (tabdef | row)*
;
endCodeTable
: EndCodeTable
;
tabdef
: Tabdef^ Identifier+
;
row
: Row^ rowData
;
rowData
: (Number^ | (Identifier^ (Comma Number)?))
Override?
obsolete?
status?
Pkg?
Ddic?
(tab | field)*
;
tab
: Tab^ value+
;
field
: (Identifier^ value) | naming
;
value
: OBracket? (Identifier | String | Number | Boolean | Null) CBracket?
;
naming
: Naming^ defaultNaming (l10nNaming)*
;
defaultNaming
: Dfltlang^ String
;
l10nNaming
: Language^ String?
;
obsolete
: Obsolete^ Substitute String
;
status
: Status^ Override?
;
... finally my class for making the parser case-insensitive:
package ch.bsource.ice.parsers;
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + 1 - 1) >= n) return CharStream.EOF
return Character.toLowerCase(data[p + 1 - 1]);
}
}
... single-line comments are skipped as expected, while multi-line comments aren't... here is the error message I get:
codetable_1.txt line 38:0 mismatched character '<EOF>' expecting '*'
codetable_1.txt line 38:0 mismatched input '<EOF>' expecting EndCodeTable
java.lang.NullPointerException
...
Am I missing something? Is there anything I should be aware of? I'm using antlr 3.4.
Here is also the example source code I'm trying to parse:
[code table 1.0]
/*
This is a multi-line comment
*/
code table my_table
-- this is a single-line comment
row 1
id "my_id_1"
name "my_name_1"
descn "my_description_1"
naming
dfltlang "My description 1"
english "My description 1"
german "Meine Beschreibung 1"
-- this is another single-line comment
row 2
id "my_id_2"
name "my_name_2"
descn "my_description_2"
naming
dfltlang "My description 2"
english "My description 2"
german "Meine Beschreibung 2"
end code table
Any help would be really appreciated :-)
Thanks,
j3d
To do this in antlr4
BlockComment
: '/*' .*? '*/' -> skip
;
Bart gave me an amazing support and I think we all really appreciate him :-)
Anyway, the problem was a bug in the FileStream class I use to convert parsed char stream to lowercase. Here below is the correct Java source code:
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + i - 1) >= n) return CharStream.EOF;
return Character.toLowerCase(data[p + i - 1]);
}
}
I use 2 rules that I use to skip line and block comments (I print them during parsing for debug purposes). They are split in 2 for better readability, and the block comment does support nested comments.
Also, I do not skip EOL chars (\r and / or \n) in my grammar because I need them explicitly for some rules.
LineComment
: '//' ~('\n'|'\r')* //NEWLINE
{System.out.println("lc > " + getText());
skip();}
;
BlockComment
#init { int depthOfComments = 0;}
: '/*' {depthOfComments++;}
( options {greedy=false;}
: ('/' '*')=> BlockComment {depthOfComments++;}
| '/' ~('*')
| ~('/')
)*
'*/' {depthOfComments--;}
{
if (depthOfComments == 0) {
System.out.println("bc >" + getText());
skip();
}
}
;

How to handle the precedence of assignment operator in a PHP parser?

I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?
Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed:

ASN.1/SMI comment syntax on ANTLR 3

On ANTLR 2, the comment syntax is like this,
// Single-line comments
SL_COMMENT
: (options {warnWhenFollowAmbig=false;}
: '--'( { LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') { newline(); }| '--') )
{$setType(Token.SKIP); }
;
However, when porting this to ANTLR 3,
SL_COMMENT
: (
: '--'( { input.LA(2)!='-' }? '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | '--') )
{$channel = HIDDEN;}
;
because there is no more options {warnWhenFollowAmbig=false;}, the following comment cannot be parsed correctly,
-- some comment -- some not comment
Then, what is the possible way to define this SL_COMMENT rule for ANTLR 3?
Personally, I like to keep grammar rules as "empty" as possible. In this case, I would create a lexer method that returns true if the next two characters in the input are "--". As long as this is not the case, match any character other than \r and \n, and repeat that zero or more times until an optional "--" is encountered. Note that I didn't put a new line at the end because there is not necessarily a new line at the end (it could also be a EOF). Besides, \r and \n will likely be matched by a SPACE rule which is put on the HIDDEN channel: so there's no harm in doing it as I suggest.
A demo:
...
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
...
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
...
And if you don't like the lexer members-block, you simply do:
SL_COMMENT
: '--' ({!(input.LA(1) == '-' && input.LA(2) == '-')}?=> ~('\r' | '\n'))* '--'?
;
EDIT
A small, complete demo:
grammar T;
#parser::members {
public static void main(String[] args) throws Exception {
String source = "12 - 34 -- foo - bar -- 42 \n - - 5678 -- more comments 666\n--\n--";
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
#lexer::members {
private boolean endCommentAhead() {
return input.LA(1) == '-' && input.LA(2) == '-';
}
}
parse
: (t=. {System.out.printf("\%-15s\%s\n", tokenNames[$t.type], $t.text);})* EOF
;
SL_COMMENT
: '--' ({!endCommentAhead()}?=> ~('\r' | '\n'))* '--'?
;
MINUS
: '-'
;
INT
: '0'..'9'+
;
SPACE
: (' ' | '\t' | '\r' | '\n') {skip();}
;
which, after parsing the input:
12 - 34 -- foo - bar -- 42
- - 5678 -- more comments 666
will print:
INT 12
MINUS -
INT 34
SL_COMMENT -- foo - bar --
INT 42
MINUS -
MINUS -
INT 5678
SL_COMMENT -- more comments 666
SL_COMMENT --
SL_COMMENT --
I came across a solution finally,
SL_COMMENT
: COMMENT ( ({input.LA(2) != '-'}? '-') => '-' | ~('-'|'\n'|'\r'))* ( (('\r')? '\n') | COMMENT)
{ $channel = HIDDEN; }
;

Switching lexer state in antlr3 grammar

I'm trying to construct an antlr grammar to parse a templating language. that language can be embedded in any text and the boundaries are marked with opening/closing tags: {{ / }}. So a valid template looks like this:
foo {{ someVariable }} bar
Where foo and bar should be ignored, and the part inside the {{ and }} tags should be parsed. I've found this question which basically has an answer for the problem, except that the tags are only one { and }. I've tried to modify the grammar to match 2 opening/closing characters, but as soon as i do this, the BUFFER rule consumes ALL characters, also the opening and closing brackets. The LD rule is never being invoked.
Has anyone an idea why the antlr lexer is consuming all tokens in the Buffer rule when the delimiters have 2 characters, but does not consume the delimiters when they have only one character?
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: (tag | BUFFER )*
;
tag
: LD IDENT^ RD
;
LD #after {
// flip lexer the state
insideTag=true;
System.err.println("FLIPPING TAG");
} : '{{';
RD #after {
// flip the state back
insideTag=false;
} : '}}';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER)*;
BUFFER : { !insideTag }?=> ~(LD | RD)+;
fragment LETTER : ('a'..'z' | 'A'..'Z');
You can match any character once or more until you see {{ ahead by including a predicate inside the parenthesis ( ... )+ (see the BUFFER rule in the demo).
A demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: tag EOF
;
tag
: LD IDENT^ RD
;
LD
#after {insideTag=true;}
: '{{'
;
RD
#after {insideTag=false;}
: '}}'
;
BUFFER
: ({!insideTag && !(input.LA(1)=='{' && input.LA(2)=='{')}?=> .)+ {$channel=HIDDEN;}
;
SPACE
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
IDENT
: ('a'..'z' | 'A'..'Z')+
;
Note that it's best to keep the BUFFER rule as the first lexer rule in your grammar: that way, it will be the first token that is tried.
If you now parse "foo {{ someVariable }} bar", the following AST is created:
Wouldn't a grammar like this fit your needs? I don't see why the BUFFER needs to be that complicated.
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean inTag=false;
}
start
: tag* EOF
;
tag
: LD IDENT RD -> IDENT
;
LD
#after { inTag=true; }
: '{{'
;
RD
#after { inTag=false; }
: '}}'
;
IDENT : {inTag}?=> ('a'..'z'|'A'..'Z'|'_') 'a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
BUFFER
: . {$channel=HIDDEN;}
;