antlr string parser rule takes precedent over other rules - antlr

I have the following grammar:
cell
: operator value
;
operator
: EQ
;
value
: StringCharacters
;
EQ
: '='
;
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~[\\\r\n]
;
WS : [ \t\r\n\u000C]+ -> skip
;
The idea is to allow the following inputs:
= 3
=3
=asdkfljer
=skdfj wkrje slkjf
and so on and have the parser recognize the preceding operator all the time. But that's exactly not what is happening. Instead, the parser always recognizes everything as a value.
How can I implement the grammar in such a way that the parser always recognizes the operator first and basically accepts the rest as the value?

The problem is that StringCharacters matches any of your input string and ANTLR takes a token with the greatest length possible.
To solve this I'd suggest using Lexical Modes, something like:
EQ
: '=' -> pushMode(VALUE_MODE)
;
mode VALUE_MODE;
StringCharacters
: StringCharacter+ -> popMode
;
fragment
StringCharacter
: ~[\\\r\n]
;
WS : [ \t\r\n\u000C]+ -> skip
;
Note, the example above will be able to parse only one line.
If you want to parse multiple lines of values, you have to modify the lexer and the parser:
Lexer:
EQ
: '=' -> pushMode(VALUE_MODE)
;
mode VALUE_MODE;
StringCharacters
: StringCharacter+ [\r\n]* -> popMode
;
fragment
StringCharacter
: ~[\\\r\n]
;
WS : [ \t\r\n\u000C]+ -> skip
;
Parser:
cell
: (operator value)*
;
operator
: EQ
;
value
: StringCharacters
;

Related

Problems defining an ANTLR parser for template file

I have started working with ANTLR4 to create a syntax parser for a self defined template file format.
The format basically consists of a mandatory part called '#settings' and at least one part called '#region'. Parts body is surrounded by braces.
I have created a sample file and also copy-pasted-modified an antlr g4 file to parse it. Works fine so far:
File:
#settings
{
setting1: value1
setting2: value2
}
#region
{
[Key1]=Value1(Comment1)
[Key2]=Value2(Comment2)
}
The G4 file for this sample:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
This works as expected. Now I want to add complexity to the file format and the parser and extend the #region header by #region NAME (Attributes).
So what I changed in the sample and in the G4 file is:
Sample changed to
...
#region name (attributes, moreAttributes)
{
...
and g4 file modified to
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
settingsText
: TEXT
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: NOSPACE
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: NOSPACE
;
regionText
: TEXT
;
TEXT
: (~[\u0000-\u001F])+
;
NOSPACE
: (~[\u0000-\u0020])+
;
WS
: [ \t\n\r] + -> skip
;
Now the parser brings up the following error:
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
And I don't get why it is behaving like this. I expected the parser to not concat the whole line when comparing. What am I doing wrong?
Thanks.
There are a couple of problems here:
whatever NOSPACE matches, is also matched by TEXT
TEXT is waaaaay too greedy
Issue 1
ANTLR's lexer works independently from the parser and the lexer will match as much characters as possible.
When 2 (or more) lexer rules match the same amount of characters, the one defined first "wins".
So, if the input is Foo and the parser is. trying to match a NOSPACE token, you're out of luck: because both TEXT and NOSPACE match the text Foo and TEXT is defined first, the lexer will produce a TEXT token. There's nothing you can do about that: it's the way ANTLR works.
Issue 2
As explained in issue 1, the lexer tries to match as much characters as possible. Because of that, your TEXT rule is too greedy. This is what your input is tokenised as:
'{' `{`
TEXT `setting1: value1`
TEXT `setting2: value2`
'}' `}`
TEXT `#region name (attributes, moreAttributes)`
'{' `{`
TEXT `[Key1]=Value1(Comment1)`
TEXT `[Key2]=Value2(Comment2)`
'}' `}`
As you can see, TEXT matches too much. And this is what the error
Parser error (7, 1): mismatched input '#region name (attributes, moreAttributes)' expecting '#region'
is telling you: #region name (attributes, moreAttributes) is a single TEXT token where a #region is trying to be matched by the parser.
Solution?
Remove NOSPACE and make the TEXT token less greedy (or the other way around).
Bart,
thank you very much for clarifying this to me. The key phrase was the lexer will match as much characters as possible. This is a behavior I still need to get used to. I redesigned my Lexer and Parser rules and it seems to work for my test case now.
For completeness, this is my g4 File now:
grammar Template;
start
: section EOF
;
section
: settings regions
;
settings
: '#settings' '{' (settingsText)* '}'
;
regions
: (region)+
;
region
: '#region' regionName (regionAttributes)? '{' (regionText)* '}'
;
regionName
: TEXT
;
settingsText
: TEXT
;
regionAttributes
: '(' regionAttribute (',' regionAttribute)* ')'
;
regionAttribute
: TEXT
;
regionText
: regionLine '('? (regionComment?) ')'?
;
regionLine
: TEXT
;
regionComment
: TEXT
;
TEXT
: ([A-z0-9:\-|= ])+
;
WS
: [ \t\n\r] + -> skip
;

Semantic Predicate affected scope

please consider the following grammar which gives me unexpected behavior:
lexer grammar TLexer;
WS : [ \t]+ -> channel(HIDDEN) ;
NEWLINE : '\n' -> channel(HIDDEN) ;
ASTERISK : '*' ;
SIMPLE_IDENTIFIER : [a-zA-Z_] [a-zA-Z0-9_$]* ;
NUMBER : [0-9] [0-9_]* ;
and
parser grammar TParser;
options { tokenVocab=TLexer; }
seq_input_list :
level_input_list | edge_input_list ;
level_input_list :
( level_symbol_any )+ ;
edge_input_list :
( level_symbol_any )* edge_symbol ;
level_symbol_any :
{getCurrentToken().getText().matches("[0a]")}? ( NUMBER | SIMPLE_IDENTIFIER ) ;
edge_symbol :
SIMPLE_IDENTIFIER | ASTERISK ;
The input 0 * is parsed fine but 0 f is not recognized by the parser (no viable alternative at input 'f'). If I change the ordering of rules in seq_input_list, both inputs are recognized.
My question to you is, if this indeed is an ANTLR issue or I understand the usage of semantic predicates wrong. I would expect the input 0 f to be recognized as (seq_input_list (edge_input_list (level_symbol_any ( NUMBER) edge_symbol ( SIMPLE_IDENTIFIER ) ) ).
Thank you in advance!
Julian

ANTLR4 stop parsing section of file and put it to a accessible buffer (string)

I would like to have grammar that would have it's structure strictly defined, but part of the structure should not be parsed by my grammar but put into some sort of a buffer (string) for later use.
My grammar looks like this:
grammar RSL;
rsl: sectionStructs? sectionProgram;
sectionProgram: 'section' 'program' '{' '}';
sectionStructs: 'section' 'structs' '{' structDef+ '}';
sectionName: ID;
structDef: 'struct' ID '{' varDef+ '}' ';';
varDef: ID ID ';';
ID: [a-zA-Z_][a-zA-Z_\-0-9]*;
WS : [ \t\r\n\u000C]+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
And my wish is to have this sort of parsing going on:
section structs {
struct TestStruct {
int var1;
float var2;
...
};
struct Struct2 {
int var1;
...
};
}
section program {
// Do not parse anything that would be in this section
// just store it in a buffer for later use.
}
So all contents of section program should be stored in a string for a later use and no grammar rules should apply to program.
What is the best way of approaching this problem?
Thanks!
One way would be to create a lexer rule that matches this section program { ... }:
grammar RSL;
rsl
: sectionStructs? SECTION_PROGRAM EOF
;
sectionStructs
: 'section' 'structs' '{' structDef+ '}'
;
structDef
: 'struct' ID '{' varDef+ '}' ';'
;
varDef
: ID ID ';'
;
SECTION
: 'section'
;
ID
: [a-zA-Z_][a-zA-Z_\-0-9]*
;
SECTION_PROGRAM
: 'section' S+ 'program' S* BLOCK
;
WS
: S+ -> skip
;
COMMENT
: '/*' .*? '*/' -> skip
;
LINE_COMMENT
: '//' ~[\r\n]* -> skip
;
fragment BLOCK
: '{' ( ~[{}] | BLOCK )* '}'
;
fragment S
: [ \t\r\n]
;
which would parse your input as follows:
Of course, if your language allows for things like string literals, you would also need to account for that in the fragment BLOCK rule.

Grammar to negate two like characters in a lexer rule inside a single quoted string

ANLTR 4:
I need to support a single quoted string literal with escaped characters AND the ability to use double curly braces as an 'escape sequence' that will need additional parsing. So both of these examples need to be supported. I'm not so worried about the second example because that seems trivial if I can get the first to work and not match double curly brace characters.
1. 'this is a string literal with an escaped\' character'
2. 'this is a string {{functionName(x)}} literal with double curlies'
StringLiteral
: '\'' (ESC | AnyExceptDblCurlies)*? '\'' ;
fragment
ESC : '\\' [btnr\'\\];
fragment
AnyExceptDblCurlies
: '{' ~'{'
| ~'{' .;
I've done a lot of research on this and understand that you can't negate multiple characters, and have even seen a similar approach work in Bart's answer in this post...
Negating inside lexer- and parser rules
But what I'm seeing is that in example 1 above, the escaped single quote is not being recognized and I get a parser error that it cannot match ' character'.
if I alter the string literal token rule to the following it works...
StringLiteral
: '\'' (ESC | .)*? '\'' ;
Any ideas how to handle this scenario better? I can deduce that the escaped character is getting matched by AnyExceptDblCurlies instead of ESC, but I'm not sure how to solve this problem.
To parse the template definition out of the string pretty much requires handling in the parser. Use lexer modes to distinguish between string characters and the template name.
Parser:
options {
tokenVocab = TesterLexer ;
}
test : string EOF ;
string : STRBEG ( SCHAR | template )* STREND ; // allow multiple templates per string
template : TMPLBEG TMPLNAME TMPLEND ;
Lexer:
STRBEG : Squote -> pushMode(strMode) ;
mode strMode ;
STRESQ : Esqote -> type(SCHAR) ; // predeclare SCHAR in tokens block
STREND : Squote -> popMode ;
TMPLBEG : DBrOpen -> pushMode(tmplMode) ;
STRCHAR : . -> type(SCHAR) ;
mode tmplMode ;
TMPLEND : DBrClose -> popMode ;
TMPLNAME : ~'}'* ;
fragment Squote : '\'' ;
fragment Esqote : '\\\'' ;
fragment DBrOpen : '{{' ;
fragment DBrClose : '}}' ;
Updated to correct the TMPLNAME rule, add main rule and options block.

Antlr4 Grammar for Function Application

I'm trying to write a simple lambda calculus grammar (show below). The issue I am having is that function application seems to be treated as right associative instead of left associative e.g. "f 1 2" is parsed as (f (1 2)) instead of ((f 1) 2). ANTLR has an assoc option for tokens, but I don't see how that helps here since there is no operator for function application. Does anyone see a solution?
LAMBDA : '\\';
DOT : '.';
OPEN_PAREN : '(';
CLOSE_PAREN : ')';
fragment ID_START : [A-Za-z+\-*/_];
fragment ID_BODY : ID_START | DIGIT;
fragment DIGIT : [0-9];
ID : ID_START ID_BODY*;
NUMBER : DIGIT+ (DOT DIGIT+)?;
WS : [ \t\r\n]+ -> skip;
parse : expr EOF;
expr : variable #VariableExpr
| number #ConstantExpr
| function_def #FunctionDefinition
| expr expr #FunctionApplication
| OPEN_PAREN expr CLOSE_PAREN #ParenExpr
;
function_def : LAMBDA ID DOT expr;
number : NUMBER;
variable : ID;
Thanks!
this breaks 4.1's pattern matcher for left-recursion. cleaned up in main branch I believe. try downloading last master and build. CUrrently 4.1 generates:
expr[int _p]
: ( {} variable
| number
| function_def
| OPEN_PAREN expr CLOSE_PAREN
)
(
{2 >= $_p}? expr
)*
;
for that rule. expr ref in loop is expr[0] actually, which isn't right.