Antlr4 parser not parsing reassignment statement correctly - antlr

I've been creating a grammar parser using Antlr4 and wanted to add variable reassignment (without having to declare a new variable)
I've tried changing the reassignment statement to be an expression, but that didn't change anything
Here's a shortened version of my grammar:
grammar MyLanguage;
program: statement* EOF;
statement
: expression EOC
| variable EOC
| IDENTIFIER ASSIGNMENT expression EOC
;
variable: type IDENTIFIER (ASSIGNMENT expression)?;
expression
: STRING
| INTEGER
| IDENTIFIER
| expression MATH expression
| ('+' | '-') expression
;
MATH: '+' | '-' | '*' | '/' | '%' | '//' | '**';
ASSIGNMENT: MATH? '=';
EOC: ';';
WHITESPACE: [ \t\r\n]+ -> skip;
STRING: '"' (~[\u0000-\u0008\u0010-\u001F"] | [\t])* '"' | '\'' (~[\u0000-\u0008\u0010-\u001F'] | [\t])* '\'';
INTEGER: '0' | ('+' | '-')? [1-9][0-9]*;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
type: 'str';
if anything else might be of relevance, please ask
so I tried to parse
str test = "empty";
test = "not empty";
which worked, but when I tried (part of the fibbionaci function)
temp = n1;
n1 = n1 + n2;
n2 = temp;
it got an error and parsed it as
temp = n1; //statement
n1 = n1 //statement - <missing ';'>
+n2; //statement
n2 = temp; //statement

Your problem has nothing to do with assignment statements. Additions simply don't work at all - whether they're part of an assignment or not. So the simplest input to get the error would be x+y;. If you print the token stream for that input (using grun with the -tokens option for example), you'll get the following output:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
line 1:1 no viable alternative at input 'x+'
Now compare this to x*y;, which works fine:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='*',<MATH>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
The important difference here is that * is recognized as a MATH token, but + isn't. It's recognized as a '+' token instead.
This happens because you introduced a separate '+' (and '-') token type in the alternative | ('+' | '-') expression. So whenever the lexer sees a + it produces a '+' token, not a MATH token, because string literals in parser rules take precedence over named lexer rules.
If you turn MATH into a parser rule math (or maybe mathOperator) instead, all of the operators will be literals and the problem will go away. That said, you probably don't want a single rule for all math operators because that doesn't give you the precedence you want, but that's a different issue.
PS: Something like x+1 still won't work because it will see +1 as a single INTEGER token. You can fix that by removing the leading + and - from the INTEGER rule (that way x = -2 would be parsed as a unary minus applied to the integer 2 instead of just the integer -2, but that's not a problem).

Related

Parsing strings with embedded multi line control character seuqences

I am writing a compiler for the realtime programming language PEARL.
PEARL supports strings with embedded control character sequence like this e.g.
'some text'\1B 1B 1B\'some more text'.
The control character sequence is prefixed with '\ and ends with \'.
Inside the control sequence are two digits numbers, which specify the control character.
In the above example the resulting string would be
'some textESCESCESCsome more text'
ESC stands for the non-printable ASCII escape character.
Furthermore inside the control char sequence are newline allowed to build multi line strings like e.g.
'some text'\1B
1B
1B\'some more text'.
which results in the same string as above.
grammar stringliteral;
tokens {
CHAR,CHARS,CTRLCHARS,ESC,WHITESPACE,NEWLINE
}
stringLiteral: '\'' CHARS? '\'' ;
fragment
CHARS: CHAR+ ;
fragment
CHAR: CTRLCHARS | ~['\n\r] ;
fragment
ESC: '\'\\' ;
fragment
CTRLCHARS: ESC ~['] ESC;
WHITESPACE: (' ' | '\t')+ -> channel(HIDDEN);
NEWLINE: ( '\r' '\n'? | '\n' ) -> channel(HIDDEN);
The lexer/parser above behaves very strangely, because it accepts only
string in the form 'x' and ignores multiple characters and the control chars sequence.
Probably I am overseeing something obvious. Any hint or idea how to solves this issue is welcome!
I have now corrected the grammar according the hints from Mike:
grammar stringliteral;
tokens {
STRING
}
stringLiteral: STRING;
STRING: '\'' ( '\'' '\\' | '\\' '\'' | . )*? '\'';
There is still a problem with the recognition of the end of the control char sequence:
The input 'A STRING'\CTRL\'' produces the errors
Line 1:10 token recognition error at: '\'
line 1:11 token recognition error at: 'C'
line 1:12 token recognition error at: 'T'
line 1:13 token recognition error at: 'R'
line 1:14 token recognition error at: 'L'
line 1:15 token recognition error at: '\'
Any idea? Btw: We are using antlr v 4.5.
There are multiple issues with this grammar:
You cannot use a fragment lexer rule in a parser rule.
Your string rule is a parser rule, so it's subject to automatic whitespace removal you defined with your WHITESPACE and NEWLINE rules.
You have no rule to accept a control char sequence like \1B 1B 1B.
Especially the third point is a real problem, since you don't know where your control sequence ends (unless this was just a typo and you actually meant: \1B \1B \1B.
In any case, don't deal with escape sequences in your lexer (except the minimum handling required to make the rule work, i.e. handling of the \' sequence. You rule just needs to parse the entire text and you can figure out escape sequences in your semantic phase:
STRING: '\' ('\\' '\'' | . )*? '\'';
Note *? is the non-greedy operator to stop at the first closing quote char. Without that the lexer would continue to match all following (escaped and non-escaped) quote chars in the same string rule (greedy behavior). Additionally, the string rule is now a lexer rule, which is not affected by the whitespace skipping.
I solved the problem with this grammar snippet by adapting the approriate rules from the lates java grammar example:
StringLiteral
: '\'' StringCharacters? '\''
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~['\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\'\\' (HexEscape| ' ' | [\r\n])* '\\\''
;
fragment
HexEscape
: B4Digit B4Digit
;
fragment
B4Digit
: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
;

How do I parse PDF strings with nested string delimiters in antlr?

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:
A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.
EXAMPLE 1:
The following are valid literal strings:
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)
It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.
lexer grammar PdfStringLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ;
// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ;
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
mode STR;
LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ;
TEXT : . -> more ;
parser grammar PdfStringParser;
options { tokenVocab=PdfStringLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
When I process this file:
(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj
I get this error and parse tree:
line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'
So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?
Edit
I left out the instructions regarding escape sequences within the string:
Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.
Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)
An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.
EXAMPLE 2:
(These \
two strings \
are the same.)
(These two strings are the same.)
EXAMPLE 3:
(This string has an end-of-line at the end of it.
)
(So does this one.\n)
Should I use this STRING definition:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?
You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:
STRING
: '(' ( ~[()]+ | STRING )* ')'
;
And with escape sequences, you could try:
STRING
: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | STRING )* ')'
;
fragment ESCAPE_SEQUENCE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;

How to fix this yacc shift/reduce conflict

I have this grammar
value
: INTEGER
| REAL
| LEFTBRACKET value RIGHTBRACKET
| op expression
| expression binaryop expression
;
and I am getting this shift reduce error
47 expression: value .
53 value: LEFTBRACKET value . RIGHTBRACKET
RIGHTBRACKET shift, and go to state 123
RIGHTBRACKET [reduce using rule 47 (expression)]
$default reduce using rule 47 (expression)`
So far I tried setting %left and %right priorities with no luck. I have also tried to use a new grammar for value that does not call itself again but I get conflicts. I tried this solution too
any thoughts?
Thank you in advance
EDIT
expression
: lvalue
| value
;
lvalue
: IDENTIFIER
| lvalue LEFTSQBRACKET expression RIGHTSQBRACKET
| LEFTBRACKET lvalue RIGHTBRACKET
binaryop
: PLUS
| MINUS
| MUL
| DIVISION
| DIV
| MOD
;
I manage to overcome most of the conflict using this grammar but i still get the conflict i mention above
binaryop
: expression PLUS expression
| expression MINUS expression
| expression MUL expression
| expression DIVISION expression
| expression DIV expression
| expression MOD expression
;
Why do you have both value and expression? Without seeing the rest of the grammar, I hesitate to guess the use of expression which leads to that conflict, but my guess is that it has to do with the unnecessary unit production.
On the other hand, you will not be able to resolve precedences if you lump all operator terminals intobinaryop (unless all binary operators have the same precedence). So I'd suggest you find a standard expression grammar (such as in the bison manual or wikipedia) and use it as a base.

ANTLR4 - How do I get the token TYPE as the token text in ANTLR?

Say I have a grammar that has tokens like this:
AND : 'AND' | 'and' | '&&' | '&';
OR : 'OR' | 'or' | '||' | '|' ;
NOT : 'NOT' | 'not' | '~' | '!';
When I visualize the ParseTree using TreeViewer or print the tree using tree.toStringTree(), each node's text is the same as what was matched.
So if I parse "A and B or C", the two binary operators will be "and" / "or".
If I parse "A && B || C", they'll be "&&" / "||".
What I would LIKE is for them to always be "AND" / "OR / "NOT", regardless of what literal symbol was matched. Is this possible?
This is what the vocabulary is for. Use yourLexer.getVocabulary() or yourParser.getVocabulary() and then vocabulary.getSymbolicName(tokenType) for the text representation of the token type. If that returns an empty string try as second step vocabulary.getLiteralName(tokenType), which returns the text used to define the token.

Antlr evaluation order

I defined the following expression rule using Antlr 4 for a script language,
basically I am trying to evaluate
x = y.z.aa * 6
the correct evaluation order should be y.z then y.z.aa then it times 6;
((y.z).aa) * 6
however after the parsing aa*6 evaluated first, then z.(aa*6) then y.(z.(aa*6)), it becomes
y.(z.(aa * 6))
the square bracket is evaluated right
x = y[z][aa] * 6
can anyone help to point what I did wrong in dot access rule?
expression
: primary #PrimaryExpression
| expression ('.' expression ) + #DotAccessExpression
| expression ('[' expression ']')+ #ArrayAccessExpression
| expression ('*'|'/') expression #MulExpression
| expression ('+'|'-') expression #AddExpression
;
primary
: '(' expression ')'
| literal
| ident
;
literal
: NUMBER
| STRING
| NULL
| TRUE
| FALSE
;
You used the following rule:
expression ('.' expression)+
This rule does not fit the syntax pattern for a binary expression, so it's actually getting treated as a suffix expression. In particular, the expression following a . character is no longer restricted within the precedence hierarchy. You may be additionally affected by issue #679, but the real resolution is the same either way. You need to replace this alternative with the following:
expression '.' expression
The same goes for the ArrayAccessExpression, which should be written as follows:
expression '[' expression ']' #ArrayAccessExpression