Why is this grammar conflicts? - grammar

It is compiled with Lemon, which is a LALR(1) parser generator :
program ::= statement.
statement ::= ifstatement Newline.
statement ::= returnstatement Newline.
ifstatement ::= If Number A statement B.
ifstatement ::= If Number A statement B Newline Else A statement B.
returnstatement ::= Return Number.
The error message is :
user#/tmp > lemon test.lm
test.lm:6: This rule can not be reduced.
1 parsing conflicts.
The debug output is :
State 0:
program ::= * statement
statement ::= * ifstatement Newline
statement ::= * returnstatement Newline
ifstatement ::= * If Number A statement B
ifstatement ::= * If Number A statement B Newline Else A statement B
returnstatement ::= * Return Number
If shift 10
Return shift 3
program accept
statement shift 13
ifstatement shift 12
returnstatement shift 11
State 1:
statement ::= * ifstatement Newline
statement ::= * returnstatement Newline
ifstatement ::= * If Number A statement B
ifstatement ::= * If Number A statement B Newline Else A statement B
ifstatement ::= If Number A statement B Newline Else A * statement B
returnstatement ::= * Return Number
If shift 10
Return shift 3
statement shift 4
ifstatement shift 12
returnstatement shift 11
State 2:
statement ::= * ifstatement Newline
statement ::= * returnstatement Newline
ifstatement ::= * If Number A statement B
ifstatement ::= If Number A * statement B
ifstatement ::= * If Number A statement B Newline Else A statement B
ifstatement ::= If Number A * statement B Newline Else A statement B
returnstatement ::= * Return Number
If shift 10
Return shift 3
statement shift 8
ifstatement shift 12
returnstatement shift 11
State 3:
returnstatement ::= Return * Number
Number shift 14
State 4:
ifstatement ::= If Number A statement B Newline Else A statement * B
B shift 15
State 5:
ifstatement ::= If Number A statement B Newline Else * A statement B
A shift 1
State 6:
ifstatement ::= If Number A statement B Newline * Else A statement B
Else shift 5
State 7:
(3) ifstatement ::= If Number A statement B *
ifstatement ::= If Number A statement B * Newline Else A statement B
Newline shift 6
Newline reduce 3 ** Parsing conflict **
State 8:
ifstatement ::= If Number A statement * B
ifstatement ::= If Number A statement * B Newline Else A statement B
B shift 7
State 9:
ifstatement ::= If Number * A statement B
ifstatement ::= If Number * A statement B Newline Else A statement B
A shift 2
State 10:
ifstatement ::= If * Number A statement B
ifstatement ::= If * Number A statement B Newline Else A statement B
Number shift 9
State 11:
statement ::= returnstatement * Newline
Newline shift 16
State 12:
statement ::= ifstatement * Newline
Newline shift 17
State 13:
(0) program ::= statement *
$ reduce 0
State 14:
(5) returnstatement ::= Return Number *
{default} reduce 5
State 15:
(4) ifstatement ::= If Number A statement B Newline Else A statement B *
{default} reduce 4
State 16:
(2) statement ::= returnstatement Newline *
{default} reduce 2
State 17:
(1) statement ::= ifstatement Newline *
{default} reduce 1
----------------------------------------------------
Symbols:
0: $:
1: Newline
2: If
3: Number
4: A
5: B
6: Else
7: Return
8: error:
9: program: If Return
10: statement: If Return
11: ifstatement: If
12: returnstatement: Return

Take a look at state 7 from debug output. It describes the case when parser already accepted the next set of tokens:
ifstatement ::= If Number A statement B *
Here are two options that the parser can choose from when Newline token comes in this case:
Remember it and switch to State 6. This shift is prescribed by the next rule from your grammar:
ifstatement ::= If Number A statement B Newline Else A statement B.
Consider current rule as completed and return to rule of upper level. This reduce is prescribed by this rule from your grammar:
ifstatement ::= If Number A statement B.
LALR(1) parser has no other option as to fail in this case due to the fact that it can't take a look ahead for next tokens in the stream. It can't predict Else coming after Newline.
Revise you grammar to avoid this conflicting situation. I can only add that new line characters are commonly not included to the language grammar. Tokenizer usually consider them as token boundary similarly to other white space characters.

Related

How to make a regex of alpha numeric in objective c

How can i make a regex that string should contain char and number. if its just letter or just number it should return me false
Eg:
123swift -> true
swift123 -> true
1231 -> false
swift -> false
My regex:
[a-z]|[0-9]
Use
^(?=.*?[A-Za-z])(?=.*?[0-9])[0-9A-Za-z]+$
Or, a presumably more efficient version:
^(?=[^A-Za-z]*[A-Za-z])(?=[^0-9]*[0-9])[0-9A-Za-z]+$
See proof.
Expanation:
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[A-Za-z] any character of: 'A' to 'Z', 'a' to 'z'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
[0-9] any character of: '0' to '9'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[0-9A-Za-z]+ any character of: '0' to '9', 'A' to 'Z',
'a' to 'z' (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Antlr4 unexpectedly stops parsing expression

I'm developing a simple calculator with the formula grammar:
grammar Formula ;
expr : <assoc=right> expr POW expr # pow
| MINUS expr # unaryMinus
| PLUS expr # unaryPlus
| expr PERCENT # percent
| expr op=(MULTIPLICATION|DIVISION) expr # multiplyDivide
| expr op=(PLUS|MINUS) expr # addSubtract
| ABS '(' expr ')' # abs
| '|' expr '|' # absParenthesis
| MAX '(' expr ( ',' expr )* ')' # max
| MIN '(' expr ( ',' expr )* ')' # min
| '(' expr ')' # parenthesis
| NUMBER # number
| '"' COLUMN '"' # column
;
MULTIPLICATION: '*' ;
DIVISION: '/' ;
PLUS: '+' ;
MINUS: '-' ;
PERCENT: '%' ;
POW: '^' ;
ABS: [aA][bB][sS] ;
MAX: [mM][aA][xX] ;
MIN: [mM][iI][nN] ;
NUMBER: [0-9]+('.'[0-9]+)? ;
COLUMN: (~[\r\n"])+ ;
WS : [ \t\r\n]+ -> skip ;
"column a"*"column b" input gives me following tree as expected:
But "column a" * "column b" input unexpectedly stops parsing:
What am I missing?
Your WS rule is broken by the COLUMN rule, which has a higher precedence. More precisely, the issue is that ~[\r\n"] matches space characters too.
"column a"*"column b" lexes as follows: '"' COLUMN '"' MULTIPLICATION '"' COLUMN '"'
"column a" * "column b" lexes as follows: '"' COLUMN '"' COLUMN '"' COLUMN '"'
Yes, "space star space" got lexed as a COLUMN token because that's how ANTLR lexer rules work: longer token matches get priority.
As you can see, this token stream does not match the expr rule as a whole, so expr matches as much as it could, which is '"' COLUMN '"'.
Declaring a lexer rule with only a negative rule like you did is always a bad idea. And having separate '"' tokens doesn't feel right for me either.
What you should have done is to include the quotes in the COLUMN rule as they're logically part of the token:
COLUMN: '"' (~["\r\n])* '"';
Then remove the standalone quotes from your parser rule. You can either unquote the text later when you'll be processing the parse tree, or change the token emission logic in the lexer to change the underlying value of the token.
And in order to not ignore trailing input, add another rule which will make sure you've consumed the whole input:
formula: expr EOF;
Then use this rule as your entry rule instead of expr when calling your parser.
But "column a" * "column b" input unexpectedly stops parsing
If I run your grammar with ANTLR 4.6, it does not stop parsing, it parses the whole file and displays in pink what the parser can't match :
The dots represent spaces.
And there is an important error message :
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}
As I explain here as soon as you have a "mismatched" error, add -tokens to grun.
With "column a"*"column b" :
$ grun Formula expr -tokens -diagnostics t1.text
[#0,0:0='"',<'"'>,1:0]
[#1,1:8='column a',<COLUMN>,1:1]
[#2,9:9='"',<'"'>,1:9]
[#3,10:10='*',<'*'>,1:10]
[#4,11:11='"',<'"'>,1:11]
[#5,12:19='column b',<COLUMN>,1:12]
[#6,20:20='"',<'"'>,1:20]
[#7,22:21='<EOF>',<EOF>,2:0]
With "column a" * "column b":
$ grun Formula expr -tokens -diagnostics t2.text
[#0,0:0='"',<'"'>,1:0]
[#1,1:8='column a',<COLUMN>,1:1]
[#2,9:9='"',<'"'>,1:9]
[#3,10:12=' * ',<COLUMN>,1:10]
[#4,13:13='"',<'"'>,1:13]
[#5,14:21='column b',<COLUMN>,1:14]
[#6,22:22='"',<'"'>,1:22]
[#7,24:23='<EOF>',<EOF>,2:0]
line 1:10 mismatched input ' * ' expecting {<EOF>, '*', '/', '+', '-', '%', '^'}
you immediately see that " * "is interpreted as COLUMN.
Many questions about matching input with lexer rules have been asked these last days :
extraneous input
ordering
greedy
ambiguity
expression
So many times that Lucas has posted a false question just to make an answer which summarizes all that problematic : disambiguate.

ANTLR is taking the wrong branch

I have this very simple grammar:
grammar LispExp;
expression : LITERAL #LiteralExp
| '(' '-' expression ')' #UnaryMinusExp
| '(' OP expression expression ')' #OpExp
| '(' 'if' expression expression expression ')' #IfExp;
OP : '+' | '-' | '*' | '/' | '==' | '<';
LITERAL : '0'|('1'..'9')('0'..'9')*;
WS : ('\t' | '\n' | '\r' | ' ') -> skip;
It should be able to parse a "lisp-like" expression, but when I try to parse this:
(+ (+ 5 (* 7 (/ 5 (- 2 (- 9) ) ) ) ) 8)
ANTLR fails to recognize the last unary minus, and generates the following (with antlr v4) :
(expression ( + (expression ( + (expression 5) (expression ( * (expression 7) (expression ( / (expression 5) (expression ( - (expression 2))) ( -) 9 )) expression ))
So, how can I make ANTLR understand the priority of unary minus over binary expression?
You are using a combined grammar LispExp, as opposed to separate lexer grammar LispExpLexer and parser grammar LispExpParser. When working with combined grammars, if you use a string literal in a parser rule the code generator will create anonymous tokens according to those string literals, and silently override the lexer.
In this case, your expression rule includes the string literal '-'. All instances of - in your input will be assigned this token type, which means they will not ever have the token type OP. Your input contains a subexpression (- 2 (- 9) ) which can only be parsed if the first - is an OP token, so according to the parser you have a syntax error in your input.
If you update your code to use separate lexer and parser grammars, any attempt to use a string literal in the parser grammar which is not defined in the lexer grammar will produce an error when you attempt to generate your lexer and parser.

Re-write Parsing Expression Grammar (PEG) without left recursion

Using https://github.com/JetBrains/Grammar-Kit how to rewrite grammar without left recursion?
grammar ::= exprs
exprs::= (sum_expr (';')?)*
private sum_expr::= sum_expr_infix | sum_expr_prefix
sum_expr_infix ::= number sum_expr_prefix
left sum_expr_prefix::= op_plus number
private op_plus ::= '+'
number ::= float | integer
float ::= digit+ '.' digit*
integer ::= digit+
private digit ::=('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')
Sample input:
10+20+30.0;
10+20+30.0
Answer shall maintain parse tree property that nodes contain 2/3 children:
this question lead in the right direction:
Parsing boolean expression without left hand recursion
grammar ::= e*
e ::= math separator?
math ::= add
add ::=
mul op_plus math
| mul op_minus math
| mul
mul ::=
factorial op_mul mul
| factorial op_div mul
| factorial
factorial ::= term op_factorial space* | term
op_factorial ::= '!'
term ::= parentheses | space* number space*
parentheses ::= '(' math ')'
op_minus ::= '-'
op_plus ::= '+'
op_div ::= '/'
op_mul ::= '*'
number ::= float | integer
float ::= (digit+'.') digit*
integer ::=digit+
digit ::= '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
space ::= ' ' | '\t'
separator ::= ';'
test input:
1!
3*2+1
3*2+1+3.0!
3*2+1 + 3.0!
1+1+(1+1)!

Markup parser failing

For a markup language I'm trying to parse, I decided to give parser generation a try with ANTLR. I'm new to the field, and I'm messing something up.
My grammar is
grammar Test;
DIGIT : ('0'..'9');
LETTER : ('A'..'Z');
SLASH : '/';
restriction
: ('E' ap)
| ('L' ap)
| 'N';
ap : LETTER LETTER LETTER;
car : LETTER LETTER;
fnum : DIGIT DIGIT DIGIT DIGIT? LETTER?;
flt : car fnum?;
message : 'A' (SLASH flt)? (SLASH restriction)?;
which does exactly what I want, when I give it an input string A/KK543/EPOS. When I give it A/KL543/EPOS however, it fails (MismatchedTokenException(9!=5)). It seems like some sort of conflict; it wants to generate restriction on the first L, so it seems I'm doing something wrong in the language definition, but I can't properly find out what.
For the input "A/KK543/EPOS", the following tokens are created:
'A' 'A'
SLASH '/'
LETTER 'K'
LETTER 'K'
DIGIT '5'
DIGIT '4'
DIGIT '3'
SLASH '/'
'E' 'E'
LETTER 'P'
LETTER 'O'
LETTER 'S'
But for the input "A/KL543/EPOS", these are created:
'A' 'A'
SLASH '/'
LETTER 'K'
'L' 'L'
DIGIT '5'
DIGIT '4'
DIGIT '3'
SLASH '/'
'E' 'E'
LETTER 'P'
LETTER 'O'
LETTER 'S'
As you can see, the char 'L' does not get tokenized as a LETTER. For the literal tokens 'A', 'E', 'L' and 'N' inside your parser rules, ANTLR (automatically) creates separate lexer rules that are place before all other lexer rules. This causes your lexer to look like this behind the scenes:
A : 'A';
E : 'E';
L : 'L';
N : 'N';
DIGIT : '0'..'9';
LETTER : 'A'..'Z';
SLASH : '/';
Therefor, any single 'A', 'E', 'L' and 'N' will never become a LETTER token. This is simply how ANTLR works. If you want to match them as letters, you'll need to create a parser rule letter and let it match these tokens too. Something like this:
message
: A (SLASH flt)? (SLASH restriction)?
;
flt
: car fnum?
;
fnum
: DIGIT DIGIT DIGIT DIGIT? letter?
;
restriction
: E ap
| L ap
| N
;
ap
: letter letter letter
;
car
: letter letter
;
letter
: A
| E
| L
| N
| LETTER
;
A : 'A';
E : 'E';
L : 'L';
N : 'N';
DIGIT : '0'..'9';
LETTER : 'A'..'Z';
SLASH : '/';
which will parse the input "A/KL543/EPOS" like this: