BNF: input going to wrong nonterminal - yacc

I'm developing a BNF for chess algebraic notation and ran into an interesting case, input going to the wrong non-terminal.
My start BNF rule is as follows (note that this intentionally doesn't include castling or notes):
algebraic_notation : piece start_position capture end_position promotion
piece, start_position, capture, and promotion can be empty, thus allowing for a move like 'd4'. The problem is that when such a move is entered, the input ('d4') is taken by start_position, thus resulting in an error b/c there is no more input for end_position, which cannot be empty.
The obvious hack/workaround is to allow end_position to be empty and then check to see if we got any input for it and act accordingly.
This does work, but I would like to know if there is a way to deal with this. Is it possible for input not to go to the first matching symbol if it causes the entire expression not to match?
Another question is whether this is standard behaviour for BNF, or a problem with the yaccer I'm using: PLY v 3.3.
Tried using flex/bison and got same thing. So it appears its not specific to PLY.
Here are all the relevant rules for completeness:
algebraic_notation : piece start_position capture end_position promotion
piece : KING
| QUEEN
| BISHOP
| KNIGHT
| ROOK
| pawn
pawn : empty
start_position : FILE
| NUMBER
| FILE NUMBER
| empty
end_position : FILE NUMBER
| empty // this line is the hack/workaround
capture : CAPTURE
| empty
promotion : EQUAL QUEEN
| EQUAL ROOK
| EQUAL KNIGHT
| EQUAL BISHOP
| empty
empty :

The problem is that you're ignoring the shift/reduce conflict you get from your parser generator. While yacc/bison (and presumably PLY) will resolve errors for you, that resolution might not be doing what you want, and might result in a parser that parses a langauge other than the one you are trying to parse.
Whenever you get a shift/reduce (or reduce/reduce) conflict from an LR parser generator, you really need to understand what the conflict is (and why it occurs) to know whether you can ignore it or whether you need to fix it. So lets fix your grammar by getting rid of the 'hack' (which is clearly wrong and not something you want to parse), as well as the useless 'empty' rule (which just confuses things):
%token FILE NUMBER
%%
algebraic_notation : piece start_position capture end_position promotion
piece : 'K' | 'Q' | 'B' | 'N' | 'R' | /*pawn*/
start_position : FILE | NUMBER | FILE NUMBER | /*empty*/
end_position : FILE NUMBER
capture : 'x' | /*empty*/
promotion : '=' 'Q' | '=' 'R' | '=' 'N' | '=' 'B' | /*empty*/
Now when you run this through 'bison -v' (ALWAYS use -v to get the verbose output file -- I'm not sure what PLY's equivalent is), you get the message about a shift/reduce conflict, and if you look in the .output file you can see what it is:
state 7
1 algebraic_notation: piece . start_position capture end_position promotion
FILE shift, and go to state 9
NUMBER shift, and go to state 10
FILE [reduce using rule 11 (start_position)]
$default reduce using rule 11 (start_position)
start_position go to state 11
This is telling you that after seeing a piece, when the next token is FILE, it doesn't know whether it should shift (treating the FILE as (part of) the start_position) or reduce (giving an empty start_position). That's because it needs more lookahead to see if there's a second position to use as an end_position to know what to do, so simply ignoring the conflict will result in a parser that fails to parse lots of valid things (basically, anything with an empty start_position and capture).
The best way to solve a lookahead-related shift-reduce conflict involving an empty production like this (or pretty much any conflict involving an empty production, really) is to unfactor the grammar -- get rid of the empty rule and duplicate any rule that uses the non-terminal both with and without it. In your case, this gives you the rules:
algebraic_notation : piece capture end_position promotion
algebraic_notation : piece start_position capture end_position promotion
start_position : FILE | NUMBER | FILE NUMBER
(the other rules are unchanged)
With that you still have a shift-reduce conflict:
state 7
1 algebraic_notation: piece . capture end_position promotion
2 | piece . start_position capture end_position promotion
FILE shift, and go to state 9
NUMBER shift, and go to state 10
'x' shift, and go to state 11
FILE [reduce using rule 14 (capture)]
start_position go to state 12
capture go to state 13
Basically, we've just moved the conflict one step and now have the problem with the empty capture rule. So we unfactor that as well:
algebraic_notation : piece end_position promotion
algebraic_notation : piece capture end_position promotion
algebraic_notation : piece start_position end_position promotion
algebraic_notation : piece start_position capture end_position promotion
capture : 'x'
and now bison reports no more conflicts, so we can be reasonably confident it will parse the way we want. You can simplify it a bit more by getting rid of the capture rule and using a literal 'x' in the algebraic_notation rule. I personally prefer this, as I think it is clearer to avoid the unnecessary indirection:
%token FILE NUMBER
%%
algebraic_notation : piece end_position promotion
algebraic_notation : piece 'x' end_position promotion
algebraic_notation : piece start_position end_position promotion
algebraic_notation : piece start_position 'x' end_position promotion
piece : 'K' | 'Q' | 'B' | 'N' | 'R' | /*pawn*/
start_position : FILE | NUMBER | FILE NUMBER
end_position : FILE NUMBER
promotion : '=' 'Q' | '=' 'R' | '=' 'N' | '=' 'B' | /*empty*/

I haven't used PLY, and without seeing the full flex/bison files you tried I might be picking on a non-issue, but it seems to me you aren't giving the parser an idea that no more is coming for the current algebraic_notation rule. You don't say how you know the input 'd4' was matched to start_position, but if the parser knew it had all tokens for the rule, and the only non-empty token is end_position, it would have to match 'd4' to that.
What about introducing a token that marks the end of a line, like EOL. So your first rule becomes:
algebraic_notation : piece start_position capture end_position promotion EOL
and the parser now sees the token 'd4' followed by EOL -- does that change the behavior?

What happens if you wrap start_position capture end_position into a middle block, and remove FILE NUMBER from start_pos, to something like this:
middle: start_pos capture end_pos
| end_pos capture end_pos
| capture end_pos
start_pos : FILE
| NUMBER
| empty
end_pos : FILE NUMBER
capture : CAPTURE
| empty

This question is a good illustration of a problem is computer science theory, the removal of epsilon (or empty) productions from a grammar. The problems with the ambiguity of the chess notation can be resolved (for yacc or PLY) by simplifying the grammar to remove the empty productions. There is much material on this on both SO/SE and on other sites. I append a bibliography for the interested reader.
By performing a mindless transformation of the rules to remove blind/empty/epsilon productions we get the following CFG:
algebraic_notation : piece start_position capture end_position promotion
| piece start_position capture end_position
| piece start_position capture promotion
| piece start_position end_position promotion
| piece capture end_position promotion
| piece start_position capture
| piece start_position end_position
| piece capture end_position
| piece start_position promotion
| piece capture promotion
| piece end_position promotion
| piece promotion
| piece end_position
| piece capture
| piece start_position
| piece
| start_position capture end_position promotion
| start_position capture end_position
| start_position capture promotion
| start_position end_position promotion
| capture end_position promotion
| start_position capture
| start_position end_position
| capture end_position
| end_position promotion
| start_position
| capture
| end_position
| promotion
piece : KING
| QUEEN
| BISHOP
| KNIGHT
| ROOK
start_position : FILE
| NUMBER
| FILE NUMBER
end_position : FILE NUMBER
capture : CAPTURE
promotion : EQUAL QUEEN
| EQUAL ROOK
| EQUAL KNIGHT
| EQUAL BISHOP
(This could probably be simplified by removing those combinations that could not occur in chess notation, but that is an exercise for the reader).
Bibliography
The best books for this are probably:
Hopcroft & Ullman Introduction to Automata Theory, Languages,
and Computation
Aho & Ullman The Theory of Parsing, Translation, and Compiling
Or Just go to the slides from Jeff Ullman's class:
Normal Forms for
CFG's
Or a bunch of related questions on SO/SE:
https://math.stackexchange.com/questions/563363/removing-epsilon-productions-from-a-context-free-grammar
Removing epsilon production from context-free grammar
Converting grammar to Chomsky Normal Form?

Related

Parsing strings with embedded multi line control character seuqences

I am writing a compiler for the realtime programming language PEARL.
PEARL supports strings with embedded control character sequence like this e.g.
'some text'\1B 1B 1B\'some more text'.
The control character sequence is prefixed with '\ and ends with \'.
Inside the control sequence are two digits numbers, which specify the control character.
In the above example the resulting string would be
'some textESCESCESCsome more text'
ESC stands for the non-printable ASCII escape character.
Furthermore inside the control char sequence are newline allowed to build multi line strings like e.g.
'some text'\1B
1B
1B\'some more text'.
which results in the same string as above.
grammar stringliteral;
tokens {
CHAR,CHARS,CTRLCHARS,ESC,WHITESPACE,NEWLINE
}
stringLiteral: '\'' CHARS? '\'' ;
fragment
CHARS: CHAR+ ;
fragment
CHAR: CTRLCHARS | ~['\n\r] ;
fragment
ESC: '\'\\' ;
fragment
CTRLCHARS: ESC ~['] ESC;
WHITESPACE: (' ' | '\t')+ -> channel(HIDDEN);
NEWLINE: ( '\r' '\n'? | '\n' ) -> channel(HIDDEN);
The lexer/parser above behaves very strangely, because it accepts only
string in the form 'x' and ignores multiple characters and the control chars sequence.
Probably I am overseeing something obvious. Any hint or idea how to solves this issue is welcome!
I have now corrected the grammar according the hints from Mike:
grammar stringliteral;
tokens {
STRING
}
stringLiteral: STRING;
STRING: '\'' ( '\'' '\\' | '\\' '\'' | . )*? '\'';
There is still a problem with the recognition of the end of the control char sequence:
The input 'A STRING'\CTRL\'' produces the errors
Line 1:10 token recognition error at: '\'
line 1:11 token recognition error at: 'C'
line 1:12 token recognition error at: 'T'
line 1:13 token recognition error at: 'R'
line 1:14 token recognition error at: 'L'
line 1:15 token recognition error at: '\'
Any idea? Btw: We are using antlr v 4.5.
There are multiple issues with this grammar:
You cannot use a fragment lexer rule in a parser rule.
Your string rule is a parser rule, so it's subject to automatic whitespace removal you defined with your WHITESPACE and NEWLINE rules.
You have no rule to accept a control char sequence like \1B 1B 1B.
Especially the third point is a real problem, since you don't know where your control sequence ends (unless this was just a typo and you actually meant: \1B \1B \1B.
In any case, don't deal with escape sequences in your lexer (except the minimum handling required to make the rule work, i.e. handling of the \' sequence. You rule just needs to parse the entire text and you can figure out escape sequences in your semantic phase:
STRING: '\' ('\\' '\'' | . )*? '\'';
Note *? is the non-greedy operator to stop at the first closing quote char. Without that the lexer would continue to match all following (escaped and non-escaped) quote chars in the same string rule (greedy behavior). Additionally, the string rule is now a lexer rule, which is not affected by the whitespace skipping.
I solved the problem with this grammar snippet by adapting the approriate rules from the lates java grammar example:
StringLiteral
: '\'' StringCharacters? '\''
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~['\\\r\n]
| EscapeSequence
;
fragment
EscapeSequence
: '\'\\' (HexEscape| ' ' | [\r\n])* '\\\''
;
fragment
HexEscape
: B4Digit B4Digit
;
fragment
B4Digit
: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
;

Q: ANTLR 4 Grammar recognition of whole odd value not only the last digit

I'm trying to make grammar for the calculator, however it have to be working only for odd numbers.
For example it works like that:
If I put 123 the result is 123.
If I put 1234 the result is 123, and the token recognition error at: 4 but should be at: 1234.
There is my grammar:
grammar G;
DIGIT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
operator : ('+' | '-' | '*' | ':');
result: DIGIT operator (DIGIT | result);
I mean specifically to make that, the 1234 should be recognized as an error, not only the last digit.
The way that tokenization works is that it tries to find the longest prefix of the input that matches any of your regular expressions and then produces the appropriate token, consuming that prefix. So when the input is 1234, it sees 123 as the longest prefix that matches the DIGIT pattern (which should really be called ODD_INT or something) and produces the corresponding token. Then it sees the remaining 4 and produces an error because no rule matches it.
Note that it's not necessarily only the last digit that produces the error. For the input 1324, it would produce a DIGIT token for 13 and then a token recognition error for 24.
So how can you get the behaviour that you want? One approach would be to rewrite your pattern to match all sequences of digits and then use a semantic predicate to verify that the number is odd. The way that semantic predicates work on lexer rules is that it first takes the longest prefix that matches the pattern (without taking into account the predicate) and then checks the predicate. If the predicate is false, it moves on to the other patterns - it does not try to match the same pattern to a smaller input to make the predicate return true. So for the input 1234, the pattern would match the entire number and then the predicate would return false. Then it would try the other patterns, none of which match, so you'd get a token recognition error for the full number.
ODD_INT: ('0'..'9') + { Integer.parseInt(getText()) % 2 == 1 }?;
The down side of this approach is that you'll need to write some language-specific code (and if you're not using Java, you'll need to adjust the above code accordingly).
Alternatively, you could just recognize all integers in the lexer - not just odd ones - and then check whether they're odd later during semantic analysis.
If you do want to check the oddness using patterns only, you can also work around the problem by defining rules for both odd and even integers:
ODD_INT: ('0'..'9') * ('1' | '3' | '5' | '7'| '9');
EVEN_INT: ('0'..'9') * ('0' | '2' | '4' | '6'| '8');
This way for an input like 1234, the longest match would always be 1234, not 123. It's just that this would match the EVEN_INT pattern, not ODD_INT. So you wouldn't get a token recognition error, but, if you consistently only use ODD_INT in the grammar, you would get an error saying that an ODD_INT was expected, but an EVEN_INT found.

ANTLR fuzzy parsing

I'm building a kind of pre-processor in ANTLRv3, which of course only works with fuzzy parsing. At the moment I'm trying to parse include statements and replace them with the corresponding file content. I used this example:
ANTLR: removing clutter
Based on this example, I wrote the following code:
grammar preprocessor;
options {
language='Java';
}
#lexer::header {
package antlr_try_1;
}
#parser::header {
package antlr_try_1;
}
parse
: (t=. {System.out.print($t.text);})* EOF
;
INCLUDE_STAT
: 'include' (' ' | '\r' | '\t' | '\n')+ ('A'..'Z' | 'a'..'z' | '_' | '-' | '.')+
{
setText("Include statement found!");
}
;
Any
: . // fall through rule, matches any character
;
This grammar does only for printing the text and replacing the include statements with the "Include statement found!" string. The example text to be parsed looks like this:
some random input
some random input
some random input
include some_file.txt
some random input
some random input
some random input
The output of the result looks in the following way:
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 1:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 2:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 3:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 7:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 8:14 mismatched character 'p' expecting 'c'
C:\Users\andriyn\Documents\SandBox\text_files\asd.txt line 9:14 mismatched character 'p' expecting 'c'
some random ut
some random ut
some random ut
Include statement found!
some random ut
some random ut
some random ut
As far as I can judge, it is confused by the "in" in the word "input", because it "thinks" it would be the INCLUDE_STAT token.
Is there a better way to do it? The filter option I cannot use, since I need not only the include statements, but also the rest of the code. I've tried several other things, but couldn't find a proper solution.
You are observing one of ANTLR 3's limitations. You could use either of these options to correct the immediate problem:
Upgrade to ANTLR 4, which does not have this limitation.
Include the following syntactic predicate at the beginning of the INCLUDE_STAT rule:
`('include' (' ' | '\r' | '\t' | '\n')+ ('A'..'Z' | 'a'..'z' | '_' | '-' | '.')+) =>`

PostgreSQL String search for partial patterns removing exrtaneous characters

Looking for a simple SQL (PostgreSQL) regular expression or similar solution (maybe soundex) that will allow a flexible search. So that dashes, spaces and such are omitted during the search. As part of the search and only the raw characters are searched in the table.:
Currently using:
SELECT * FROM Productions WHERE part_no ~* '%search_term%'
If user types UTR-1 it fails to bring up UTR1 or UTR 1 stored in the database.
But the matches do not happen when a part_no has a dash and the user omits this character (or vice versa)
EXAMPLE search for part UTR-1 should find all matches below.
UTR1
UTR --1
UTR 1
any suggestions...
You may well find the offical, built-in (from 8.3 at least) fulltext search capabilities in postrgesql worth looking at:
http://www.postgresql.org/docs/8.3/static/textsearch.html
For example:
It is possible for the parser to produce overlapping tokens from the
same of text.
As an example, a hyphenated word will be reported both as the entire word
and as each component:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1
SELECT *
FROM Productions
WHERE REGEXP_REPLACE(part_no, '[^[:alnum:]]', '')
= REGEXP_REPLACE('UTR-1', '[^[:alnum:]]', '')
Create an index on REGEXP_REPLACE(part_no, '[^[:alnum:]]', '') for this to work fast.

How can my ANTLR lexer match a token made of characters that are subset of another kind of token?

I have what I think is a simple ANTLR question. I have two token types: ident and special_ident. I want my special_ident to match a single letter followed by a single digit. I want the generic ident to match a single letter, optionally followed by any number of letters or digits. My (incorrect) grammar is below:
expr
: special_ident
| ident
;
special_ident : LETTER DIGIT;
ident : LETTER (LETTER | DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
When I try to check this grammar, I get this warning:
Decision can match input such as "LETTER DIGIT" using multiple alternatives: 1, 2.
As a result, alternative(s) 2 were disabled for that input
I understand that my grammar is ambiguous and that input such as A1 could match either ident or special_ident. I really just want the special_ident to be used in the narrowest of cases.
Here's some sample input and what I'd like it to match:
A : ident
A1 : special_ident
A1A : ident
A12 : ident
AA1 : ident
How can I form my grammar such that I correctly identify my two types of identifiers?
Seems that you have 3 cases:
A
AN
A(A|N)(A|N)+
You could classify the middle one as special_ident and the other two as ident; seems that should do the trick.
I'm a bit rusty with ANTLR, I hope this hint is enough. I can try to write out the expressions for you but they could be wrong:
long_ident : LETTER (LETTER | DIGIT) (LETTER | DIGIT)+
special_ident : LETTER DIGIT;
ident : LETTER | long_ident;
Expanding on Carl's thought, I would guess you have four different cases:
A
AN
AA(A|N)*
AN(A|N)+
Only option 2 should be token special_ident and the other three should be ident. All tokens can be identified by syntax alone. Here is a quick grammar I was able to test in ANTLRWorks and it appeared to work properly for me. I think Carl's might have one bug when trying to check AA , but getting you 99% there is a huge benefit, so this is only a minor modification to his quick thought.
prog
: (expr WS)+ EOF;
expr
: special_ident {System.out.println("Found special_ident:" + $special_ident.text + "\n");}
| ident {System.out.println("Found ident:" + $ident.text + "\n");}
;
special_ident : LETTER DIGIT;
ident : LETTER
|LETTER DIGIT (LETTER|DIGIT)+
|LETTER LETTER (LETTER|DIGIT)*;
LETTER : 'A'..'Z';
DIGIT : '0'..'9';
WS
: (' '|'\t'|'\n'|'\r')+;