Lex and yacc program to find palindrome string - yacc

Here are my lex and yacc file to recognise palindrome strings but it is giving "INVALID "for both valid as well as invalid string. Please help me to find the problem, I am new to lex and yacc. Thanx in advance
LEX file
%{
#include "y.tab.h"
%}
%%
a return A;
b return B;
. return *yytext;
%%
YACC file
%{
#include<stdio.h>
#include "lex.yy.c"
int i=0;
%}
%token A B
%%
S: pal '\n' {i=1;}
pal:
| A pal A {printf("my3");i=1;}
| B pal B {printf("my4");i=1;}
| A {printf("my1");i=1;}
| B {printf("my2");i=1;}
;
%%
int main()
{
printf("Enter Valid string\n");
yyparse();
if(i==1)
printf("Valid");
return 0;
}
int yyerror(char* s)
{
printf("Invalid\n");
return 0;
}
Example : entered string is : aba
expected output should be VALID but it is giving INVALID

It is impossible to solve this problem with Yacc.
Yacc is a LALR(1) parser generator. LALR refers to a class of grammars. A grammar is a math tool to reason about parsing. One in parens refers to the lookahead - that is a max number of tokens we consider before definitely deciding which of the alternative productions (or "rules") to follow. Remember, the parsing algorithm is one pass, it can't backtrack and try another alternative as some regular expression engines do.
Concerning your palindrom problem, when a parser encounters 'a', it has to pick the right choice somehow
pal: A - 'a' alone is a valid palindrome all by itself, let's call it the inner core
pal: [A] pal A - outter layer, increasing nesting level
pal: A pal [A] - outter layer, decreasing nesting level
Making the right choice is impossible without infinite lookahead, but Yacc has only one token of lookahead.
The way Yacc handles this grammar is interesting as well.
If a grammar is ambiguous or not LR(1) the generated stack automata is non-deterministic. There are some builtin tools to fix it.
The first tool is priorities and associativity to deal with operators in programming languages (not relevant here).
Another one is a quirk - by default Yacc prefers "shift" to "reduce". These two are technicalities reffering to the internal operation of the parse algorithm. Basically tokens are "shift" into a stack. Once a group of tokens on the top match a rule it is possible to "reduce" them, replacing entire group with the single non-terminal from the left side of the rule.
Hence once we have 'a' at the top, we can either reduce it to a pal, or we can shift another token in assuming that a nested pal will emerge eventually. Yacc prefers the later.
The reason for this preference? The same ambiguity arrise in if-then-else statement in most languages. Consider two nested if statements but only one else clause. Yacc attaches else to the innermost if statement which seams to be the right thing to do.
Besides Yacc can generate a report highlighting issues in the grammar like shift-reduce conflicts mentioned above.

In the continuation of #ChrisDod and #NickZavaritsky comments, I add a working version of the glr (bison) parser.
%option noyywrap
%%
a return A;
b return B;
\n return '\n';
. {fprintf(stderr, "Error\n"); exit(1);}
%%
and Yacc / bison
%{
#include <stdio.h>
int i=0;
%}
%token A B
%glr-parser
%%
S : pal '\n' {i=1; return 1 ;}
| error '\n' {i=0; return 1 ;}
pal: A pal A
| B pal B
| A
| B
|
;
%%
#include "lex.yy.c"
int main() {
yyparse();
if(i==1) printf("Valid\n");
else printf("inValid\n");
return 0;
}
int yyerror(char* s) { return 0; }
Some changes were introduced in the lexer: (1) \n was missing; (2) unknown chars are now fatal errors;
The error recovery error was used to obtain the "invalid palindrome" situations.

Related

Erratic parser. Same grammar, same input, cycles through different results. What am I missing?

I'm writing a basic parser that reads form stdin and prints results to stdout. The problem is that I'm having troubles with this grammar:
%token WORD NUM TERM
%%
stmt: /* empty */
| word word term { printf("[stmt]\n"); }
| word number term { printf("[stmt]\n"); }
| word term
| number term
;
word: WORD { printf("[word]\n"); }
;
number: NUM { printf("[number]\n"); }
;
term: TERM { printf("[term]\n"); /* \n */}
;
%%
When I run the program, I and type: hello world\n The output is (as I expected) [word] [word] [term] [stmt]. So far, so good, but then if I type: hello world\n (again), I get syntax error [word][term].
When I type hello world\n (for the third time) it works, then it fails again, then it works, and so on and do forth.
Am I missing something obvious in here?
(I have some experience on hand rolled compilers, but I've not used lex/yacc et. al.)
This is the main func:
int main() {
do {
yyparse();
} while(!feof(yyin));
return 0;
}
Any help would be appreciated. Thanks!
Your grammar recognises a single stmt. Yacc/bison expect the grammar to describe the entire input, so after the statement is recognised, the parser waits for an end-of-input indication. But it doesn't get one, since you typed a second statement. That causes the parser to report a syntax error. But note that it has now read the first token in the second line.
You are calling yyparse() in a loop and not stopping when you get a syntax error return value. So when you call yyparse() again, it will continue where the last one left off, which is just before the second token in the second line. What remains is just a single word, which it then correctly parses.
What you probably should do is write your parser so that it accepts any number of statements, and perhaps so that it does not die when it hits an error. That would look something like this:
%%
prog: %empty
| prog line
line: stmt '\n' { puts("Got a statement"); }
| error '\n' { yyerrok; /* Simple error recovery */ }
...
Note that I print a message for a statement only after I know that the line was correctly parsed. That usually turns out to be less confusing. But the best solution is not use printf's, but rather to use Bison's trace facility, which is as simple as putting -t on the bison command line and setting the global variable yydebug = 1;. See Tracing your parser

yacc lex when parsing CNC GCODES

I have to parse motion control programs (CNC machines, GCODE)
It is GCODE plus similar looking code specific to hardware.
There are lots of commands that consist of a single letter and number, example:
C100Z0.5C100Z-0.5
C80Z0.5C80Z-0.5
So part of my (abreviated) lex (racc & rex actually) looks like:
A {[:A,text]}
B {[:B,text]}
...
Z {[:Z,text]}
So I find a command that takes ANY letter as an argument, and in racc started typing:
letter : A
| B
| C
......
Then I stopped, I haven't used yacc is 30 years, is there some kind of shortcut for the above? Have I gone horribly off course?
It is not clear what are you trying to accomplish. If you want to create Yacc rule that covers all letters you could create token for that:
%token letter_token
In lex you would find with regular expressions each letter and simply return letter_token:
Regex for letters {
return letter_token;
}
Now you can use letter_token in Yacc rules:
letter : letter_token
Also you haven't said what language you're using. But if you need, you can get specific character you assigned with letter_token, by defining union:
%union {
char c;
}
%token <c> letter_token
Let's say you want to read single characters, Lex part in assigning character to token would be:
[A-Z] {
yylval.c = *yytext;
return letter_token;
}
Feel free to ask any further questions, and read more here about How to create a Minimal, Complete, and Verifiable example.

Bison Syntax Error easy file

i'm trying to run this .y file
%{
#include <stdlib.h>
#include <stdio.h>
int yylex();
int yyerror();
%}
%start BEGIN
%%
BEGIN: 'a' | BEGIN 'a'
%%
int yylex(){
return getchar();
}
int yyerror(char* s){
fprintf(stderr, "*** ERROR: %s\n", s);
return 0;
}
int main(int argn, char **argv){
yyparse();
return 0;
}
It's a simple program in bison, the syntax seems to me correct, but always get the Syntax error problem ...
Thanks for your help.
The lexer function yylex needs to return 0 to indicate the end of the input. However, your implementation simply passes through the value returned by getchar, which will be EOF (normally -1).
Also, your input is almost certain to include a newline character, which will also be passed through to the parser.
Since the parser recognizes neither \n nor EOF, it produces an error when it receives one of them.
At a minimum, you would need to modify yylex to correctly respond to end of input:
int yylex(void) {
int ch = getchar();
return (ch == EOF) ? 0 : ch;
}
But you will still have to deal with newline charactets, either by handling them in your lexer (possibly ignoring them or possibly returning an end of input imdication), or by handling them in your grammar.
Note that bison/yacc-generated parsers always parse the entire input stream, not just the longest sequence satisfying the grammar. That can be adjusted with some work -- see the documentation for the YYACCEPT special action -- but the standard behaviour is usually what is desired when parsing.
By the way, please use standard style conventions in your bison/yacc grammars, in order to avoid problems and in order to avoid confusing readers. Normally we reserve UPPER_CASE for terminal symbols, since those are also used as compile-time constants in the lexer. Non-terminals are usually written in lower_case although some prefer to use CamelCase. For the terminals, you need to avoid the use of names reserved by the standard library (such as EOF) or by (f)lex (BEGIN) or bison/yacc (END). There are lists of reserved names in the manuals.

Syntax error in lex yacc

here is my lex yacc code to parse an XML file and print the contents between and tags.
LEX
%{
%}
%%
"<XML>" {return XMLSTART;}
"</XML>" {return XMLEND;}
[a-z]+ {yylval=strdup(yytext); return TEXT;}
"<" {yylval=strdup(yytext);return yytext[0];}
">" {yylval=strdup(yytext);return yytext[0];}
"\n" {yylval=strdup(yytext);return yytext[0];}
. {}
%%
YACC
%{
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define YYSTYPE char *
%}
%token XMLSTART
%token XMLEND
%token TEXT
%%
program : XMLSTART '\n' A '\n' XMLEND {printf("%s",$3);
}
A : '<' TEXT '>' '\n' A '\n' '<' TEXT '>' { $$ = strcat($1,strcat($2,strcat($3,strcat($4,strcat($5,strcat($6,strcat($7,strcat($8,$9))))))));}
| TEXT
%%
#include"lex.yy.c"
I'm getting Syntax error, tried using ECHOs at some places but didn't find the error.
The input file I'm using is:
<XML>
<hello>
hi
<hello>
</XML>
Please help me figure out the error. I have relatively less experience using lex and yacc
That grammar will only successfully parse a file which has XMLEND at the end. However, all text files end with a newline.
Although you could presumably fix that by adding a newline at the end of the start rule, it's almost always a bad idea to try to parse whitespace. In general, except for line-oriented languages -- which xml is not -- it is best to ignore whitespace.
Your use of strcat is incorrect. Quoting man strcat from a GNU/Linux system:
The strcat() function appends the src string to the dest string, overwriting the terminating null byte ('\0') at the end of dest, and then adds a terminating null byte. The strings may not overlap, and the dest string must have enough space for the result. If dest is not large enough, program behavior is unpredictable; buffer overruns are a favorite avenue for attacking secure programs.
You might want to use asprintf if it exists in your standard library.
Also, you never free() the strings produced by strdup, so all of them leak memory. In general, it's better to not set strdup tokens whose string representation is known -- particularly single-character tokens -- but the important thing is to keep track of the tokens whose string value has been freshly allocated. That would apply to semantic values produced with asprintf if the above suggestion is taken.

Using precedence in Bison for unary minus doesn't solve shift/reduce conflict

I'm devising a very simple grammar, where I use the unary minus operand. However, I get a shift/reduce conflict. In the Bison manual, and everywhere else I look, it says that I should define a new token and give it higher precedence than the binary minus operand, and then use "%prec TOKEN" in the rule.
I've done that, but I still get the warning. Why?
I'm using bison (GNU Bison) 2.4.1. The grammar is shown below:
%{
#include <string>
extern "C" int yylex(void);
%}
%union {
std::string token;
}
%token <token> T_IDENTIFIER T_NUMBER
%token T_EQUAL T_LPAREN T_RPAREN
%right T_EQUAL
%left T_PLUS T_MINUS
%left T_MUL T_DIV
%left UNARY
%start program
%%
program : statements expr
;
statements : '\n'
| statements line
;
line : assignment
| expr
;
assignment : T_IDENTIFIER T_EQUAL expr
;
expr : T_NUMBER
| T_IDENTIFIER
| expr T_PLUS expr
| expr T_MINUS expr
| expr T_MUL expr
| expr T_DIV expr
| T_MINUS expr %prec UNARY
| T_LPAREN expr T_RPAREN
;
%prec doesn't do as much as you might hope here. It tells Bison that in a situation where you have - a * b you want to parse this as (- a) * b instead of - (a * b). In other words, here it will prefer the UNARY rule over the T_MUL rule. In either case, you can be certain that the UNARY rule will get applied eventually, and it is only a question of the order in which the input gets reduced to the unary argument.
In your grammar, things are very much different. Any sequence of line non-terminals will make up a sequence, and there is nothing to say that a line non-terminal must end at an end-of-line. In fact, any expression can be a line. So here are basically two ways to parse a - b: either as a single line with a binary minus, or as two “lines”, the second starting with a unary minus. There is nothing to decide which of these rules will apply, so the rule-based precedence won't work here yet.
Your solution is correcting your line splitting, by requiring every line to actually end with or be followed by an end-of-line symbol.
If you really want the behaviour your grammar indicates with respect to line endings, you'd need two separate non-terminals for expressions which can and which cannot start with a T_MINUS. You'd have to propagate this up the tree: the first line may start with a unary minus, but subsequent ones must not. Inside a parenthesis, starting with a minus would be all right again.
The expr rule is ok (without the %prec UNARY). Your shift/reduce conflict comes from the rule:
statements : '\n'
| statements line
;
The rule does not what you think. For example you can write:
a + b c + d
I think that is not supposed to be valid input.
But also the program rule is not very sane:
program : statements expr
;
The rules should be something like:
program: lines;
lines: line | lines line;
line: statement "\n" | "\n";
statement: assignment | expr;