Syntax error in lex yacc - yacc

here is my lex yacc code to parse an XML file and print the contents between and tags.
LEX
%{
%}
%%
"<XML>" {return XMLSTART;}
"</XML>" {return XMLEND;}
[a-z]+ {yylval=strdup(yytext); return TEXT;}
"<" {yylval=strdup(yytext);return yytext[0];}
">" {yylval=strdup(yytext);return yytext[0];}
"\n" {yylval=strdup(yytext);return yytext[0];}
. {}
%%
YACC
%{
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define YYSTYPE char *
%}
%token XMLSTART
%token XMLEND
%token TEXT
%%
program : XMLSTART '\n' A '\n' XMLEND {printf("%s",$3);
}
A : '<' TEXT '>' '\n' A '\n' '<' TEXT '>' { $$ = strcat($1,strcat($2,strcat($3,strcat($4,strcat($5,strcat($6,strcat($7,strcat($8,$9))))))));}
| TEXT
%%
#include"lex.yy.c"
I'm getting Syntax error, tried using ECHOs at some places but didn't find the error.
The input file I'm using is:
<XML>
<hello>
hi
<hello>
</XML>
Please help me figure out the error. I have relatively less experience using lex and yacc

That grammar will only successfully parse a file which has XMLEND at the end. However, all text files end with a newline.
Although you could presumably fix that by adding a newline at the end of the start rule, it's almost always a bad idea to try to parse whitespace. In general, except for line-oriented languages -- which xml is not -- it is best to ignore whitespace.
Your use of strcat is incorrect. Quoting man strcat from a GNU/Linux system:
The strcat() function appends the src string to the dest string, overwriting the terminating null byte ('\0') at the end of dest, and then adds a terminating null byte. The strings may not overlap, and the dest string must have enough space for the result. If dest is not large enough, program behavior is unpredictable; buffer overruns are a favorite avenue for attacking secure programs.
You might want to use asprintf if it exists in your standard library.
Also, you never free() the strings produced by strdup, so all of them leak memory. In general, it's better to not set strdup tokens whose string representation is known -- particularly single-character tokens -- but the important thing is to keep track of the tokens whose string value has been freshly allocated. That would apply to semantic values produced with asprintf if the above suggestion is taken.

Related

warning: rule useless in parser due to conflicts

here CR is create
SP is space
RE is replace
iam getting the output correctly for create or replace but not for just create. could anyone pls tell what is wrong with code
but iam still getting this warning and hence not working
p.y:10.5-6: warning: rule useless in parser due to conflicts
%token CR TRI SP RE OR BEF AFT IOF INS UPD DEL ON OF
%%
s:e '\n' { printf("valid variable\n");f=1; };
e:TPR SP TRI;
TPR:CR
|CR SP OR SP RE;
It's rarely a good idea to pass whitespace to the parser. It only complicates the grammar, providing little or no additional value.
It is also always a good idea to adopt a single convention for the names of terminals and non-terminals. If you are going to use ALL CAPS for terminals (which is the normal convention), then don't use it also for non-terminals such as TPR. Also, the use of meaningful names and literal strings will make your grammar much more readable.
The "rule useless in parser due to conflicts" warning is always accompanied by one or more shift/reduce or reduce/reduce conflicts. Normally, the solution is to fix the conflicts. In this case, you could do so by simply not passing the whitespace to the parser.
Here is your grammar, I think: (I'm guessing what your abbreviations mean)
%token CR "create" OR "or" RE "replace"
%token TABLE_IDENTIFIER
%%
statement: expr '\n' { /* Some action */ }
expr: table_producer TABLE_IDENTIFIER
table_producer
: "create"
| "create" "or" "replace"
Written this way, without the whitespace, the grammar does not have any conflicts. If we reintroduce the whitespace:
%token CR "create" OR "or" RE "replace"
%token TABLE_IDENTIFIER SPACE
%%
statement: expr '\n' { /* Some action */ }
expr: table_producer SPACE TABLE_IDENTIFIER
table_producer
: "create"
| "create" SPACE "or" SPACE "replace"
then there is a shift/reduce conflict after create is recognized. The lookahead will be SPACE, but the parser cannot know whether that SPACE is part of the second table_producer production (create or...) or part of the expr production (create table_name).
There must be some punctuation between two words, otherwise they would be recognized by the lexer as a single-word. So the fact that the words are separated by whitespace is not meaningful; if the lexer simply keeps the whitespace to itself, as is normal, then the conflict disappears.

Bison Syntax Error easy file

i'm trying to run this .y file
%{
#include <stdlib.h>
#include <stdio.h>
int yylex();
int yyerror();
%}
%start BEGIN
%%
BEGIN: 'a' | BEGIN 'a'
%%
int yylex(){
return getchar();
}
int yyerror(char* s){
fprintf(stderr, "*** ERROR: %s\n", s);
return 0;
}
int main(int argn, char **argv){
yyparse();
return 0;
}
It's a simple program in bison, the syntax seems to me correct, but always get the Syntax error problem ...
Thanks for your help.
The lexer function yylex needs to return 0 to indicate the end of the input. However, your implementation simply passes through the value returned by getchar, which will be EOF (normally -1).
Also, your input is almost certain to include a newline character, which will also be passed through to the parser.
Since the parser recognizes neither \n nor EOF, it produces an error when it receives one of them.
At a minimum, you would need to modify yylex to correctly respond to end of input:
int yylex(void) {
int ch = getchar();
return (ch == EOF) ? 0 : ch;
}
But you will still have to deal with newline charactets, either by handling them in your lexer (possibly ignoring them or possibly returning an end of input imdication), or by handling them in your grammar.
Note that bison/yacc-generated parsers always parse the entire input stream, not just the longest sequence satisfying the grammar. That can be adjusted with some work -- see the documentation for the YYACCEPT special action -- but the standard behaviour is usually what is desired when parsing.
By the way, please use standard style conventions in your bison/yacc grammars, in order to avoid problems and in order to avoid confusing readers. Normally we reserve UPPER_CASE for terminal symbols, since those are also used as compile-time constants in the lexer. Non-terminals are usually written in lower_case although some prefer to use CamelCase. For the terminals, you need to avoid the use of names reserved by the standard library (such as EOF) or by (f)lex (BEGIN) or bison/yacc (END). There are lists of reserved names in the manuals.

Lex and yacc program to find palindrome string

Here are my lex and yacc file to recognise palindrome strings but it is giving "INVALID "for both valid as well as invalid string. Please help me to find the problem, I am new to lex and yacc. Thanx in advance
LEX file
%{
#include "y.tab.h"
%}
%%
a return A;
b return B;
. return *yytext;
%%
YACC file
%{
#include<stdio.h>
#include "lex.yy.c"
int i=0;
%}
%token A B
%%
S: pal '\n' {i=1;}
pal:
| A pal A {printf("my3");i=1;}
| B pal B {printf("my4");i=1;}
| A {printf("my1");i=1;}
| B {printf("my2");i=1;}
;
%%
int main()
{
printf("Enter Valid string\n");
yyparse();
if(i==1)
printf("Valid");
return 0;
}
int yyerror(char* s)
{
printf("Invalid\n");
return 0;
}
Example : entered string is : aba
expected output should be VALID but it is giving INVALID
It is impossible to solve this problem with Yacc.
Yacc is a LALR(1) parser generator. LALR refers to a class of grammars. A grammar is a math tool to reason about parsing. One in parens refers to the lookahead - that is a max number of tokens we consider before definitely deciding which of the alternative productions (or "rules") to follow. Remember, the parsing algorithm is one pass, it can't backtrack and try another alternative as some regular expression engines do.
Concerning your palindrom problem, when a parser encounters 'a', it has to pick the right choice somehow
pal: A - 'a' alone is a valid palindrome all by itself, let's call it the inner core
pal: [A] pal A - outter layer, increasing nesting level
pal: A pal [A] - outter layer, decreasing nesting level
Making the right choice is impossible without infinite lookahead, but Yacc has only one token of lookahead.
The way Yacc handles this grammar is interesting as well.
If a grammar is ambiguous or not LR(1) the generated stack automata is non-deterministic. There are some builtin tools to fix it.
The first tool is priorities and associativity to deal with operators in programming languages (not relevant here).
Another one is a quirk - by default Yacc prefers "shift" to "reduce". These two are technicalities reffering to the internal operation of the parse algorithm. Basically tokens are "shift" into a stack. Once a group of tokens on the top match a rule it is possible to "reduce" them, replacing entire group with the single non-terminal from the left side of the rule.
Hence once we have 'a' at the top, we can either reduce it to a pal, or we can shift another token in assuming that a nested pal will emerge eventually. Yacc prefers the later.
The reason for this preference? The same ambiguity arrise in if-then-else statement in most languages. Consider two nested if statements but only one else clause. Yacc attaches else to the innermost if statement which seams to be the right thing to do.
Besides Yacc can generate a report highlighting issues in the grammar like shift-reduce conflicts mentioned above.
In the continuation of #ChrisDod and #NickZavaritsky comments, I add a working version of the glr (bison) parser.
%option noyywrap
%%
a return A;
b return B;
\n return '\n';
. {fprintf(stderr, "Error\n"); exit(1);}
%%
and Yacc / bison
%{
#include <stdio.h>
int i=0;
%}
%token A B
%glr-parser
%%
S : pal '\n' {i=1; return 1 ;}
| error '\n' {i=0; return 1 ;}
pal: A pal A
| B pal B
| A
| B
|
;
%%
#include "lex.yy.c"
int main() {
yyparse();
if(i==1) printf("Valid\n");
else printf("inValid\n");
return 0;
}
int yyerror(char* s) { return 0; }
Some changes were introduced in the lexer: (1) \n was missing; (2) unknown chars are now fatal errors;
The error recovery error was used to obtain the "invalid palindrome" situations.

Antlr generated Java doesn't match Antlr IDE

I have a grammar that accepts key / value pairs that appear one per line. The values may be multi-line.
The Eclipse plug-in ANTLR IDE works correctly and accepts a valid test string. However, the generated Java does not accept the same string.
Here is the grammar:
message: block4 ;
block4: STARTBLOCK '4' COLON expr4+ ENDBLOCK ;
expr4: NEWLINE (COLON key COLON expr | '-')+;
key: FIELDVALUE* ;
expr: FIELDVALUE* ;
NEWLINE : ('\n'|'\r') ;
FIELDVALUE : (~('-'|COLON|ENDBLOCK|STARTBLOCK))+;
COLON : ':' ;
STARTBLOCK : '{' ;
ENDBLOCK : '}' ;
ANTLR IDE parses this correctly:
Don't squint... It is dividing up key/expression pairs whether they are single-line values (like 23B / CRED) or multiline values (like 59 / /13212312\r\nRECEIVER NAME S.A\r\n).
Here is the input string:
{4:
:20:007505327853
:23B:CRED
:32A:050902JPY3520000,
:33B:JPY3520000,
:50K:EUROXXXEI
:52A:FEBXXXM1
:53A:MHCXXXJT
:54A:FOOBICXX
:59:/13212312
RECEIVER NAME S.A
:70:FUTURES
:71A:SHA
:71F:EUR12,00
:71F:EUR2,34
-}
When Eclipse runs anltr-3.4-complete.jar on the grammar, it generates SwiftTinyLexer.java and SwiftTinyParser.java. The lexer lexes them into 35 tokens, starting with:
STARTBLOCK
4
COLON
FIELDVALUE
COLON
I would like token 4 to be an expr4 rather than a FIELDVALUE (and the IDE seems to agree with me). But since it is a FIELDVALUE, the parser is choking on that token with line 1:3 required (...)+ loop did not match anything at input '\r\n'.
Why is there a difference between the way that anltr 3.4 and ANTLR IDE 2.1.2.201108281759 lex the same string?
Is there a way to fix the grammar so that it matches expr4 before it matches FIELDVALUE?
The IDE input string has a single \n while the Java test code is getting a Windows-style \r\n.
I changed NEWLINE by adding a "1 or more," that is from
NEWLINE : ('\n'|'\r') ;
to
NEWLINE : ('\n'|'\r')+ ;
This allowed the parse go forward without the lexical error, and now it makes sense why the IDE behaved differently from generated Java: They were getting slightly different input strings.

Parsing Newlines, EOF as End-of-Statement Marker with ANTLR3

My question is in regards to running the following grammar in ANTLRWorks:
INT :('0'..'9')+;
SEMICOLON: ';';
NEWLINE: ('\r\n'|'\n'|'\r');
STMTEND: (SEMICOLON (NEWLINE)*|NEWLINE+);
statement
: STMTEND
| INT STMTEND
;
program: statement+;
I get the following results with the following input (with program as the start rule), regardless of which newline NL (CR/LF/CRLF) or integer I choose:
"; NL" or "32; NL" parses without error.
";" or "45;" (without newlines) result in EarlyExitException.
"NL" by itself parses without error.
"456 NL", without the semicolon, results in MismatchedTokenException.
What I want is for a statement to be terminated by a newline, semicolon, or semicolon followed by newline, and I want the parser to eat as many contiguous newlines as it can on a termination, so "; NL NL NL NL" is just one termination, not four or five. Also, I would like the end-of-file case to be a valid termination as well, but I don't know how to do that yet.
So what's wrong with this, and how can I make this terminate nicely at EOF? I'm completely new to all of parsing, ANTLR, and EBNF, and I haven't found much material to read on it at a level somewhere in between the simple calculator example and the reference (I have The Definitive ANTLR Reference, but it really is a reference, with a quick start in the front which I haven't yet got to run outside of ANTLRWorks), so any reading suggestions (besides Wirth's 1977 ACM paper) would be helpful too. Thanks!
In case of input like ";" or "45;", the token STMTEND will never be created.
";" will create a single token: SEMICOLON, and "45;" will produce: INT SEMICOLON.
What you (probably) want is that SEMICOLON and NEWLINE never make it to real tokens themselves, but they will always be a STMTEND. You can do that by making them so called "fragment" rules:
program: statement+;
statement
: STMTEND
| INT STMTEND
;
INT : '0'..'9'+;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r' '\n' | '\n' | '\r';
Fragment rules are only available for other lexer rules, so they will never end up in parser (production) rules. To emphasize: the grammar above will only ever create either INT or STMTEND tokens.