Lex and Yacc and EBNF specification - yacc

if I just want to check whether the syntax of a language is correct or not,
what is the easy way of writing a syntax analyzer using yacc.

Note that the ISO standard for EBNF is ISO 14977:1996 and the 'EBNF' you've used in the question bears limited resemblance to the standard version. That leaves us having to interpret your grammar rule.
Non-terminals are written as single words in all lower-case.
Terminals are written as single words in all upper-case.
Colon is used to separate a non-terminal from its definition.
Dot (period) is used to mark the end of a rule.
Square brackets enclose optional (zero or once) material.
With those definitions in mind, you need:
A lexical analyzer that recognizes DECLARATION, OF, CONST, VAR, END as terminals (keywords).
A grammar that contains rules for declaration_unit, ident, const_declaration, var_declaration, procedure_interface, function_interface.
Given:
%token DECLARATION
%token OF
%token CONST
%token VAR
%token END
%%
declaration_unit
: DECLARATION OF ident opt_const_declaration opt_var_declaration
opt_procedure_interface opt_function_interface DECLARATION END
;
opt_const_declaration
: /* Nothing */
| CONST const_declaration
;
opt_var_declaration
: /* Nothing */
| VAR var_declaration
;
opt_procedure_interface
: /* Nothing */
| procedure_interface
;
opt_function_interface
: /* Nothing */
| function_interface
;
You now just have to fill in the rules for ident, const_declaration, var_declaration, procedure_interface, function_interface.
For simple syntax checking, you could add placeholder tokens and rules for the parts of the grammar that you've not yet fully defined. For example, you might add:
%token IDENT
%token CONST_DECLARATION
%token VAR_DECLARATION
%token PROCEDURE_INTERFACE
%token FUNCTION_INTERFACE
and
ident
: IDENT
;
const_declaration
: CONST_DECLARATION
;
var_declaration
: VAR_DECLARATION
;
procedure_interface
: PROCEDURE_INTERFACE
;
function_interface
: FUNCTION_INTERFACE
;
Your lexical analyzer simply needs to be able to recognize those dummy tokens reliably until you provide the correct rules.

Related

warning: rule useless in parser due to conflicts

here CR is create
SP is space
RE is replace
iam getting the output correctly for create or replace but not for just create. could anyone pls tell what is wrong with code
but iam still getting this warning and hence not working
p.y:10.5-6: warning: rule useless in parser due to conflicts
%token CR TRI SP RE OR BEF AFT IOF INS UPD DEL ON OF
%%
s:e '\n' { printf("valid variable\n");f=1; };
e:TPR SP TRI;
TPR:CR
|CR SP OR SP RE;
It's rarely a good idea to pass whitespace to the parser. It only complicates the grammar, providing little or no additional value.
It is also always a good idea to adopt a single convention for the names of terminals and non-terminals. If you are going to use ALL CAPS for terminals (which is the normal convention), then don't use it also for non-terminals such as TPR. Also, the use of meaningful names and literal strings will make your grammar much more readable.
The "rule useless in parser due to conflicts" warning is always accompanied by one or more shift/reduce or reduce/reduce conflicts. Normally, the solution is to fix the conflicts. In this case, you could do so by simply not passing the whitespace to the parser.
Here is your grammar, I think: (I'm guessing what your abbreviations mean)
%token CR "create" OR "or" RE "replace"
%token TABLE_IDENTIFIER
%%
statement: expr '\n' { /* Some action */ }
expr: table_producer TABLE_IDENTIFIER
table_producer
: "create"
| "create" "or" "replace"
Written this way, without the whitespace, the grammar does not have any conflicts. If we reintroduce the whitespace:
%token CR "create" OR "or" RE "replace"
%token TABLE_IDENTIFIER SPACE
%%
statement: expr '\n' { /* Some action */ }
expr: table_producer SPACE TABLE_IDENTIFIER
table_producer
: "create"
| "create" SPACE "or" SPACE "replace"
then there is a shift/reduce conflict after create is recognized. The lookahead will be SPACE, but the parser cannot know whether that SPACE is part of the second table_producer production (create or...) or part of the expr production (create table_name).
There must be some punctuation between two words, otherwise they would be recognized by the lexer as a single-word. So the fact that the words are separated by whitespace is not meaningful; if the lexer simply keeps the whitespace to itself, as is normal, then the conflict disappears.

How do I ignore arbitrary stuff inside braces in ANTLR?

I am trying to write a config file grammar and get ANTLR4 to handle it. I am quite new to ANTLR (this is my first project with it).
Largely, I understand what needs to be done (or at least I think I do) for most of the config file grammar, but the files that I will be reading will have arbitrary C code inside of curly braces. Here is an example:
Something like:
#DEVICE: servo "servos are great"
#ACTION: turnRight "turning right is fun"
{
arbitrary C source code goes here;
some more arbitrary C source code;
}
#ACTION: secondAction "this is another action"
{
some more code;
}
And it could be many of those. I can't seem to get it to understand that I want to just ignore (without skipping) the source code. Here is my grammar so far:
/**
ANTLR4 grammar for practicing
*/
grammar practice;
file: (devconfig)*
;
devconfig: devid (action)+
;
devid: DEV_HDR (COMMENT)?
;
action: ACTN_HDR '{' C_BLOCK '}'
;
DEV_HDR: '#DEVICE: ' ALPHA+(IDCHAR)*
;
fragment
ALPHA: [a-zA-Z]
;
fragment
IDCHAR: ALPHA
| [0-9]
| '_'
;
COMMENT: '"' .*? '"'
;
ACTN_HDR: '#ACTION: ' ACTION_ID
;
fragment
ACTION_ID: ALPHA+(IDCHAR)*
;
C_BLOCK: WHAT DO I PUT HERE?? -> channel(HIDDEN)
;
WS: [ \t\n\r]+ -> skip
;
The problem is that whatever I put in the C_BLOCK lexer rule seems to screw up the whole thing - like if I put .*? -> channel(HIDDEN), it doesn't seem to work at all (of course, there is an error when using ANTLR on the grammar to the tune of ".*? can match the empty string" - but what should I put there if not that, so that it ignores the C code, but in such a way that I can access it later (i.e., not skipping it)?
Your C_BLOCK rule can be defined just like the usual multi line comment rule is done in so many languages. Make the curly braces part of the rule too:
C_BLOCK: CURLY .*? CURLY -> channel(HIDDEN);
If you need to nest blocks you write something like:
C_BLOCK: CURLY .*? C_BLOCK? .*? CURLY -> channel(HIDDEN);
or maybe:
C_BLOCK:
CURLY (
C_BLOCK
| .
)*?
CURLY
;
(untested).
Update: changed code to use the non-greedy kleene operator as suggested by a comment.

Lex and yacc program to find palindrome string

Here are my lex and yacc file to recognise palindrome strings but it is giving "INVALID "for both valid as well as invalid string. Please help me to find the problem, I am new to lex and yacc. Thanx in advance
LEX file
%{
#include "y.tab.h"
%}
%%
a return A;
b return B;
. return *yytext;
%%
YACC file
%{
#include<stdio.h>
#include "lex.yy.c"
int i=0;
%}
%token A B
%%
S: pal '\n' {i=1;}
pal:
| A pal A {printf("my3");i=1;}
| B pal B {printf("my4");i=1;}
| A {printf("my1");i=1;}
| B {printf("my2");i=1;}
;
%%
int main()
{
printf("Enter Valid string\n");
yyparse();
if(i==1)
printf("Valid");
return 0;
}
int yyerror(char* s)
{
printf("Invalid\n");
return 0;
}
Example : entered string is : aba
expected output should be VALID but it is giving INVALID
It is impossible to solve this problem with Yacc.
Yacc is a LALR(1) parser generator. LALR refers to a class of grammars. A grammar is a math tool to reason about parsing. One in parens refers to the lookahead - that is a max number of tokens we consider before definitely deciding which of the alternative productions (or "rules") to follow. Remember, the parsing algorithm is one pass, it can't backtrack and try another alternative as some regular expression engines do.
Concerning your palindrom problem, when a parser encounters 'a', it has to pick the right choice somehow
pal: A - 'a' alone is a valid palindrome all by itself, let's call it the inner core
pal: [A] pal A - outter layer, increasing nesting level
pal: A pal [A] - outter layer, decreasing nesting level
Making the right choice is impossible without infinite lookahead, but Yacc has only one token of lookahead.
The way Yacc handles this grammar is interesting as well.
If a grammar is ambiguous or not LR(1) the generated stack automata is non-deterministic. There are some builtin tools to fix it.
The first tool is priorities and associativity to deal with operators in programming languages (not relevant here).
Another one is a quirk - by default Yacc prefers "shift" to "reduce". These two are technicalities reffering to the internal operation of the parse algorithm. Basically tokens are "shift" into a stack. Once a group of tokens on the top match a rule it is possible to "reduce" them, replacing entire group with the single non-terminal from the left side of the rule.
Hence once we have 'a' at the top, we can either reduce it to a pal, or we can shift another token in assuming that a nested pal will emerge eventually. Yacc prefers the later.
The reason for this preference? The same ambiguity arrise in if-then-else statement in most languages. Consider two nested if statements but only one else clause. Yacc attaches else to the innermost if statement which seams to be the right thing to do.
Besides Yacc can generate a report highlighting issues in the grammar like shift-reduce conflicts mentioned above.
In the continuation of #ChrisDod and #NickZavaritsky comments, I add a working version of the glr (bison) parser.
%option noyywrap
%%
a return A;
b return B;
\n return '\n';
. {fprintf(stderr, "Error\n"); exit(1);}
%%
and Yacc / bison
%{
#include <stdio.h>
int i=0;
%}
%token A B
%glr-parser
%%
S : pal '\n' {i=1; return 1 ;}
| error '\n' {i=0; return 1 ;}
pal: A pal A
| B pal B
| A
| B
|
;
%%
#include "lex.yy.c"
int main() {
yyparse();
if(i==1) printf("Valid\n");
else printf("inValid\n");
return 0;
}
int yyerror(char* s) { return 0; }
Some changes were introduced in the lexer: (1) \n was missing; (2) unknown chars are now fatal errors;
The error recovery error was used to obtain the "invalid palindrome" situations.

Syntax error in lex yacc

here is my lex yacc code to parse an XML file and print the contents between and tags.
LEX
%{
%}
%%
"<XML>" {return XMLSTART;}
"</XML>" {return XMLEND;}
[a-z]+ {yylval=strdup(yytext); return TEXT;}
"<" {yylval=strdup(yytext);return yytext[0];}
">" {yylval=strdup(yytext);return yytext[0];}
"\n" {yylval=strdup(yytext);return yytext[0];}
. {}
%%
YACC
%{
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define YYSTYPE char *
%}
%token XMLSTART
%token XMLEND
%token TEXT
%%
program : XMLSTART '\n' A '\n' XMLEND {printf("%s",$3);
}
A : '<' TEXT '>' '\n' A '\n' '<' TEXT '>' { $$ = strcat($1,strcat($2,strcat($3,strcat($4,strcat($5,strcat($6,strcat($7,strcat($8,$9))))))));}
| TEXT
%%
#include"lex.yy.c"
I'm getting Syntax error, tried using ECHOs at some places but didn't find the error.
The input file I'm using is:
<XML>
<hello>
hi
<hello>
</XML>
Please help me figure out the error. I have relatively less experience using lex and yacc
That grammar will only successfully parse a file which has XMLEND at the end. However, all text files end with a newline.
Although you could presumably fix that by adding a newline at the end of the start rule, it's almost always a bad idea to try to parse whitespace. In general, except for line-oriented languages -- which xml is not -- it is best to ignore whitespace.
Your use of strcat is incorrect. Quoting man strcat from a GNU/Linux system:
The strcat() function appends the src string to the dest string, overwriting the terminating null byte ('\0') at the end of dest, and then adds a terminating null byte. The strings may not overlap, and the dest string must have enough space for the result. If dest is not large enough, program behavior is unpredictable; buffer overruns are a favorite avenue for attacking secure programs.
You might want to use asprintf if it exists in your standard library.
Also, you never free() the strings produced by strdup, so all of them leak memory. In general, it's better to not set strdup tokens whose string representation is known -- particularly single-character tokens -- but the important thing is to keep track of the tokens whose string value has been freshly allocated. That would apply to semantic values produced with asprintf if the above suggestion is taken.

Using precedence in Bison for unary minus doesn't solve shift/reduce conflict

I'm devising a very simple grammar, where I use the unary minus operand. However, I get a shift/reduce conflict. In the Bison manual, and everywhere else I look, it says that I should define a new token and give it higher precedence than the binary minus operand, and then use "%prec TOKEN" in the rule.
I've done that, but I still get the warning. Why?
I'm using bison (GNU Bison) 2.4.1. The grammar is shown below:
%{
#include <string>
extern "C" int yylex(void);
%}
%union {
std::string token;
}
%token <token> T_IDENTIFIER T_NUMBER
%token T_EQUAL T_LPAREN T_RPAREN
%right T_EQUAL
%left T_PLUS T_MINUS
%left T_MUL T_DIV
%left UNARY
%start program
%%
program : statements expr
;
statements : '\n'
| statements line
;
line : assignment
| expr
;
assignment : T_IDENTIFIER T_EQUAL expr
;
expr : T_NUMBER
| T_IDENTIFIER
| expr T_PLUS expr
| expr T_MINUS expr
| expr T_MUL expr
| expr T_DIV expr
| T_MINUS expr %prec UNARY
| T_LPAREN expr T_RPAREN
;
%prec doesn't do as much as you might hope here. It tells Bison that in a situation where you have - a * b you want to parse this as (- a) * b instead of - (a * b). In other words, here it will prefer the UNARY rule over the T_MUL rule. In either case, you can be certain that the UNARY rule will get applied eventually, and it is only a question of the order in which the input gets reduced to the unary argument.
In your grammar, things are very much different. Any sequence of line non-terminals will make up a sequence, and there is nothing to say that a line non-terminal must end at an end-of-line. In fact, any expression can be a line. So here are basically two ways to parse a - b: either as a single line with a binary minus, or as two “lines”, the second starting with a unary minus. There is nothing to decide which of these rules will apply, so the rule-based precedence won't work here yet.
Your solution is correcting your line splitting, by requiring every line to actually end with or be followed by an end-of-line symbol.
If you really want the behaviour your grammar indicates with respect to line endings, you'd need two separate non-terminals for expressions which can and which cannot start with a T_MINUS. You'd have to propagate this up the tree: the first line may start with a unary minus, but subsequent ones must not. Inside a parenthesis, starting with a minus would be all right again.
The expr rule is ok (without the %prec UNARY). Your shift/reduce conflict comes from the rule:
statements : '\n'
| statements line
;
The rule does not what you think. For example you can write:
a + b c + d
I think that is not supposed to be valid input.
But also the program rule is not very sane:
program : statements expr
;
The rules should be something like:
program: lines;
lines: line | lines line;
line: statement "\n" | "\n";
statement: assignment | expr;