how define default rule in EBNF/Tatsu? - grammar

I have a problem in my EBNF and Tatsu implementation
extract grammar EBNF for Tatsu :
define ='#define' constantename [constante] ;
constante = CONSTANTE ;
CONSTANTE = ( Any | ``true`` ) ;
Any = /.*/ ;
constantename = (/[A-Z0-9_()]*/) ;
When I test with :
#define _TEST01_ "test01"
#define _TEST_
#define _TEST02_ "test02"
I get :
[
"#define",
"_TEST01_",
"\"test01\""
],
[
"#define",
"_TEST_",
"#define _TEST02_ \"test02\""
]
But I want this :
[
"#define",
"_TEST01_",
"\"test01\""
],
[
"#define",
"_TEST_",
"true"
],
[
"#define",
"_TEST02_",
"\"test02\""
]
Where is my mistake ?
Thanks a lot...

The problem is that Tatsu skips white space, including newlines, between elements by default. So when you apply the rule '#define' constantename [constante] to the input:
#define _TEST_
#define _TEST02_ "test02"
It first matches #define with '#define', then skips the space, then matches _TEST_ with constantename, then skips the newline, and then matches #define _TEST02_ "test02" with ANY (via constante).
Note that that's exactly the behaviour you'd want (I assume) if the newline weren't there:
#define _TEST_ #define _TEST02_ "test02"
Here you'd want the output ["#define", "_TEST_", "#define _TEST02_ \"test02\""], right? At least the C preprocessor would handle it the same way in that case.
So what that tells us is that the newline is significant. Therefore you can't ignore it. You can tell Tatsu to only ignore tabs and spaces (not newlines) either by passing whitespace = '\t ' as an option when creating the parser, or by adding this line to the grammar:
##whitespace :: /[\t ]+/
Now you'll need to explicitly mention newlines anywhere where newlines should go, so your rule becomes:
define ='#define' constantename [constante] '\n';
Now it's clear that the constant, if present, should appear before the line break, so for the line #define _TEST_, it would realize that there is no constant.
Note that you'll also want a rule to match empty lines, so empty lines aren't syntax errors.

Related

Fail to continue parsing after correct input

I have two input numbers separated by ','.
The program works fine for the first try, but for the second try it always ends with error.
How do I keep parsing?
lex file snippet:
#include "y.tab.h"
%%
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
. return yytext[0];
%%
yacc file snippet:
%{
#include <stdio.h>
int yylex();
int yyerror();
%}
%start s
%token NUMBER
%%
s: NUMBER ',' NUMBER{
if(($1 % 3 == 0) && ($3 % 2 == 0)) {printf("OK");}
else{printf("NOT OK, try again.");}
};
%%
int main(){ return yyparse(); }
int yyerror() { printf("Error Occured.\n"); return 0; }
output snippet:
benjamin#benjamin-VirtualBox:~$ ./ex1
15,4
OK
15,4
Error Occured.
Your start rule (indeed, your only rule) is:
s: NUMBER ',' NUMBER
That means that an input consists of a NUMBER, a ',' and another NUMBER.
That's it. After the parser encounters those three things, it expects an end of input indicator, because that's what you've told it a complete input looks like.
If you want to accept multiple lines, each consisting of two numbers separated by a comma, you'll need to write a grammar which describes that input. (And in order to describe the fact that they are lines, you'll probably want to make a newline character a token. Right now, it falls through the the scanner's default rule, because in (f)lex . doesn't match a newline character.) You'll also probably want to include an error production so that your parser doesn't suddenly terminate on the first error.
Alternatively, you could parse your input one line at a time by reading the lines yourself, perhaps using fgets or the Posix-standard getline function, and then passing each line to your scanner using yy_scan_string

Simple calculator program in lex and yacc not giving output

I am trying to write a very simple calculator program using lex and yacc but getting stuck in printing the output. The files are:
calc.l:
%{
#include "y.tab.h"
extern int yylval;
%}
%%
[0-9]+ {yylval = atoi(yytext); return NUMBER;}
[ \t] ;
\n return 0;
. return yytext[0];
%%
calc.y:
%{
#include <stdio.h>
void yyerror(char const *s) {
fprintf(stderr, "%s\n", s);
}
%}
%token NAME NUMBER
%%
statement: NAME '=' expression
| expression {printf(" =%d\n", $1);}
;
expression: expression '+' NUMBER {$$ = $1 + $3;}
| expression '-' NUMBER {$$ = $1 - $3;}
| NUMBER {$$ = $1;}
;
The commands I have used:
flex calc.l
bison calc.y -d
gcc lex.yy.c calc.tab.c -lfl
./a.out
After running the last command although the program takes input from the keyboard but does not print anything, simply terminates. I didn't get any warning or error while compiling but it doesn't give any output. Please help.
You have no definition of main, so the main function in -lfl will be used. That library is for flex programs, and its main function will call yylex -- the lexical scanner -- until it returns 0.
You need to call the parser. Furthermore, you need to call it repeatedly, because your lexical scanner returns 0, indicating end of input, every time it reads a newline.
So you might use something like this:
int main(void) {
do {
yyparse();
} while (!feof(stdin));
return 0;
}
However, that will reveal some other problems. Most irritatingly, your grammar will not accept an empty input, so an empty line will trigger a syntax error. That will certainly happen at the end of the input, because the EOF will cause yylex to return 0 immediately, which is indistinguishable from an empty line.
Also, any error encountered during the parse will cause the parse to terminate immediately, leaving the remainder of the input line unread.
On the whole, it is often better for the scanner to return a newline token (or \n) for newline characters.
Other than the main function which you don't require, the only thing in -lfl is a default definition of yywrap. You could just define this function yourself (it only needs to return 1), or you could avoid the need for the function by adding
%option noyywrap
to your flex file. In fact, I usually recommend
%option noyywrap noinput nounput
which will avoid the compiler warnings (which you didn't see because you didn't supply -Wall when you compiled the program, which you should do.)
Another compiler warning will be avoided by adding a declaration of yylex to your bison input file before the definition of yyerror:
int yylex(void);
Finally, yylval is declared in y.tab.h, so there is no need for extern int yylval; in your flex file. In this case, it doesn't hurt, but if you change the type of the semantic value, which you will probably eventually want to do, this line will need to be changed as well. Better to just eliminate it.

Need help to understand a small piece of a lex/flex hello world example

I'm a beginner of lex/flex and yacc. I'm now reading a book which gives a hello world example of lex/flex input file, to implement a simple calculator lexer.
The code is here:
%{
#include <stdoio.h>
#include "y.tab.h"
int
yywrap(void)
{
return 1;
}
%}
%%
"+" return ADD;
"-" return SUB;
"*" return MUL;
"/" return DIV;
"\n" return CR;
([1-9][0-9]*)|0|([0-9]+\.[0-9]+) {
double temp;
sscanf(yytext,"%lf",&temp);
yylval.double_value=temp;
return DOUBLE_LITERAL;
}
[ \t] ;
.{
fprintf(stderr, "lexical error.\n");
exit(1);
}
%%
I don't quite understand what does the line [ \t] ; do here. Could anybody help me? thx.
The brackets indicate a "character class." Any character that appears within the brackets is considered a match. Here we have two characters, space and horizontal tab (\t). These characters are often called "whitespace."
The bare semicolon says "do nothing."
So the rule says, "whenever you see either a space or a tab (a whitespace character), do nothing and get the next character."
Since the input to the lexer might have multiple whitespace characters repeated together, this lexer rule could be applied multiple times. As a simplification, it is common to see a quantifier like + (1 or more) or * (zero or more) after the character class. This rule means, "whenever you see one or more whitespace characters, do nothing and get the next character."
[ \t]+ ;

How do I handle newlines in a Bison Grammar, without allowing all characters?

I've gone right back to basics to try and understand how the parser can match an input line such as "asdf", or any other jumble of characters, where there is no rule defined for this.
My lexer:
%{
#include
%}
%%
"\n" {return NEWLINE; }
My Parser:
%{
#include <stdlib.h>
%}
% token NEWLINE
%%
program:
| program line
;
line: NEWLINE
;
%%
#include <stdio.h>
int yyerror(char *s)
{
printf("%s\n", s);
return(0);
}
int main(void)
{
yyparse();
exit(0);
}
It is my understanding that this, when compiled and run should accept nothing more than empty blank lines, but it will also allow any strings to be input without a syntax error.
What am I missing?
Thanks
Currently, your lexer echos and ignores all non-newline characters (that's the default action in lex for characters that don't match any rule), so the parser will only ever see newlines.
In general, your lexer needs to do something with any/every possible input character. It can ignore them (silently or with a message), or return tokens for the parser. The usual approach is to have the last lexer rule be:
. return *yytext;
which matches any single character (other than a newline) and sends it on to the parser as-is. This is the last rule, so that any earlier rule that matches a single character takes precedence,
This is completely independent of the parser, which only sees that part of the input the lexer gives it.
You have default rules. Add the option nodefault in order to solve your problem. Your lexer will then look like this instead:
%option nodefault
%{
#include <stdlib.h>
%}
%%
"\n" {return NEWLINE; }

Is it possible to override rebol path operator?

It is possible to overide rebol system words like print, make etc., so is it possible to do the same with the path operator ? Then what's the syntax ?
Another possible approach is to use REBOL meta-programming capabilities and preprocess your own code to catch path accesses and add your handler code. Here's an example :
apply-my-rule: func [spec [block!] /local value][
print [
"-- path access --" newline
"object:" mold spec/1 newline
"member:" mold spec/2 newline
"value:" mold set/any 'value get in get spec/1 spec/2 newline
"--"
]
:value
]
my-do: func [code [block!] /local rule pos][
parse code rule: [
any [
pos: path! (
pos: either object? get pos/1/1 [
change/part pos reduce ['apply-my-rule to-block pos/1] 1
][
next pos
]
) :pos
| into rule ;-- dive into nested blocks
| skip ;-- skip every other values
]
]
do code
]
;-- example usage --
obj: make object! [
a: 5
]
my-do [
print mold obj/a
]
This will give you :
-- path access --
object: obj
member: a
value: 5
--
5
Another (slower but more flexible) approach could also be to pass your code in string mode to the preprocessor allowing freeing yourself from any REBOL specific syntax rule like in :
my-alternative-do {
print mold obj..a
}
The preprocessor code would then spot all .. places and change the code to properly insert calls to 'apply-my-rule, and would in the end, run the code with :
do load code
There's no real limits on how far you can process and change your whole code at runtime (the so-called "block mode" of the first example being the most efficient way).
You mean replace (say)....
print mold system/options
with (say)....
print mold system..options
....where I've replaced REBOL's forward slash with dot dot syntax?
Short answer: no. Some things are hardwired into the parser.