Can I choose my own token values? - yacc

When I make a grammar file and do a yacc -d on it, I get a y.tab.h output file. Is there any way that I can feed the values of the tokens I want into yacc instead of it picking the values?
For example,
%token FIRST_NAME
%token LAST_NAME
...
produces (in y.tab.h):
#define FIRST_NAME 257
#define LAST_NAME 258
I know that the first 256 values are reserved for single character matches, but i would really like FIRST_NAME to be #defined as 1001 and LAST_NAME to be #defined as 1002. What this means is that I would choose the #defines and put them into an include file before I do a yacc on the grammar file.
Is this possible?
Thanks

I started reading the GNU bison manual and it said that you could do
%token FIRST_NAME 1001
%token LAST_NAME 1002
in bison and it would use these values. I then just tried it for yacc, and it works as well.
Thanks for your time.

Related

yacc lex when parsing CNC GCODES

I have to parse motion control programs (CNC machines, GCODE)
It is GCODE plus similar looking code specific to hardware.
There are lots of commands that consist of a single letter and number, example:
C100Z0.5C100Z-0.5
C80Z0.5C80Z-0.5
So part of my (abreviated) lex (racc & rex actually) looks like:
A {[:A,text]}
B {[:B,text]}
...
Z {[:Z,text]}
So I find a command that takes ANY letter as an argument, and in racc started typing:
letter : A
| B
| C
......
Then I stopped, I haven't used yacc is 30 years, is there some kind of shortcut for the above? Have I gone horribly off course?
It is not clear what are you trying to accomplish. If you want to create Yacc rule that covers all letters you could create token for that:
%token letter_token
In lex you would find with regular expressions each letter and simply return letter_token:
Regex for letters {
return letter_token;
}
Now you can use letter_token in Yacc rules:
letter : letter_token
Also you haven't said what language you're using. But if you need, you can get specific character you assigned with letter_token, by defining union:
%union {
char c;
}
%token <c> letter_token
Let's say you want to read single characters, Lex part in assigning character to token would be:
[A-Z] {
yylval.c = *yytext;
return letter_token;
}
Feel free to ask any further questions, and read more here about How to create a Minimal, Complete, and Verifiable example.

YACC or Bison Action Variables positional max value

In YACC and other Yacc like programs. There are action positional variables for the current parsed group of tokens. I might want to process some csv file input that the number of columns changes for unknown reasons. With my rules quoted_strings and numbers can be one or more instances found.
rule : DATE_TOKEN QUOTED_NUMBERS q_string numbers { printf(..... $1,$2....}
q_string
: QUOTED_STRING
| QUOTED_STRING q_string
;
numbers
: number numbers
| number
;
number
: INT_VALUE
| FLOAT_VALUE
;
Actions can be added to do things with what ever has been parsed as is
{ printf("%s %s %s \n",$<string>1, $<string>1, $<string>1); }
Is there a runtime macro, constuct or variable that tells me how many tokens have been read so that I can write a loop to print all token values?
What is $max
The $n variables in a bison action refer to right-hand side symbols, not to tokens. If the corresponding rhs object is a non-terminal, $n refers to that non-terminal's semantic value, which was set by assigning to $$ in the semantic action of that nonterminal.
So if there are five symbols on the right-hand side of a rule, then you can use $1 through $5. There is no variable notation which allows you to refer to the "nth" symbol.

Lex and yacc program to find palindrome string

Here are my lex and yacc file to recognise palindrome strings but it is giving "INVALID "for both valid as well as invalid string. Please help me to find the problem, I am new to lex and yacc. Thanx in advance
LEX file
%{
#include "y.tab.h"
%}
%%
a return A;
b return B;
. return *yytext;
%%
YACC file
%{
#include<stdio.h>
#include "lex.yy.c"
int i=0;
%}
%token A B
%%
S: pal '\n' {i=1;}
pal:
| A pal A {printf("my3");i=1;}
| B pal B {printf("my4");i=1;}
| A {printf("my1");i=1;}
| B {printf("my2");i=1;}
;
%%
int main()
{
printf("Enter Valid string\n");
yyparse();
if(i==1)
printf("Valid");
return 0;
}
int yyerror(char* s)
{
printf("Invalid\n");
return 0;
}
Example : entered string is : aba
expected output should be VALID but it is giving INVALID
It is impossible to solve this problem with Yacc.
Yacc is a LALR(1) parser generator. LALR refers to a class of grammars. A grammar is a math tool to reason about parsing. One in parens refers to the lookahead - that is a max number of tokens we consider before definitely deciding which of the alternative productions (or "rules") to follow. Remember, the parsing algorithm is one pass, it can't backtrack and try another alternative as some regular expression engines do.
Concerning your palindrom problem, when a parser encounters 'a', it has to pick the right choice somehow
pal: A - 'a' alone is a valid palindrome all by itself, let's call it the inner core
pal: [A] pal A - outter layer, increasing nesting level
pal: A pal [A] - outter layer, decreasing nesting level
Making the right choice is impossible without infinite lookahead, but Yacc has only one token of lookahead.
The way Yacc handles this grammar is interesting as well.
If a grammar is ambiguous or not LR(1) the generated stack automata is non-deterministic. There are some builtin tools to fix it.
The first tool is priorities and associativity to deal with operators in programming languages (not relevant here).
Another one is a quirk - by default Yacc prefers "shift" to "reduce". These two are technicalities reffering to the internal operation of the parse algorithm. Basically tokens are "shift" into a stack. Once a group of tokens on the top match a rule it is possible to "reduce" them, replacing entire group with the single non-terminal from the left side of the rule.
Hence once we have 'a' at the top, we can either reduce it to a pal, or we can shift another token in assuming that a nested pal will emerge eventually. Yacc prefers the later.
The reason for this preference? The same ambiguity arrise in if-then-else statement in most languages. Consider two nested if statements but only one else clause. Yacc attaches else to the innermost if statement which seams to be the right thing to do.
Besides Yacc can generate a report highlighting issues in the grammar like shift-reduce conflicts mentioned above.
In the continuation of #ChrisDod and #NickZavaritsky comments, I add a working version of the glr (bison) parser.
%option noyywrap
%%
a return A;
b return B;
\n return '\n';
. {fprintf(stderr, "Error\n"); exit(1);}
%%
and Yacc / bison
%{
#include <stdio.h>
int i=0;
%}
%token A B
%glr-parser
%%
S : pal '\n' {i=1; return 1 ;}
| error '\n' {i=0; return 1 ;}
pal: A pal A
| B pal B
| A
| B
|
;
%%
#include "lex.yy.c"
int main() {
yyparse();
if(i==1) printf("Valid\n");
else printf("inValid\n");
return 0;
}
int yyerror(char* s) { return 0; }
Some changes were introduced in the lexer: (1) \n was missing; (2) unknown chars are now fatal errors;
The error recovery error was used to obtain the "invalid palindrome" situations.

Xtext: How do test the xtext lexer?

I have a list of terminals in my Xtext grammar how can I test that they work and that there are no token conflicts?
For example the following terminals:
terminal COMMA: ',';
terminal QUESTION: '?';
terminal IDENTIFIER: ('a'..'z'| 'A'..'Z')+;
terminal LENGTH: 'LENGTH' | 'l' | 'len';
terminal SEMICOLON: ';' ;
I want to make sure that for example IDENTIFIER and LENGTH do not conflict with each other so LENGTH or len gives a token of LENGTH and not IDENTIFIER.
(which is wrong in the grammar above assuming that tokens defined first would take priority)
When I try your example and generate the language, Antlr will report the token conflict.
Dedicated lexer tests are rather easy to setup if you inject a Provider into your test. You may also want to look into the xtext-utils which are unforatunately no longer maintained as it seems. But still the wiki has some insight on how tests could look like.

Syntax error in lex yacc

here is my lex yacc code to parse an XML file and print the contents between and tags.
LEX
%{
%}
%%
"<XML>" {return XMLSTART;}
"</XML>" {return XMLEND;}
[a-z]+ {yylval=strdup(yytext); return TEXT;}
"<" {yylval=strdup(yytext);return yytext[0];}
">" {yylval=strdup(yytext);return yytext[0];}
"\n" {yylval=strdup(yytext);return yytext[0];}
. {}
%%
YACC
%{
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define YYSTYPE char *
%}
%token XMLSTART
%token XMLEND
%token TEXT
%%
program : XMLSTART '\n' A '\n' XMLEND {printf("%s",$3);
}
A : '<' TEXT '>' '\n' A '\n' '<' TEXT '>' { $$ = strcat($1,strcat($2,strcat($3,strcat($4,strcat($5,strcat($6,strcat($7,strcat($8,$9))))))));}
| TEXT
%%
#include"lex.yy.c"
I'm getting Syntax error, tried using ECHOs at some places but didn't find the error.
The input file I'm using is:
<XML>
<hello>
hi
<hello>
</XML>
Please help me figure out the error. I have relatively less experience using lex and yacc
That grammar will only successfully parse a file which has XMLEND at the end. However, all text files end with a newline.
Although you could presumably fix that by adding a newline at the end of the start rule, it's almost always a bad idea to try to parse whitespace. In general, except for line-oriented languages -- which xml is not -- it is best to ignore whitespace.
Your use of strcat is incorrect. Quoting man strcat from a GNU/Linux system:
The strcat() function appends the src string to the dest string, overwriting the terminating null byte ('\0') at the end of dest, and then adds a terminating null byte. The strings may not overlap, and the dest string must have enough space for the result. If dest is not large enough, program behavior is unpredictable; buffer overruns are a favorite avenue for attacking secure programs.
You might want to use asprintf if it exists in your standard library.
Also, you never free() the strings produced by strdup, so all of them leak memory. In general, it's better to not set strdup tokens whose string representation is known -- particularly single-character tokens -- but the important thing is to keep track of the tokens whose string value has been freshly allocated. That would apply to semantic values produced with asprintf if the above suggestion is taken.