Troubles using Bison's recursive rules, and storing values using it - grammar

I am trying to make a flex+bison scanner and parser for Newick file format trees in order to do operations on them. The implemented grammar an explanation is based on a simplification of (labels and lengths are always of the same type, returned by flex) this example.
This is esentially a parser for a file format which represents a tree with a series of (recursive) subtrees and/or leaves.
The main tree will always end on ; and said tree and all subtrees within will contain a series of nodes between ( and ), with a name and a length to the right of the rightmost parenthesis specified by name and :length, which are optional (you can avoid specifying them, put one of them (name or :length), or both with name:length).
If any node lacks either the name or a length, default values will be applied. (for example: 'missingName' and '1')
An example would be (child1:4, child2:6)root:6; , ((child1Of1:2, child2Of1:9)child1:5, child2:6)root:6;
The implementation of said grammar is the following one (NOTE: I translated my own code, as it was in my language, and lots of side stuff got removed for clarity):
struct node {
char* name; /*the node's assigned name, either from the file or from default values*/
float length; /*node's length*/
} dataOfNode;
}
%start tree
%token<dataOfNode> OP CP COMMA SEMICOLON COLON DISTANCE NAME
%type<dataOfNode> tree subtrees recursive_subtrees subtree leaf
%%
tree: subtrees NAME COLON DISTANCE SEMICOLON {} // with name and distance
| subtrees NAME SEMICOLON {} // without distance
| subtrees COLON DISTANCE SEMICOLON {} // without name
| subtrees SEMICOLON {} // without name nor distance
;
subtrees: OP recursive_subtrees CP {}
;
recursive_subtrees: subtree {} // just one subtree, or the last one of the list
| recursive_subtrees COMMA subtree {} // (subtree, subtree, subtree...)
subtree: subtrees NAME COLON DISTANCE { $$.NAME= $2.name; $$.length = $4.length; $$.lengthAcum = $$.lengthAcum + $4.length;
} // group of subtrees, same as the main tree but without ";" at the end, with name and distance
| subtrees NAME { $$.name= $2.name; $$.length = 1.0;} // without distance
| subtrees COLON DISTANCE { $$.name= "missingName"; $$.length = $3.length;} // without name
| subtrees { $$.name= "missingName"; $$.length = 1.0;} // without name nor distance
| leaf { $$.name= $1.name; $$.length = $1.length;} // a leaf
leaf: NAME COLON DISTANCE { $$.name= $$.name; $$.length = $3.length;} // with name and distance
| NAME { $$.name= $1.name; $$.length = 1.0;} // without distance
| COLON DISTANCE { $$.name= "missingName"; $$.length = $2.length;} // without name
| { $$.name= "missingName"; $$.length = 1.0;} // without name nor distance
;
%%
Now, let's say that I want to distinguish who is the parent of each subtree and leaf, so that I can accumulate the length of a parent subtree with the length of the "longest" child, recursively.
I do not know if I chose a bad design for this, but I can't get past
assigning names and lengths to each subtree (and leaf, which is also considered a subtree), and
I don't think I understand either how recursivity works in order to identify the parents in the matching process.

This is mostly a matter of defining the data structure you want to hold your trees, and building that "bottom up" in the actions of the rules. The "bottom up" part is an important implication of the way that bison parsers work -- they are "bottom up", recognizing constructs from the leaves of the grammar and then assembling them into higher non-terminals (and ulitimately into the start non-terminal, which will be the last action run). You can also simplify things by not having so many redundant rules. Finally, IMO it's always better to use character literals for single character tokens rather than names. So you might end up with:
%{
struct tree {
struct tree *next; /* each tree is potentially part of a list of (sub)trees */
struct tree *subtree; /* and may contain a subtress */
const char *name;
double length;
};
struct tree *new_leaf(const char *name, double length); /* malloc a new leaf "tree" */
void append_tree(struct tree **list, struct tree *t); /* append a tree on to a list of trees */
%}
%union {
const char *name;
double value;
struct tree *tree;
}
%type<tree> subtrees recursive_subtrees subtree leaf
%token<name> NAME
%token<value> DISTANCE
%%
tree: subtrees leaf ';' { $2->subtree = $1; print_tree($2); }
;
subtrees: '(' recursive_subtrees ')' { $$ = $2; }
;
recursive_subtrees: subtree { $$ = $1; } // just one subtree, or the last one of the list
| recursive_subtrees ',' subtree { append_tree(&($$=$1)->next, $3); } // (subtree, subtree, subtree...)
;
subtree: subtrees leaf { ($$=$2)->subtree = $1; }
| leaf { $$ = $1; }
;
leaf: NAME ':' DISTANCE { $$ = new_leaf($1, $3);} // with name and distance
| NAME { $$ = new_leaf($1, 1.0);} // without distance
| ':' DISTANCE { $$ = new_leaf("missingName", $2; } // without name
| { $$ = new_leaf("missingName", 1.0); } // without name nor distance
;
%%
struct tree *new_leaf(const char *name, double length) {
struct tree *rv = malloc(sizeof(struct tree));
rv->subtree = rv->next = NULL;
rv->name = name;
rv->length = length;
return rv;
}
void append_tree(struct tree **list, struct tree *t) {
assert(t->next == NULL); // must not be part of a list yet
while (*list) list = &(*list)->next;
*list = t;
}

Related

In Bison (or yacc for that matter), is there an order defined by a grammar?

I have the following grammar in a Bisone file:
item
: "ITEM" t_name t_type v_storage t_prefix t_tag ';'
;
t_name
: [$_A-Za-z][$_A-Z0-9a-z]*
;
t_type
: "BYTE"
| "WORD"
| "LONG"
| "QUAD"
;
v_storage
: %empty
| "TYPEDEF"
;
t_prefix
: %empty
| "PREFIX" t_name
;
t_tag
: %empty
| "TAG" t_name
;
When I attempt to parse the following string ITEM foobar BYTE PREFIX str_ TAG S TYPEDEF; I get an unexpected 'TYPEDEF" and it accepts the ";". Is there something I need to do to allow any order to be specified? If so, I'm hoping that there is a simple solution. Otherwise, I'll need to do a little more work.
It is not possible to tell bison (or yacc) that order doesn't matter. Rules are strictly ordered.
So you have two options:
List all possible orders. If you do this, watch out for ambiguities caused by optional productions. You'll actually need to list all orders and subsets. That mounts up exponentially.
Just accept any list of components, as a list. That will accept repeated components so you'll need to catch that in the semantic action if you care.
The second option is almost always the one you want. Implementation is usually trivial, because you will want to store the components somewhere; as long as that somewhere has a unique value (such as NULL) which means "not yet set", then you only need to test that value before setting it. For example rather than the one in the question):
%{
#include <stdbool>
enum Type {
TYPE_DEFAULT = 0, TYPE_BYTE, TYPE_WORD, TYPE_LONG, TYPE_QUAD
};
typedef struct Item Item;
struct Item {
const char *name;
enum Type type;
int storage; /* 0: unset, 1: TYPEDEF */
const char *prefix;
const char *tag;
};
// ...
// Relies on the fact that NULL and 0 are converted to boolean
// false. Returns true if it's ok to do the set (i.e. thing
// wasn't set).
bool check_dup(bool already_set, const char* thing) {
if (already_set)
fprintf(stderr, "Duplicate %s ignored at line %d\n", thing, yylineno);
return !already_set;
}
%}
%union {
const char *str;
Item *item;
// ...
}
%type <item> item item-def
%token <str> NAME STRING
%%
/* Many of the actions below depend on $$ having been set to $1.
* If you use a template which doesn't provide that guarantee, you
* will have to add $$ = $1; to some actions.
*/
item: item-def { /* Do whatever is necessary to finalise $1 */ }
item-def
: "ITEM" NAME
{ $$ = calloc(1, sizeof *$$); $$->name = $2; }
| item-def "BYTE"
{ if (check_dup($$->type, "type") $$->type = TYPE_BYTE; }
| item-def "WORD"
{ if (check_dup($$->type, "type") $$->type = TYPE_WORD; }
| item-def "LONG"
{ if (check_dup($$->type, "type") $$->type = TYPE_LONG; }
| item-def "QUAD"
{ if (check_dup($$->type, "type") $$->type = TYPE_QUAD; }
| item-def "TYPEDEF"
{ if (check_dup($$->storage, "storage") $$->storage = 1; }
| item-def "PREFIX" STRING
{ if (check_dup($$->prefix, "prefix") $$->prefix = $3; }
| item-def "TAG" STRING
{ if (check_dup($$->tag, "tag") $$->tag = $3; }
You can separate all those item-def productions into something like:
item-def: "ITEM" NAME { /* ... */ }
| item-def item-option
item-option: type | storage | prefix | tag
But then in the actions you need to get at the item object, which is not part of the option production. You can do that with a Bison feature which lets you look into the parser stack:
prefix: "PREFIX" STRING { if (check_dup($<item>0->prefix, "prefix")
$<item>0->prefix = $2; }
In this context, $0 will refer to whatever came before prefix, which is whatever came before item-option, which is an item-def. See the end of this section in the Bison manual, where it describes this practice as "risky", which it is. It also requires you to explicitly specify the tag, because bison doesn't do the grammar analysis necessary to validate the use of $0, which would identify its type.

Building a linked list in yacc with left recursive Grammar

I want to build a linked list of data in yacc.
My Grammar reads like this:
list: item
| list ',' item
;
I have put the appropriate structures in place in the declarations section. But I am not able to figure out a way to get a linked list out of this data. I have to store the recursively obtained data and then redirect it for other purposes.
Basically I am looking for a solution like this one:
https://stackoverflow.com/a/1429820/5134525
But this solution is for right recursion and doesn't work with left.
It depends heavily on how you implement your linked list, but once you have that, it is straight-forward. Something like:
struct list_node {
struct list_node *next;
value_t value;
};
struct list {
struct list_node *head, **tail;
};
struct list *new_list() {
struct list *rv = malloc(sizeof(struct list));
rv->head = 0;
rv->tail = &rv->head;
return rv; }
void push_back(struct list *list, value_t value) {
struct list_node *node = malloc(sizeof(struct list_node));
node->next = 0;
node->value = value;
*list->tail = node;
list->tail = &node->next; }
allows you to write your yacc code as:
list: item { push_back($$ = new_list(), $1); }
| list ',' item { push_back($$ = $1, $3); }
;
of course, you should probably add checks for running out of memory, and exit gracefully in that case.
If you use a left recursive rule, then you need to push the new item at the end of the list rather than the beginning.
If your linked list implementation doesn't support push_back, then push the successive items at the front and reverse the list when its finished.
Very simple.
list
: item
{
$$ = new MyList<SomeType>();
$$.add($1);
}
| list ',' item
{
$1.add($3);
$$ = $1;
}
;
assuming you are using C++, which you didn't state, and assuming you have some MyList<T> class with an add(T) method.

what is the need of else block in the method "push_links" of the following code?

This code is for Aho-Corasick algorithm which i have refereed from here
I understood this code up to if block of push_links method but i didn't get the use or requirement for the else part of the same method.
More specifically first method is used for the construction of trie. The remaining work is done by second method i.e linking the node to their longest proper suffix which are prefix of some pattern also. This is carried out by the If block then what is the need of else part.
Please help me in this.
const int MAXN = 404, MOD = 1e9 + 7, sigma = 26;
int term[MAXN], len[MAXN], to[MAXN][sigma], link[MAXN], sz = 1;
// this method is for constructing trie
void add_str(string s)
{
// here cur denotes current node
int cur = 0;
// this for loop adds string to trie character by character
for(auto c: s)
{
if(!to[cur][c - 'a'])
{
//here two nodes or characters are linked using transition
//array "to"
to[cur][c - 'a'] = sz++;
len[to[cur][c - 'a']] = len[cur] + 1;
}
// for updating the current node
cur = to[cur][c - 'a'];
}
//for marking the leaf nodes or terminals
term[cur] = cur;
}
void push_links()
{
//here queue is used for breadth first search of the trie
int que[sz];
int st = 0, fi = 1;
//very first node is enqueued
que[0] = 0;
while(st < fi)
{
// here nodes in the queue are dequeued
int V = que[st++];
// link[] array contains the suffix links.
int U = link[V];
if(!term[V]) term[V] = term[U];
// here links for the other nodes are find out using assumption that the
// link for the parent node is defined
for(int c = 0; c < sigma; c++)
// this if condition ensures that transition is possible to the next node
// for input 'c'
if(to[V][c])
{
// for the possible transitions link to the reached node is assigned over
// here which is nothing but transition on input 'c' from the link of the
// current node
link[to[V][c]] = V ? to[U][c] : 0;
que[fi++] = to[V][c];
}
else
{
to[V][c] = to[U][c];
}
}
}
IMO you don't need the else-condition. If there is no children either it's already a link or nothing.
There are some variations of Aho-Corasick algorithm.
Base algorithm assumes that if edge from current node (cur) over symbol (c) is missing, then you go via suffix links to the first node that has edge over c (you make move via this edge).
But your way over suffix links is the same (from the same cur and c), because you don't change automaton while searching. So you can cache it (save result of
// start from node
while (parent of node doesn't have an edge over c) {
node = parent
}
// set trie position
node = to[node][c]
// go to next character
in to[node][c]. So next time you won't do this again. And it transfrom automaton from non-deterministic into deterministic state machine (you don't have to use link array after pushing, you can use only to array).
There are some problems with this implementation. First, you can get an index of string you found (you don't save it). Also, len array isn't used anywhere.
For
means, this algorithm is just checking the existence of the character in the current node link using "link[to[V][c]] = V ? to[U][c] : 0;". should not it verify in the parents link also?
Yes, it's ok, because if to[U][c] is 0, then there are no edges via c from all chain U->suffix_parent->suffix parent of suffix_parent ... -> root = 0. So you should set to[V][c] to zero.

PEGJS predicate grammar

I need to create a grammar with the help of predicate. The below grammar fails for the given case.
startRule = a:namespace DOT b:id OPEN_BRACE CLOSE_BRACE {return {"namespace": a, "name": b}}
namespace = id (DOT id)*
DOT = '.';
OPEN_BRACE = '(';
CLOSE_BRACE = ')';
id = [a-zA-Z]+;
It fails for the given input as
com.mytest.create();
which should have given "create" as value of "name" key in the result part.
Any help would be great.
There are several things here.
The most important, is that you must be aware that PEG is greedy. That means that your (DOT id)* rule matches ALL the DOT id sequences, including the one that you have in startRule as DOT b:id.
That can be solved using lookahead.
The other thing is that you must remember to use join, since by default it will return each character as the member of an array.
I also added a rule for semicolons.
Try this:
start =
namespace:namespace DOT name:string OPEN_BRACE CLOSE_BRACE SM nl?
{
return { namespace : namespace, name : name };
}
/* Here I'm using the lookahead: (member !OPEN_BRACE)* */
namespace =
first:string rest:(member !OPEN_BRACE)*
{
rest = rest.map(function (x) { return x[0]; });
rest.unshift(first);
return rest;
}
member =
DOT str:string
{ return str; }
DOT =
'.'
OPEN_BRACE =
'('
CLOSE_BRACE =
')'
SM =
';'
nl =
"\n"
string =
str:[a-zA-Z]+
{ return str.join(''); }
And as far I can tell, I'm parsing that line correctly.

How can I build an ANTLR Works style parse tree?

I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar:
grammar test;
program : idList;
idList : id* ;
id : ID ;
ID : LETTER (LETTER | NUMBER)* ;
LETTER : 'a' .. 'z' | 'A' .. 'Z' ;
NUMBER : '0' .. '9' ;
ANTLR Works will output:
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens.
A quick demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
IdList;
Id;
}
#parser::members {
private static void walk(CommonTree tree, int indent) {
if(tree == null) return;
for(int i = 0; i < indent; i++, System.out.print(" "));
System.out.println(tree.getText());
for(int i = 0; i < tree.getChildCount(); i++) {
walk((CommonTree)tree.getChild(i), indent + 1);
}
}
public static void main(String[] args) throws Exception {
testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123"));
testParser parser = new testParser(new CommonTokenStream(lexer));
walk((CommonTree)parser.program().getTree(), 0);
}
}
program : idList EOF -> idList;
idList : id* -> ^(IdList id*);
id : ID -> ^(Id ID);
ID : LETTER (LETTER | DIGIT)*;
SPACE : ' ' {skip();};
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
fragment DIGIT : '0' .. '9';
If you run the demo above, you will see the following being printed to the console:
IdList
Id
abc
Id
abc123
As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead:
idList : id* -> ^(IdList["idList"] id*);
id : ID -> ^(Id["id"] ID);
which will print:
idList
id
abc
id
abc123