In Bison (or yacc for that matter), is there an order defined by a grammar? - grammar

I have the following grammar in a Bisone file:
item
: "ITEM" t_name t_type v_storage t_prefix t_tag ';'
;
t_name
: [$_A-Za-z][$_A-Z0-9a-z]*
;
t_type
: "BYTE"
| "WORD"
| "LONG"
| "QUAD"
;
v_storage
: %empty
| "TYPEDEF"
;
t_prefix
: %empty
| "PREFIX" t_name
;
t_tag
: %empty
| "TAG" t_name
;
When I attempt to parse the following string ITEM foobar BYTE PREFIX str_ TAG S TYPEDEF; I get an unexpected 'TYPEDEF" and it accepts the ";". Is there something I need to do to allow any order to be specified? If so, I'm hoping that there is a simple solution. Otherwise, I'll need to do a little more work.

It is not possible to tell bison (or yacc) that order doesn't matter. Rules are strictly ordered.
So you have two options:
List all possible orders. If you do this, watch out for ambiguities caused by optional productions. You'll actually need to list all orders and subsets. That mounts up exponentially.
Just accept any list of components, as a list. That will accept repeated components so you'll need to catch that in the semantic action if you care.
The second option is almost always the one you want. Implementation is usually trivial, because you will want to store the components somewhere; as long as that somewhere has a unique value (such as NULL) which means "not yet set", then you only need to test that value before setting it. For example rather than the one in the question):
%{
#include <stdbool>
enum Type {
TYPE_DEFAULT = 0, TYPE_BYTE, TYPE_WORD, TYPE_LONG, TYPE_QUAD
};
typedef struct Item Item;
struct Item {
const char *name;
enum Type type;
int storage; /* 0: unset, 1: TYPEDEF */
const char *prefix;
const char *tag;
};
// ...
// Relies on the fact that NULL and 0 are converted to boolean
// false. Returns true if it's ok to do the set (i.e. thing
// wasn't set).
bool check_dup(bool already_set, const char* thing) {
if (already_set)
fprintf(stderr, "Duplicate %s ignored at line %d\n", thing, yylineno);
return !already_set;
}
%}
%union {
const char *str;
Item *item;
// ...
}
%type <item> item item-def
%token <str> NAME STRING
%%
/* Many of the actions below depend on $$ having been set to $1.
* If you use a template which doesn't provide that guarantee, you
* will have to add $$ = $1; to some actions.
*/
item: item-def { /* Do whatever is necessary to finalise $1 */ }
item-def
: "ITEM" NAME
{ $$ = calloc(1, sizeof *$$); $$->name = $2; }
| item-def "BYTE"
{ if (check_dup($$->type, "type") $$->type = TYPE_BYTE; }
| item-def "WORD"
{ if (check_dup($$->type, "type") $$->type = TYPE_WORD; }
| item-def "LONG"
{ if (check_dup($$->type, "type") $$->type = TYPE_LONG; }
| item-def "QUAD"
{ if (check_dup($$->type, "type") $$->type = TYPE_QUAD; }
| item-def "TYPEDEF"
{ if (check_dup($$->storage, "storage") $$->storage = 1; }
| item-def "PREFIX" STRING
{ if (check_dup($$->prefix, "prefix") $$->prefix = $3; }
| item-def "TAG" STRING
{ if (check_dup($$->tag, "tag") $$->tag = $3; }
You can separate all those item-def productions into something like:
item-def: "ITEM" NAME { /* ... */ }
| item-def item-option
item-option: type | storage | prefix | tag
But then in the actions you need to get at the item object, which is not part of the option production. You can do that with a Bison feature which lets you look into the parser stack:
prefix: "PREFIX" STRING { if (check_dup($<item>0->prefix, "prefix")
$<item>0->prefix = $2; }
In this context, $0 will refer to whatever came before prefix, which is whatever came before item-option, which is an item-def. See the end of this section in the Bison manual, where it describes this practice as "risky", which it is. It also requires you to explicitly specify the tag, because bison doesn't do the grammar analysis necessary to validate the use of $0, which would identify its type.

Related

Troubles using Bison's recursive rules, and storing values using it

I am trying to make a flex+bison scanner and parser for Newick file format trees in order to do operations on them. The implemented grammar an explanation is based on a simplification of (labels and lengths are always of the same type, returned by flex) this example.
This is esentially a parser for a file format which represents a tree with a series of (recursive) subtrees and/or leaves.
The main tree will always end on ; and said tree and all subtrees within will contain a series of nodes between ( and ), with a name and a length to the right of the rightmost parenthesis specified by name and :length, which are optional (you can avoid specifying them, put one of them (name or :length), or both with name:length).
If any node lacks either the name or a length, default values will be applied. (for example: 'missingName' and '1')
An example would be (child1:4, child2:6)root:6; , ((child1Of1:2, child2Of1:9)child1:5, child2:6)root:6;
The implementation of said grammar is the following one (NOTE: I translated my own code, as it was in my language, and lots of side stuff got removed for clarity):
struct node {
char* name; /*the node's assigned name, either from the file or from default values*/
float length; /*node's length*/
} dataOfNode;
}
%start tree
%token<dataOfNode> OP CP COMMA SEMICOLON COLON DISTANCE NAME
%type<dataOfNode> tree subtrees recursive_subtrees subtree leaf
%%
tree: subtrees NAME COLON DISTANCE SEMICOLON {} // with name and distance
| subtrees NAME SEMICOLON {} // without distance
| subtrees COLON DISTANCE SEMICOLON {} // without name
| subtrees SEMICOLON {} // without name nor distance
;
subtrees: OP recursive_subtrees CP {}
;
recursive_subtrees: subtree {} // just one subtree, or the last one of the list
| recursive_subtrees COMMA subtree {} // (subtree, subtree, subtree...)
subtree: subtrees NAME COLON DISTANCE { $$.NAME= $2.name; $$.length = $4.length; $$.lengthAcum = $$.lengthAcum + $4.length;
} // group of subtrees, same as the main tree but without ";" at the end, with name and distance
| subtrees NAME { $$.name= $2.name; $$.length = 1.0;} // without distance
| subtrees COLON DISTANCE { $$.name= "missingName"; $$.length = $3.length;} // without name
| subtrees { $$.name= "missingName"; $$.length = 1.0;} // without name nor distance
| leaf { $$.name= $1.name; $$.length = $1.length;} // a leaf
leaf: NAME COLON DISTANCE { $$.name= $$.name; $$.length = $3.length;} // with name and distance
| NAME { $$.name= $1.name; $$.length = 1.0;} // without distance
| COLON DISTANCE { $$.name= "missingName"; $$.length = $2.length;} // without name
| { $$.name= "missingName"; $$.length = 1.0;} // without name nor distance
;
%%
Now, let's say that I want to distinguish who is the parent of each subtree and leaf, so that I can accumulate the length of a parent subtree with the length of the "longest" child, recursively.
I do not know if I chose a bad design for this, but I can't get past
assigning names and lengths to each subtree (and leaf, which is also considered a subtree), and
I don't think I understand either how recursivity works in order to identify the parents in the matching process.
This is mostly a matter of defining the data structure you want to hold your trees, and building that "bottom up" in the actions of the rules. The "bottom up" part is an important implication of the way that bison parsers work -- they are "bottom up", recognizing constructs from the leaves of the grammar and then assembling them into higher non-terminals (and ulitimately into the start non-terminal, which will be the last action run). You can also simplify things by not having so many redundant rules. Finally, IMO it's always better to use character literals for single character tokens rather than names. So you might end up with:
%{
struct tree {
struct tree *next; /* each tree is potentially part of a list of (sub)trees */
struct tree *subtree; /* and may contain a subtress */
const char *name;
double length;
};
struct tree *new_leaf(const char *name, double length); /* malloc a new leaf "tree" */
void append_tree(struct tree **list, struct tree *t); /* append a tree on to a list of trees */
%}
%union {
const char *name;
double value;
struct tree *tree;
}
%type<tree> subtrees recursive_subtrees subtree leaf
%token<name> NAME
%token<value> DISTANCE
%%
tree: subtrees leaf ';' { $2->subtree = $1; print_tree($2); }
;
subtrees: '(' recursive_subtrees ')' { $$ = $2; }
;
recursive_subtrees: subtree { $$ = $1; } // just one subtree, or the last one of the list
| recursive_subtrees ',' subtree { append_tree(&($$=$1)->next, $3); } // (subtree, subtree, subtree...)
;
subtree: subtrees leaf { ($$=$2)->subtree = $1; }
| leaf { $$ = $1; }
;
leaf: NAME ':' DISTANCE { $$ = new_leaf($1, $3);} // with name and distance
| NAME { $$ = new_leaf($1, 1.0);} // without distance
| ':' DISTANCE { $$ = new_leaf("missingName", $2; } // without name
| { $$ = new_leaf("missingName", 1.0); } // without name nor distance
;
%%
struct tree *new_leaf(const char *name, double length) {
struct tree *rv = malloc(sizeof(struct tree));
rv->subtree = rv->next = NULL;
rv->name = name;
rv->length = length;
return rv;
}
void append_tree(struct tree **list, struct tree *t) {
assert(t->next == NULL); // must not be part of a list yet
while (*list) list = &(*list)->next;
*list = t;
}

Yield a modified token in ANTLR4

I have a syntax like the following
Identifier
: [a-zA-Z0-9_.]+
| '`' Identifier '`'
;
When I matched an identifier, e.g `someone`, I'd like to strip the backtick and yield a different token, aka someone
Of course, I could walk through the final token array, but is it possible to do it during token parsing?
If I well understand, given the input (file t.text) :
one `someone`
two `fred`
tree `henry`
you would like that tokens are automatically produced as if the grammar had the lexer rules :
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
ID : [a-zA-Z0-9_.]+ ;
But tokens are identified by a type, i.e. an integer, not by the name of the lexer rule. You can change this type with setType() :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::members { int next_number = 1001; }
question
#init {System.out.println("Question last update 1117");}
: expr+ EOF
;
expr
: ID BACKTICK_ID
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`' { setType(next_number); next_number+=1; } ;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:2='one',<ID>,1:0]
[#1,4:12='`someone`',<1001>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='`fred`',<1002>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='`henry`',<1003>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1117
line 1:4 mismatched input '`someone`' expecting BACKTICK_ID
line 2:4 mismatched input '`fred`' expecting BACKTICK_ID
line 3:5 mismatched input '`henry`' expecting BACKTICK_ID
The basic types come from the lexer rules :
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
the other from setType. Instead of incrementing a number for each token, you could write the tokens found in a table, and before creating a new one, access the table to check if it already exists and avoid duplicate tokens receive a different number.
Anyway you can do nothing useful in the parser because parser rules need to know the type number.
If you have a set of names known in advance, you can list them in a tokens statement :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::header {
import java.util.*;
}
tokens { SOMEONE, FRED, HENRY }
#lexer::members {
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("someone", QuestionParser.SOMEONE);
put("fred", QuestionParser.FRED);
put("henry", QuestionParser.HENRY);
}};
}
question
#init {System.out.println("Question last update 1746");}
: expr+ EOF
;
expr
: ID SOMEONE
| ID FRED
| ID HENRY
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`'
{ String textb = getText();
String texta = textb.substring(1, textb.length() - 1);
System.out.println("text before=" + textb + ", text after="+ texta);
if ( keywords.containsKey(texta)) {
setType(keywords.get(texta)); // reset token type
setText(texta); // remove backticks
}
}
;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
text before=`someone`, text after=someone
text before=`fred`, text after=fred
text before=`henry`, text after=henry
[#0,0:2='one',<ID>,1:0]
[#1,4:12='someone',<4>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='fred',<5>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='henry',<6>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1746
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
SOMEONE=4
FRED=5
HENRY=6
As you can see, there are no more errors because the expr rule is happy with well identified tokens. Even if there are no
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
only ID and BACKTICK_ID, the types have been defined behind the scene by the tokens statement :
public static final int
ID=1, BACKTICK_ID=2, WS=3, SOMEONE=4, FRED=5, HENRY=6;
I'm afraid that if you want a free list of names, it's not possible because the parser works with types, not the name of lexer rules :
public static class ExprContext extends ParserRuleContext {
public TerminalNode ID() { return getToken(QuestionParser.ID, 0); }
public TerminalNode SOMEONE() { return getToken(QuestionParser.SOMEONE, 0); }
public TerminalNode FRED() { return getToken(QuestionParser.FRED, 0); }
public TerminalNode HENRY() { return getToken(QuestionParser.HENRY, 0); }
...
public final ExprContext expr() throws RecognitionException {
try { ...
setState(17);
case 1:
enterOuterAlt(_localctx, 1);
{
setState(11);
match(ID);
setState(12);
match(SOMEONE);
}
break;
In
match(SOMEONE);
SOMEONE is a constant representing the number 4.
If you don't have a list of known names, emit will not solve your problem because it creates a Token whose most important field is _type :
public Token emit() {
Token t = _factory.create(_tokenFactorySourcePair, _type, _text, _channel, _tokenStartCharIndex, getCharIndex()-1,
_tokenStartLine, _tokenStartCharPositionInLine);
emit(t);
return t;
}

Lex: Gather all text not defined in rules

I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)

ANTLR4 change listener during parse

I have an ANTLR4 listener which handles a standard and well-formed grammar, however am struggling with how to deal the non-standard implementations. Although all of the variants go through the lexer without problems the parse stage is a lot trickier.
A traditional way of doing this would be something like
// Header of document
variant = STANDARD;
if (header.indexOf("microsoft") != -1) {
variant = MICROSOFT;
} else if (header.indexOf("google") != -1) {
variant = GOOGLE;
}
...
// Parsing a particular element
if (variant.equals(MICROSOFT)) {
// Microsoft-specific stuff
} else if (variant.equals(GOOGLE)) {
// Google-specific stuff
} else {
// Standard stuff
}
but this quickly becomes unmaintainable. The obvious solution is to have a ParseTreeListener for the standard implementation and then subclass it for each variant, but I don't know which variant it is until I've started the parse.
So how can I either switch from one listener to another part-way through the parse, or restart the parse with a new listener once I know which variant I'm dealing with?
If these variants occur frequently, you might want to consider embedding custom code to handle context sensitive parsing by using predicates (the {...}? construct in the following pseudo grammar):
rule
: { boolean-expression-a }? a-alternative
| { boolean-expression-b }? b-alternative
| /* fall through */ not-a-or-b-alternative
;
Let's say you want to parse a file containing chunks. A chunk consists of a header and a data row. In the header you can set your variant. The data of a normal variant contains 3 NUMBERs, Google's variant contains 2 NUMBERs and Microsoft's variant contains a single NUMBER. An example of such a file would look like this:
header: none
data: 1 2 3
header: google
data: 4 5
header: microsoft
data: 6
And here's a demo of a context sensitive ANTLR v4 grammar able to parse this:
grammar T;
#parser::members {
enum Variant {
GOOGLE,
MICROSOFT,
OTHER;
public static Variant tryValueOf(String name) {
try {
return Variant.valueOf(name.toUpperCase());
}
catch(Exception e) {
return OTHER;
}
}
}
private Variant variant = Variant.OTHER;
}
parse
: chunk+ EOF
;
chunk
: header data
;
header
: K_HEADER COLON NAME {variant = Variant.tryValueOf($NAME.text);}
;
data
: {variant == Variant.MICROSOFT}? K_DATA COLON NUMBER #MicrosoftData
| {variant == Variant.GOOGLE}? K_DATA COLON NUMBER NUMBER #GoogleData
| K_DATA COLON NUMBER NUMBER NUMBER #OtherData
;
K_DATA : 'data';
K_HEADER : 'header';
NAME : [a-zA-Z]+;
NUMBER : [0-9]+;
COLON : ':';
SPACE : [ \t\r\n] -> skip;
Resulting in the following parse:

ANTLR Variable Troubles

In short: how do I implement dynamic variables in ANTLR?
I come to you again with a basic ANTLR question.
I have this grammar:
grammar Amethyst;
options {
language = Java;
}
#header {
package org.omer.amethyst.generated;
import java.util.HashMap;
}
#lexer::header {
package org.omer.amethyst.generated;
}
#members {
HashMap memory = new HashMap();
}
begin: expr;
expr: (defun | println)*
;
println:
'println' atom {System.out.println($atom.value);}
;
defun:
'defun' VAR INT {memory.put($VAR.text, Integer.parseInt($INT.text));}
| 'defun' VAR STRING_LITERAL {memory.put($VAR.text, $STRING_LITERAL.text);}
;
atom returns [Object value]:
INT {$value = Integer.parseInt($INT.text);}
| ID
{
Object v = memory.get($ID.text);
if (v != null) $value = v;
else System.err.println("undefined variable " + $ID.text);
}
| STRING_LITERAL
{
String v = (String) memory.get($STRING_LITERAL.text);
if (v != null) $value = String.valueOf(v);
else System.err.println("undefined variable " + $STRING_LITERAL.text);
}
;
INT: '0'..'9'+ ;
STRING_LITERAL: '"' .* '"';
VAR: ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9')* ;
ID: ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
LETTER: ('a..z'|'A'..'Z')+ ;
WS: (' '|'\t'|'\n'|'\r')+ {skip();} ;
What it does (or should do), so far, is have a built-in "println" function to do exactly what you think it does, and a "defun" rule to define variables.
When "defun" is called on either a string or integer, the value is put into the "memory" HashMap with the first parameter being the variable's name and the second being its value.
When println is called on an atom, it should display the atom's value. The atom can be either a string or integer. It gets its value from memory and returns it. So for example:
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
null
NOTE: This output comes when I do:
println "greeting"
Output:
undefined variable "greeting"null
Does anyone know why this is so? Sorry if I'm not being clear, I don't understand most of this.
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
Because the input "greeting" is being tokenized as a VAR and a VAR is no atom. So the input defun greeting "Hello world!" is properly matched by the 2nd alternative of the defun rule:
defun
: 'defun' VAR INT // 1st alternative
| 'defun' VAR STRING_LITERAL // 2nd alternative
;
but the input println "greeting" cannot be matched by the println rule:
println
: 'println' atom
;
You must realize that the lexer does not produce tokens based on what the parser tries to match at a particular time. The input "greeting" will always be tokenized as a VAR, never as an ID rule.
What you need to do is remove the ID rule from the lexer, and replace ID with VAR inside your parser rules.