What does it mean when yacc {code} are in the middle? - yacc

extdefs:
{$<ttype>$ = NULL_TREE; } extdef
| extdefs {$<ttype>$ = NULL_TREE; } extdef
;
Why is it in the middle?

It could be everywhere. Sometimes it's useful to have something done in between the tokens, especially in this kind of or expressions.
In the standard description of the yacc utility it's said that:
Actions can occur anywhere in a rule
(not just at the end); an action can
access values returned by actions to
its left, and in turn the value it
returns can be accessed by actions to
its right. An action appearing in the
middle of a rule shall be equivalent
to replacing the action with a new
non-terminal symbol and adding an
empty rule with that non-terminal
symbol on the left-hand side. The
semantic action associated with the
new rule shall be equivalent to the
original action. The use of actions
within rules might introduce conflicts
that would not otherwise exist.

Related

ANTLR4: parse number as identifier instead as numeric literal

I have this situation, of having to treat integer as identifier.
Underlying language syntax (unfortunately) allows this.
grammar excerpt:
grammar Alang;
...
NLITERAL : [0-9]+ ;
...
IDENTIFIER : [a-zA-Z0-9_]+ ;
Example code, that has to be dealt with:
/** declaration block **/
Method 465;
...
In above code example, because NLITERAL has to be placed before IDENTIFIER, parser picks 465 as NLITERAL.
What is a good way to deal with such a situations?
(Ideally, avoiding application code within grammar, to keep it runtime agnostic)
I found similar questions on SO, not exactly helpful though.
There's no good way to make 465 produce either an NLITERAL token or an IDENTIFIER token depending on context (you might be able to use lexer modes, but that's probably not a good fit for your needs).
What you can do rather easily though, is to allow NLITERALs in addition to IDENTIFIERS in certain places. So you could define a parser rule
methodName: IDENTIFIER | NLITERAL;
and then use that rule instead of IDENTIFIER where appropriate.

antlr2 to antlr4 class specifier, options, TOKENS and more

I need to rewrite a grammar file from antlr2 syntax to antlr4 syntax and have the following questions.
1) Bart Kiers states there is a strict order: grammar, options, tokens, #header, #members in this SO post. This antlr2.org post disagrees stating header is before options. Is there a resource that states the correct order (if one exists) for antlr4?
2) The same antlr2.org post states: "The options section for a grammar, if specified, must immediately follow the ';' of the class specifier:
class MyParser extends Parser;
options { k=2; }
However, when running with antlr4, any class specifier creates this error:
syntax error: missing COLON at 'MyParser' while matching a rule
3) What happened to options in antlr4? says there are no rule-level options at the time.
warning(83): MyGrammar.g4:4:4: unsupported option k
warning(83): MyGrammar.g4:5:4: unsupported option exportVocab
warning(83): MyGrammar.g4:6:4: unsupported option codeGenMakeSwitchThreshold
warning(83): MyGrammar.g4:7:4: unsupported option codeGenBitsetTestThreshold
warning(83): MyGrammar.g4:8:4: unsupported option defaultErrorHandler
warning(83): MyGrammar.g4:9:4: unsupported option buildAST
i.) does antlr4's adaptive LL(*) parsing algorithm no longer require k token lookhead?
ii.) is there an equivalent in antlr4 for exportVocab?
iii.) are there equivalents in antlr4 for optimizations codeGenMakeSwitchThreshold and codeGenBitsetTestThreshold or have they become obsolete?
iv.) is there an equivalent for defaultErrorHandler ?
v.) I know antlr4 no longer builds AST. I'm still trying to get a grasp of how this will affect what uses the currently generated *Parser.java and *Lexer.java.
4) My current grammar file specifies a TOKENS section
tokens {
ROOT; FOO; BAR; TRUE="true"; FALSE="false"; NULL="null";
}
I changed the double quotes to single quotes and the semi-colons to commas and the equal sign to a colon to try and get rid of each syntax error but have this error:
mismatched input ':' expecting RBRACE
along with others. Rewritten looks like:
tokens {
ROOT; FOO; BAR; TRUE:'true'; FALSE:'false' ...
}
so I removed :'true' and :'false' and TRUE and FALSE will appear in the generated MyGrammar.tokens but I'm not sure if it will function the same as before.
Thanks!
Just look at the ultimate source for the syntax: the ANTLR4 grammar. As you can see the order plays no role in the prequel section (which includes named actions, options and the like, you can even have more than one option section). The only condition is that the prequel section must appear before any rule.
The error is about a wrong option. Remove that and the error will go away.
Many (actually most of the old) options are no longer needed and supported in ANTLR4.
i.) ANTLR4 uses unlimited lookahead (hence the * in ALL(*)). You cannot specify any other lookahead.
ii.) The exportVocab has long gone (not even ANTLR3 supports it). It only specifies a name for the .tokens file. Use the default instead.
iii.) Nothing like that is needed nor supported anymore. The prediction algorithm has completely changed in ANTLR4.
iv.) You use an error listener instead. There are many examples how to do that (also here at SO).
v.) Is that a question or just thinking loudly? Hint: ANTLR4 based parsers generate a parse tree.
I'm not 100% sure about this one, but I believe you can no longer specify the value a token should match in the tokens section. Instead this is only for virtual tokens and everything else must be specified as normal lexer tokens.
To sum up: most of the special options and tricks required for older ANTLR grammars are not needed anymore and must be removed. The new parsing algorithm can deal with all the ambiquities automatically, which former versions had trouble with and needed guidance from the user for.

Semantic predicates fail but don't go to the next one

I tried to use ANTLR4 to identify a range notation like <1..100>, and here is my attempt:
#parser::members {
def evalRange(self, minnum, maxnum, num):
if minnum <= num <= maxnum:
return True
return False
}
range_1_100 : INT { self.evalRange(1, 100, $INT.int) }? ;
But it does not work for more than one range like:
some_rule : range_1_100 | range_200_300 ;
When I input a number (200), it just stops at the first rule:
200
line 3:0 rule range_1_100 failed predicate: { self.evalRange(1, 100, $INT.int) }?
(top (range_1_100 200))
It is not as I expected. How can I make the token match the next rule (range_200_300)?
Here's an excerpt from the docs (emphasis mine):
Predicates can appear anywhere within a parser rule just like actions can, but only those appearing on the left edge of alternatives can affect prediction (choosing between alternatives).
[...]
ANTLR's general decision-making strategy is to find all viable alternatives and then ignore the alternatives guarded with predicates that currently evaluate to false. (A viable alternative is one that matches the current input.) If more than one viable alternative remains, the parser chooses the alternative specified first in the decision.
Which basically means your predicate must be the first item in the alternation to be taken into account during the prediction phase.
Of course, you won't be able to use $INT as it wasn't matched yet at this point, but you can replace it with something like _input.LA(1) instead (lookahead of one token) - the exact syntax depends on your language target.
As a side note, I'd advise you to not validate the input through the grammar, it's easier and better to perform a separate validation pass after the parse. Let the grammar handle the syntax, not the semantics.

Custom objects in ParseKit Actions

I am very intrigued by the ability to add actions to ParseKit grammars. There is surprisingly little documentation on what is available in those actions. Say I have two rules like:
databaseName = Word;
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' 'IF NOT EXISTS'? databaseName;
This obviously isn't a whole grammar but will serve as an example. When parsing i'd like to "return" a CreateTableStmt object that has certain properties. If I understand the tool correctly i'd add an action to the rule, do stuff then push it on the assembly which will carry it around for the next rule to deal with or use.
So for example it would look like:
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' 'IF NOT EXISTS'? databaseName;
{
AnotherObj* dbName = Pop(); //gives me the top most object
CreateTableStmt* createTable = [[CreateTableStmt alloc] initWith:dbName];
//set if it was temporary
// set 'IF NOT EXISTS'
PUSH(createTable);//push back on stack for next rule to use
}
Then when everything is parsed I can just get that root object off the stack and it is a fully instantiated custom representation of the grammar. Somewhat like building an AST if i remember correctly. I can then do stuff with that representation much easier than with the passed in string.
My question is how can I see if it matched ('TEMP' | 'TEMPORARY') so I can set the value. Are those tokens on the stack? Is there a better way than to pop back to the 'CREATE' and see if we passed it. Should I be popping back to the bottom of the stack anyway on each match?
Also if my rule was instead
qualifiedTableName = (databaseName '.')? tableName (('INDEXED' 'BY' indexName) | ('NOT' 'INDEXED'))?;
Is it correct to assume that the action would not be called until the rule had been matched? So in this case when the action is called to the stack could look like:
possibly:
|'INDEXED'
|'NOT'
or:
|indexName (A custom object possibly)
|'BY'
|'INDEXED
|tableName (for sure will be here)
and possibly these
|'.' (if this is here I know the database name must be here) if not push last one on?
|databaseName
--------------(perhaps more things from other rules)
Are these correct assessments? Is there any other documentation on actions? I know it is heavily based on Antlr but its the subtle differences that can really get you in trouble.
Creator of ParseKit here. A few items:
ParseKit deprecation:
Just this week, I have forked ParseKit to a cleaner/smaller/faster library called PEGKit. ParseKit should be considered deprecated, and PEGKit should be used for all new development. Please move to PEGKit.
PEGKit is nearly identical to the grammar and code-gen features of ParseKit, and your ParseKit grammars are usable with PEGKit with a few small changes. In fact, all of the examples in your question here are usable with no changes in PEGKit.
See the Deprecation Notice in the ParseKit README.
And this tutorial on PEGKit.
Syntax errors in your grammar:
I spot 3 syntax errors in your grammar samples above (this applies equally to both ParseKit and PEGKit).
This line:
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' 'IF NOT EXISTS'? databaseName;
Should be:
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' ('IF' 'NOT' 'EXISTS')? databaseName;
Notice the break up of the invalid 'IF NOT EXISTS' construct into individual literal tokens. This is not only necessary, but also desireable so that variable whitespace between the words is allowed.
The POP() macro should be all upper case.
Your createTableStmt rule is missing a semicolon at the very end (after the action's closing }).
Before Answering:
Make sure you are using v0.3.1 PEGKit or later (HEAD of master). I fixed an important bug while finding the answer to your question, and my solutions below require this fix.
Answer to your first question:
My question is how can I see if it matched ('TEMP' | 'TEMPORARY') so I can set the value?
Good question! You basically have the right idea in your further comments above.
Specficially, I would probably break up the createTableStmt rule into 4 rules like this:
createTableStmt = 'CREATE'! tempOpt 'TABLE'! existsOpt databaseName ';'!;
databaseName = QuotedString;
tempOpt
= ('TEMP'! | 'TEMPORARY'!)
| Empty
;
existsOpt
= ('IF'! 'NOT'! 'EXISTS'!)
| Empty
;
Notice all of the vital ! discard directives for discarding unneeded literal tokens.
Also Notice that I've changed the last two rules to use | Empty rather than ?. This is so I can add Actions to the Empty alternatives (you'll see that in a sec).
Then you can either add Actions to your grammar, or use ObjC parser delegate callbacks if you prefer to work in pure code.
If you use Actions in your grammar, something like the following will work:
createTableStmt = 'CREATE'! tempOpt 'TABLE'! existsOpt databaseName ';'!
{
NSString *dbName = POP();
BOOL ifNotExists = POP_BOOL();
BOOL isTemp = POP_BOOL();
NSLog(#"create table: %#, %d, %d", dbName, ifNotExists, isTemp);
// go to town
// myCreateTable(dbName, ifNotExists, isTemp);
};
databaseName = QuotedString
{
// pop the string value of the `PKToken` on the top of the stack
NSString *dbName = POP_STR();
// trim quotes
dbName = [dbName substringWithRange:NSMakeRange(1, [dbName length]-2)];
// leave it on the stack for later
PUSH(dbName);
};
tempOpt
= ('TEMP'! | 'TEMPORARY'!) { PUSH(#YES); }
| Empty { PUSH(#NO); }
;
existsOpt
= ('IF'! 'NOT'! 'EXISTS'!) { PUSH(#YES); }
| Empty { PUSH(#NO); }
;
I've added this Grammar and a test case to the PEGKit project.
As for your second question, please break it out as a new SO question, and tag it ParseKit and PEGKit and I will get to it ASAP.

Ignore MSGTokenError in JAVACC

I use JAVACC to parse some string defined by a bnf grammar with initial non-terminal G.
I would like to catch errors thrown by TokenMgrError.
In particular, I want to handle the following two cases:
If some prefix of the input satisfies G, but not all of the symbols are read from the input, consider this case as normal and return AST for found prefix by a call to G().
If the input has no prefix satisfying G, return null from G().
Currently I'm getting TokenMgrError 's in each of this case instead.
I started to modify the generated files (i.e, to change Error to Exception and add appropriate try/catch/throws statements), but I found it to be tedious. In addition, automatic generation of the modified files produced by JAVACC does not work. Is there a smarter way to accomplish this?
You can always eliminate all TokenMgrErrors by including
<*> TOKEN : { <UNEXPECTED: ~[] > }
as the final rule. This pushes all you issues to the grammar level where you can generally deal with them more easily.