ANTLR: Define new channel in grammar - antlr

I know it is possible to switch between the default and hidden token channels in an ANTLR grammar, but lets say I want a third channel. How can I define a new token channel in the gramar? For instance, lets say I want a channel named ALTERNATIVE.

They're just final int's in the Token class
, so you could simply introduce an extra int in your lexer like this:
grammar T;
#lexer::members {
public static final int ALTERNATIVE = HIDDEN + 1;
}
// parser rules ...
FOO
: 'foo' {$type=ALTERNATIVE;}
;
// other lexer rules ...
A related Q&A: How do I get an Antlr Parser rule to read from both default AND hidden channel

For the C target you can use
//This must be assigned somewhere
#lexer::context {
ANTLR3_UINT32 defaultChannel;
}
TOKEN : 'blah' {$channel=defaultChannel;};
This gets reset after every rule so if you want a channel assignment to persist across rules you may have to override nextTokenStr().

Related

Which rule does the string match?

I'm using the Java syntax defined at https://github.com/antlr/grammars-v4/tree/master/java/java
My users are free to input any thing, for example
assert image != null;
,
public Color[][] smooth(Color[][] image, int neighberhoodSize)
{
...
}
,
package myapplication.mylibrary;
, and
import static java.lang.System.out; //'out' is a static field in java.lang.System
import static screen.ColorName.*;
My program should tell which syntax the input matches.
What I have up to now is
var stream = CharStreams.fromString(input);
ITokenSource lexer = new JavaLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
Parser parser = new JavaParser(tokens);
parser.ErrorHandler = new BailErrorStrategy();
try
{
var tree = parser.statement();
Console.WriteLine("The input is a statement");
}
catch (Exception e)
{
Console.WriteLine("The input is not a statement");
}
Are there better way to check the input match any of the 100 rules?
No, there's no other way than trial-and-error. Note that your generated parser has the property:
public static final String[] ruleNames
which you can use in combination with reflection to call all parser rules automatically instead of trying them manually.
Also, trying parser.statement() might not be enough: the input String s = "mu"; FUBAR could be properly parsed by parser.statement() and leave the trailing Identifier (FUBAR) in the token stream. After all, the statement rule probably does not end with an EOF token forcing the parser to consume all tokens. You'll probably have to manually check if all tokens are consumed before determining the input was successfully parsed by a certain parser rule. Also see this Q&A: How to test ANTLR translation without adding EOF to every rule
Unless you really mean that your users can enter anything (and I would suspect that, with some thought, that’s not really the case)
You could add a parser rule that includes alternatives for each construct your users could enter. You might have to take a little care on the order.
Since parser rules are evaluated recursive descent, if your new rule isn’t referenced by any other rules, it would have no impact on the rest of the grammar.
Could be worth a shot.

What scope does ":my $foo" have and what is it used for?

With a regex, token or rule, its possible to define a variable like so;
token directive {
:my $foo = "in command";
<command> <subject> <value>?
}
There is nothing about it in the language documentation here, and very little in S05 - Regexes and Rules, to quote;
Any grammar regex is really just a kind of method, and you may declare variables in such a routine using a colon followed by any scope declarator parsed by the Perl 6 grammar, including my, our, state, and constant. (As quasi declarators, temp and let are also recognized.) A single statement (up through a terminating semicolon or line-final closing brace) is parsed as normal Perl 6 code:
token prove-nondeterministic-parsing {
:my $threshold = rand;
'maybe' \s+ <it($threshold)>
}
I get that regexen within grammars are very similar to methods in classes; I get that you can start a block anywhere within a rule and if parsing successfully gets to that point, the block will be executed - but I don't understand what on earth this thing is for.
Can someone clearly define what it's scope is; explain what need it fulfills and give the typical use case?
What scope does :my $foo; have?
:my $foo ...; has the lexical scope of the rule/token/regex in which it appears.
(And :my $*foo ...; -- note the extra * signifying a dynamic variable -- has both the lexical and dynamic scope of the rule/token/regex in which it appears.)
What this is used for
Here's what happens without this construct:
regex scope-too-small { # Opening `{` opens a regex lexical scope.
{ my $foo = / bar / } # Block with its own inner lexical scope.
$foo # ERROR: Variable '$foo' is not declared
}
grammar scope-too-large { # Opening `{` opens lexical scope for gramamr.
my $foo = / bar / ;
regex r1 { ... } # `$foo` is recognized inside `r1`...
...
regex r999 { ... } # ...but also inside r999
}
So the : ... ; syntax is used to get exactly the desired scope -- neither too broad nor too narrow.
Typical use cases
This feature is typically used in large or complex grammars to avoid lax scoping (which breeds bugs).
For a suitable example of precise lexical only scoping see the declaration and use of #extra_tweaks in token babble as defined in a current snapshot of Rakudo's Grammar.nqp source code.
P6 supports action objects. These are classes with methods corresponding one-to-one with the rules in a grammar. Whenever a rule matches, it calls its corresponding action method. Dynamic variables provide precisely the right scoping for declaring variables that are scoped to the block (method, rule, etc.) they're declared in both lexically and dynamically -- which latter means they're available in the corresponding action method too. For an example of this, see the declaration of #*nibbles in Rakudo's Grammar module and its use in Rakudo's Actions module.

Serialization of ANTLR ParseTree

I have a generated grammar that does two things:
Check the syntax of a domain specific language
Evaluate input against that domain specific language
These two functions are separate, lets call them validate() and evaluate().
The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).
What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().
I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?
Thanks for any help.
Jason
ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.
Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:
For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.
If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.
If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.
Say for example your input data is a set of method declarations:
int add(int a, int b) {
return a+b;
}
int mul(int a, int b) {
return a*b;
}
...
and the grammar is something like:
methodList : methodDeclaration methodList
|
;
methodDeclaration : // your method declaration rules...
The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.
void visitMethodList(MethodListContext ctx) {
if(ctx.methodDeclaration() != null) {
String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
// store methodStr for later parsing
}
// visit next method list item, if any
if(ctx.methodList() != null) {
visit(ctx.methodList());
}
}
The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.
void visitMethodDeclaration(MethodDeclarationContext ctx) {
// parse the method block
}
The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.
String formatParseTree(ParseTree tree, String separator) {
StringBuilder builder = new StringBuilder();
for(int i = 0; i < tree.getChildCount(); i ++) {
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode) {
builder.append(child.getText());
builder.append(separator);
} else if(child instanceof RuleContext) {
builder.append(formatParseTree(child, separator));
}
}
return builder.toString();
}

antlr global rule scope declaration vs #members declaration

Which one would you prefer to declare a variable in which case, global scope or #members declaration? It seems to me that they can serve for same purpose?
UPDATE here is a grammar to explain what i mean.
grammar GlobalVsScope;
scope global{
int i;
}
#lexer::header{package org.inanme.antlr;}
#parser::header{package org.inanme.antlr;}
#parser::members {
int j;
}
start
scope global;
#init{
System.out.println($global::i);
System.out.println(j);
}:R EOF;
R:'which one';
Note that besides global (ANTLR) scopes, you can also have local rule-scopes, like this:
grammar T;
options { backtrack=true; }
parse
scope { String x; }
parse
: 'foo'? ID {$parse::x = "xyz";} rule*
| 'foo' ID
;
rule
: ID {System.out.println("x=" + $parse::x);}
;
The only time I'd consider using local rule-scopes is when there are a lot of predicates, or global backtracking is enabled (resulting in all rules to have predicates in front of them). In that case, you could create a member variable String x (or define it in a global scope) and set it in the parse rule, but you might be changing this instance/scope variable after which the parser could backtrack, and this backtracking will not cause the global variable to be set to it's original form/state! The local scoped variable will also not be "unset", but that will likely be less of a risk: them being local to a single rule.
To summarize: yes, you're right, global scopes and member/instance variables are much alike. But I'd sooner opt for members-variables because of the friendlier syntax.

Seeking very simple ANTLR error handling example when generating C code

I want to generate C code. I will not be reading from an input file, one line at a time (as, for instance, a compiler might). Rather, I will be parsing user input as it arrives, one line at a time.
I would prefer to detect and handle bad input in the lexer/parser, e.g
/* lexer tokens */
foo : "FOO";
bar : "BAR";
baz : "BAZ";
/* grammar*/
grammar : foo "=" BAZ
| foo "=" BAR
| <some non-existent Antrl-else> : {printf(stderr, "bad input\n");}
;
OK, if I can't catch it in the lexer/parser, it seems like I need to use displayRecognitionError() but how??
Can anyone point me at a very simple example which generates C code and shows some error handling of invalid input?
Thanks!
Ok, bounty, yippee!
But only for a real, working answer, with real, working code. No "use method X()" without an wxample.
What you are most likely looking for is the displayRecognitionError() function. This function is called in the cases that you are interested in, and is part of the C runtime.
If you want to see an example of how to use this function, look at this mailing list post. Although this code mixes C and C++, you should be able to work out what you need from it.
Handling a recognition exception in Java would go like this:
grammar X;
// ...
#rulecatch{
catch(RecognitionException rex) {
// do something
}
}
// parser rules
// lexer rules
In other words, simply add some custom C code inside the #rulecatch{ ... } block.