ANTLR: return always the same number of children - antlr

I have the following rule:
statement : TOKEN1 opt1=TOKEN2? opt2=TOKEN3 TOKEN4 -> ^(TOKEN1 opt1? opt2);
The AST generated by this rule will have one or two children (depending on if
opt1 was defined or not).
I need to have always a fixed number of children (in this case 2). I know that
this can be achieved by doing the following (UNDEFINED is an imaginary token):
statement : TOKEN1 opt1=TOKEN2 TOKEN4 -> ^(TOKEN1 opt1 UNDEFINED)
| TOKEN1 opt1=TOKEN2 opt2=TOKEN3 TOKEN4 -> ^(TOKEN1 opt1 opt2);
This is fine for just one optional token. The problem is when I have a higher
number of optional tokens. A lot of rules must written in order to catch all
possible combinations. How can this issue be solved in an elegant way?
I'm using ANTLR 3.4/C target by the way.
Thanks,
T.

You could do this:
grammar G;
tokens {
CHILD1;
CHILD2;
CHILD3;
}
...
statement
: ROOT t2=TOKEN2? t3=TOKEN3? t4=TOKEN4?
-> ^(ROOT ^(CHILD1 $t2?) ^(CHILD2 $t3?) ^(CHILD3 $t4?))
;
which will cause the AST to always have 3 child nodes (which may or may not have a tokens as child themselves).

Related

ANTLR4 Correctly continuing to parse sections after error

I'm trying to write some tooling (validation/possibly autocomplete) for a SQL-esk query language. However, parser is tokenizing invalid/incomplete inputs in a way that is making it more difficult to work with.
I've reduce my scenario to its simplest reproducible form. Here is my minimized grammar:
grammar SOQL;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) -> channel(HIDDEN) ;
FROM : 'FROM' ;
SELECT : 'SELECT' ;
/********** SYMBOLS **********/
COMMA : ',' ;
ID: ( 'A'..'Z' | 'a'..'z' | '_' | '$') ( 'A'..'Z' | 'a'..'z' | '_' | '$' | '0'..'9' )* ;
soql_query: select_clause from_clause;
select_clause: SELECT field ( COMMA field )*;
from_clause: FROM table;
field : ID;
table : ID;
When I run the following code (using antlr4ts, but it should be similar to any other port):
const input = 'SELECT ID, Name, Website, Contact, FROM Account'; //invalid trailing ,
let inputStream = new ANTLRInputStream(input);
let lexer = new SOQLLexer(inputStream);
let tokenStream = new CommonTokenStream(lexer);
let parser = new SOQLParser(tokenStream);
let qry = parser.soql_query();
let select = qry.select_clause();
console.log('FIELDS: ', select.field().map(field => field.text));
console.log('FROM: ', qry.from_clause().text);
Console Log
line 1:35 extraneous input 'FROM' expecting ID
line 1:47 mismatched input '<EOF>' expecting 'FROM'
FIELDS: Array(5) ["ID", "Name", "Website", "Contact", "FROMAccount"]
FROM:
I get errors (which is expected), but I was hoping it would still be able to correctly pick out the FROM clause.
It was my understanding since FROM is a identifier, it's not a valid field in the select_clause (maybe I'm just misunderstanding)?
Is there some way to setup the grammar or parser so that it will continue on to properly identify the FROM clause in this scenario (and other common WIP query states).
It was my understanding since FROM is a identifier, it's not a valid
field in the select_clause (maybe I'm just misunderstanding)?
All the parser sees is a discrete stream of typed tokens coming from the lexer. The parser has no intrinsic way to tell if a token is intended to be an identifier, or for that matter, have any particular semantic nature.
In designing a fault-tolerant grammar, plan the parser to be fairly permissive to syntax errors and expect to use several tree-walkers to progressively identify and where possible resolve the syntax and semantic ambiguities.
Two ANTLR features particularly useful to this end include:
1) implement a lexer TokenFactory and custom token, typically extending CommonToken. The custom token provides a convenient space for flags and logic for identifying the correct syntactic/semantic use and expected context for a particular token instance.
2) implement a parser error strategy, extending or expanding on the DefaultErrorStrategy. The error strategy will allow modest modifications to the parser operation on the token stream when an attempted match results in a recognition error. If the error cannot be fully resolved and appropriately fixed upon examining the surrounding (custom) tokens, at least those same custom tokens can be suitably annotated to ease problem resolution during the subsequent tree-walks.

Antlr3 warning: Decison can match multiple alternatives but I don't see how

Here is the warning message in question:
BB_LLVM2AST.g:120:15: Decision can match input such as "'a'..'z'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
And here is the rule:
fragment
IDENTIFIER
:
((LOWERCHARS)+ (('0'..'9')+)? PERIOD?)+
| ('0'..'9')+
;
here are the other rules called:
fragment
LOWERCHARS
:
('a'..'z')
;
fragment
PERIOD
:
'.'
;
So, I tried using syntactic predicates, but still have the same warning message.
fragment
IDENTIFIER
:
(LOWERCHARS) => ((LOWERCHARS)+ (INT)? PERIOD?)+
|(INT) => INT
;
I took out an INT fragment that was there in a blind attempt to get rid of the warning. I don't understand how input such as "'a'..'z' could possibly be matched using an alternative where the alternative is ('0'..'9'). Also, what can I do to get rid of this warning? I hate warning messages.

Sequence of rules

My application is not working correctly with my input parametres.
I have 2 rules at config urlManager:
'<controller:\w+>/<action:\w+>/<factor:\w+>/<id:\d+>'=>'<controller>/<action>',
'<controller:\w+>/<action:\w+>/<factor:\w+>/<ids:((id\d+)|\d)+>'=>'<controller>/<action>'
At my action I try 2 inputs: id12id78 and 87(any number).
With the first input, the action gets id12id78, but if I try the second input, my $ids parameter is empty.
How I can fix the bug?
Well, nothing strange :
id12id78 : the second rule will be applied : $ids => id12id78
87 : the first rule will be applied : $id => 87
I don't think you need to different params here, you should use only id, e.g. :
'<controller:\w+>/<action:\w+>/<factor:\w+>/id<id:\d+>'=>'<controller>/<action>',
'<controller:\w+>/<action:\w+>/<factor:\w+>/<id:\d+>'=>'<controller>/<action>',

multiple return values in ANTLR

I use ANTLR4 with Java and I would like to store the values that a rule returns while parses the input. I use a grammar like this:
db : 'DB' '(' 'ID' '=' ID ',' query* ')'
{
System.out.println("creating db");
System.out.println("Number of queries -> "+$query.qs.size());
}
;
query returns [ArrayList<Query> qs]
#init
{
$qs = new ArrayList<Query>();
}
: 'QUERY' '(' 'ID' '=' ID ',' smth ')'
{
System.out.println("creating query with id "+$ID.text);
Query query = new Query();
query.setId($ID.text);
$qs.add(query);
}
;
but what happens is that the Number of queries printed ($query.qs size) is always one. This happens because each time a QUERY element is recognized at input it is added to the $qs ArrayList, but for each other QUERY a new ArrayList is instantiated and this query is added to this new ArrayList. When all the queries are recognized then the action for the db : rule is invoked, but the $query.qs ArrayList has only the last query. I solved this problem by maintaining global ArrayLists that store the queries. But, is there another way to do it with ANTLR while the rules are returning, and not having my own global ArrayLists?
Many thanks in advance,
Dimos.
Well, the problem is resolved. I just added the ArrayList to the db rule like this:
db [ArrayList queries] : 'DB' ....
and then at query rule:
$db::queries.add(query)
So, everything is fine!
Thanks for looking, anyway!

ANTLR ambigous reference - how to get output?

So I have a rule for statement which can lead to more statements:
statement returns[String txt]
: '{'{
$txt="{";
}
(statement{
$txt+=$statement.txt;
})*
'}'{
$txt+="}";
}
| ... //more rules // ...
;
I am getting
reference $statement is ambiguous; rule statement is enclosing rule and referenced in the production (assuming enclosing rule)
but don't know how to resolve it. Somehow I would need to tell ANTLR that I need the return txt of statement inside parent statement. Please help me out :)
If you use $statement, ANTLR doesn't know if you mean the rule itself, or the statement inside ( ... )*.
Try something like this:
statement returns[String txt]
: '{'{
$txt="{";
}
(s=statement{
$txt+=$s.txt;
})*
'}'{
$txt+="}";
}
| ...
;