Difficulty writing PEG recursive expression grammar with Arpeggio - grammar

My input text might have a simple statement like this:
aircraft
In my language I call this a name which represents a set of instances with various properties.
It yields an instance_set of all aircraft instances in this example.
I can apply a filter in parenthesis to any instance_set:
aircraft(Altitude < ceiling)
It yields another, possibly reduced instance_set.
And since it is an instance set, I can filter it yet again:
aircraft(Altitude < ceiling)(Speed > min_speed)
I foolishly thought I could do something like this in my grammar:
instance_set = expr
expr = source / instance_set
source = name filter?
It parses my first two cases correctly, but chokes on the last one:
aircraft(Altitude < ceiling)(Speed > min_speed)
The error reported being just before the second open paren.
Why doesn't Arpeggio see that there is just a filtered instance_set which is itself a filtered instance set?
I humbly submit my appeal to the peg parsing gods and await insight...

Your first two cases both match source. Once source is matched, it's matched; that's the PEG contract. So the parser isn't going to explore the alternative.
But suppose it did. How could that help? The rule says that if an expr is not a source, then it's an instance_set. But an instance_set is just an expr. In other words, an expr is either a source or it's an expr. Clearly the alternative doesn't get us anywhere.
I'm pretty sure Arpeggio has repetitions, which is what you really want here:
source = name filter*

Related

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

Semantic predicates fail but don't go to the next one

I tried to use ANTLR4 to identify a range notation like <1..100>, and here is my attempt:
#parser::members {
def evalRange(self, minnum, maxnum, num):
if minnum <= num <= maxnum:
return True
return False
}
range_1_100 : INT { self.evalRange(1, 100, $INT.int) }? ;
But it does not work for more than one range like:
some_rule : range_1_100 | range_200_300 ;
When I input a number (200), it just stops at the first rule:
200
line 3:0 rule range_1_100 failed predicate: { self.evalRange(1, 100, $INT.int) }?
(top (range_1_100 200))
It is not as I expected. How can I make the token match the next rule (range_200_300)?
Here's an excerpt from the docs (emphasis mine):
Predicates can appear anywhere within a parser rule just like actions can, but only those appearing on the left edge of alternatives can affect prediction (choosing between alternatives).
[...]
ANTLR's general decision-making strategy is to find all viable alternatives and then ignore the alternatives guarded with predicates that currently evaluate to false. (A viable alternative is one that matches the current input.) If more than one viable alternative remains, the parser chooses the alternative specified first in the decision.
Which basically means your predicate must be the first item in the alternation to be taken into account during the prediction phase.
Of course, you won't be able to use $INT as it wasn't matched yet at this point, but you can replace it with something like _input.LA(1) instead (lookahead of one token) - the exact syntax depends on your language target.
As a side note, I'd advise you to not validate the input through the grammar, it's easier and better to perform a separate validation pass after the parse. Let the grammar handle the syntax, not the semantics.

How to tell if an identifier is being assigned or referenced? (FLEX/BISON)

So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.

How to use Regular Expressions inside treePatterns?

I am working with the example about Parse Tree Matching and XPath shown here. More specifically, I was trying to understand how the following code works:
// assume we are parsing Java
ParserRuleContext tree = parser.compilationUnit();
String xpath = "//blockStatement/*"; // get children of blockStatement
String treePattern = "int <Identifier> = <expression>;";
ParseTreePattern p =
parser.compileParseTreePattern(treePattern,
ExprParser.RULE_localVariableDeclarationStatement);
List<ParseTreeMatch> matches = p.findAll(tree, xpath);
System.out.println(matches);
What I wanted to ask is if we can have regular expressions inside the treePattern string?
For example, I want to write a pattern which identifies all the localVariableDeclarations inside a for loop.
I would like to be able to identify the following code:
for (Object o : list) {
int tempVariable=0;
if ( o.id ==12) {
System.out.println(t);
}
}
The way I have written the pattern (which works) to identify this code is as follows:
String pattern3 = " for ( <className1:type> <localName1:Identifier> : <listName1:expression> ) { <localVariables1:localVariableDeclarationStatement> "
+ "if (<parameter1:expression>.<identifier1:Identifier> == <value1:primary> ) <block1:statement> }";
However, if I have more than one local variables, the pattern doesn't match. I tried to add a '*' at the end as it would happen in the grammar file, but I get an
* invalid tag error.
<localVariables1:localVariableDeclarationStatement>*
Of course I can also add a pattern with two localVariableDeclarationStatement statements, but this again means that I have to create many different patterns for each number of local variables that I want to identify:
<localVariables1:localVariableDeclarationStatement> <localVariables2:localVariableDeclarationStatement> and identify the pattern with
At this time, we don't support repeated elements within the patterns. I thought about that but it essentially means making yet another parser generator whereas static patterns like that are fairly easy to match. It's possible to build one of these, as the last version of ANTLR had tree grammars where you could in fact specify the grammatical structure of subtrees. Until we decide what sort of enhancement to the patterns we can make, I suggest you get creative.
In your specific case, find all of the localVariableDeclarations within for loops as you are doing now and then use a small bit of code to walk that list to identify the contiguous sequences (they are all siblings) and the ones terminated by that particular IF pattern. Would that work?

ANTLR reports error and I think it should be able to resolve input with backtracking

I have a simple grammar that works for the most part, but at one place it reports error and I think it shouldn't, because it can be resolved using backtracking.
Here is the portion that is problematic.
command: object message_chain;
object: ID;
message_chain: unary_message_chain keyword_message?
| binary_message_chain keyword_message?
| keyword_message;
unary_message_chain: unary_message+;
binary_message_chain: binary_message+;
unary_message: ID;
binary_message: BINARY_OPERATOR object;
keyword_message: (ID ':' object)+;
This is simplified version, object is more complex (it can be result of other command, raw value and so on, but that part works fine). Problem is in message_chain, in first alternative. For input like obj unary1 unary2 it works fine, but for intput like obj unary1 unary2 keyword1:obj2 is trys to match keyword1 as unary message and fails when it reaches :. I would think that it this situation parser would backtrack and figure that there is : and recognize that that is keyword message.
If I make keyword message non-optional it works fine, but I need keyword message to be optional.
Parser finds keyword message if it is in second alternative (binary_message) and third alternative (just keyword_message). So something like this gives good results: 1 + 2 + 3 Keyword1:Value
What am I missing? Backtracking is set to true in options and it works fine in other cases in the same grammar.
Thanks.
This is not really a case for PEG-style backtracking, because upon failure that returns to decision points in uncompleted derivations only. For input obj unary1 unary2 keyword1:obj2, with a single token lookahead, keyword1 could be consumed by unary_message_chain. The failure may not occur before keyword_message, and next to be tried would be the second alternative of message_chain, i.e. binary_message_chain, thus missing the correct parse.
However as this grammar is LL(2), it should be possible to extend lookahead to avoid consuming keyword1 from within unary_message_chain. Have you tried explicitly setting k=2, without backtracking?