how the order in this recursive rule does not give the same result? - grammar

can anyone tell me what's the difference between the following two rules (Notice the order)?
the first which doesn't work
without => "[" "]" without | "[" "]"
with => "[" INDEX "]" with | "[" INDEX "]"
array => ID with | ID without | ID with without
the second which seemingly works
without => without "[" "]"| "[" "]"
with => with "[" INDEX "]" | "[" INDEX "]"
array => ID with | ID without | ID with without
i am trying to achieve the syntax of an n-dims array with a size, like C# arrays. So the following syntax should work arr[], arr[1], arr[1][], arr[1][1], arr[][] but not the ones like arr[][1].

I'm assuming that by "doesn't work", you mean that bison reports a shift/reduce conflict. If you go ahead and use the generated parser anyway, then it will not parse correctly in many cases, because the conflict is real and cannot be resolved by any static rule.
The issue is simple. Remember that a LALR(1) bottom-up parser like the one generate by bison performs every reduction exactly at the end of the right-hand side, taking into account only the next token (the "lookahead token"). So it must know which production to use at the moment the production is completely read. (That gives it a lot more latitude than a top-down parser, which needs to know which production it will use at the beginning of the production. But it's still not always enough.)
The problematic case is the production ID with without. Here, whatever input matches with needs to be reduced to a single non-terminal with before the continues with without. To get to this point, the parser must have passed over some number of '[' INDEX ']' dimensions, and the lookahead token must be [, regardless of whether the next dimension has a definite size or not.
If the with rule is right-recursive:
with: '[' INDEX ']' with
| '[' INDEX ']'
then the parser is really stuck. If what follows has a definite dimension, it needs to continue trying the first production, which means shifting the [. If what follows has no INDEX, it needs to reduce the second production, which will trigger a chain of reductions leading back to the beginning of the list of dimensions.
On the other hand, with a left recursive rule:
with: with '[' INDEX ']'
| '[' INDEX ']'
the parser has no problem at all, because each with is reduced as soon as the ] is seen. That means that the parser doesn't have to know what follows in order to decide to reduce. It decides between the two rules based on the past, not the future: the first dimension in the array uses the second production, and the remaining ones (which follow a with) use the first one.
That's not to say that left-recursion is always the answer, although it often is. As can be seen in this case, right-recursion of a list means that individual list elements pile up on the parser stack until the list is eventually terminated, while left-recursion allows the reductions to happen immediately, so that the parser stack doesn't need to grow. So if you have a choice, you should generally prefer left-recursion.
But sometimes right-recursion can be convenient, particularly in syntaxes like this where the end of the list is different from the beginning. Another way of writing the grammar could be:
array : ID dims
dims : without
| '[' INDEX ']'
| '[' INDEX ']' dims
without: '[' ']'
| '[' ']' without
Here, the grammar only accepts empty dimensions at the end of the list because of the structure of dims. But to achieve that effect, dims must be right-recursive, since it is the end of the list which has the expanded syntax.

Related

Construct CFG from {w element of {a, b}* : 2#a(w)=3#b(w)}

If i have following language { x is element of {a,b}*, where 2#a(x)=3#b(x), then the cfg of that language is :
S=>SaSaSaSbSbS |SaSaSbSaSbS|SaSaSbSbSaS | SaSbSaSaSbS| SaSbSaSbSaS | SaSbSbSaSaS |SbSaSaSaSbS |SbSaSaSbSaS |SbSaSbSaSaS |SbSbSaSaSaS | epsilon/lambda
Is this correct? If this isnt correct/there's another more simple form, can you tell it? I have no clue on another form other than that.
At a glance it looks like this probably works:
your base case is good; the empty string is in the language
you cover all your inductive cases: you only add 2 a and 3 b and you cover all arrangements
I am not seeing a fundamentally simpler solution than this, although you might be able to remove either the leading or the trailing S from the right-hand side of all productions; then, by choosing a production you'd be committing to that first or last terminal symbol, but I think that still works out. Possibly even removing both leading and trailing S so you commit to both the first and the last. Any other simplification seems like it would increase the number of productions or the number of nonterminals, or both, which while possibly reducing the total number of symbols needed to encode the grammar, arguably doesn't make the grammar any simpler (indeed, more nonterminals and productions is typically seen as more complicated, not less). If you want to experiment with adding productions or nonterminals, consider e.g. T => Sa and R => Sb, just to cut down on repetition.

ANTLR4 predicates with greedy * quantifier: avoid unnecessary predicate calls (lexing)

Following lexer grammar snippet is supposed to tokenize 'custom names' depending on a predicate that is defined in a class LexerHelper:
fragment NUMERICAL : [0-9];
fragment XML_NameStartChar
: [:a-zA-Z]
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment XML_NameChar : XML_NameStartChar
| '-' | '_' | '.' | NUMERICAL
| '\u00B7'
| '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
fragment XML_NAME_FRAG : XML_NameStartChar XML_NameChar*;
CUSTOM_NAME : XML_NAME_FRAG ':' XML_NAME_FRAG {LexerHelper.myPredicate(getText())}?;
The correct match for CUSTOM_NAME is always the longest possible match. Now if the lexer encounters a custom name such as some:cname then I would like it to lex the entire string some:cname and then call the predicate once with 'some:cname' as argument.
Instead, the lexer calls the predicate with each possible 'valid' match it finds along the way, so some:c, some:cn, some:cna, some:cnam until finally some:cname.
Is there a way to change the behaviour to force antlr4 to first find the longest possible match, before calling the predicate? Alternatively, is there an efficient way for the predicate to determine that the match is not the longest one yet to simply return with false in that case?
EDIT: The funny thing about this behavior is that as long as only partial matches are passed to the predicate, the result of the predicate seems to be completely ignored by the lexer anyway. This seems oddly inefficient.
As it turns out, the behavior is known and permitted by Antlr. Antlr may or may not call predicates more than necessary (see here for more details). To avoid that behavior I am now using actions instead, which only get executed once the rule has completely and successfully matched. This allows me to e.g. switch modes in an action.

How to define Pascal variables in PetitParser

Here is the (simplified) EBNF section I'm trying to implement in PetitParser:
variable :: component / identifier
component :: indexed / field
indexed :: variable , $[ , blah , $]
field :: variable , $. , identifier
What I did was to add all these productions (except identifier) as ivars of my subclass of PPCompositeParser and define the corresponding methods as follows:
variable
^component / self identifier
component
^indexed / field
identifier
^(#letter asParser, (#word asParser) star) flatten
indexed
^variable , $[ asParser, #digit asParser, $] asParser
field
^variable , $. asParser, self identifier
start
^variable
Finally, I created a new instance of my parser and sent to it the message parse: 'a.b[0]'.
The problem: I get a stack overflow.
The grammar has a left recursion: variable -> component -> indexed -> variable. PetitParser uses Parsing Expression Grammars (PEGs) that cannot handle left recursion. A PEG parser always takes the left option until it finds a match. In this case it will not find a match due to the left recursion. To make it work you need to first eliminate left recursion. Eliminating all left recursion could be more tricky as you will also get one through field after eliminating the first. For example, you can write the grammar as follows to make the left recursion more obvious:
variable = (variable , $[ , blah , $]) | (variable , $. , identifier) | identifier
If you have a left recursion like:
A -> A a | b
you can eliminate it like (e is an empty parser)
A -> b A'
A' -> a A' | e
You'll need to apply this twice to get rid of the recursion.
Alternatively you can choose to simplify the grammar if you do not want to parse all possible combinations of identifiers.
The problem is that your grammar is left recursive. PetitParser uses a top-down greedy algorithm to parse the input string. If you follow the steps, you'll see that it goes from start then variable -> component -> indexed -> variable. This is becomes a loop that gets executed infinitely without consuming any input, and is the reason of the stack overflow (that is the left-recursiveness in practice).
The trick to solve the situation is to rewrite the parser by adding intermediate steps to avoid left-recursing. The basic idea is that the rewritten version will consume at least one character in each cycle. Let's start by simplifying a bit the parser refactoring the non-recursive parts of ´indexed´ and ´field´, and moving them to the bottom.
variable
^component, self identifier
component
^indexed / field
indexed
^variable, subscript
field
^variable, fieldName
start
^variable
subscript
^$[ asParser, #digit asParser, $] asParser
fieldName
^$. asParser, self identifier
identifier
^(#letter asParser, (#word asParser) star) flatten
Now you can more easily see (by following the loop) that if the recursion in variable is to end, an identifier has to be found at the beginning. That's the only way to start, and then comes more input (or ends). Let's call that second part variable':
variable
^self identifier, variable'
now the variable' actually refers to something with the identifier consumed, and we can safely move the recusion from the left of indexed and field to the right in variable':
variable'
component', variable' / nil asParser
component'
^indexed' / field'
indexed'
^subscript
field'
^fieldName
I've written this answer without actually testing the code, but should be okish. The parser can be further simplified, I leave that as an excercise ;).
For more information on left-recursion elimination you can have a look at left recursion elimination

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

ANTLR Matching all tokens except

Is there any way to match a token in antlr except a specific one?
I have a rule which states that a '_' can be an ID. Now I have a specific situation in which I want to match an ID, but in this particular case I want it to ignore the '_' alternative. Is it possible?
I think something like
(ID {!$ID.text.equals("_")}?)
should do it (if you are using Java as target language). Otherwise you will have to write that semantic predicate in a way that your language understands it.
In short, this will check whether the text does not equal "_" and only then will the subrule match.
Another possible way to do this:
id: ID
| '_'
;
ID: // lexer rule to match every valid identifier EXCEPT '_' ;
That way, whenever you mean "either '_' or any other ID", you use id to match this, if you disallow "_", you can use _.