How to define Pascal variables in PetitParser - smalltalk

Here is the (simplified) EBNF section I'm trying to implement in PetitParser:
variable :: component / identifier
component :: indexed / field
indexed :: variable , $[ , blah , $]
field :: variable , $. , identifier
What I did was to add all these productions (except identifier) as ivars of my subclass of PPCompositeParser and define the corresponding methods as follows:
variable
^component / self identifier
component
^indexed / field
identifier
^(#letter asParser, (#word asParser) star) flatten
indexed
^variable , $[ asParser, #digit asParser, $] asParser
field
^variable , $. asParser, self identifier
start
^variable
Finally, I created a new instance of my parser and sent to it the message parse: 'a.b[0]'.
The problem: I get a stack overflow.

The grammar has a left recursion: variable -> component -> indexed -> variable. PetitParser uses Parsing Expression Grammars (PEGs) that cannot handle left recursion. A PEG parser always takes the left option until it finds a match. In this case it will not find a match due to the left recursion. To make it work you need to first eliminate left recursion. Eliminating all left recursion could be more tricky as you will also get one through field after eliminating the first. For example, you can write the grammar as follows to make the left recursion more obvious:
variable = (variable , $[ , blah , $]) | (variable , $. , identifier) | identifier
If you have a left recursion like:
A -> A a | b
you can eliminate it like (e is an empty parser)
A -> b A'
A' -> a A' | e
You'll need to apply this twice to get rid of the recursion.
Alternatively you can choose to simplify the grammar if you do not want to parse all possible combinations of identifiers.

The problem is that your grammar is left recursive. PetitParser uses a top-down greedy algorithm to parse the input string. If you follow the steps, you'll see that it goes from start then variable -> component -> indexed -> variable. This is becomes a loop that gets executed infinitely without consuming any input, and is the reason of the stack overflow (that is the left-recursiveness in practice).
The trick to solve the situation is to rewrite the parser by adding intermediate steps to avoid left-recursing. The basic idea is that the rewritten version will consume at least one character in each cycle. Let's start by simplifying a bit the parser refactoring the non-recursive parts of ´indexed´ and ´field´, and moving them to the bottom.
variable
^component, self identifier
component
^indexed / field
indexed
^variable, subscript
field
^variable, fieldName
start
^variable
subscript
^$[ asParser, #digit asParser, $] asParser
fieldName
^$. asParser, self identifier
identifier
^(#letter asParser, (#word asParser) star) flatten
Now you can more easily see (by following the loop) that if the recursion in variable is to end, an identifier has to be found at the beginning. That's the only way to start, and then comes more input (or ends). Let's call that second part variable':
variable
^self identifier, variable'
now the variable' actually refers to something with the identifier consumed, and we can safely move the recusion from the left of indexed and field to the right in variable':
variable'
component', variable' / nil asParser
component'
^indexed' / field'
indexed'
^subscript
field'
^fieldName
I've written this answer without actually testing the code, but should be okish. The parser can be further simplified, I leave that as an excercise ;).
For more information on left-recursion elimination you can have a look at left recursion elimination

Related

Fast lookup of tree with placeholders?

For an application I'm considering, there would be a large (100,000+) 'database' of trees (think expressions in a programming language, or S-expressions), and I would need to query that database for expressions that match a specific given expression.
Before giving the details of what I'd like to have, note that I'd appreciate any information related to indexing a large set of trees for optimizing lookup by a subtree.
In my specific situation (which would be for a backend to be used by Metamath proof assistants), expressions have the following structure (in Haskell-like notation):
data Expression = Placeholder Id | VarName Id | ConstName Id [Expression]
or as a BNF for an S-expression form:
Expression = '?' Id | Id | '(' Id Expression* ')'
where Id is some kind of identifier.
For example, I could have a database with expressions like
(equiv ?ph ?ps)
(not (in (appl (sqrt) (2)) (Q)))
(equiv (eq ?A ?B) (forall ?x (equiv (in ?x ?A) (in ?x ?B))))
In this context, two expressions match if they can be made equal by substitution of expressions for placeholders. So looking up (equiv (eq A (emptyset)) ?ph) in the above mini-database would result in the first and last expressions.
So again: how would I implement fast lookups in a large set of (expression) trees with placeholders? What kind of index data structure could I use?
I would implement the lookup with a trie. Each key would consist of one of the following:
ConstName Identifier
Variable w/ context info
ConstValue
Placeholder
These should be ordered in some fashion- possibly Placeholder, then all ConstNames (alphabetical), then variables (scope ordering, then argument order), then ConstValues (numerical order). As long as there's a concrete ordering for usage in the trie, you're fine.
Traverse the expression's tree, injecting the appropriate keys into the trie as they are encountered. Do this for all the expressions you want to insert into your data structure. When it comes time to query it, you can traverse the trie in a similar fashion, but with a few new rules.
Everything matches a placeholder node. If it matches some other key as well, then you'll need to explore both branches (easily done via a recursive DFS-like approach).
A placeholder matches everything. This is not equivalent to the previous point- we are talking about placeholders in the query here, the previous bullet is regarding placeholders as trie keys.
Now, this does mean that the search space can somewhat "explode" as you encounter placeholders, but there is one thing you can do to try to mitigate this in practice. Traverse the expression's tree in a breadth-first fashion (both in construction of the trie, and querying). This means if one of the arguments is a placeholder, you won't have to full-depth search every single subtree that matches that expression so far- instead you jump ahead to the next argument- which may not be a placeholder, and will thus greatly prune the search space (compared to matching "everything").
For completeness sake, lets take one of your examples
(not (in (appl (sqrt) (2)) (Q)))
and make a trie entry from that-
not -> in -> apply -> "Q" -> sqrt -> 2
adding (not (in ?ph E)) to this would result in-
not -> in -> apply -> "Q" -> sqrt -> 2
\-> ?ph -> "E"
Continue in this fashion injecting expressions into the trie. Also traverse in this fashion for querying until you reach the ends of your searches into the trie, and return those that matched.
Note- the uniqueness of these entries is based on the assumption you do not have to support variadic functions. If you do, attach to each key some context info (read the next paragraphs for info on how to do this) to distinguish which arguments go to which functions
There is one detail I glossed over- variables. If you only want it to match if they are the exact same variable name, then no work is necessary. But this likely isn't what you want; you probably want it to match generic variables as long as they are "consistent" with each other. The way to do this is to assign each variable an identifier that represents the scope of which it was first defined.
The easiest way to do this is just compose an identifier from the concatenation of the argument ordering of its ancestors. That is, if a variable is first defined as the second argument to a function which is the fifth argument to the root function, then we might label it as (5, 2) or (2, 5), whichever makes more sense intuitively. Either way, this will ensure the variable is given a consistent identifier regardless of other variables / functions elsewhere. Then proceed as normal with this new variable name.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

Recursive rule in ANTLR

I need an idea how to express a statement like the following:
Int<Double<Float>>
So, in an abstract form we should have:
1.(easiest case): a<b>
2. case: a<a<b>>
3. case: a<a<a<b>>>
4. ....and so on...
The thing is that I should enable the possibility to embed a statement of the form a < b > within the < .. > - signs such that I have a nested statement. In other words: I should replace the b with a< b >.
The 2nd thing is that the number of the opening and closed <>-signs should be equal.
How can I do that in ANTLR ?
A rule can refer to itself without any problem¹. Let's say we have a rule type which describes your case, in a minimalist approach:
type: typeLiteral ('<' type '>')?;
typeLiteral: 'Int' | 'Double' | 'Float';
Note the ('<' type '>') is optional, denoted by the ? symbol, so using only a typeLiteral is a valid type. Here are the synta trees generated by these rules in your example Int<Double<Float>>:
¹: As long some terminals (like '<' or '>') can diferentiate when the recursion stop.
Image generated by http://ironcreek.net/phpsyntaxtree/

Not able to understand ANTLR parser rule

I am new to ANTLR , and walking through existing grammar(got at Internet). See the given rule , i am not able to understand it for what is it all about?
Specially $model_expr inside Tree construct and initial (unary_expr -> unary_expr). Please help me understanding the same.
model_expr
: (unary_expr -> unary_expr)
(LEFT_BRACKET model_expr_element RIGHT_BRACKET
-> ^(MODEL_EXPR[$LEFT_BRACKET] $model_expr model_expr_element))?
;
Thanks
See the detailed explanation of above syntax with example (copied from book)
Referencing Previous Rule ASTs in Rewrite Rules
Sometimes you can’t build the proper AST in a purely declarative manner. In other words, executing a single rewrite after the parser has matched everything in a rule is insufficient. Sometimes you need to iteratively build up the AST . To iteratively build an AST, you need to be able to reference the previous value of the current rule’s AST. You can reference the previous value by using $r within a rewrite rule where r
is the enclosing rule. For example, the following rule matches either a single integer or a series of integers addedtogether:
expr : (INT -> INT) ( '+' i=INT -> ^( '+' $expr $i) ) * ;
The (INT->INT) subrule looks odd but makes sense. It says to match INT
and then make its AST node the result of expr . This sets a result AST
in case the (...)* subrule that follows matches nothing. To add another
integer to an existing AST, you need to make a new ’+’ root node that
has the previous expression as the left child and the new integer as the
right child.
That grammar with embedded rewrite rules recognizes the same input and generates the same tree as the following version that uses the construction operators:
expr : INT ('+'^ INT)*;

How to tell if an identifier is being assigned or referenced? (FLEX/BISON)

So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.