PLY conditional rule - yacc

How can I handle one to many rules in PLY.
tokens = (
'VAR1',
'VAR2',
'TRUE',
'SINGLE_CHAR',
)
def t_VAR1(t):
r'var1'
return t
def t_VAR2(t):
r'var2'
return t
def t_TRUE(t):
r't|T'
return t
def t_SINGLE_CHAR(t):
r'[A-Za-z]'
return t
Now if I have two rules.
def p_expression_variable2(p):
'''variable2 : VAR2 SINGLE_CHAR'''
print("got VAR2")
def p_expression_variable1(p):
'''variable1 : VAR1 TRUE'''
print("got VAR1")
I want to make the rules so that var1 T matches with p_expression_variable1 and var2 T matches with p_expression_variable2. But as t_TRUE is a subset of t_SINGLE_CHAR, T always matches with t_TRUE, hence giving error.
I am novice in lex and yacc, want to know how to handle rules for such issue. I know that It can be handled by conditional lexing (defining two states), but is there a way to handle it with single state?

Not in the way you want to do it.
In the yacc/lex parsing model lexical analysis is performed without regard to the parser state. That's the point of separating syntactic and lexical analysis into distinct components. If you wanted them to interact in the way you suggest, you would have to use a scannerless parser, but that involves a certain cost, both in grammar complexity and in the parsing algorithm, because the resulting grammar is unlikely to work with only a single character lookahead.
With the lex/yacc framework, you could use multiple lexical states, but that requires either feedback from the parser to the lexical analyser, generally using mid-rule actions, which Ply doesn't support, or reproducing part of the syntactic analysis inside a hand-built state machine in the lexical analyser. Both of these solutions are inelegant, unscalable, and unnecessarily bulky. (Nonetheless, they are more common than one might hope.)
What you can do is to make the lexical analysis unambiguous:
def t_TRUE(t):
r'[tT]'
return t
def t_SINGLE_CHAR(t):
r'[A-SU-Za-su-z]'
return t
and implement a kind of lexical fallback in the parser:
def p_any_char(p):
'''any_char : SINGLE_CHAR
| TRUE
'''
p[0] = p[1]
def p_expression_variable2(p):
'''variable2 : VAR2 any_char'''
print("got VAR2")
def p_expression_variable1(p):
'''variable1 : VAR1 TRUE'''
print("got VAR1")

Related

Computation of dependencies (related to the K prelude)

I'm particularly interested to understand the K prelude (how it is structured, why its content is like that, how "kompile" calculates dependencies, etc).
The main question is: what is the criterion for a hooked symbol from the K prelude to be copied into the generated Kore file?
Here some examples of potential problems:
The symbol andBool is copied with its associated rewrite rules, which does not seem to be the case for the symbol in_keys, which is simply copied without its rewrite rules.
Other symbols seem to be useless (for the IMP semantic) but exist, with or without its rewrite rules, in the generated Kore file, such as countAllOccurrences, findChar, signExtendBitRangeInt or Float2String.
It seems that SortId is generated by the line syntax Id [token]. However, the lines "syntax Bool ::= "true" [token] and syntax Bool ::= "false" [token] do not generate true and false symbols.
(Moreover, is it a choice that true and false are values and not constructors?)
The sort named SortId is not generated for the following example, whereas some generated hooked symbols depend on this sort. This problem does not exist with the IMP semantic.
module MAX-OW-SYNTAX
imports INT
imports BOOL
syntax Exp ::= Int | "(" Exp ")" [bracket]
| "max" Exp Exp
endmodule
module MAX-OW
imports MAX-OW-SYNTAX
syntax KResult ::= Int
rule max X Y => Y requires X <Int Y
rule max X _ => X [owise]
endmodule
Is it correct that the K prelude is implemented in each language of each backend, and that an implementation in the Kore language is available in the K prelude?
Do you have the necessary interface to implement for a new backend? (For instance, Bag is obsolete, but not Set, List and Map, but I don't know the list of set operators, map operators, etc. that the new backend must provide.)
Is there a reason why andThenBool and andBool have the same semantics once implemented in the Kore syntax (Booleans module)?
Where are the rewrite rules defined for ==Bool, used in the definition of =/=Bool (Booleans module)?
The best reference point for the K internals is the User Manual, along with the K source for the prelude. To respond to your specific questions as best as I can:
in_keys only has simplification rules that apply on symbolic backends. These will not apply on concrete backends, and so those backends use the hooked implementation MAP.in_keys. Some functions (such as andBool) can be implemented both in K and as an efficient backend hook. For example, on the K LLVM backend, andBool is implemented by code generation. If a backend didn't support that hook, the (relatively) inefficient K rewriting implementation would be used.
The Id sort is built in for convenience. It represents program identifiers.
You haven't imported DOMAINS in this example. Doing so will pull in the Id sort and related rewrites.
Very roughly, and largely for internal purposes. Do you have a hypothetical K backend in mind, or is there a way in which the LLVM / Haskell backends provided by K are inadequate for your specific use case?
andThenBool is required to short-circuit its arguments; andBool is permitted to short-circuit, but may evaluate both arguments strictly. An implementation that makes both perform short-circuiting is valid.
==Bool is implemented only in terms of a hook. In domains.md, you can see the hook(BOOL.eq) attribute that indicates how ==Bool is implemented.
Do let us know if you have further questions, or would like help implementing a specific semantics in K.

Modify the stack of a PLY parser

Trying to modify a value already pushed to the stack of a PLY/Yacc parser. I'm using ply on python 3.
Basically I want to invert the previous two values when a token SWAP is used.
Imagine we have this stack:
1, 2, 3, 4, SWAP
I need it to reduce to:
1, 2, 4, 3
the value you write to p[0] will be pushed to the stack, but how can I push more then one value?
# this fails because it consume two values and pushes only one
# results into: `1`, `2`, `4`
def p_swap(p):
'value : value value SWAP'
p[0] = p[2]
# this was just a try... fails as well
def p_swap(p):
'value : value value SWAP'
p[0] = p[2]
p[1] = p[1]
# this locked as a good idea since consumes only only value and modify the second in place
# it fails because the stack (negative indexes) are immutable:
# https://github.com/dabeaz/ply/blob/master/ply/yacc.py#L234
# results into: `1`, `2`, `3`, `3`
def p_swap(p):
'value : value SWAP'
p[0] = p[-1]
p[-1] = p[1] # this is a NOP
p is an instance of this class
I guess it was designed to be immutable to enforce the parsing to be done a certain way (the correct way), but I'm missing it: what's the correct way to modify the stack or to design a parser?
It sounds like you're trying to create a stack-based language like Forth or Joy. If so, you shouldn't need a bottom-up parser, and you shouldn't be surprised that a bottom-up parser-generator doesn't work the way you want it to.
Stack-based languages are mostly simply streams of tokens. Each token has some kind of stack effect, and they are just applied in sequence; there's usually little or no syntactic structure beyond that. Consequently, the languages really aren't parsed; at best, they are tokenised.
Most stack-based languages contain some kind of nested control structures which are not strictly conformant with the above (but not all; see Postscript, for example). But even these are so simple that a real parser is unnecessary.
Of course, nothing stops you from using a generated parser to parse a trivial language. But if you do that, you should certainly not expect to be able to gain access to the parser's internal datastructures. The parser stack is used by the parser in ways which might not be fully obvious, and which certainly must not be interfered with. If you want to implement a stack-based language interpreter, you need to use your own value stack. (Or stacks; many stack-based languages have several different stacks, each with its own semantics.)

How to synthesise compiler testing data?

I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.
Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.

How to modify parsing grammar to allow assignment and non-assignment statements?

So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.
Statement ::= Expr SC
Expr ::= /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}
Expr has to be made such that Statement must allow for the two cases:
x = 789; (regular assignment, followed by semicolon)
x+2; (no assignment, just calculation, discarded; followed by a semicolon)
The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.
I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.
All help is appreciated.
EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.
The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:
while ((ch = getchar()) != EOF) { ... }
Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).
There are two small complications, which are relatively easy to accomplish:
Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.
Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.
None of these options are difficult to implement in a grammar, though.
In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.
If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)
A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.
If you want to go down this road, then the solution is simple enough. You will have some sort of function:
/* This function parses and returns a single expression */
Node expr() {
Node left = value();
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)
/* This function parses and returns a single expression
* after the first value has been parsed. The value must be
* passed as an argument.
*/
Node expr_rest(Node left) {
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
With that in place, it is straightforward to implement both expr and stmt:
Node expr() {
return expr_rest(value());
}
Node stmt() {
/* Check lookahead for statements which start with
* a keyword. Omitted for simplicity.
*/
/* either first value in an expr or target of assignment */
Node left = value();
switch (lookahead) {
case OP_ASSIGN:
accept(lookahead);
return MakeAssignment(left, expr())
}
/* Handle += and other mutating assignments if desired */
default: {
/* Not an assignment, just an expression */
return MakeExpressionStatement(expr_rest(left));
}
}
}

Generating a list of lists of Int with QuickCheck

I'm working through Real World
Haskell one of the
exercises of chapter 4 is to implement an foldr based version of
concat. I thought this would be a great candidate for testing with
QuickCheck since there is an existing implementation to validate my
results. This however requires me to define an instance of the
Arbitrary typeclass that can generate arbitrary [[Int]]. So far I have
been unable to figure out how to do this. My first attempt was:
module FoldExcercises_Test
where
import Test.QuickCheck
import Test.QuickCheck.Batch
import FoldExcercises
prop_concat xs =
concat xs == fconcat xs
where types = xs ::[[Int]]
options = TestOptions { no_of_tests = 200
, length_of_tests = 1
, debug_tests = True }
allChecks = [
run (prop_concat)
]
main = do
runTests "simple" options allChecks
This results in no tests being performed. Looking at various bits and
pieces I guessed that an Arbitrary instance declaration was needed and
added
instance Arbitrary a => Arbitrary [[a]] where
arbitrary = sized arb'
where arb' n = vector n (arbitrary :: Gen a)
This resulted in ghci complaining that my instance declaration was
invalid and that adding -XFlexibleInstances might solve my problem.
Adding the {-# OPTIONS_GHC -XFlexibleInstances #-} directive
results in a type mismatch and an overlapping instances warning.
So my question is what's needed to make this work? I'm obviously new
to Haskell and am not finding any resources that help me out.
Any pointers are much appreciated.
Edit
It appears I was misguided by QuickCheck's output when in a test first manner fconcat is defined as
fconcat = undefined
Actually implementing the function correctly indeed gives the expected result. DOOP!
[[Int]] is already an Arbitrary instance (because Int is an Arbitrary instance, so is [a] for all as that are themselves instances of Arbitrary). So that is not the problem.
I ran your code myself (replacing import FoldExcercises with fconcat = concat) and it ran 200 tests as I would have expected, so I am mystified as to why it doesn't do it for you. But you do NOT need to add an Arbitrary instance.