How to deal with variable references in yacc/bison (with ocaml) - variables

I was wondering how to deal with variable references inside statements while writing grammars with ocamlyacc and ocamllex.
The problem is that statements of the form
var x = y + z
var b = true | f;
should be both correct but in the first case variable refers to numbers while in the second case f is a boolean variable.
In the grammar I'm writing I have got this:
numeric_exp_val:
| nint { Syntax.Int $1 }
| FLOAT { Syntax.Float $1 }
| LPAREN; ne = numeric_exp; RPAREN { ne }
| INCR; r = numeric_var_ref { Syntax.VarIncr (r,1) }
| DECR; r = numeric_var_ref { Syntax.VarIncr (r,-1) }
| var_ref { $1 }
;
boolean_exp_val:
| BOOL { Syntax.Bool $1 }
| LPAREN; be = boolean_exp; RPAREN { be }
| var_ref { $1 }
;
which obviously can't work, since both var_ref non terminals reduce to the same (reduce/reduce conflict). But I would like to have type checking that is mostly statically done (with respect to variable references) during the parsing phase itself.
That's why I'm wondering which is the best way to have variable references and keep this structure. Just as an additional info I have functions that compile the syntax tree by translating it into a byte code similar to this one:
let rec compile_numeric_exp exp =
match exp with
Int i -> [Push (Types.I.Int i)]
| Float f -> [Push (Types.I.Float f)]
| Bop (BNSum,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Plus]
| Bop (BNSub,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Minus]
| Bop (BNMul,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Times]
| Bop (BNDiv,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Div]
| Bop (BNOr,e1,e2) -> (compile_numeric_exp e1) # (compile_numeric_exp e2) # [Types.I.Or]
| VarRef n -> [Types.I.MemoryGet (Memory.index_for_name n)]
| VarIncr ((VarRef n) as vr,i) -> (compile_numeric_exp vr) # [Push (Types.I.Int i);Types.I.Plus;Types.I.Dupe] # (compile_assignment_to n)
| _ -> []

Parsing is simply not the right place to do type-checking. I don't understand why you insist on doing this in this pass. You would have much clearer code and greater expressive power by doing it in a separate pass.
Is it for efficiency reasons? I'm confident you could devise efficient incremental-typing routines elsewhere, to be called from the grammar production (but I'm not sure you'll win that much). This looks like premature optimization.
There has been work on writing type systems as attribute grammars (which could be seen as a declarative way to express typing derivations), but I don't think it is meant to conflate parsing and typing in a single pass.
If you really want to go further in this direction, I would advise you to use a simple lexical differentiation between num-typed and bool-typed variables. This sounds ugly but is simple.

If you want to treat numeric expressions and boolean expressions as different syntactic categories, then consider how you must parse var x = ( ( y + z ) ). You don't know which type of expression you're parsing until you hit the +. Therefore, you need to eat up several tokens before you know whether you are seeing a numeric_exp_val or a boolean_exp_val: you need some unbounded lookahead. Yacc does not provide such lookahead (Yacc only provides a restricted form of lookahead, roughly described as LALR, which puts bounds on parsing time and memory requirements). There is even an ambiguous case that makes your grammar context-sensitive: with a definition like var x = y, you need to look up the type of y.
You can solve this last ambiguity by feeding back the type information into the lexer, and you can solve the need for lookahead by using a parser generator that supports unbounded lookahead. However, both of these techniques will push your parser towards a point where it can't easily evolve if you want to expand the language later on (for example to distinguish between integer and floating-point numbers, to add strings or lists, etc.).
If you want a simple but constraining fix with a low technological overhead, I'll second gasche's suggestion of adding a syntactic distinguisher for numeric and boolean variable definitions, something like bvar b = … and nvar x = …. There again, this will make it difficult to support other types later on.
You will have an easier time overall if you separate the type checking from the parsing. Once you've built an abstract syntax tree, do a pass of type checking (in which you will infer the type of variables.
type numeric_expression = Nconst of float | Nplus of numeric_expression * numeric_expression | …
and boolean_expression = Bconst of bool | Bor of boolean_expression * boolean_expression | …
type typed_expression = Tnum of numeric_expression | Tbool of boolean_expression
type typed_statement = Tvar of string * typed_expression
let rec type_expression : Syntax.expression -> typed_expression = function
| Syntax.Float x -> Tnum (Nconst x)
| Syntax.Plus (e1, e2) ->
begin match type_expression e1, type_expression e2 with
| Tnum n1, Tnum n2 -> Tnum (Nplus (n1, n2))
| _, (Tbool _ as t2) -> raise (Invalid_argument_type ("+", t2))
| (Tbool _ as t1), _ -> raise (Invalid_argument_type ("+", t1))
end
| …

Related

Is it possible to do patern binding a la Haskell in Idris?

An example would be:
fib: Stream Integer
fib#(1::tfib) = 1 :: 1 :: [ a+b | (a,b) <- zip fib tfib]
But this generates the error:
50 | fib#(1::tfib) = 1 :: 1 :: [ a+b | (a,b) <- zip fib tfib]
| ^
unexpected "#(1::tfib)"
expecting "<==", "using", "with", ':', argument expression, constraint argument, expression, function right hand side, implementation
block, implicit function argument, or with pattern
This doesn't look promising given that it doesn't recognize # at the likely position.
Note that the related concept of as-patterns works the same in Haskell and Idris:
growHead : List a -> List a
growHead nnl#(x::_) = x::nnl
growHead ([]) = []

How do you define a non-generic recursive datatype in Idris?

This is literally my first line of Idris code. When I looked up the documentation, all appeared proper:
Idris> data T = Foo Bool | Bar (T -> T)
(input):1:6:
|
1 | data T = Foo Bool | Bar (T -> T)
| ^
unexpected reserved data
expecting dependent type signature
This makes me think I may need to declare T to be a symbol in some fashion?
It works as expected inside an Idris source file. At the REPL however, declarations need to be prefixed with the :let command:
:let data T = Foo Bool | Bar (T -> T)
Thanks for the question. I learned something trying to answer it.

Creating Context-Free-Grammar with restrictions

I'm trying to understand CFG by using an example with some obstacles in it.
For example, I want to match the declaration of a double variable:
double d; In this case, "d" could be any other valid identifier.
There are some cases that should not be matched, e.g. "double double;", but I don't understand how to avoid a match of the second "double"
My approach:
G = (Σ, V, S, P)
Σ = {a-z}
V = {S,T,U,W}
P = { S -> doubleTUW
T -> _(space)
U -> (a-z)U | (a-z)
W -> ;
}
Now there must be a way to limit the possible outcomes of this grammar by using the function L(G). Unfortunately, I couldn't find a syntax that meet my requirement to deny a second "double".
Here's a somewhat tedious regular expression to match any identifier other than double. Converting it to a CFG can be done mechanically but it is even more tedious.
([a-ce-z]|d[a-np-z]|do[a-tv-z]|dou[ac-z]|doub[a-km-z]|doubl[a-df-z]|double[a-z])[a-z]*
Converting it to a CFG can be done mechanically but it is even more tedious:
ALPHA → a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_B → a|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_D → a|b|c|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_E → a|b|c|d|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_L → a|b|c|d|e|f|g|h|i|j|k|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_O → a|b|c|d|e|f|g|h|i|j|k|l|m|n|p|q|r|s|t|u|v|w|x|y|z
NOT_U → a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|v|w|x|y|z
WORD → NOT_D
| d NOT_O
| do NOT_U
| dou NOT_B
| doub NOT_L
| doubl NOT_E
| double ALPHA
| WORD ALPHA
This is why many of us usually use scanner generators like (f)lex which handle such exclusions automatically.

With respect to grammas, when are eplison production rules allowed?

I'm trying to understand a concept with respect to grammar and Production Rules.
According to most material on this subject:
1) Epsilon production rules are only allowable if they do not appear on the RHS of any other production rule.
However, taking a grammar:
G = { T,N,P,S }
Where:
T = {a,b}
N = {S,S1}
S = {S}
P {
S -> aSb
S -> ab
S1 -> SS1
S1 -> E //Please note, using E to represent Epsilon.
}
Where, the language of the grammar is:
L(G) = { a^n, b^n | n >= 1 }
In this case, a production rule containing Epsilon exists (derived from S1) but S1 also forms part of a RHS of another production rule (S1 -> SS1).
Doesn't this violate point 1?
Your statement:
Epsilon production rules are only allowable if they do not appear on the RHS of any other production rule.
would be better stated as
A non-terminal may have an epsilon production rules if that non-terminal does not appear on the right-hand side of any production rule.
In Chomsky's original hierarchy, epsilon productions were banned for all but Type 0 (unrestricted) grammars. If all epsilon productions are banned, then it is impossible for the grammar to produce the empty string. I believe this was not a concern for Chomsky; consequently, most modern formulations allow the start symbol to have an empty right-hand side as long as the start symbol itself does not appear on the right-hand side of any production.
As it happens, the restriction on epsilon-productions is somewhat stronger than is necessary. In the case of both context-free grammars and regular grammars (Chomsky type 2 and type 3 grammars), it is always possible to create a weakly-equivalent grammar without epsilon productions (except possibly the single production S → ε if the grammar can produce the empty string.) It is also possible to remove a number of other anomalies which complicate grammar analysis: unreachable symbols, unproductive symbols, and cyclic productions. The result of the combination of all these eliminations is a "proper context-free grammar".
Consequently, most modern formulations of context-free grammars do not require the right-hand sides to be non-empty.
Your grammar G = {T, N, S, P} with
T = {a, b}
N = {S, S1}
S = {S}
P {
S → a S b
S → a b
S1 → S S1
S1 → ε
}
contains an unreachable symbol, S1. We can easily eliminate it, producing the equivalent grammar G' = { T, N', S, P' }:
N' = {S}
P' {
S → a S b
S → a b
}
G' does not contain any epsilon productions (but even if it had, they could have been eliminated).

Why the need for terminals? Is my solution sufficient enough?

I'm trying to get my head around context free grammars and I think I'm close. What is baffling me is this one question (I'm doing practise questions as I have an exam in a month's time):
I've come up with this language but I believe it's wrong.
S --> aSb | A | B
A --> aA | Σ
B --> bB | Σ
Apparently this is the correct solution:
S --> aSb | aA | bB
A --> aA | Σ
B --> bB | Σ
What I don't quite understand is why we have S --> aSb | aA | bB and not just S --> aSb | A | B. What is the need for the terminals? Can't I just call A instead and grab my terminals that way?
Testing to see if I can generate the string: aaabbbb
S --> aSb --> aaSbb --> aaaSbbb --> aaaBbbb --> aaabbbb
I believe I generate the string correctly, but I'm not quite sure. I'm telling myself that the reason for S --> aSb | aA | bB is that if we start with aA and then replace A with a, we have two a's which gives us our correct string as they're not equal, this can be done with b as well. Any advice is greatly appreciated.
Into the Tuple (G-4-tuple)
V (None terminals) = {A, B}
Σ (Terminals) = {a, b}
P = { } // not quite sure how to express my solution in R? Would I have to use a test string to do so?
S = A
First:
Σ means language symbols. in your language Σ = {a, b}
^ means null symbols (it is theoretical, ^ is not member of any language symbol)
ε means empty string (it is theoretical, ε can be a member of some language)
See ^ symbol means nothing but we use it just for theoretical purpose, like ∞ infinity symbol we uses in mathematics(really no number is ∞ but we use it to understand, to proof some theorems) similarly ^ is nothing but we use it.
this point not written in any book, I am writing it to explain/for understanding point of view. The subject more belongs to theoretical and math and I am from computer science.
As you says your grammar is L = {am bn | m != n}. Suppose if productions are as follows:
First:
S --> aSb | A | B
A --> aA | Σ
B --> bB | Σ
It means.(very rare book may use Σ in grammar rules)
S --> aSb | A | B
A --> aA | a | b
B --> bB | a | b
I replaced Σ by a | b (a, b language symbols).
This grammar can generates a string of equal numbers of symbols a and symbol b(an bn). How it can generate an bn? See below an example derivation:
S ---> aSb ---> aAb ---> aaAb ---> aabb
^ ^ ^ ^
rule-1 S-->A A--> aA A --> b
But these kind of strings are not possible in language L because m != n.
Second:
For the same reason production rules S --> aSb | aA | bB is also not correct grammar if A --> aA | Σ or B --> bB | Σ are in grammar.
I think in second grammar you mean:
S --> aSb | aA | bB
A --> aA | ^
B --> bB | ^
Then this is correct grammar for language L = {am bn | m != n}. Because using:
S --> aSb
you can only generate equal numbers of a' and b and by replacing S either by aA or by bB you make a sentential form in which unequal numbers of a and b symbols are present and that can't convert back to generate a string of type an bn. (since A doesn't generates b and B doesn't generates a).
Third:
But usually we write grammar rules like:
S --> aSb | A | B
A --> aA | a
B --> bB | b
Both forms are equivalent (generate same language L = {am bn | m != n}) because once you convert S into either A or B you have to generate at-least one a or b (or more) respectively and thus constraint m != n holds.
Remember proofing, whether two grammars are equivalent or not is undecidable problem. We can't prove it by algorithm (but logically possible, that works because we are human being having brain better then processor :P :) ).
Fourth:
At the end I would also like to add, Grammar:
S --> aSb | A | B
A --> aA | ^
B --> bB | ^
doesn't produces L = {am bn | m != n} because we can generate an bn for example:
S ---> aSb ---> aAb ---> ab
^
A --> ^
Grammar in formal languages
Any class of formal languages can be represented by a formal Grammar consisting of the four-tuple (S, V, Σ, P). (note a Grammar or an automata both are finite representation weather language is finite or infinite: Check figures one & two).
Σ: Finite set of language symbols.
In grammar we commonly call it finite set of terminals (in contrast of variables V). Language symbols or terminals are thing, using which language strings (sentences) are constructed. In your example set of terminals Σ is {a, b}. In natural language you can correlate terminals with vocabulary or dictionary words.
Natural language means what we speak Hindi, English
V: Finite set of Non-terminals.
Non-terminal or say 'variable', should always participate in grammar production rules. (otherwise the variable counts in useless variables, that is a variable that doesn't derives terminals or nothing).
See: 'ultimate aim of grammar is to produce language's strings in correct form hence every variable should be useful in some way.
In natural language you can correlate variable set with Noun/Verbs/Tens that defined a specific semantical property of an language (like Verb means eating/sleeping, Noun means he/she/Xiy etc).
Note: One can find in some books V ∩ Σ = ∅ that means variables are not terminals.
S: Start Variable. (S ∈ V)
S is a special variable symbol, that is called 'Start Symbol'. We can only consider a string in language of grammar L(G) if it can be derived from Start variable S. If a string can not be derived from S (even if its consist of language symbols Σ) then string will not be consider in the language of grammar( actually that string belongs to 'complement language' of L(G), we writes complement language L' = Σ* - L(G) , Check: "the complement language in case of regular language")
P: Finite set of Production Rules.
Production Rules defines replacement rules in the from α --> β, that means during the derivation of a string from S, from grammar rules at any time α (lhs) can be replaced by β (rhs).(this is similar to Noun can be replace by he,she or Xiy, and Verb can be replace by eating, sleeping etc in natural language.
Production rules defines formation rules of language sentences. Formal language are similar to Natural language having a pattern that is certain thing can occurs in certain form--that we call syntax in programming language. And because of this ability of grammar, grammar use for syntax checking called parse).
Note: In α --> β, α and β both are consists of language symbols and terminals (V U Σ)* with a constraint that in α their must be at-least one variable. (as we can replace only a string contain variable by rhs of rule. a terminal can't replace by other terminal or we can say a sentence can't be replaced by other sentence)
Remember: There is two form Sentential Form and Sentence of a string:
Sentence: if all symbols are terminals (sentence can be either in L(G) or in complement language L' = Σ* - L)
Sentential: if any symbol is variable (not a language string but derivation string)
From #MAV (Thanks!!):
To represent grammar of above language L = {am bn | m != n}, 4-tuple are :
V = {S, A, B}
Σ = {a, b}
P = {S --> aSb | A | B, A --> aA | a, B --> bB | a}
S = S
note: Generally I use P for Production rules, your book may use R for rules
Terminology uses in theory of formal languages and automate
Capital letters are uses for variables e.g. S, A, B in grammar construction.
Small letter from start uses for terminals(language symbols) for example a, b.
(some time numbers like 0, 1 uses. Also ^ is null symbol).
Small letters form last uses for string of terminals z, y, w, x (for example you can find these notations in pumping lemma,
symbols use for language string or sub strings).
α, β, γ for Sentential forms.
Σ for language symbols.
Γ for input or output tap symbol, other then language symbols.
^ for null symbol, # or ☐ Symbol for blank symbol in Turing machine and PDA (^, #, ☐ are other then language symbols.
ε uses for empty string (can be a part of language string for example { } is empty body in C language, you can write while(1); or
while(1){ } both are valid see here I have defined a valid program
with empty sentences).
∅ means empty set in set theory.
Φ, Ψ uses for substring in Sentential forms.
Note: ∅ means set is empty, ε means string is empty, ^ means none symbol (don't mix in theory, all are different in semantic)
There is no rules I know about symbol notation, but these are commonly used terminology once can find in most standard books I observed during study.
Next post: Tips for writing Context free grammar