I have syntax:
syntax Process ::= KVar "(" KVar ")" "." Process [binder]
| "new" KVar "." Process [binder]
syntax Program ::= KVar "(" KVarVec ")" "=" Process [binder]
syntax KVarVec ::= KVar | KVar "," KVarVec
The two syntax has three productions that bind differently:
a(x).P, where x is bound in P, but a is a name that isn't being bound by that term.
new a.P binds a in P like a lambda.
f(a,b,c) = P binds a vector a,b,c of KVar in P. Each KVar in the vector is supposed to be bound in P.
How can I tell binder to bind specific variables in a production? Is there something like binder(2) to tell it that the second KVar is supposed to be bound? what if its several KVars defined by another syntax?
Currently one of the limitations of the binder attribute is that the variable bound must be the first nonterminal in the production, and the term that it is bound in must be the last nonterminal. Feel free to make a feature request for the generalization you propose on GitHub and I'll get to it at some point. Might not be right away though.
Related
I am trying to understand this answer: https://stackoverflow.com/a/44180583/481061 and particularly this part:
if the first line of the statement is a valid statement, it won't work:
val text = "This " + "is " + "a "
+ "long " + "long " + "line" // syntax error
This does not seem to be the case for the dot operator:
val text = obj
.getString()
How does this work? I'm looking at the grammar (https://kotlinlang.org/docs/reference/grammar.html) but am not sure what to look for to understand the difference. Is it built into the language outside of the grammar rules, or is it a grammar rule?
It is a grammar rule, but I was looking at an incomplete grammar.
In the full grammar https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinParser.g4 it's made clear in the rules for memberAccessOperator and identifier.
The DOT can always be preceded by NL* while the other operators cannot, except in parenthesized contexts which are defined separately.
How can I convert this EBNF rules below with K Framework ?
An element can be used to mean zero or more of the previous:
items ::= {"," item}*
For now, I am using a List from the Domain module. But inline List declarations are not allowed, like this one:
syntax Foo ::= Stmt List{Id, ""}
For now, I have to create a new syntax rule for the listed item to counter the problem:
syntax Ids ::= List{Id, ""}
syntax Foo ::= Stmt Ids
Is there another way to counter this creation of a new rule?
An element can appear zero or one time. In other words it can be optional:
array-decl ::= <variable> "[" {Int}? "]"
Where we want to accept: a[4] and a[]. For now, to bypass this one I create 2 rules, where one branch has the item and the other not. But this solution duplicate rules in an unnecessary way in my opinion.
An element can appear one or more of the previous:
e ::= {a-z}+
Where we don't accept any non-zero length sequence of lower case letters. Right now, I didn't find a way to simulate this.
Thank you in advance!
Inline zero-or-more productions have been restricted in the K-framework because the backend doesn't support terms with a variable number of arguments.
Therefore we ask that each list is declared as a separate production which will produce a cons list. Typical functional style matching can be used to process the AST.
Your typical EBNF extensions would look something like this:
{Id ","}* - syntax Ids ::= List{Id, ","}
{Id ","}+ - syntax Ids ::= NeList{Id, ","}
Id? - syntax OptionalId ::= "" [klabel(none)] | Id [klabel(some)]
The optional (?) production has the same problem. So we ask the user to specify labels that can be referenced by rules. Note that the empty production is not allowed in the semantics module because it may interfere with parsing the concrete syntax in rules. So you will need to create a COMMON module with most of the syntax, and a *-SYNTAX module with the productions that can interfere with rule parsing (empty productions and tokens that can conflict with K variables).
No, there is currently no mechanism to do this without the extra production.
I typically do this as follows:
syntax MaybeFoo ::= ".MaybeFoo" | Foo
syntax ArrayDecl ::= Variable "[" MaybeFoo "]"
Non-empty lists may be declared similar to lists:
syntax Bars ::= NeList{Bar, ","}
It is a common K idiom to define a programming language's syntax with a top-sort of well-formed programs (e.g. Pgm) and then to restrict the <k> cell to have this sort in the configuration declaration using the special $PGM variable which is passed automatically by krun. This prevents users from executing programs with krun that are not well-formed. My question is:
Are the sort of cells checked only at start-up time or after each rule evaluation?
Do different cells show different behavior depending on their identity (e.g. the <k> cell) or how they are typed (e.g. user-defined types versus builtin types)?
Here is a partial example to show what I mean:
configuration
<mylang>
<k> $PGM:Pgm </k>
<env> .Env:Env </env> // Env is a custom map structure defined for environments
<store> .Map </store> // For the store we use the K builtin Map
...
</mylang>
For the <k> cell, I conclude that it is definitely only checked at start-up time, since program evaluation typically tears a program apart into an expression and a continuation (e.g. ADD ~> ...) which cannot have the sort Pgm anymore because ~> is builtin.
So, elaborating on questions (1-2) above, is the <k> cell exceptional in this sense?
Each rule is sort-checked at kompile time to be sort-preserving, so it's not needed to check this at runtime. If something of the correct sort goes in, something of the correct sort comes out.
The <k> cell gets sort K, at least for example, in this definition: https://github.com/kframework/evm-semantics/blob/272608d70f363ed3d8d921887b98a26102a03032/evm.md#configuration
it results in compiled.txt (found at .build/defn/java/driver-kompiled/compiled.txt) which looks like:
...
syntax KCell ::= "project:KCell" "(" K ")" [function, projection]
syntax KCell ::= "initKCell" "(" Map ")" [function, initializer, noThread]
syntax KCell ::= "<k>" K "</k>" [cell, cellName(k), contentStartColumn(7), contentStartLine(31), format(%1%i%n%2%d%n%3), maincell, org.kframework.definition.Production(syntax #RuleContent ::= #RuleBody [klabel(#ruleNoConditions), symbol])]
...
But other cells get more specific sorts:
...
syntax JumpDestsCell ::= "project:JumpDestsCell" "(" K ")" [function, projection]
syntax JumpDestsCell ::= "initJumpDestsCell" [function, initializer, noThread]
syntax JumpDestsCell ::= "<jumpDests>" Set "</jumpDests>" [cell, cellName(jumpDests), contentStartColumn(7), contentStartLine(31), format(%1%i%n%2%d%n%3), org.kframework.definition.Production(syntax #RuleContent ::= #RuleBody [klabel(#ruleNoConditions), symbol])]
...
I'm not sure how K decides that the <k> cell needs to get sort K, but I don't think it's based on analyzing the rules. I think it's likely that it sees $PGM in that cell, so it adds the maincell attribute you see and gives it sort K. Everething is a subsort of K.
I'm fairly certain it's not any $ variable in the configuration that gives it sort K, because the <chainID> cell in KEVM gets these declarations:
...
syntax ChainIDCell ::= "project:ChainIDCell" "(" K ")" [function, projection]
syntax ChainIDCell ::= "initChainIDCell" "(" Map ")" [function, initializer, noThread]
syntax ChainIDCell ::= "<chainID>" Int "</chainID>" [cell, cellName(chainID), contentStartColumn(7), contentStartLine(31), format(%1%i%n%2%d%n%3), org.kframework.definition.Production(syntax #RuleContent ::= #RuleBody [klabel(#ruleNoConditions), symbol])]
...
Note that there isn't very much special about the _~>_ operator. It's declared here: https://github.com/kframework/k/blob/135469ea0ebea96dacf0f9a49261ff1171440c20/k-distribution/include/kframework/builtin/kast.k#L57
I'm trying to write my own type inference algorithm for a toy language, but I'm running into a wall - I think algorithm W can only be used for excessively general types.
Here are the expressions:
Expr ::= EAbs String Expr
| EApp Expr Expr
| EVar String
| ELit
| EConc Expr Expr
The typing rules are straightforward - we proceed to use type variables for abstraction and application. Here are all possible types:
Type ::= TVar String
| TFun Type Type
| TMono
As you might have guessed, ELit : TMono, and more specifically, EConc :: TMono → TMono → TMono.
My issue comes from doing the actual type inference. When recursing down an expression structure, the general technique when seeing an EAbs is to generate a fresh type variable representing the newly bound variable, replace any occurrences of typing in our context with the (String : TVar fresh) judgment, then continue down the expression.
Now, when I hit EConc, I was thinking about taking the same approach - replace the free expression variables of the sub expressions with TMon in the context, then type-infer the sub expressions, and take the most-general unifier of the two results as the main substitution to return. However, when I try this with an expression like EAbs "x" $ EConc ELit (EVar "x"), I get the incorrect TFun (TVar "fresh") TMon.
You need to use mgu to coerce sub-expressions. If you directly manipulate the context to affect sub expressions, you don't know how that affects earlier types. Use mgu to get the substitution that unifies sub expressions to TMon, then compose that substitution in the result.
I'm working on a small compiler in order to get a greater appreciation of the difficulties of creating one's own language. Right now I'm at the stage of adding pointer functionality to my grammar but I got a reduce/reduce conflict by doing it.
Here is a simplified version of my grammar that is compilable by bnfc. I use the happy parser generator and that's the program telling me there is a reduce/reduce conflict.
entrypoints Stmt ;
-- Statements
-------------
SDecl. Stmt ::= Type Ident; -- ex: "int my_var;"
SExpr. Stmt ::= Expr; -- ex: "printInt(123); "
-- Types
-------------
TInt. Type ::= "int" ;
TPointer. Type ::= Type "*" ;
TAlias. Type ::= Ident ; -- This is how I implement typedefs
-- Expressions
--------------
EMult. Expr1 ::= Expr1 "*" Expr2 ;
ELitInt. Expr2 ::= Integer ;
EVariable. Expr2 ::= Ident ;
-- and the standard corecions
_. Expr ::= Expr1 ;
_. Expr1 ::= Expr2 ;
I'm in a learning stage of how grammars work. But I think I know what happens. Consider these two programs
main(){
int a;
int b;
a * b;
}
and
typedef int my_type;
main(){
my_type * my_type_pointer_variable;
}
(The typedef and main(){} part isn't relevant and in my grammar. But they give some context)
In the first program I wish it would parse a "*" b as Stmt ==(SExpr)==> Expr ==(EMult)==> Expr * Expr ==(..)==> Ident "*" Ident, that is to essentially start stepping using the SExpr rule.
At the same time I would like my_type * my_type_pointer_variable to be expanded using the rules. Stmt ==(SDecl)==> Type Ident ==(TPointer)==> Type "*" Ident ==(TAlias)==> Ident "*" Ident.
But the grammar stage have no idea if an identifier originally is a type alias or a variable.
(1) How can I get rid of the reduce/reduce conflict and (2) am I the only one having this issue? Is there an obvious solution and how does the c grammar resolve this issue?
So far I have successfully just been able to change the syntax of my language by using "&" or some other symbol instead of "*", but that's very undesirable. Also I cannot make sense from various public c grammars and tried to see why they don't have this issue but I have had no luck in this.
And last, how do I resolve issues like these on my own? All I understood from happys more verbose output is how the conflict happens, is cleverness the only way to work around these conflicts? I'm afraid I'll stumble on even more issues for example when introducing EIndir. Expr = '*' Expr;
The usual way this problem is dealt with in C parsers is something generally called "the lexer feedback hack". Its a 'hack' in the sense that it doesn't deal with it in the grammar at all; instead, when the lexer recognizes an identifier, it classifies that identifier as either a typename or a non-typename, and returns a different token for each case (usually designated 'TypeIdent' for an identifier that is a typename and simply 'Ident' for any other). The lexer makes this selection by looking at the current state of the symbol table, so it sees all the typedefs that have occurred prior to the current point in the parse, but not typedefs that are after the current point. This is why C requires that you declare typedefs before their first use in each compilation unit.