Why does a list of values not cool on the LLVM backend of K? - kframework

When trying to define the syntax for a Scheme-like language, I found that the running result of kompiled file with java backend
kompile --backend java scheme.k -d .
behaves differently with llvm backend
kompile --backend llvm scheme.k -d .
Here's my code for scheme.k:
module SCHEME-COMMON
imports DOMAINS-SYNTAX
syntax Name ::= "+" | "-" | "*" | "/"
| "display" | "newline"
syntax Names ::= List{Name," "}
syntax Exp ::= Int | Bool | String | Name
| "[" Name Exps "]" [strict(2)]
syntax Exps ::= List{Exp," "} [strict]
syntax Val
syntax Vals ::= List{Val," "}
syntax Bottom
syntax Bottoms ::= List{Bottom," "}
syntax Pgm ::= Exp Pgm [strict(1)]
| "eof"
endmodule
module SCHEME-SYNTAX
imports SCHEME-COMMON
imports BUILTIN-ID-TOKENS
syntax Name ::= r"[a-z][_a-zA-Z0-9]*" [token, prec(2)]
| #LowerId [token]
endmodule
module SCHEME-MACROS
imports SCHEME-COMMON
endmodule
module SCHEME
imports SCHEME-COMMON
imports SCHEME-MACROS
imports DOMAINS
configuration <T color="yellow">
<k color="green"> $PGM:Pgm </k>
<env color="violet"> .Map </env>
<store color="white"> .Map </store>
<input color="magenta" stream="stdin"> .List </input>
<output color="brown" stream="stdout"> .List </output>
</T>
syntax Val ::= Int | Bool | String
syntax Exp ::= Val
syntax Exps ::= Vals
syntax Vals ::= Bottoms
syntax Exps ::= Names
syntax Names ::= Bottoms
syntax KResult ::= Vals | Val
rule _:Val P:Pgm => P
when notBool(P ==K eof)
rule V:Val eof => V
rule [+ I1 I2 Vals] => [+ (I1 +Int I2) Vals] [arith]
rule [+ I .Vals] => I [arith]
rule [- I1 I2 Vals] => [- (I1 -Int I2) Vals] [arith]
rule [- I .Vals] => I [arith]
rule [* I1 I2 Vals] => [* (I1 *Int I2) Vals] [arith]
rule [* I .Vals] => I [arith]
rule [/ I1 I2 Vals] => [/ (I1 /Int I2) Vals]
when I2 =/=K 0 [arith]
rule [/ I .Vals] => I [arith]
rule <k> [newline .Exps] => "" ...</k>
<output>... .List => ListItem("\n") </output> [io]
rule <k> [display V:Val] => "" ...</k>
<output>... .List => ListItem(V) </output> [io]
endmodule
and this is the test file I'm trying to run:
[display 8]
eof
Strangely, the kompiled version using java can run this test case normally, while the kompiled version using llvm stucks at
<k>
8 .Bottoms ~> #freezer[__]_SCHEME-COMMON_Exp_Name_Exps0_ ( display ) ~> #freezer___SCHEME-COMMON_Pgm_Exp_Pgm1_ ( eof )
</k>
What might be a possible reason? The version information for kompile is
RV-K version 1.0-SNAPSHOT
Git revision: a7c2937
Git branch: UNKNOWN
Build date: Wed Feb 12 09:46:03 CST 2020

In the LLVM and Haskell backends, two productions are said to overload with one another when they share the same arity and klabel attribute and all the argument sorts of one production are less than or equal to the argument sorts of the other, and the result sort of the first is less than the result of the other. Special consideration is given during matching to terms that overload: For example, in your example, if a list of Exps and a list of Vals were said to overload, then if you have a pattern V:Vals, it would match the term V:Val, .Exps of sort Exps.
By default, the Java backend assumes that all Lists between sorts that have a subsort relationship overload. However, the LLVM and Haskell backends do not make this assumption. Thus, your example will work if you give the same klabel attribute to your Exps List and your Vals list. We do not do the same thing in the llvm backend because we have found that it tends to lead to serious ambiguity in your grammar in places where you do not expect it.
For example:
module SCHEME-COMMON
imports DOMAINS-SYNTAX
syntax Name ::= "+" | "-" | "*" | "/"
| "display" | "newline"
syntax Names ::= List{Name," "} [klabel(exps)]
syntax Exp ::= Int | Bool | String | Name
| "[" Name Exps "]" [strict(2)]
syntax Exps ::= List{Exp," "} [strict, klabel(exps)]
syntax Val
syntax Vals ::= List{Val," "} [klabel(exps)]
syntax Bottom
syntax Bottoms ::= List{Bottom," "} [klabel(exps)]
syntax Pgm ::= Exp Pgm [strict(1)]
| "eof"
endmodule

Related

K Framework: Cannot convert to subtype

I'm trying evaluate Expressions to Values (Exps ::= Values) for function calls.
Here's a simple example:
module ERL-SYNTAX
imports INT-SYNTAX
imports STRING
syntax Atom ::= "main" | "f"
syntax Exp ::= Atom | Int
syntax Exp ::= Exp "(" Exps ")" [seqstrict]
syntax Exps ::= List{Exp, ","} [seqstrict]
endmodule
module ERL-CONFIGURATION
imports ERL-SYNTAX
imports MAP
syntax Value ::= Atom | Int | "{" Values "}"
syntax Values ::= List{Value, ","}
syntax Exp ::= Value
syntax Exps ::= Values
syntax KResult ::= Value
syntax KResult ::= Values
configuration <cfg color="yellow">
<k color="green"> $PGM:Exp </k>
<fundefs> //some default function definitions
.Map (f |-> 5 , .Exps
main |-> f ( 2 , 3 , .Exps ) , .Exps )
</fundefs>
</cfg>
endmodule
module ERL
imports ERL-SYNTAX
imports ERL-CONFIGURATION
//rule .Exps => .Values
rule <k>F:Atom(_:Values) => L ...</k>
<fundefs>... F |-> L ...</fundefs>
endmodule
This gets stuck at
.Exps ~> #freezer_(_)ERL-SYNTAX1 ( main )
So I tried with this rule: .Exps => .Values to evaluate main().
To me, the strange thing is that this time heating 3 is ok:
.Values ~> #freezer_,ERL-SYNTAX1 ( 3 ) ~> #freezer,_ERL-SYNTAX1 ( 2 ) ~> ...
will be
3 , .Values ~> #freezer_,_ERL-SYNTAX1 ( 2 ) ~> ..
but here it gets stuck again.
How should I approach this problem?
Put the productions for Exps and Vals in the same module and give them the same klabel attribute. This will make them overload one another, at which point in time, the fact that .Values is a KResult should solve your problem.

Antlr production producing a bunch of single element arrays

My grammar is working, but I have a bunch of elements in the tree that are single element arrays, and I don't really understand why. I tried reading the information about visitors, but I'm pretty sure the "problem" is with the grammar and perhaps its verbosity. Does anything jump out here? Or perhaps I'm just visiting things incorrectly. In the example below I do not react to visitFnArgs or visitArgs, but just visitFunctionCall. Things like function arguments and statements seem to sometimes be wrapped in single element arrays.
grammar Txl;
root: program;
// High level language
program: stmt (NEWLINE stmt)* NEWLINE? EOF # Statement
;
stmt: require # Condition
| entry # CreateEntry
| assignment # Assign
;
require: REQUIRE valueExpression;
entry: (CREDIT | DEBIT) journal valueExpression (IF valueExpression)? (LPAREN 'id:' valueExpression RPAREN)?;
assignment: IDENT ASSIGN valueExpression;
journal: IDENT COLON IDENT;
valueExpression: expr # Expression;
expr: expr (MULT | DIV) expr # MulDiv
| expr (PLUS | MINUS) expr # AddSub
| expr MOD expr # Mod
| expr POW expr # Pow
| MINUS expr # Negative
| expr AND expr # And
| expr OR expr # Or
| NOT expr # Not
| expr EQ expr # Equality
| expr NEQ expr # Inequality
| expr (LTE | GTE) expr # CmpEqual
| expr (LT | GT) expr # Cmp
| expr QUESTION expr COLON expr # Ternary
| LPAREN expr RPAREN # Parens
| NUMBER # NumberLiteral
| IDENT LPAREN args RPAREN # FunctionCall
| IDENT # Identifier
| STRING_LITERAL # StringLiteral
;
fnArg: expr | journal;
args: (fnArg (',' fnArg)*)?;
// Reserved words
CREDIT: 'credit';
DEBIT: 'debit';
IF: 'if';
REQUIRE: 'require';
// Operators
MULT: '*';
DIV: '/';
MINUS: '-';
PLUS: '+';
POW: '^';
MOD: '%';
LPAREN: '(';
RPAREN: ')';
LBRACE: '[';
RBRACE: ']';
COMMA: ',';
EQ: '==';
NEQ: '!=';
GTE: '>=';
LTE: '<=';
GT: '>';
LT: '<';
ASSIGN: '=';
QUESTION: '?';
COLON: ':';
AND: 'and';
OR: 'or';
NOT: 'not';
HASH: '#';
NEWLINE : [\r\n];
WS: [ \t] + -> skip;
// Entities
NUMBER: ('0' .. '9') + ('.' ('0' .. '9') +)?;
IDENT: [a-zA-Z]+[0-9a-zA-Z]*;
EXTID: [a-zA-Z0-9-]+;
STRING_LITERAL : '"' (~('"' | '\\' | '\r' | '\n') | '\\' ('"' | '\\'))* '"';
This input:
require balance(assets:cash) + balance(assets:earnings) > AMT
Produces the following single element arrays:
SINGLE ELEMENT INSTRUCTION MathOperation (>)
SINGLE ELEMENT INSTRUCTION JournalReference { identifier: 'assets:cash' }
SINGLE ELEMENT INSTRUCTION JournalReference { identifier: 'assets:earnings' }
I wonder if partly my problem is I'm not visiting things properly. Here's my Math visitor:
visitMath(ctx) {
const visited = this.visitChildren(ctx);
return new MathOperation(
visited[0],
ctx.getChild(1).getText(),
visited[2],
);
}
But I assume the problem is in the thing that contains the math operation, which I think is visitRequire:
visitRequire(ctx) {
return new Condition(this.visitExpression(ctx.getChild(1)));
}
Or perhaps in visitValueExpression or visitCondition, which are not overridden in my visitor.
Really short answer: There's nothing wrong with single element arrays. If there was only one instance of a thing that could exist multiple times, then it has to be an array (or List), and that list will have only the one item, if that's how many there are.
Antlr won't "unwrap" a single item to not be in an array. (That would only be valid in untyped languages or languages that allow Union types, and would be a pain to use as you'd always have to check whether you had a "thing" or a list of "thing"s)
Any time the "same type of thing" can exist more than once when matching a rule, ANTLR will make that available as an Array/List of that type.
Eample:
journal: IDENT COLON IDENT;
has 2 IDENT tokens, so it'll be made accessible via the context as a List of those types
(in Java, I'm not positive which language you're using).
public List<TerminalNode> IDENT() { return getTokens(TxlParser.IDENT); }
Two of your examples are of "JournalReference" so this would explain getting a list (if you use the ctx.IDENT() or the ctx.getChild(n) methods).
If I change the Journal rule to be:
journal: j1=IDENT COLON j2=IDENT;
I've given names to each IDENT so I get individual accessors for them (in addition to the IDENT() accessor that returns a list:
public static class JournalContext extends ParserRuleContext {
public Token j1;
public Token j2;
public TerminalNode COLON() { return getToken(TxlParser.COLON, 0); }
public List<TerminalNode> IDENT() { return getTokens(TxlParser.IDENT); }
With the labels you can use cox.j1 or cox.j2 to get individual tokens. (of course you'd name them as appropriate to your use case).
since the FunctionCall alternative of the expr rule uses the args rule
args: (fnArg (',' fnArg)*)?;
and that rule can have more than one fnArg, the it will necessarily be a list of fnArgs in the context:
public static class ArgsContext extends ParserRuleContext {
public List<FnArgContext> fnArg() {
return getRuleContexts(FnArgContext.class);
}
There's really not much you can do (or should want to do to not have that in a List, there can be one or more of them.
Since non of the code you present shows where you're writing your output, its a bit difficult to be more specific than that.
Your visitMath(cox) example is also a bit perplexing as math is not a rule in your grammar, so it would not exist in the Visitor interface.
I would suggest taking a closer look at the *Context classes that are generated for you. They'll provide utility methods that will be much easy to use and read in the future than getChild(n). getChild(n) is obscure, in that you'll have to refer back to the rule and diligently count rule members to determine which child to get, and it is also VERY brittle, in that n will change with any modification to your grammar. (Maintainers, or future you, will appreciate using the utility methods instead.)

Grammar for string interpolation where malformed interpolations are treated as normal strings

Here is a subset of the language I want to parse:
A program consists of statements
A statement is an assignment: A = "b"
Assignment's left side is an identifier (all caps)
Assignment's right side is a string enclosed by quotation marks
A string supports string interpolation by inserting a bracket-enclosed identifier (A = "b[C]d")
So far this is straight forward enough. Here is what works:
Lexer:
lexer grammar string_testLexer;
STRING_START: '"' -> pushMode(STRING);
WS: [ \t\r\n]+ -> skip ;
ID: [A-Z]+;
EQ: '=';
mode STRING;
VAR_START: '[' -> pushMode(INTERPOLATION);
DOUBLE_QUOTE_INSIDE: '"' -> popMode;
REGULAR_STRING_INSIDE: ~('"'|'[')+;
mode INTERPOLATION;
ID_INSIDE: [A-Z]+;
CLOSE_BRACKET_INSIDE: ']' -> popMode;
Parser:
parser grammar string_testParser;
options { tokenVocab=string_testLexer; }
mainz: stat *;
stat: ID EQ string;
string: STRING_START string_part* DOUBLE_QUOTE_INSIDE;
string_part: interpolated_var | REGULAR_STRING_INSIDE;
interpolated_var: VAR_START ID_INSIDE CLOSE_BRACKET_INSIDE;
So far so good. However there is one more language feature:
if there is no valid identifier (that is all caps) in the brackets, treat as normal string.
Eg:
A = "hello" => "hello"
B = "h[A]a" => "h", A, "a"
C="h [A] a" => "h ", A, " a"
D="h [A][V] a" => "h ", A, V, " a"
E = "h [A] [V] a" => "h ", A, " ", V, " a"
F = "h [aVd] a" => "h [aVd] a"
G = "h [Va][VC] a" => "h [Va]", VC, " a"
H = "h [V][][ff[Z]" => "h ", V, "[][ff", Z
I tried to replace REGULAR_STRING_INSIDE: ~('"'|'[')+; With just REGULAR_STRING_INSIDE: ~('"')+;, but that does not work in ANTLR. It results in matching all the lines above as strings.
Since in ANTLR4 there is no backtracking to enable I'm not sure how to overcome this and tell ANTLR that if it did not match the interpolated_var rule it should go ahead and match REGULAR_STRING_INSIDE instead, it seems to always chose the latter.
I read that lexer always matches the longest token, so I tried to lift REGULAR_STRING_INSIDE and VAR_START as a parser rules, hoping that alternatives order in the parser will be honoured:
r: REGULAR_STRING_INSIDE
v: VAR_START
string: STRING_START string_part* DOUBLE_QUOTE_INSIDE;
string_part: v ID_INSIDE CLOSE_BRACKET_INSIDE | r;
That did not seem to make any difference at all.
I also read that antlr4 semantic predicates could help. But I have troubles coming up with the ones that needs to be applied in this case.
How do I modify this grammar above so that it can match both interpolated bits, or treat them as strings if they are malformed?
Test input:
A = "hello"
B = "h[A]a"
C="h [A] a"
D="h [A][V] a"
E = "h [A] [V] a"
F = "h [aVd] a"
G = "h [Va][VC] a"
H = "h [V][][ff[Z]"
How I compile / test:
antlr4 string_testLexer.g4
antlr4 string_testParser.g4
javac *.java
grun string_test mainz st.txt -tree
I tried to replace REGULAR_STRING_INSIDE: ~('"'|'[')+; With just REGULAR_STRING_INSIDE: ~('"')+;, but that does not work in ANTLR. It results in matching all the lines above as strings.
Correct, ANTLR tries to match as much as possible. So ~('"')+ will be far too greedy.
I also read that antlr4 semantic predicates could help.
Only use predicates as a last resort. It introduces target specific code in your grammar. If it's not needed (which in this case it isn't), then don't use them.
Try something like this:
REGULAR_STRING_INSIDE
: ( ~( '"' | '[' )+
| '[' [A-Z]* ~( ']' | [A-Z] )
| '[]'
)+
;
The rule above would read as:
match any char other than " or [ once or more
OR match a [ followed by zero or more capitals, followed by any char other than ] or a capital (your [Va and [aVd cases)
OR match an empty block, []
And match one of these 3 alternatives above once or more to create a single REGULAR_STRING_INSIDE.
And if a string can end with one or mote [, you may also want to do this:
DOUBLE_QUOTE_INSIDE
: '['* '"' -> popMode
;

GOLD Parser comment grammar

I'm having some trouble with comment blocks in my grammar. The syntax is fine, but Step 3 DFA scanner is complaining about the way I'm going about it.
The language I'm trying to parse looks like this:
{statement}{statement} etc.
Within each statement can be a couple of different types of comments:
{% This is a comment.
It can contain multiple lines
and continues until the statement end}
{statement REM This is a comment.
It can contain multiple lines
and continues until the statement end}
This is a simplified grammar that displays the problem I'm running into:
"Start Symbol" = <Program>
{String Chars} = {Printable} + {HT} - ["\]
StringLiteral = '"' ( {String Chars} | '\' {Printable} )* '"'
Comment Start = '{%'
Comment End = '}'
Comment Block #= { Ending = Closed } ! Eat the } and produce an empty statement
!Comment #= { Type = Noise } !Implied by GOLD
Remark Start = 'REM'
Remark End = '}'
Remark Block #= { Ending = Open } ! Don't eat the }, the statements expects it
Remark #= { Type = Noise }
<Program> ::= <Statements>
<Statements> ::= '{' <Statement> '}' <Statements> | <>
<Statement> ::= StringLiteral
Step 3 is complaining about the } in <Statements> and the } for the End of the lexical group.
Anyone know how to accomplish what I need?
[Edit]
I got the REM portion working with the following:
{Remark Chars} = {Printable} + {WhiteSpace} - [}]
Remark = 'REM' {Remark Chars}* '}'
<Statements> ::= <Statements> '{' <Statement> '}'
| <Statements> '{' <Statement> <Remark Stmt>
| <>
<Remark Stmt> ::= Remark
This is actually ideal, since Remarks are not necessarily noise to me.
Still having issues with the comment lexical group. I'll look at solving in the same way.
I don't think capturing the REM comment with a lexical group is possible.
I think you need to define a new terminal like this:
Remark = 'REM' ({Printable} - '}')*
This however means, that you need to be able to handle this new terminal in your productions...
Eg.
From:
<CurlyStatement> ::= '{' <Statement> '}'
To:
<CurlyStatement> ::= '{' <Statement> '}'
| '{' <Statement> Remark '}'
I have'nt checked the syntax in the above examples, but I hope you get my idear

ANTLR: how to parse a region within matching brackets with a lexer

i want to parse something like this in my lexer:
( begin expression )
where expressions are also surrounded by brackets. it isn't important what is in the expression, i just want to have all what's between the (begin and the matching ) as a token. an example would be:
(begin
(define x (+ 1 2)))
so the text of the token should be (define x (+ 1 2)))
something like
PROGRAM : LPAREN BEGIN .* RPAREN;
does (obviously) not work because as soon as he sees a ")", he thinks the rule is over, but i need the matching bracket for this.
how can i do that?
Inside lexer rules, you can invoke rules recursively. So, that's one way to solve this. Another approach would be to keep track of the number of open- and close parenthesis and let a gated semantic predicate loop as long as your counter is more than zero.
A demo:
T.g
grammar T;
parse
: BeginToken {System.out.println("parsed :: " + $BeginToken.text);} EOF
;
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // keep reapeating `( ... )*` as long as open > 0
( ~('(' | ')') // match anything other than parenthesis
| '(' {open++;} // match a '(' in increase the var `open`
| ')' {open--;} // match a ')' in decrease the var `open`
)
)*
;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String input = "(begin (define x (+ (- 1 3) 2)))";
TLexer lexer = new TLexer(new ANTLRStringStream(input));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
java -cp antlr-3.3-complete.jar org.antlr.Tool T.g
javac -cp antlr-3.3-complete.jar *.java
java -cp .:antlr-3.3-complete.jar Main
parsed :: (begin (define x (+ (- 1 3) 2)))
Note that you'll need to beware of string literals inside your source that might include parenthesis:
BeginToken
#init{int open = 1;}
: '(' 'begin' ( {open > 0}?=> // ...
( ~('(' | ')' | '"') // ...
| '(' {open++;} // ...
| ')' {open--;} // ...
| '"' ... // TODO: define a string literal here
)
)*
;
or comments that may contain parenthesis.
The suggestion with the predicate uses some language specific code (Java, in this case). An advantage of calling a lexer rule recursively is that you don't have custom code in your lexer:
BeginToken
: '(' Spaces? 'begin' Spaces? NestedParens Spaces? ')'
;
fragment NestedParens
: '(' ( ~('(' | ')') | NestedParens )* ')'
;
fragment Spaces
: (' ' | '\t')+
;