How do I make an item optional or repeatable in K syntax rule? - kframework

How can I convert this EBNF rules below with K Framework ?
An element can be used to mean zero or more of the previous:
items ::= {"," item}*
For now, I am using a List from the Domain module. But inline List declarations are not allowed, like this one:
syntax Foo ::= Stmt List{Id, ""}
For now, I have to create a new syntax rule for the listed item to counter the problem:
syntax Ids ::= List{Id, ""}
syntax Foo ::= Stmt Ids
Is there another way to counter this creation of a new rule?
An element can appear zero or one time. In other words it can be optional:
array-decl ::= <variable> "[" {Int}? "]"
Where we want to accept: a[4] and a[]. For now, to bypass this one I create 2 rules, where one branch has the item and the other not. But this solution duplicate rules in an unnecessary way in my opinion.
An element can appear one or more of the previous:
e ::= {a-z}+
Where we don't accept any non-zero length sequence of lower case letters. Right now, I didn't find a way to simulate this.
Thank you in advance!

Inline zero-or-more productions have been restricted in the K-framework because the backend doesn't support terms with a variable number of arguments.
Therefore we ask that each list is declared as a separate production which will produce a cons list. Typical functional style matching can be used to process the AST.
Your typical EBNF extensions would look something like this:
{Id ","}* - syntax Ids ::= List{Id, ","}
{Id ","}+ - syntax Ids ::= NeList{Id, ","}
Id? - syntax OptionalId ::= "" [klabel(none)] | Id [klabel(some)]
The optional (?) production has the same problem. So we ask the user to specify labels that can be referenced by rules. Note that the empty production is not allowed in the semantics module because it may interfere with parsing the concrete syntax in rules. So you will need to create a COMMON module with most of the syntax, and a *-SYNTAX module with the productions that can interfere with rule parsing (empty productions and tokens that can conflict with K variables).

No, there is currently no mechanism to do this without the extra production.
I typically do this as follows:
syntax MaybeFoo ::= ".MaybeFoo" | Foo
syntax ArrayDecl ::= Variable "[" MaybeFoo "]"
Non-empty lists may be declared similar to lists:
syntax Bars ::= NeList{Bar, ","}

Related

Did "!", "^" and "$" had a special meaning in Antlr3?

I dont have any prior knowledge about ANTLR(I recently learned a little bit about ANTLR4), but I have to translate an old grammar to a newer version and eclipse is telling me, that their are no viable alternatives for those characters and shows the syntax error " '!' came as a complete surprise to me".
I already deleted those characters and it does not seam to be a problem, but maybe it had a special function in ANTLR3.
Thanks in advance.
global_block:
DATABASE! IDENTIFIER!
| GLOBALS! define_section!+ END! GLOBALS!
| GLOBALS! STRING!
;
main_block: MAIN sequence? END em=MAIN
-> ^(MAIN MAIN '(' ')' sequence? $em)
;
^ and -> are related to tree rewriting: https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687090/Tree+construction
ANTLR4 does not support it (v4 has listeners and visitors for tree traversal, but no rewriting anymore). Just remove all of these ! and -> ... in parser rules (do not remove the -> ... inside lexer rules like -> channel(...), which is still supported in v4).
So in your case, these rules would be valid in ANTLR4:
global_block
: DATABASE IDENTIFIER
| GLOBALS define_section+ END GLOBALS
| GLOBALS STRING
;
main_block
: MAIN sequence? END MAIN
;
The $ can still be used in ANTLR4: they are used to reference sub-rules or tokens:
expression
: lhs=expression operator=(PLUS | MINUS) rhs=expression
| NUMBER
;
so that in embedded code block, you can do: $lhs.someField.someMethod(). In your case, you can also just remove them because they are probably only used in the tree rewrite rules.
EDIT
kaby76 has a Github page with some instructions for converting grammars to ANTLR4: https://github.com/kaby76/AntlrVSIX/blob/master/doc/Import.md#antlr3

When are K configuration cells type-checked?

It is a common K idiom to define a programming language's syntax with a top-sort of well-formed programs (e.g. Pgm) and then to restrict the <k> cell to have this sort in the configuration declaration using the special $PGM variable which is passed automatically by krun. This prevents users from executing programs with krun that are not well-formed. My question is:
Are the sort of cells checked only at start-up time or after each rule evaluation?
Do different cells show different behavior depending on their identity (e.g. the <k> cell) or how they are typed (e.g. user-defined types versus builtin types)?
Here is a partial example to show what I mean:
configuration
<mylang>
<k> $PGM:Pgm </k>
<env> .Env:Env </env> // Env is a custom map structure defined for environments
<store> .Map </store> // For the store we use the K builtin Map
...
</mylang>
For the <k> cell, I conclude that it is definitely only checked at start-up time, since program evaluation typically tears a program apart into an expression and a continuation (e.g. ADD ~> ...) which cannot have the sort Pgm anymore because ~> is builtin.
So, elaborating on questions (1-2) above, is the <k> cell exceptional in this sense?
Each rule is sort-checked at kompile time to be sort-preserving, so it's not needed to check this at runtime. If something of the correct sort goes in, something of the correct sort comes out.
The <k> cell gets sort K, at least for example, in this definition: https://github.com/kframework/evm-semantics/blob/272608d70f363ed3d8d921887b98a26102a03032/evm.md#configuration
it results in compiled.txt (found at .build/defn/java/driver-kompiled/compiled.txt) which looks like:
...
syntax KCell ::= "project:KCell" "(" K ")" [function, projection]
syntax KCell ::= "initKCell" "(" Map ")" [function, initializer, noThread]
syntax KCell ::= "<k>" K "</k>" [cell, cellName(k), contentStartColumn(7), contentStartLine(31), format(%1%i%n%2%d%n%3), maincell, org.kframework.definition.Production(syntax #RuleContent ::= #RuleBody [klabel(#ruleNoConditions), symbol])]
...
But other cells get more specific sorts:
...
syntax JumpDestsCell ::= "project:JumpDestsCell" "(" K ")" [function, projection]
syntax JumpDestsCell ::= "initJumpDestsCell" [function, initializer, noThread]
syntax JumpDestsCell ::= "<jumpDests>" Set "</jumpDests>" [cell, cellName(jumpDests), contentStartColumn(7), contentStartLine(31), format(%1%i%n%2%d%n%3), org.kframework.definition.Production(syntax #RuleContent ::= #RuleBody [klabel(#ruleNoConditions), symbol])]
...
I'm not sure how K decides that the <k> cell needs to get sort K, but I don't think it's based on analyzing the rules. I think it's likely that it sees $PGM in that cell, so it adds the maincell attribute you see and gives it sort K. Everething is a subsort of K.
I'm fairly certain it's not any $ variable in the configuration that gives it sort K, because the <chainID> cell in KEVM gets these declarations:
...
syntax ChainIDCell ::= "project:ChainIDCell" "(" K ")" [function, projection]
syntax ChainIDCell ::= "initChainIDCell" "(" Map ")" [function, initializer, noThread]
syntax ChainIDCell ::= "<chainID>" Int "</chainID>" [cell, cellName(chainID), contentStartColumn(7), contentStartLine(31), format(%1%i%n%2%d%n%3), org.kframework.definition.Production(syntax #RuleContent ::= #RuleBody [klabel(#ruleNoConditions), symbol])]
...
Note that there isn't very much special about the _~>_ operator. It's declared here: https://github.com/kframework/k/blob/135469ea0ebea96dacf0f9a49261ff1171440c20/k-distribution/include/kframework/builtin/kast.k#L57

Antlr4 grammar - Allow variable namess with spaces

I am new to Antlr and I want to write a compiler for the custom programming language which has variable names with spaces. Following is the sample code:
SET Variable with a Long Name TO FALSE
SET Variable with Numbers 1 2 3 in the Name TO 3 JUN 1990
SET Variable with Symbols # %^& TO "A very long text string"
Variable rules:
Can contain white spaces
Can contain special symbols
I want to write the compiler in javascript. Following is my grammar:
grammar Foo;
compilationUnit: stmt*;
stmt:
assignStmt
| invocationStmt
;
assignStmt: SET ID TO expr;
invocationStmt: name=ID ((expr COMMA)* expr)?;
expr: ID | INT | STRING;
COMMA: ',';
SAY: 'say';
SET: 'set';
TO: 'to';
INT: [0-9]+;
STRING: '"' (~('\n' | '"'))* '"';
ID: [a-zA-Z_] [ a-zA-Z0-9_]*;
WS: [ \n\t\r]+ -> skip;
I tried supplying input source code as:
"set variable one to 1".
But got the error "Undefined token identifier".
Any help is greatly appreciated.
ID: [a-zA-Z_] [ a-zA-Z0-9_]*;
will match "set variable one to 1". Like most lexical analysers, ANTLR's scanners greedily match as much as they can. set doesn't get matched even though it has a specific pattern. (And even if you managed that, "variable one to 1" would match on the next token; the match doesn't stop just because to happens to appear.)
The best way to handle multi-word variable names is to treat them as multiple words. That is, recognise each word as a separate token, and recognise an identifier as a sequence of words. That has the consequence that two words and two words end up being the same identifier, but IMHO, that's a feature, not a bug.

Failed to parse command using ANTLR3 grammar, if command has same word which is declared as rule

I have facing a problem while parsing some command with the parser which, I have implemented using ANLTR3. Parser fails to parse those commands which contains 'any-word' that is declared as lexer rule in the grammar.
For Example take a look following grammar:
show :
SHOW TABLES '[' projectName? tableName']' -> ^(SHOW TABLES_ ^(PROJECT_NAME projectName)? ^(DATASET_TABLE tableName));
SHOW : S H O W;
If i try to parse command 'SHOW TABLES [sample-project:SHOW]' then parse fails for this command.But if I change the SHOW word then it works.
SHOW TABLES [sample-project:SHOW] - this works.
I don't want to get name as string which is surrounded in quotes(").
Can anyone suggest solution? I am using ANTLR3.
Thanks in advance.
This is a typical effect of using a reserved word as identifier. In ANTLR when you define a reserved word like your SHOW rule it will implicitly excluded from a identifier rule you might have defined after that keyword rule.
The solution to allow such keywords also as identifiers in rules like your tablName is to make that rule accept certain (or all) keywords that could be accepted in that place (and will not act as keywords then). Example:
tableName:
IDENTIFIER
| SHOW
| <others go here>
;

antlr4: need to convert sequences of symbols to characters in lexer

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.