I want to create a dsl to define an attributed grammar. Here is an example grammar specification written in my dsl:
RootSymbol -> AdditiveValue {
impl = {
1.calc
"out " 1.result;
}
}
AdditiveValue -> AdditiveValue "+" Number {
result = #0
calc = {
1.calc
3.calc
"add " 0.result " " 1.result " " 3.result;
}
}
AdditiveValue -> AdditiveValue "-" Number {
result = #0
calc = {
1.calc
3.calc
"sub " 0.result " " 1.result " " 3.result;
}
}
AdditiveValue -> Number {
result = 1.result
calc = 1.calc
}
Number -> "0|[1-9][0-9]*" {
result = #0
calc = {
"set " 0.result " " 1.value;
}
}
Input:
100 + 42 - 63 + 55
Output:
set 0 100
set 1 42
add 2 0 1
set 3 63
sub 4 2 3
set 5 55
add 6 4 5
out 6
Explanation:
The grammar consists of three non terminal symbols:
RootSymbol has one rule
AdditiveValue has three rules which allow to derive productions like the input above
Number has one rule which maps to the regex that matches a single number
Also, the grammar specification defines how the input can he translated to the output, so the Number just outputs a set command, the AdditiveValue outputs the output of both operators and then an add or sub command and the RootSymbol outputs the output of its AdditiveValue and then the out command.
Now, I want to define an Xtext grammar for this dsl and have the following two questions:
Question 1
How can I split the declaration of an identifier? In the example above, I have three rules for the symbol AdditiveValue. I know that if I have one place where I declare a symbol, I can easily implement that in Xtext. However, in my dsl I have multiple places to declare one symbol because I need to be able to declare multiple rules per symbol.
Question 2
How can I reference a symbol by its position? In the example above (second rule), the expression 1.calc should reference the calc attribute of the first part of the production of the surrounding rule. To be clear, it should reference the calc attribute of the AdditiveValue symbol.
Any help is appreciated!
Update
Be sure to understand the difference between the grammar specification in Xtext and the grammar specification using my dsl. They are two different things: I want to define a grammar in Xtext that matches a grammar specification that is written in my dsl. I want to do this just to have an editor for my dsl, not to use Xtext as code generator. The grammar that is written in my dsl is used by an external compiler program to translate an input to an output. Also, the example grammar above, that is written in my dsl is just that: A simple example for my dsl. The grammar could be much more complex to match any Context-free language.
Related
I am trying to understand this answer: https://stackoverflow.com/a/44180583/481061 and particularly this part:
if the first line of the statement is a valid statement, it won't work:
val text = "This " + "is " + "a "
+ "long " + "long " + "line" // syntax error
This does not seem to be the case for the dot operator:
val text = obj
.getString()
How does this work? I'm looking at the grammar (https://kotlinlang.org/docs/reference/grammar.html) but am not sure what to look for to understand the difference. Is it built into the language outside of the grammar rules, or is it a grammar rule?
It is a grammar rule, but I was looking at an incomplete grammar.
In the full grammar https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinParser.g4 it's made clear in the rules for memberAccessOperator and identifier.
The DOT can always be preceded by NL* while the other operators cannot, except in parenthesized contexts which are defined separately.
How can I convert this EBNF rules below with K Framework ?
An element can be used to mean zero or more of the previous:
items ::= {"," item}*
For now, I am using a List from the Domain module. But inline List declarations are not allowed, like this one:
syntax Foo ::= Stmt List{Id, ""}
For now, I have to create a new syntax rule for the listed item to counter the problem:
syntax Ids ::= List{Id, ""}
syntax Foo ::= Stmt Ids
Is there another way to counter this creation of a new rule?
An element can appear zero or one time. In other words it can be optional:
array-decl ::= <variable> "[" {Int}? "]"
Where we want to accept: a[4] and a[]. For now, to bypass this one I create 2 rules, where one branch has the item and the other not. But this solution duplicate rules in an unnecessary way in my opinion.
An element can appear one or more of the previous:
e ::= {a-z}+
Where we don't accept any non-zero length sequence of lower case letters. Right now, I didn't find a way to simulate this.
Thank you in advance!
Inline zero-or-more productions have been restricted in the K-framework because the backend doesn't support terms with a variable number of arguments.
Therefore we ask that each list is declared as a separate production which will produce a cons list. Typical functional style matching can be used to process the AST.
Your typical EBNF extensions would look something like this:
{Id ","}* - syntax Ids ::= List{Id, ","}
{Id ","}+ - syntax Ids ::= NeList{Id, ","}
Id? - syntax OptionalId ::= "" [klabel(none)] | Id [klabel(some)]
The optional (?) production has the same problem. So we ask the user to specify labels that can be referenced by rules. Note that the empty production is not allowed in the semantics module because it may interfere with parsing the concrete syntax in rules. So you will need to create a COMMON module with most of the syntax, and a *-SYNTAX module with the productions that can interfere with rule parsing (empty productions and tokens that can conflict with K variables).
No, there is currently no mechanism to do this without the extra production.
I typically do this as follows:
syntax MaybeFoo ::= ".MaybeFoo" | Foo
syntax ArrayDecl ::= Variable "[" MaybeFoo "]"
Non-empty lists may be declared similar to lists:
syntax Bars ::= NeList{Bar, ","}
I'm trying to write a Raku grammar that can parse commands that ask for programming puzzles.
This is a simplified version just for my question, but the commands combine a difficulty level with an optional list of languages.
Sample valid input:
No language: easy
One language: hard javascript
Multiple languages: medium javascript python raku
I can get it to match one language, but not multiple languages. I'm not sure where to add the :g.
Here's an example of what I have so far:
grammar Command {
rule TOP { <difficulty> <languages>? }
token difficulty { 'easy' | 'medium' | 'hard' }
rule languages { <language>+ }
token language { \w+ }
}
multi sub MAIN(Bool :$test) {
use Test;
plan 5;
# These first 3 pass.
ok Command.parse('hard', :token<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :token<difficulty>), '<difficulty> should not parse random words';
# Why does this parse <languages>, but <language> fails below?
ok Command.parse('js', :rule<languages>), '<languages> can parse a language';
# These last 2 fail.
ok Command.parse('js', :token<language>), '<language> can parse a language';
# Why does this not match both words? Can I use :g somewhere?
ok Command.parse('js python', :rule<languages>), '<languages> can parse multiple languages';
}
This works, even though my test #4 fails:
my token wrd { \w+ }
'js' ~~ &wrd; #=> 「js」
Extracting multiple languages works with a regex using this syntax, but I'm not sure how to use that in a grammar:
'js python' ~~ m:g/ \w+ /; #=> (「js」 「python」)
Also, is there an ideal way to make the order unimportant so that difficulty could come anywhere in the string? Example:
rule TOP { <languages>* <difficulty> <languages>? }
Ideally, I'd like for anything that is not a difficulty to be read as a language. Example: raku python medium js should read medium as a difficulty and the rest as languages.
There are two things at issue here.
To specify a subrule in a grammar parse, the named argument is always :rule, regardless whether in the grammar it's a rule, token, method, or regex. Your first two tests are passing because they represent valid full-grammar parses (that is, TOP), as the :token named argument is ignored since it's unknown.
That gets us:
ok Command.parse('hard', :rule<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :rule<difficulty>), '<difficulty> should not parse random words';
ok Command.parse('js', :rule<languages> ), '<languages> can parse a language';
ok Command.parse('js', :rule<language> ), '<language> can parse a language';
ok Command.parse('js python', :rule<languages> ), '<languages> can parse multiple languages';
# Output
ok 1 - <difficulty> can parse a difficulty
ok 2 - <difficulty> should not parse random words
ok 3 - <languages> can parse a language
ok 4 - <language> can parse a language
not ok 5 - <languages> can parse multiple languages
The second issue is how implied whitespace is handled in a rule. In a token, the following are equivalent:
token foo { <alpha>+ }
token bar { <alpha> + }
But in a rule, they would be different. Compare the token equivalents for the following rules:
rule foo { <alpha>+ }
token foo { <alpha>+ <.ws> }
rule bar { <alpha> + }
token bar { [<alpha> <.ws>] + }
In your case, you have <language>+, and since language is \w+, it's impossible to match two (because the first one will consume all the \w). Easy solution though, just change <language>+ to <language> +.
To allow the <difficulty> token to float around, the first solution that jumps to my mind is to match it and bail in a <language> token:
token language { <!difficulty> \w+ }
<!foo> will fail if at that position, it can match <foo>. This will work almost perfect until you get a language like 'easyFoo'. The easy fix there is to ensure that the difficulty token always occurs at a word boundary:
token difficulty {
[
| easy
| medium
| hard
]
>>
}
where >> asserts a word boundary on the right.
I am new to Antlr and I want to write a compiler for the custom programming language which has variable names with spaces. Following is the sample code:
SET Variable with a Long Name TO FALSE
SET Variable with Numbers 1 2 3 in the Name TO 3 JUN 1990
SET Variable with Symbols # %^& TO "A very long text string"
Variable rules:
Can contain white spaces
Can contain special symbols
I want to write the compiler in javascript. Following is my grammar:
grammar Foo;
compilationUnit: stmt*;
stmt:
assignStmt
| invocationStmt
;
assignStmt: SET ID TO expr;
invocationStmt: name=ID ((expr COMMA)* expr)?;
expr: ID | INT | STRING;
COMMA: ',';
SAY: 'say';
SET: 'set';
TO: 'to';
INT: [0-9]+;
STRING: '"' (~('\n' | '"'))* '"';
ID: [a-zA-Z_] [ a-zA-Z0-9_]*;
WS: [ \n\t\r]+ -> skip;
I tried supplying input source code as:
"set variable one to 1".
But got the error "Undefined token identifier".
Any help is greatly appreciated.
ID: [a-zA-Z_] [ a-zA-Z0-9_]*;
will match "set variable one to 1". Like most lexical analysers, ANTLR's scanners greedily match as much as they can. set doesn't get matched even though it has a specific pattern. (And even if you managed that, "variable one to 1" would match on the next token; the match doesn't stop just because to happens to appear.)
The best way to handle multi-word variable names is to treat them as multiple words. That is, recognise each word as a separate token, and recognise an identifier as a sequence of words. That has the consequence that two words and two words end up being the same identifier, but IMHO, that's a feature, not a bug.
Im hoping someone can help me understand a question I have, its not homework, its just an example question I am trying to work out. The problem is to define a grammar that generates all the sums of any number of operands. For example, 54 + 3 + 78 + 2 + 5... etc. The way that I found most easy to define the problem is:
non-terminal {S,B}
terminal {0..9,+,epsilon}
Rules:
S -> [0..9]S
S -> + B
B -> [0..9]B
B -> + S
S -> epsilon
B -> epsilon
epsilon is an empty string.
This seems like it should be correct to me as you could define the first number recursively with the first rule, then to add the next integer, you could use the second rule and then define the second integer using the third rule. You could then use the fourth rule to go back to S and define as many integers as you need.
This solution seems to me to be a regular grammar as it obeys the rule A -> aB or A -> a but in the notes it says for this question that it is no possible to define this problem using a regular grammar. Can anyone please explain to me why my attempt is wrong and why this needs to be context free?
Thanks.
Although it's not the correct definition, it's easier to think that for a language to be non-regular it would need to balance something (like parenthesis).
Your problem can be solved using direct recursion only on the sides of the rules, not in the middle, so it can be solved using a regular language. (Again, this is not the correct definition, but it's easier to remember!)
For example, for a regular expression engine (like in Perl or JavaScript) one could easily write /(\d+)(\+(\d+))*/.
You could write it this way:
non-terminal {S,R,N,N'}
terminal {0..9,+,epsilon}
Rules:
S -> N R
S -> epsilon
N -> [0..9] N'
N' -> N
N' -> epsilon
R -> + N R
R -> epsilon
Which should work correctly.
The language is regular. A regular expression would be:
((0|1|2|...|9)*(0|1|2|...|9)+)*(0|1|2|...|9)*(0|1|2|...|9)
Terminals are: {0,1,2,...,9,+}
"|" means union and * stands for Star closure
If you need to have "(" and ")" in your language, then it will not be regular as it needs to match parentheses.
A sample context free grammar would be:
E->E+E
E->(E)
E->F
F-> 0F | 1F | 2F | ... | 9F | 0 | 1 | ... | 9