How to implement a string data container and use it to add two strings with Kotlin and Antlr - kotlin

I'm trying to make my own language parser using Kotlin and Antlr. I'm trying to implement a data container for the string data and have the code execute.
Code to be executed:
val program = """
x = "Hello";
y = "World";
// Expect "Hello World"
print(x ++ y);
"""
So far, my Kotlin backend is:
package backend
import org.antlr.v4.runtime.*
import mygrammar.*
abstract class Data
data class StringData(val value: String) : Data()
data class IntData(val value: Int): Data()
class Context: HashMap<String, Data>()
abstract class Expr {
abstract fun eval(scope: Context): Data
}
class Compiler: PLBaseVisitor<Expr>() {
}
My Antlr grammar is:
grammar PL;
#header {
package mygrammar;
}
program : statement* EOF
;
statement : assignment ';' # assignmentStatement
| expr ';' # exprStatement
;
assignment : 'let' ID '=' expr
;
expr : x=expr '+' y=expr # addExpr
| x=expr '-' y=expr # subExpr
| x=expr '*' y=expr # mulExpr
| x=expr '/' y=expr # divExpr
| '(' expr ')' # parenExpr
| value # valueExpr
;
value : NUMERIC # numericValue
| STRING # stringValue
| ID # idValue
;
NUMERIC : ('0' .. '9')+ ('.' ('0' .. '9')*)?
;
STRING : '"' ( '\\"' | ~'"' )* '"'
;
ID : ('a' .. 'z' | 'A' .. 'Z' | '_') ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_')*
;
COMMENT : '/*' .*? '*/' -> skip
;
WHITESPACE : (' ' | '\t' | '\r' | '\n')+ -> skip
;
I've been trying to search for the next steps, but whatever I search always seems to give me results for how to compile Kotlin code, not how to compile your own code using Kotlin.

There are quite a few steps between having an ANTLR grammar and executing your code.
The very next step is to use ANTLR to generate source code for a parser to recognize source code matching your grammar. Since Kotlin is pretty good with Java Interop, you could just generate and compile the Java target for your grammar. (If you want to stick with Kotlin, I see there is support for that here ANTLR Kotlin. (I've not used it myself, but do know the Strumenta Community, so I would expect it to be good. (It looks like there's a good intro from them here)
Once you have compiled your generated parser, you should be able to find sample code for calling the parser on your source (program in your example). This will give you a ParseTree. ANTLR provides convenience classes in the form of Listeners and/or Visitors that make it easy to process the resulting parse tree. At this point, ANTLR has provided it's value for you. ANTLR is a tool for generating source code for a parser to recognize your input. The ParseTree is the result of that parsing. From there, it is up to you to decide how to handle interpreting that parse tree and executing the logic.
If your language does not get too much more complicated, and is not particularly performance sensitive, you might get by with logic to visit the tree performing the logic represented there (keep a dictionary of values, interpreting and performing the operations in the ParseTree, etc.)
Your question suggests that you may expect ANTLR to be able to compile and execute your logic. That's just simply not the case. What I've outlined would be the next steps in a way forward, and there are many ways that you may choose to get to execution of the logic.
If you need an intro to ANTLR this one is pretty good: ANTLR Mega Tutorial
And this page links to many more Strumenta ANTLR articles

Related

What is the meaning of the ANTLR syntax in this grammar file?

I am trying to parse a file using ANTLR4 via Python. I am following a tutorial (https://faun.pub/introduction-to-antlr-python-af8a3c603d23); I am able to execute the code and get responses like the ones shown in the tutorial, but I'm failing to understand the logic of the grammar file.
grammar MyGrammer;
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
From my understanding, The constants (what I call them since they are all capitals) HELLO, BYE, INT, and WS define rules for what that set of text can contain. I think they are relating to functions somehow, but I am not sure. So the HELLO function will be executed if the parser encounters something that says either 'hello' or 'hi'. The expr is what is confusing me.
expr: left=expr op=('*'|'/') right=expr # InfixExpr
| left=expr op=('+'|'-') right=expr # InfixExpr
| atom=INT # NumberExpr
| '(' expr ')' # ParenExpr
| atom=HELLO # HelloExpr
| atom=BYE # ByeExpr
;
HELLO: ('hello'|'hi') ;
BYE : ('bye'| 'tata') ;
INT : [0-9]+ ;
WS : [ \t]+ -> skip ;
When I run the command
antlr4 -Dlanguage=Python3 MyGrammer.g4 -visitor -o dist
it produces many files but the main one contains InfixExpr, NumberExpr, ParenExpr, HelloExpr, and ByeExpr. I can see that somehow the author knows that he is doing something with the constants HELLO, BYE, etc. Is there any documentation on the expr piece above and what do the keywords atom, left, right mean?
Any rules that begin with a capital letter (often we captilize the entire rule name to make it obvious) is a Lexer rule.
Rules that begin with lower case letters are parser rules.
It’s VERY important to understand the difference and the flow of your input all the way through to a parse tree.
Your input stream of characters is first processed by the Lexer (using the Lexer rules) to produce a stream of tokens for the parser to act upon. It’s important to understand that the parser has NO impact on how the Lexer interprets the input.
When multiple Lexer rules could match you input, two “tie breakers” come into play.
1 - if a rules matches more characters in your input stream than other rules, then that will be the rules used to produce a token.
2 - if there is a tie of multiple Lexer rules matching the same sequence of input characters, then the Lexer rules that appears first in your grammar will be used to generate a token.
Your parser rules are evaluated using a recursive descent approach beginning with whatever startRule you specify. ANTLR uses several techniques to do it’s best to recognize your input, that includes trying alternatives until one is found that matches, ignoring a token (and producing an error) if that allows the parser to continue on, and inserting a missing token (and producing an error) if that allows the parser to continue.
re: the expr portion:
The rule says that there are 6 possible ways to recognize an expr
left=expr op=('*'|'/') right=expr (which will create an InfixExprContext node in the parse tree)
left=expr op=('+'|'-') right=expr (InfixExprContext (also))
atom=INT (NumberExprContext)
'(' expr ')' (ParenExprContext)
atom=HELLO (HelloExprContext)
atom=BYE (ByeExprContext)
The benefit of the labels (ex: # InfixExpr) is that, by creating a Context more specific than an ExprContext) you will have visitInfixExpr, visitNumberExpr, (etc.) methods that you can override in you Visitor instead of just a visitExpr method that contains all the alternatives. A similar thing will result for the enterXX and exitXX methods for your Listener classes.
In the left=expr op=('*'|'/') right=expr rule, the left, op and right names will generate accessors that make it easier to access those parts of you parse tree in you *Context class (without them you'd just have an array of expr, for example and expr[0] would be the first expr and expr[1] would be the second. (It's probably a good idea to look at the generated code with and without the names and labels to see the difference. Both make it MUCH easier to write the logic in your visitor/listeners.

ANTLR rule works on its own, but fails when included in another rule

I am trying to write an ANTLR grammar for a reparsed and retagged kconfig file (retagged to solve a couple of ambiguities). A simplified version of the grammar is:
grammar FailureExample;
options {
language = Java;
}
#lexer::header {
package parse.failure.example;
}
reload
: configStatement*
EOF
;
configStatement
: CONFIG IDENT
configOptions
;
configOptions
: (type
| defConfigStatement
| dependsOnStatement
| helpStatement
| rangeStatement
| defaultStatement
| selectStatement
| visibleIfStatement
| prompt
)*
;
type : FAKE1;
dependsOnStatement: FAKE2;
helpStatement: FAKE3;
rangeStatement: FAKE4;
defaultStatement: FAKE5;
selectStatement:FAKE6;
visibleIfStatement:FAKE7;
prompt:FAKE8;
defConfigStatement
: defConfigType expression
;
defConfigType
: DEF_BOOL
;
//expression parsing
primative
: IDENT
| L_PAREN expression R_PAREN
;
negationExpression
: NOT* primative
;
orExpression
: negationExpression (OR negationExpression)*
;
andExpression
: orExpression (AND orExpression)*
;
unequalExpression
: andExpression (NOT_EQUAL andExpression)?
;
equalExpression
: unequalExpression (EQUAL unequalExpression)?
;
expression
: equalExpression (BECOMES equalExpression)?
;
DEF_BOOL: 'def_bool';
CONFIG : 'config';
COMMENT : '#' .* ('\n'|'\r') {$channel = HIDDEN;};
AND : '&&';
OR : '||';
NOT : '!';
L_PAREN : '(';
R_PAREN : ')';
BECOMES : '::=';
EQUAL : '=';
NOT_EQUAL : '!=';
FAKE1 : 'fake1';
FAKE2: 'fake2';
FAKE3: 'fake3';
FAKE4: 'fake4';
FAKE5: 'fake5';
FAKE6: 'fake6';
FAKE7: 'fake7';
FAKE8: 'fake8';
IDENT : (LETTER | DIGIT | '_')*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : '0'..'9';
With input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
I can set antlrworks to parse just the second line (commenting out the first) and I get the proper defConfigStatement token emitted with the proper expression following. However, if I exercise either the configOptions rule or the configStatement rule (with the first line uncommented), my configOptions results in an empty set and a NoViableAlt exception is thrown.
What would cause this behavior? I know that the defConfigStatement rule is accurate and can parse correctly, but as soon as it's added as a potential option in another rule, it fails. I know I don't have conflicting rules, and I've made DEF_BOOL and DEF_TRISTATE rules the top in my list of lexer rules, so they have priority over the other lexer rules.
/Added since edit/
To further complicate the issue, if I move the defConfigStatement choice in the configOptions rule, it will work, but other rules will fail.
Edit: Using full, simplified grammar.
In short, why does the rule work on its own, but fail when it's in configOptions (especially since configOptions is in (A | B | C)* form)?
When I parse the input:
config HAVE_DEBUG_RAM_SETUP
def_bool n
with the parser generated from your grammar, I get the following parse tree:
So, I see no issues here. My guess is that you're using ANTLRWorks' interpreter: don't. It's buggy. Always test your grammar with a class of your own, or use ANTLWorks' debugger (press CTRL+D to launch is). The debugger works like a charm (without the package declaration, btw). The image I posted above is an export from the debugger.
EDIT
If the debugger doesn't work, try (temporarily) removing the package declaration (note that you're only declaring a package for the lexer, not the parser, but that might be a caused by posting a minimal grammar). You could also try changing the port number the debugger should connect to. It could be the port is already in use (see: File -> Preferences -> Debugger-tab).

How to to create / specify the AST input for testing a tree grammar with ANTLRWorks?

Background: I have created an ANTLR grammar. I am able to test and debug it with ANTLRWorks and have verified that the parser creates the AST that I had in my mind. Now, I want to write a tree grammar for the AST, parse the tree and debug the tree grammar with ANTLRWorks.
Question: I want to test and debug a tree grammar with ANTLRWorks. I thus want to parse the AST that has been created by the parser. How do I specify the AST as input when testing the tree grammar with ANTLRWorks?
P.S.
I have studied the question / answer at Does anyone know of a way to debug tree grammars in ANTLRWorks but it doesn't answer my question. Although accepted by the OP, he made a similar comment.
How do I specify the AST as input when testing the tree grammar with ANTLRWorks?
You don't need to provide the AST yourself, only the parser that produces the AST.
Given the following grammar that produces an AST:
grammar ASTDemo;
options {
output=AST;
}
tokens {
ROOT;
U_MIN;
}
parse
: expression EOF -> ^(ROOT expression)
;
expression
: addition
;
addition
: multiplication (('+' | '-')^ multiplication)*
;
multiplication
: unary (('*' | '/')^ unary)*
;
unary
: '-' atom -> ^(U_MIN atom)
| atom
;
atom
: ID
| NUMBER
| '(' expression ')' -> expression
;
ID : ('a'..'z' | 'A'..'Z')+;
NUMBER : '0'..'9'+ ('.' '0'..'9'*)?;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
The following would be a tree grammar for the AST produced by grammar above:
tree grammar ASTDemoWalker;
options {
output=AST;
tokenVocab=ASTDemo;
ASTLabelType=CommonTree;
}
parse
: ^(ROOT expression)
;
expression
: ^('+' expression expression)
| ^('-' expression expression)
| ^('*' expression expression)
| ^('/' expression expression)
| ^(U_MIN expression)
| atom
;
atom
: ID
| NUMBER
;
Be sure to put both ASTDemo.g and ASTDemoWalker.g in the same folder. Open both grammars in ANTLRWorks and generate the lexer & parser from ASTDemo.g first by pressing CTRL+SHIFT+G and then generate the tree walker by opening ASTDemoWalker.g and pressing CTRL+SHIFT+G.
Now, from the ASTDemoWalker.g editor panel, fire up the debugger by pressing CTRL+D and paste the following source in the text area:
42 * ((a + 3) / -3.14)
and press OK.
You can now step through the debugging process and at the end, you can see both what AST the parser generated:
and how the tree walker walked over said AST:
If you now make an "accidental" mistake in the tree grammar, say, instead of ^('*' expression expression) you define ^('*' expression). If you debug the tree grammar again, you will see it fail after passing the 42 node:
In the AST there is another node after the 42 node, while the tree walker expected only 1 single node (42) after the * root node.
Of course, this is an easy grammar, but even if you're familiar with ANTLR, it's sometimes a pain in the #$& to track down errors in a tree grammar! :)

How to have unstructured sections in a file parsed using Antlr

I am creating a translator from my language into many (all?) other object oriented languages. As part of the language I want to support being able to insert target language code sections into the file. This is actually rather similar to how Antlr supports actions in rules.
So I would like to be able to have the sections begin and end with curlies like this:
{ ...target lang code... }
The issue is that it is quite possible { ... } can show up in the target language code so I need to be able match pairs of curlies.
What I want to be able to do is something like this fragment that I've pulled into its own grammar:
grammar target_lang_block;
options
{
output = AST;
}
entry
: target_lang_block;
target_lang_block
: '{' target_lang_code* '}'
;
target_lang_code
: target_lang_block
| NO_CURLIES
;
WS
: (' ' | '\r' | '\t' | '\n')+ {$channel = HIDDEN;}
;
NO_CURLIES
: ~('{'|'}')+
;
This grammar works by itself (at least to the extent I have tested it).
However, when I put these rules into the larger language, NO_CURLIES seems to eat everything and cause MismatchedTokenExceptions.
I'm not sure how to deal with this situation, but it seems that what I want is to be able to turn NO_CURILES on and off based on if I'm in target_lang_block, but it does not seem that is possible.
Is it possible? Is there another way?
Thanks
Handle the target_lang_block inside the lexer instead:
Target_lang_block
: '{' (~('{' | '}') | Target_lang_block)* '}'
;
And remove NO_CURLIES, of course.

ANTLR: rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2

grammar AdifyMapReducePredicate;
PREDICATE
: PREDICATE_BRANCH
| EXPRESSION
;
PREDICATE_BRANCH
: '(' PREDICATE (('&&' PREDICATE)+ | ('||' PREDICATE)+) ')'
;
EXPRESSION
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
Trying to interpret this in ANTLRWorks 1.4 and getting the following error:
[12:18:21] error(211): <notsaved>:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
[12:18:21] Interpreting...
When I interepret, I'm trying to interpret a PREDICATE and my test case is (A||B)
What am I missing?
By ANTLR's conventions, parser rule names start with a lower case letter, while lexer rules start with capital letters. So the grammar, as you wrote it, has three lexer rules, defining tokens. This may not be what you want.
The reason for the error message apparently is the ambiguity between these tokens: your input pattern matches the definitions of both PREDICATE and PREDICATE_BRANCH.
Just use names starting in lower case letters instead of PREDICATE and PREDICATE_BRANCH. You may also have to add an extra rule for the target symbol, that is not directly involved in the recursion.
By the way, this grammar is recursive, but not left-recursive, and when using parser rules, it definitely is LL(1).
You don't have a parser rule (parser rules start with a lower case letter), although I'm not sure that last part is necessary when interpreting some test cases in ANTLRWorks.
Anyway, try something like this:
grammar AdifyMapReducePredicate;
parse
: (p=predicate {System.out.println("parsed :: "+$p.text);})+ EOF
;
predicate
: expression
;
expression
: booleanExpression
;
booleanExpression
: atom ((AND | OR) atom)*
;
atom
: ID
| '(' predicate ')'
;
AND
: '&&'
;
OR
: '||'
;
ID
: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
SPACE
: (' ' | '\t' | '\r' | '\n') {skip();}
;
With the following test class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("(A || B) (C && (D || F || G))");
AdifyMapReducePredicateLexer lexer = new AdifyMapReducePredicateLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
AdifyMapReducePredicateParser parser = new AdifyMapReducePredicateParser(tokens);
parser.parse();
}
}
which after generating a lexer & parser (a), compiling all .java files (b) and running the test class (c), produces the following output:
parsed :: (A||B)
parsed :: (C&&(D||F||G))
a
java -cp antlr-3.2.jar org.antlr.Tool AdifyMapReducePredicate.g
b
javac -cp antlr-3.2.jar *.java
c (*nix/MacOS)
java -cp .:antlr-3.2.jar Main
c (Windows)
java -cp .;antlr-3.2.jar Main
HTH