Is that possible to implement [stmt]+ in yacc? - yacc

I have a homework from school is to make a compiler in yacc
and the question is :
PROGRAM ::= STMT+
STMT ::= EXP | PRINT-STMT
.......
I try to implement STMT+ like this :
Program : STMT_PLUS { printf( "Program !\n" );}
;
STMT_PLUS : STMT STMT_PLUS {}
| ;
STMT : STMT {}
| EXP {}
| PRINT_STMT {}
|
;
The input will be multiple line .
Will it work? if it wrong , how should I edit my code ?

Personally, I'd use left recursion but it depends on what you are going to do in the actions. It's certainly one of the ways of creating a grammar for STMT*. But it's not STMT+, because it evidently accepts the empty sentence.
STMT+ would be one of:
STMT_PLUS: STMT
| STMT_PLUS STMT
or
STMT_PLUS: STMT
| STMT STMT_PLUS
If your goal is to build a syntax tree of some kind, either will work but the right recursion probably expresses the semantics better. If your goal is to build a line-by-line interpreter, you definitely want the left recursion; otherwise, the evaluation order will be unexpected.
If you want to learn how all this works, experiment. Try different options and see how they work.

Related

How to implement a string data container and use it to add two strings with Kotlin and Antlr

I'm trying to make my own language parser using Kotlin and Antlr. I'm trying to implement a data container for the string data and have the code execute.
Code to be executed:
val program = """
x = "Hello";
y = "World";
// Expect "Hello World"
print(x ++ y);
"""
So far, my Kotlin backend is:
package backend
import org.antlr.v4.runtime.*
import mygrammar.*
abstract class Data
data class StringData(val value: String) : Data()
data class IntData(val value: Int): Data()
class Context: HashMap<String, Data>()
abstract class Expr {
abstract fun eval(scope: Context): Data
}
class Compiler: PLBaseVisitor<Expr>() {
}
My Antlr grammar is:
grammar PL;
#header {
package mygrammar;
}
program : statement* EOF
;
statement : assignment ';' # assignmentStatement
| expr ';' # exprStatement
;
assignment : 'let' ID '=' expr
;
expr : x=expr '+' y=expr # addExpr
| x=expr '-' y=expr # subExpr
| x=expr '*' y=expr # mulExpr
| x=expr '/' y=expr # divExpr
| '(' expr ')' # parenExpr
| value # valueExpr
;
value : NUMERIC # numericValue
| STRING # stringValue
| ID # idValue
;
NUMERIC : ('0' .. '9')+ ('.' ('0' .. '9')*)?
;
STRING : '"' ( '\\"' | ~'"' )* '"'
;
ID : ('a' .. 'z' | 'A' .. 'Z' | '_') ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_')*
;
COMMENT : '/*' .*? '*/' -> skip
;
WHITESPACE : (' ' | '\t' | '\r' | '\n')+ -> skip
;
I've been trying to search for the next steps, but whatever I search always seems to give me results for how to compile Kotlin code, not how to compile your own code using Kotlin.
There are quite a few steps between having an ANTLR grammar and executing your code.
The very next step is to use ANTLR to generate source code for a parser to recognize source code matching your grammar. Since Kotlin is pretty good with Java Interop, you could just generate and compile the Java target for your grammar. (If you want to stick with Kotlin, I see there is support for that here ANTLR Kotlin. (I've not used it myself, but do know the Strumenta Community, so I would expect it to be good. (It looks like there's a good intro from them here)
Once you have compiled your generated parser, you should be able to find sample code for calling the parser on your source (program in your example). This will give you a ParseTree. ANTLR provides convenience classes in the form of Listeners and/or Visitors that make it easy to process the resulting parse tree. At this point, ANTLR has provided it's value for you. ANTLR is a tool for generating source code for a parser to recognize your input. The ParseTree is the result of that parsing. From there, it is up to you to decide how to handle interpreting that parse tree and executing the logic.
If your language does not get too much more complicated, and is not particularly performance sensitive, you might get by with logic to visit the tree performing the logic represented there (keep a dictionary of values, interpreting and performing the operations in the ParseTree, etc.)
Your question suggests that you may expect ANTLR to be able to compile and execute your logic. That's just simply not the case. What I've outlined would be the next steps in a way forward, and there are many ways that you may choose to get to execution of the logic.
If you need an intro to ANTLR this one is pretty good: ANTLR Mega Tutorial
And this page links to many more Strumenta ANTLR articles

I need insemantic check using flex/bison

There are 5 rules. I have to check incompatible type in initialization or double declarations. I know that I need in symbol table. I know about $i. And all other stuff. But I don't have any ideas how to implement the code. Sorry for my English, I am not native speaker.
1)prog:
PROGMY IDENT ';' decls BEGINMY stats ENDMY '.' ;
2) decl:
CONSTMY IDENT '=' NUM ';' {}
|
VARMY VARFULL {}
|
error ';' ;
3) VARFULL:
MYPEREMEN ':' MYTYPE ';' {}
|
VARFULL MYPEREMEN ':' MYTYPE ';'
4) MYTYPE :
MYINT {} //int
|
MYBOOL {} //bool
;
5) MYPEREMEN :
IDENT {}
|
MYPEREMEN ',' IDENT {}
;
bison executes the semantic action associated with a rule when it reduces the rule. So generally what you do is put code in the action associated with a declaration to check to see if a symbol is already in the symbol table, and then add it to the symbol table. The symbol table itself is a global variable. So you might have a rule like:
declaration: type IDENT {
if (symbol_exists(symbol_table, $2))
Error("duplicate symbol %s", $2);
else
AddSymbolWithType(symbol_table, $2, $1); }
Alternately, you could put the error check into the function AddSymbolWithType and make your grammar file cleaner.
Bison uses the following paradigm:
// declarations
%%
non-terminal : rule { c/c++ action }
%%
// your functions
So to elaborate on the previous answer:
prog:
PROGMY IDENT ';' decls BEGINMY stats ENDMY '.'
{ your code; }
; // bison (not C/C++) terminal semi-colon
As a separate note, since parsers are slow you might want to use "inline" for code functions as much as reasonable. The previous comment on what code there is to write is right-on. That is, if you are going to insert a symbol into a symbol table, you need to discover whether the symbol is there, and if not, then insert it.

Jison: Reduce Conflict where actually no conflict is

I'm trying to generate a small JavaScript parser which also includes typed variables for a small project.
Luckily, jison already provides a jscore.js which I just adjusted to fit my needs. After adding types I ran into a reduce conflict. I minimized to problem to this minimum JISON:
Jison:
%start SourceElements
%%
// This is up to become more complex soon
Type
: VAR
| IDENT
;
// Can be a list of statements
SourceElements
: Statement
| SourceElements Statement
;
// Either be a declaration or an expression
Statement
: VariableStatement
| ExprStatement
;
// Parses something like: MyType hello;
VariableStatement
: Type IDENT ";"
;
// Parases something like hello;
ExprStatement
: PrimaryExprNoBrace ";"
;
// Parses something like hello;
PrimaryExprNoBrace
: IDENT
;
Actually this script does nothing than parsing two statements:
VariableStatement
IDENT IDENT ";"
ExpStatement
IDENT ";"
As this is a extremly minimized JISON Script, I can't simply replace "Type" be "IDENT" (which btw. worked).
Generating the parser throws the following conflicts:
Conflict in grammar: multiple actions possible when lookahead token is IDENT in state 8
- reduce by rule: PrimaryExprNoBrace -> IDENT
- reduce by rule: Type -> IDENT
Conflict in grammar: multiple actions possible when lookahead token is ; in state 8
- reduce by rule: PrimaryExprNoBrace -> IDENT
- reduce by rule: Type -> IDENT
States with conflicts:
State 8
Type -> IDENT . #lookaheads= IDENT ;
PrimaryExprNoBrace -> IDENT . #lookaheads= IDENT ;
Is there any trick to fix this conflict?
Thank you in advanced!
~Benjamin
This looks like a Jison bug to me. It is complaining about ambiguity in the cases of these two sequences of tokens:
IDENT IDENT
IDENT ";"
The state in question is that reached after shifting the first IDENT token. Jison observes that it needs to reduce that token, and that (it claims) it doesn't know whether to reduce to a Type or to a PrimaryExpressionNoBrace.
But Jison should be able to distinguish based on the next token: if it is a second IDENT then only reducing to a Type can lead to a valid parse, whereas if it is ";" then only reducing to PrimaryExpressionNoBrace can lead to a valid parse.
Are you sure the given output goes with the given grammar? It would be possible either to add rules or to modify the given ones to produce an ambiguity such as the one described. This just seems like such a simple case that I'm surprised Jison gets it wrong. If it in fact does, however, then you should consider filing a bug report.

How to have unstructured sections in a file parsed using Antlr

I am creating a translator from my language into many (all?) other object oriented languages. As part of the language I want to support being able to insert target language code sections into the file. This is actually rather similar to how Antlr supports actions in rules.
So I would like to be able to have the sections begin and end with curlies like this:
{ ...target lang code... }
The issue is that it is quite possible { ... } can show up in the target language code so I need to be able match pairs of curlies.
What I want to be able to do is something like this fragment that I've pulled into its own grammar:
grammar target_lang_block;
options
{
output = AST;
}
entry
: target_lang_block;
target_lang_block
: '{' target_lang_code* '}'
;
target_lang_code
: target_lang_block
| NO_CURLIES
;
WS
: (' ' | '\r' | '\t' | '\n')+ {$channel = HIDDEN;}
;
NO_CURLIES
: ~('{'|'}')+
;
This grammar works by itself (at least to the extent I have tested it).
However, when I put these rules into the larger language, NO_CURLIES seems to eat everything and cause MismatchedTokenExceptions.
I'm not sure how to deal with this situation, but it seems that what I want is to be able to turn NO_CURILES on and off based on if I'm in target_lang_block, but it does not seem that is possible.
Is it possible? Is there another way?
Thanks
Handle the target_lang_block inside the lexer instead:
Target_lang_block
: '{' (~('{' | '}') | Target_lang_block)* '}'
;
And remove NO_CURLIES, of course.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.