I need insemantic check using flex/bison - semantics

There are 5 rules. I have to check incompatible type in initialization or double declarations. I know that I need in symbol table. I know about $i. And all other stuff. But I don't have any ideas how to implement the code. Sorry for my English, I am not native speaker.
1)prog:
PROGMY IDENT ';' decls BEGINMY stats ENDMY '.' ;
2) decl:
CONSTMY IDENT '=' NUM ';' {}
|
VARMY VARFULL {}
|
error ';' ;
3) VARFULL:
MYPEREMEN ':' MYTYPE ';' {}
|
VARFULL MYPEREMEN ':' MYTYPE ';'
4) MYTYPE :
MYINT {} //int
|
MYBOOL {} //bool
;
5) MYPEREMEN :
IDENT {}
|
MYPEREMEN ',' IDENT {}
;

bison executes the semantic action associated with a rule when it reduces the rule. So generally what you do is put code in the action associated with a declaration to check to see if a symbol is already in the symbol table, and then add it to the symbol table. The symbol table itself is a global variable. So you might have a rule like:
declaration: type IDENT {
if (symbol_exists(symbol_table, $2))
Error("duplicate symbol %s", $2);
else
AddSymbolWithType(symbol_table, $2, $1); }
Alternately, you could put the error check into the function AddSymbolWithType and make your grammar file cleaner.

Bison uses the following paradigm:
// declarations
%%
non-terminal : rule { c/c++ action }
%%
// your functions
So to elaborate on the previous answer:
prog:
PROGMY IDENT ';' decls BEGINMY stats ENDMY '.'
{ your code; }
; // bison (not C/C++) terminal semi-colon
As a separate note, since parsers are slow you might want to use "inline" for code functions as much as reasonable. The previous comment on what code there is to write is right-on. That is, if you are going to insert a symbol into a symbol table, you need to discover whether the symbol is there, and if not, then insert it.

Related

How to implement a string data container and use it to add two strings with Kotlin and Antlr

I'm trying to make my own language parser using Kotlin and Antlr. I'm trying to implement a data container for the string data and have the code execute.
Code to be executed:
val program = """
x = "Hello";
y = "World";
// Expect "Hello World"
print(x ++ y);
"""
So far, my Kotlin backend is:
package backend
import org.antlr.v4.runtime.*
import mygrammar.*
abstract class Data
data class StringData(val value: String) : Data()
data class IntData(val value: Int): Data()
class Context: HashMap<String, Data>()
abstract class Expr {
abstract fun eval(scope: Context): Data
}
class Compiler: PLBaseVisitor<Expr>() {
}
My Antlr grammar is:
grammar PL;
#header {
package mygrammar;
}
program : statement* EOF
;
statement : assignment ';' # assignmentStatement
| expr ';' # exprStatement
;
assignment : 'let' ID '=' expr
;
expr : x=expr '+' y=expr # addExpr
| x=expr '-' y=expr # subExpr
| x=expr '*' y=expr # mulExpr
| x=expr '/' y=expr # divExpr
| '(' expr ')' # parenExpr
| value # valueExpr
;
value : NUMERIC # numericValue
| STRING # stringValue
| ID # idValue
;
NUMERIC : ('0' .. '9')+ ('.' ('0' .. '9')*)?
;
STRING : '"' ( '\\"' | ~'"' )* '"'
;
ID : ('a' .. 'z' | 'A' .. 'Z' | '_') ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_')*
;
COMMENT : '/*' .*? '*/' -> skip
;
WHITESPACE : (' ' | '\t' | '\r' | '\n')+ -> skip
;
I've been trying to search for the next steps, but whatever I search always seems to give me results for how to compile Kotlin code, not how to compile your own code using Kotlin.
There are quite a few steps between having an ANTLR grammar and executing your code.
The very next step is to use ANTLR to generate source code for a parser to recognize source code matching your grammar. Since Kotlin is pretty good with Java Interop, you could just generate and compile the Java target for your grammar. (If you want to stick with Kotlin, I see there is support for that here ANTLR Kotlin. (I've not used it myself, but do know the Strumenta Community, so I would expect it to be good. (It looks like there's a good intro from them here)
Once you have compiled your generated parser, you should be able to find sample code for calling the parser on your source (program in your example). This will give you a ParseTree. ANTLR provides convenience classes in the form of Listeners and/or Visitors that make it easy to process the resulting parse tree. At this point, ANTLR has provided it's value for you. ANTLR is a tool for generating source code for a parser to recognize your input. The ParseTree is the result of that parsing. From there, it is up to you to decide how to handle interpreting that parse tree and executing the logic.
If your language does not get too much more complicated, and is not particularly performance sensitive, you might get by with logic to visit the tree performing the logic represented there (keep a dictionary of values, interpreting and performing the operations in the ParseTree, etc.)
Your question suggests that you may expect ANTLR to be able to compile and execute your logic. That's just simply not the case. What I've outlined would be the next steps in a way forward, and there are many ways that you may choose to get to execution of the logic.
If you need an intro to ANTLR this one is pretty good: ANTLR Mega Tutorial
And this page links to many more Strumenta ANTLR articles

ANTLR4 No Viable Alternative At Input

I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.

Parse string antlr

I have strings as a parser rule rather than lexer because strings may contain escapes with expressions in them, such as "The variable is \(variable)".
string
: '"' character* '"'
;
character
: escapeSequence
| .
;
escapeSequence
: '\(' expression ')'
;
IDENTIFIER
: [a-zA-Z][a-zA-Z0-9]*
;
WHITESPACE
: [ \r\t,] -> skip
;
This doesn't work because . matches any token rather than any character, so many identifiers will be matched and whitespace will be completely ignored.
How can I parse strings that can have expressions inside of them?
Looking into the parser for Swift and Javascript, both languages that support things like this, I can't figure out how they work. From what I can tell, they just output a string such as "my string with (variables) in it" without actually being able to parse the variable as its own thing.
This problem can be approached using lexical modes by having one mode for the inside of strings and one (or more) for the outside. Seeing a " on the outside would switch to the inside mode and seeing a \( or " would switch back outside. The only complicated part would be seeing a ) on the outside: Sometimes it should switch back to the inside (because it corresponds to a \() and some times it shouldn't (when it corresponds to a plain ().
The most basic way to achieve this would be like this:
Lexer:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(' -> pushMode(DEFAULT_MODE);
RPAR: ')' -> popMode;
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(DEFAULT_MODE);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
Parser:
parser grammar StringParser;
options {
tokenVocab = 'StringLexer';
}
start: exp EOF ;
exp : '(' exp ')'
| IDENTIFIER
| DQUOTE stringContents* DQUOTE
;
stringContents : TEXT
| ESCAPE_SEQUENCE
| '\\(' exp ')'
;
Here we push the default mode every time we see a ( or \( and pop the mode every time we see a ). This way it will go back inside the string only if the mode on top of the stack is the string mode, which would only be the case if there aren't any unclosed ( left since the last \(.
This approach works, but has the downside that an unmatched ) will cause an empty stack exception rather than a normal syntax error because we're calling popMode on an empty stack.
To avoid this, we can add a member that tracks how deeply nested we are inside parentheses and doesn't pop the stack when the nesting level is 0 (i.e. if the stack is empty):
#members {
int nesting = 0;
}
LPAR: '(' {
nesting++;
pushMode(DEFAULT_MODE);
};
RPAR: ')' {
if (nesting > 0) {
nesting--;
popMode();
}
};
mode IN_STRING;
BACKSLASH_PAREN: '\\(' {
nesting++;
pushMode(DEFAULT_MODE);
};
(The parts I left out are the same as in the previous version).
This works and produces normal syntax errors for unmatched )s. However, it contains actions and is thus no longer language-agnostic, which is only a problem if you plan to use the grammar from multiple languages (and depending on the language, you might even be lucky and the code might be valid in all of your targeted languages).
If you want to avoid actions, the last alternative would be to have three modes: One for code that's outside of any strings, one for the inside of the string and one for the inside of \(). The third one will be almost identical to the outer one, except that it will push and pop the mode when seeing parentheses, whereas the outer one will not. To make both modes produce the same types of tokens, the rules in the third mode will all call type(). This will look like this:
lexer grammar StringLexer;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* ;
DQUOTE: '"' -> pushMode(IN_STRING);
LPAR: '(';
RPAR: ')';
mode IN_STRING;
TEXT: ~[\\"]+ ;
BACKSLASH_PAREN: '\\(' -> pushMode(EMBEDDED);
ESCAPE_SEQUENCE: '\\' . ;
DQUOTE_IN_STRING: '"' -> type(DQUOTE), popMode;
mode EMBEDDED;
E_IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]* -> type(IDENTIFIER);
E_DQUOTE: '"' -> pushMode(IN_STRING), type(DQUOTE);
E_LPAR: '(' -> type(LPAR), pushMode(EMBEDDED);
E_RPAR: ')' -> type(RPAR), popMode;
Note that we now can no longer use string literals in the parser grammar because string literals can't be used when multiple lexer rules are defined using the same string literal. So now we have to use LPAR instead of '(' in the parser and so on (we already had to do this for DQUOTE for the same reason).
Since this version involves a lot of duplication (especially as the amount of tokens rises) and prevents the use of string literals in the parser grammar, I generally prefer the version with the actions.
The full code for all three alternatives can also be found on GitHub.

What is the ANTLR4 equivalent of a ! in a lexer rule?

I'm working on converting an old ANTLR 2 grammar to ANTLR 4, and I'm having trouble with the string rule.
STRING :
'\''!
(
~('\'' | '\\' | '\r' | '\n')
)*
'\''!
;
This creates a STRING token whose text contains the contents of the string, but does not contain the starting and ending quotes, because of the ! symbol after the quote literals.
ANTLR 4 chokes on the ! symbol, ('!' came as a complete surprise to me (AC0050)) but if I leave it off, I end up with tokens that contain the quotes, which is not what I want. What's the correct way to port this to ANTLR 4?
Antlr4 generally treats tokens as being immutable, at least in the sense that there is no support for a language neutral equivalent of !.
Perhaps the simplest way to accomplish the equivalent is:
string : str=STRING { Strings.unquote($str); } ;
STRING : SQuote ~[\r\n\\']* SQuote ;
fragment SQuote : '\'' ;
where Strings.unquote is:
public static void unquote(Token token) {
CommonToken ct = (CommonToken) token;
String text = ct.getText();
text = .... unquote it ....
ct.setText(text);
}
The reason for using a parser rule is because attribute references are not (currently) supported in the lexer. Still, it could be done on the lexer rule - just would require a slight bit more effort to dig to the token.
An alternative to modifying the token text is to implement a custom token with custom fields and methods. See this answer if of interest.
I believe in ANTLR4 your problem can be solved using lexical modes and lexer commands.
Here is an example from there that I think does exactly what you need (although for double quotes but it's an easy fix):
lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

How to use similar lexers

I have the following grammar:
cmds
: cmd+
;
cmd
: include_cmd | other_cmd
;
include_cmd
: INCLUDE DOUBLE_QUOTE FILE_NAME DOUBLE_QUOTE
;
other_cmd
: CMD_NAME ARG+
;
INCLUDE
: '#include'
;
DOUBLE_QUOTE
: '"'
;
CMD_NAME
: ('a'..'z')*
;
ARG
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_')+
;
FILE_NAME
: ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '.')+
;
So the difference between CMD_NAME, ARG and FILE_NAME is not large, CMD_NAME must be lower case letters, ARG can have upper case letter and "_" and FILE_NAME yet can have ".".
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
Do I have to rely on such technique as predict to deal with this? Is there a pure EBNF solution other than relying on host programming language?
Thanks.
But this has a problem, when I test the rule with - #include "abc", 'abc' is interpreted as CMD_NAME instead of FILE_NAME, I think it is because CMD_NAME is before FILE_NAME in the grammar file, this leads to parsing error.
The set of all valid CMD_NAMEs intersects with the set of all valid FILE_NAMEs. Input abc qualifies as both. The lexer matches the input with the first rule listed (as you suspected) because it's the first one matched.
Do I have to rely on such technique as [predicate] to deal with this? Is there a pure EBNF solution other than relying on host programming language?
It depends on what you're willing accept in your grammar. Consider changing your include_cmd rule to something more conventional, like this:
include_cmd : INCLUDE STRING;
STRING
: '"' ~('"'|'\r'|'\n')* '"' {String text = getText(); setText(text.substring(1, text.length() - 1));}
;
Now input #include "abc" turns into tokens [INCLUDE : #include] [STRING : abc].
I don't think the grammar should be responsible for determining whether a file name is valid or not: a valid file name doesn't imply a valid file, and the grammar has to understand OS file naming conventions (valid characters, paths, etc) that probably have no bearing on the grammar itself. I think you'll be fine if you're willing to drop rule FILE_NAME for something like the rules the above.
Also worth noting, your CMD_NAME rule matches zero-length input. Consider changing ('a'..'z')* to ('a'..'z')+ unless a CMD_NAME really can be empty.
Keep in mind, too, that you'll have the same problem with ARG that you did with FILE_NAME. It's listed after CMD_NAME, so any input that qualifies for both rules (like abc again) will hit CMD_NAME. Consider breaking these rules up into more conventional ones like so:
other_cmd : ID (ID | NUMBER)+ SEMI; //instead of CMD_NAME ARG+
ID : ('a'..'z'|'A'..'Z'|'_')+; //instead of CMD_NAME, "id" part of ARG
NUMBER : ('0'..'9')+; //"number" part of ARG
SEMI : ';';
I added rule SEMI to mark the end of a command. Otherwise the parser won't know if input a b c d is supposed to be one command with three arguments (a(b,c,d)) or two commands with one argument each (a(b), c(d)).