Doubts regarding Compiler Construction (Flex/Bison) - yacc

Am trying to construct a simple compiler in my class, its second week and am totally stuck at these points: Am providing my simple.l as (flex and bison files are snipped to save space):
..snip..
end {return(END);}
skip {return(SKIP);}
in {return(IN);}
integer {return(INTEGER);}
let {return(LET);}
..snip..
[ \t\n\r]+
and simple.y as :
%start program
%token LET IN END
%token SKIP IF THEN ELSE WHILE DO READ WRITE FI ASSGNOP
%token NUMBER PERIOD COMMA SEMICOLON INTEGER
%token IDENTIFIER EWHILE LT
%left '-' '+'
%left '*' '/'
%right '^'
%%
program : LET declarations IN commands END SEMICOLON
declarations :
|INTEGER id_seq IDENTIFIER PERIOD
;
id_seq:
|id_seq IDENTIFIER COMMA
;
commands :
| commands command SEMICOLON
;
command : SKIP
;
exp : NUMBER
| IDENTIFIER
| '('exp')'
;
..snip..
%%
My first problem is when I compile and execute this, it properly accepts my input till the end but it does not stop at end; i.e it again comes to start state , isn't it supposed to terminate when it encounters an end :
On the input :
let
integer x.
in
skip;
end;
here is an output :
Starting parse
Entering state 0
Reading a token: let
Next token is token LET ()
Shifting token LET ()
Entering state 1
Reading a token: integer x.
Next token is token INTEGER ()
Shifting token INTEGER ()
Entering state 3
Reducing stack by rule 4 (line 22):
-> $$ = nterm id_seq ()
Stack now 0 1 3
Entering state 6
Reading a token: Next token is token IDENTIFIER ()
Shifting token IDENTIFIER ()
Entering state 8
Reading a token: Next token is token PERIOD ()
Shifting token PERIOD ()
Entering state 10
Reducing stack by rule 3 (line 20):
$1 = token INTEGER ()
$2 = nterm id_seq ()
$3 = token IDENTIFIER ()
$4 = token PERIOD ()
-> $$ = nterm declarations ()
Stack now 0 1
Entering state 4
Reading a token: in
Next token is token IN ()
Shifting token IN ()
Entering state 7
Reducing stack by rule 6 (line 25):
-> $$ = nterm commands ()
Stack now 0 1 4 7
Entering state 9
Reading a token: skip;
Next token is token SKIP ()
Shifting token SKIP ()
Entering state 13
Reducing stack by rule 8 (line 28):
$1 = token SKIP ()
-> $$ = nterm command ()
Stack now 0 1 4 7 9
Entering state 19
Reading a token: Next token is token SEMICOLON ()
Shifting token SEMICOLON ()
Entering state 29
Reducing stack by rule 7 (line 26):
$1 = nterm commands ()
$2 = nterm command ()
$3 = token SEMICOLON ()
-> $$ = nterm commands ()
Stack now 0 1 4 7
Entering state 9
Reading a token: end;
Next token is token END ()
Shifting token END ()
Entering state 12
Reading a token: Next token is token SEMICOLON ()
Shifting token SEMICOLON ()
Entering state 20
Reducing stack by rule 1 (line 18):
$1 = token LET ()
$2 = nterm declarations ()
$3 = token IN ()
$4 = nterm commands ()
$5 = token END ()
$6 = token SEMICOLON ()
-> $$ = nterm program ()
Stack now 0
Entering state 2
Reading a token:
Why is it ready to read a token again when I have entered end; ?? What am I missing ? Shouldn't it end here ? If I enter anything now it gives me following error :
Reading a token: let
Next token is token LET ()
syntax error, unexpected LET, expecting $end
Error: popping nterm program ()
Stack now 0
Cleanup: discarding lookahead token LET ()
Stack now 0
My Second doubt is what should be the next step in implementing this compiler ? I mean what more steps are required between this and code generation part ? How do I implement Symbol table now ? and How do I make this parser accept code from file . Till now am providing input in terminal, what if I want to make this accept code from file like my_program.simple ?
Thank You.

declarations :
|INTEGER id_seq IDENTIFIER PERIOD
;
...
I think that you're using a wrong syntax: you state that declarations (as well as idseq and commands) could be epsilon, i.e. an empty production. That because | it's the alternative operator. Alternative between an empty body and the actual pattern. Doesn't make sense.
I think could be the cause of your parser looping.
For a symbol table you can use a map (I hope you're generating C++), declared global outside the parser. Then insert symbols when you seen them.
Before to get a compiler, could be useful to have a working interpreter, it's easier and clarifies many aspects that will be reused building the compiler.

Related

Don't read specific token at a given point

At some point in my grammar file, I want ANTLR to read my input as 2 tokens instead of one.
In my source file I have the value
12345.name
and the lexer consumes
12345.
as a FLOAT-Token. At this specific point in the source file I want ANTLR to read this as
12345 (INT)
. (DOT)
name (NAME)
Is there a way to tell ANTLR that it should ignore FLOAT-Types at some given point?
This is my current .g4 file:
grammar Quest;
import Lua;
#header {
package dev.codeflush.m2qc.antlr;
}
/*
prefixed everything with "m2" to avoid nameclashes
*/
m2QuestFile
: m2Define* m2Quest* EOF
;
m2Define
: 'define' NAME m2DefineValue
;
m2DefineValue
: ~('\r\n' | '\r' | '\n')
;
m2Quest
: 'quest' NAME 'begin' m2State* 'end'
;
m2State
: 'state' NAME 'begin' (m2TriggerBlock | m2Function)* 'end'
;
m2TriggerBlock
: 'when' m2Trigger ('or' m2Trigger)* ('with' exp)? 'begin' block 'end'
;
m2Function
: 'function' NAME funcbody
;
m2Trigger
: m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerSubEvent DOT m2TriggerArgument
| m2TriggerTarget DOT m2TriggerEvent DOT m2TriggerArgument
| m2TriggerTarget DOT m2TriggerEvent
| m2TriggerEvent
;
m2TriggerTarget
: NAME
| INT
| NORMALSTRING
;
/*
not complete
*/
m2TriggerEvent
: 'button'
| 'enter'
| 'info'
| 'item_informer'
| 'kill'
| 'leave'
| 'letter'
| 'levelup'
| 'login'
| 'logout'
| 'unmount'
| 'target'
| 'chat'
| 'timer'
| 'server_timer'
;
m2TriggerSubEvent
: 'click'
| 'chat'
| 'arrive'
;
m2TriggerArgument
: exp
;
DOT
: '.'
;
I'm using the Lua grammar from https://github.com/antlr/grammars-v4/blob/master/lua/Lua.g4
My current sample input file looks like this:
quest test begin
state start begin
when kill begin
end
when "12345".kill begin
end
when 12345.kill begin
end
end
end
Where the first two work as intended but the third one doesn't (because the lexer reads '12345.' as one FLOAT-Token)
I had a similar need in my grammar where I wanted to issue multiple tokens (2 actually) for a single match under a specific condition (here: when a dot is directly followed by an identifier, including a keyword).
// Special rule that should also match all keywords if they are directly preceded by a dot.
// Hence it's defined before all keywords.
// Here we make use of the ability in our base lexer to emit multiple tokens with a single rule.
DOT_IDENTIFIER:
DOT_SYMBOL LETTER_WHEN_UNQUOTED_NO_DIGIT LETTER_WHEN_UNQUOTED* { emitDot(); } -> type(IDENTIFIER)
;
A helper function is needed to emit the extra token(s):
/**
* Puts a DOT token onto the pending token list.
*/
void MySQLBaseLexer::emitDot() {
_pendingTokens.emplace_back(_factory->create({this, _input}, MySQLLexer::DOT_SYMBOL, _text, channel,
tokenStartCharIndex, tokenStartCharIndex, tokenStartLine,
tokenStartCharPositionInLine));
++tokenStartCharIndex;
}
which in turn requires custom handling of the token production. You have to override the nextToken method in your token stream, to consider the pending token list before returning the next real token.
/**
* Allow a grammar rule to emit as many tokens as it needs.
*/
std::unique_ptr<antlr4::Token> MySQLBaseLexer::nextToken() {
// First respond with pending tokens to the next token request, if there are any.
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
return pending;
}
// Let the main lexer class run the next token recognition.
// This might create additional tokens again.
auto next = Lexer::nextToken();
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
_pendingTokens.push_back(std::move(next));
return pending;
}
return next;
}
Keep in mind: the lexer rule still issues its own token (which I set to be an IDENTIFIER here), which means you only have to issue the additional tokens.

How can I use the perl6 regex metasyntax, <foo regex>?

In perl6 grammars, as explained here (note, the design documents are not guaranteed to be up-to-date as the implementation is finished), if an opening angle bracket is followed by an identifier then the construct is a call to a subrule, method or function.
If the character following the identifier is an opening paren, then it's a call to a method or function eg: <foo('bar')>. As explained further down the page, if the first char after the identifier is a space, then the rest of the string up to the closing angle will be interpreted as a regex argument to the method - to quote:
<foo bar>
is more or less equivalent to
<foo(/bar/)>
What's the proper way to use this feature? In my case, I'm parsing line oriented data and I'm trying to declare a rule that will instigate a seperate search on the current line being parsed:
#!/usr/bin/env perl6
# use Grammar::Tracer ;
grammar G {
my $SOLpos = -1 ; # Start-of-line pos
regex TOP { <line>+ }
method SOLscan($regex) {
# Start a new cursor
my $cur = self."!cursor_start_cur"() ;
# Set pos and from to start of the current line
$cur.from($SOLpos) ;
$cur.pos($SOLpos) ;
# Run the given regex on the cursor
$cur = $regex($cur) ;
# If pos is >= 0, we found what we were looking for
if $cur.pos >= 0 {
$cur."!cursor_pass"(self.pos, 'SOLscan')
}
self
}
token line {
{ $SOLpos = self.pos ; say '$SOLpos = ' ~ $SOLpos }
[
|| <word> <ws> 'two' { say 'matched two' } <SOLscan \w+> <ws> <word>
|| <word>+ %% <ws> { say 'matched words' }
]
\n
}
token word { \S+ }
token ws { \h+ }
}
my $mo = G.subparse: q:to/END/ ;
hello world
one two three
END
As it is, this code produces:
$ ./h.pl
$SOLpos = 0
matched words
$SOLpos = 12
matched two
Too many positionals passed; expected 1 argument but got 2
in method SOLscan at ./h.pl line 14
in regex line at ./h.pl line 32
in regex TOP at ./h.pl line 7
in block <unit> at ./h.pl line 41
$
Line 14 is $cur.from($SOLpos). If commented out, line 15 produces the same error. It appears as though .pos and .from are read only... (maybe :-)
Any ideas what the proper incantation is?
Note, any proposed solution can be a long way from what I've done here - all I'm really wanting to do is understand how the mechanism is supposed to be used.
It does not seem to be in the corresponding directory in roast, so that would make it a "Not Yet Implemented" feature, I'm afraid.

Yacc/bison: what's wrong with my syntax equations?

I'm writing a "compiler" of sorts: it reads a description of a game (with rooms, characters, things, etc.) Think of it as a visual version of an Adventure-style game, but with much simpler problems.
When I run my "compiler" I'm getting a syntax error on my input, and I can't figure out why. Here's the relevant section of my yacc input:
character
: char-head general-text character-insides { PopChoices(); }
;
character-insides
: LEFTBRACKET options RIGHTBRACKET
;
char-head
: char-namesWT opt-imgsWT char-desc opt-cond
;
char-desc
: general-text { SetText($1); }
;
char-namesWT
: DOTC ID WORD { AddCharacter($3, $2); expect(EXP_TEXT); }
;
opt-cond
: %empty
| condition
;
condition
: condition-reason condition-main general-text
{ AddCondition($1, $2, $3); }
;
condition-reason
: DOTU { $$ = 'u'; }
| DOTV { $$ = 'v'; }
;
condition-main
: money-conditionWT
| have-conditionWT
| moves-conditionWT
| flag-conditionWT
;
have-conditionWT
: PERCENT_SLASH opt-bang ID
{ $$ = MkCondID($1, $2, $3) ; expect(EXP_TEXT); }
;
opt-bang
: %empty { $$ = TRUE; }
| BANG { $$ = FALSE; }
;
ID: WORD
;
Things in all caps are terminal symbols, things in lower or mixed case are non-terminals. If a non-terminal ends in WT, then it "wants text". That is, it expects that what comes after it may be arbitrary text.
Background: I have written my own token recognizer in C++ because(*) I want the syntax to be able to change the way the lexer's behavior. Two types of tokens should be matched only when the syntax expects them: FILENAME (with slashes and other non-alphameric characters) and TEXT, which means "all the text from here to the end of the line" (but not starting with certain keywords).
The function "expect" tells the lexer when to look for these two symbols. The expectation is reset to EXP_NORMAL after each token is returned.
I have added code to yylex that prints out the tokens as it recognizes them, and it looks to me like the tokenizer is working properly -- returning the tokens I expect.
(*) Also because I want to be able to ask the tokenizer for the column where the error occurred, and get the contents of the line being scanned at the time so I can print out a more useful error message.
Here is the relevant part of the input:
.c Wendy wendy
OK, now you caught me, what do you want to do with me?
.u %/lasso You won't catch me like that.
[
Here is the last part of the debugging output from yylex:
token: 262: DOTC/
token: 289: WORD/Wendy
token: 289: WORD/wendy
token: 292: TEXT/OK, now you caught me, what do you want to do with me?
token: 286: DOTU/
token: 274: PERCENT_SLASH/%/
token: 289: WORD/lasso
token: 292: TEXT/You won't catch me like that.
token: 269: LEFTBRACKET/
here's my error message:
: line 124, columns 3-4: syntax error, unexpected LEFTBRACKET, expecting TEXT
[
To help you understand the equations above, here is the relevant part of the description of the input syntax that I wrote the yacc code from.
// Character:
// .c id charactername,[imagename,[animationname]]
// description-text
// .u condition on the character being usable [optional]
// .v condition on the character being visible [optional]
// [
// (options)
// ]
// Conditions:
// %$[-]n Must [not] have at least n dollars
// %/[-]name Must [not] have named thing
// %t-nnn At/before specified number of moves
// %t+nnn At/after specified number of moves
// %#[-]name named flag must [not] be set
// Condition-char: $, /, t, or #, as described above
//
// Condition:
// % condition-char (identifier/int) ['/' text-if-fail ]
// description-text: Can be either on-line text or multi-line text
// On-line text is the rest of the line
brackets mark optional non-terminals, but a bracket standing alone (represented by LEFTBRACKET and RIGHTBRACKET in the yacc) is an actual token, e.g.
// [
// (options)
// ]
above.
What am I doing wrong?
To debug parsing problems in your grammar, you need to understand the shift/reduce machine that yacc/bison produces (described in the .output file produced with the -v option), and you need to look at the trail of states that the parser goes through to reach the problem you see.
To enable debugging code in the parser (which can print the states and the shift and reduce actions as they occur), you need to compile with -DYYDEBUG or put #define YYDEBUG 1 in the top of your grammar file. The debugging code is controlled by the global variable yydebug -- set to non-zero to turn on the trace and zero to turn it off. I often use the following in main:
#ifdef YYDEBUG
extern int yydebug;
if (char *p = getenv("YYDEBUG"))
yydebug = atoi(p);
#endif
Then you can include -DYYDEBUG in your compiler flags for debug builds and turn on the debugging code by something like setenv YYDEBUG 1 to set the envvar prior to running your program.
I suppose your syntax error message was generated by bison. What is striking is that it claims to have found a LEFTBRACKET when it expects a [. Naively, you might expect it to be satisfied with the LEFTBRACKET it found, but of course bison knows nothing about LEFTBRACKET except its numeric value, which will be some integer larger than 256.
The only reason bison might expect [ is if your grammar includes the terminal '['. But since your scanner seems to return LEFTBRACKET when it sees a [, the parser will never see '['.

How to use Start States in ML-Lex?

I am creating a tokeniser in ML-Lex a part of the definition of which is
datatype lexresult = STRING
| STRINGOP
| EOF
val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n")
val eof = fn () => EOF
%%
%structure myLang
digit=[0-9];
ws=[\ \t\n];
str=\"[.*]+\";
strop=\[[0-9...?\^]\];
%s alpha;
alpha=[a-zA-Z];
%%
<alpha> {alphanum}+ => (ID);
. => (error ("myLang: ignoring bad character " ^ yytext); lex());
I want that the type ID should be detected only when it starts with or is found after "alpha". I know that writing it as
{alpha}+ {alphanum}* => (ID);
is an option but I need to learn to use the use of start states as well for some other purposes. Can someone please help me on this?
The information you need is in the documentation which comes with SML available in various places. Many university courses have online notes which contain working examples.
The first thing to note from your example code is that you have overloaded the name alpha and used it to name a state and a pattern. This is probably not a good idea. The pattern alphanum is not not defined, and the result ID is not declared. Some basic errors which you should probably fix before thinking about using states - or posting a question here on SO. Asking for help for code with such obvious faults in it is not encouraging help from the experts. :-)
Having fixed up those errors, we can start using states. Here is my version of your code:
datatype lexresult = ID
| EOF
val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n")
val eof = fn () => EOF
%%
%structure myLang
digit=[0-9];
ws=[\ \t\n];
str=\"[.*]+\";
strop=\[[0-9...?\^]\];
%s ALPHA_STATE;
alpha=[a-zA-Z];
alphanum=[a-zA-Z0-9];
%%
<INITIAL>{alpha} => (YYBEGIN ALPHA_STATE; continue());
<ALPHA_STATE>{alphanum}+ => (YYBEGIN INITIAL; TextIO.output(TextIO.stdOut,"ID\n"); ID);
. => (error ("myLang: ignoring bad character " ^ yytext); lex());
You can see I've added ID to the lexresult, named the state ALPHA_STATE and added the alphanum pattern. Now lets look at how the state code works:
There are two states in this program, they are called INITIAL and ALPHA_STATE (all lex programs have an INITIAL default state). It always begins recognising in the INITIAL state. Having a rule <INITIAL>{alpha} => indicates that if you encounter a letter when in the initial state (i.e. NOT in the ALPHA_STATE) then it is a match and the action should be invoked. The action for this rule works as follows:
YYBEGIN ALPHA_STATE; (* Switch from INITIAL state to ALPHA_STATE *)
continue() (* and keep going *)
Now we are in ALPHA_STATE it enables those rules defined for this state, which enable the rule <ALPHA_STATE>{alphanum} =>. The action on this rule switch back to the INITIAL state and record the match.
For a longer example of using states (lex rather than ML-lex) you can see my answer here: Error while parsing comments in lex.
To test this ML-LEX program I referenced this helpful question: building a lexical analyser using ml-lex, and generated the following SML program:
use "states.lex.sml";
open myLang
val lexer =
let
fun input f =
case TextIO.inputLine f of
SOME s => s
| NONE => raise Fail "Implement proper error handling."
in
myLang.makeLexer (fn (n:int) => input TextIO.stdIn)
end
val nextToken = lexer();
and just for completeness, it generated the following output demonstrating the match:
c:\Users\Brian>"%SMLNJ_HOME%\bin\sml" main.sml
Standard ML of New Jersey v110.78 [built: Sun Dec 21 15:52:08 2014]
[opening main.sml]
[opening states.lex.sml]
[autoloading]
[library $SMLNJ-BASIS/basis.cm is stable]
[autoloading done]
structure myLang :
sig
structure UserDeclarations : <sig>
exception LexError
structure Internal : <sig>
val makeLexer : (int -> string) -> unit -> Internal.result
end
val it = () : unit
hello
ID

Handle hidden channel in antlr 3

I am writing an ANTRL grammar for translating one language to another but the documentation on using the HIDDEN channel is very scarce. I cannot find an example anywhere. The only thing I have found is the FAQ on www.antlr.org which tells you how to access the hidden channel but not how best to use this functionality. The target language is Java.
In my grammar file, I pass whitespace and comments through like so:
// Send runs of space and tab characters to the hidden channel.
WHITESPACE
: (SPACE | TAB)+ { $channel = HIDDEN; }
;
// Single-line comments begin with --
SINGLE_COMMENT
: ('--' COMMENT_CHARS NEWLINE) {
$channel=HIDDEN;
}
;
fragment COMMENT_CHARS
: ~('\r' | '\n')*
;
// Treat runs of newline characters as a single NEWLINE token.
NEWLINE
: ('\r'? '\n')+ { $channel = HIDDEN; }
;
In my members section I have defined a method for writing hidden channel tokens to my output StringStream...
#members {
private int savedIndex = 0;
void ProcessHiddenChannel(TokenStream input) {
List<Token> tokens = ((CommonTokenStream)input).getTokens(savedIndex, input.index());
for(Token token: tokens) {
if(token.getChannel() == token.HIDDEN_CHANNEL) {
output.append(token.getText());
}
}
savedIndex = input.index();
}
}
Now to use this, I have to call the method after every single token in my grammar.
myParserRule
: MYTOKEN1 { ProcessHiddenChannel(input); }
MYTOKEN2 { ProcessHiddenChannel(input); }
;
Surely there must be a better way?
EDIT: This is an example of the input language:
-- -----------------------------------------------------------------
--
--
-- Name Description
-- ==================================
-- IFM1/183 Freq Spectrum Inversion
--
-- -----------------------------------------------------------------
PROCEDURE IFM1/183
TITLE "Freq Spectrum Inversion";
HELP
Freq Spectrum Inversion
ENDHELP;
PRIVILEGE CTRL;
WINDOW MANDATORY;
INPUT
$Input : #NO_YES
DEFAULT select %YES when /IFMS1/183.VALUE = %NO;
%NO otherwise
endselect
PROMPT "Spec Inv";
$Forced_Cmd : BOOLEAN
Default FALSE
Prompt "Forced Commanding";
DEFINE
&RetCode : #PSTATUS := %OK;
&msg : STRING;
&Input : BOOLEAN;
REQUIRE AVAILABLE(/IFMS1)
MSG "IFMS1 not available";
REQUIRE /IFMS1/001.VALUE = %MON_AND_CTRL
MSG "IFMS1 not in control mode";
BEGIN -- Procedure Body --
&msg := "IFMS1/183 -> " + toString($Input) + " : ";
-- pre-check
IF /IFMS1/183.VALUE = $Input
AND $Forced_Cmd = FALSE THEN
EXIT (%OK, MSG &msg + "already set");
ENDIF;
-- command
IF $Input = %YES THEN &Input:= TRUE;
ELSE &Input:= FALSE;
ENDIF;
SET &RetCode := SEND IFMS1.FREQPLAN
( $FreqSpecInv := &Input);
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "command failed");
ENDIF;
-- verify
SET &RetCode := VERIFY /IFMS1/183.VALUE = $Input TIMEOUT '10';
IF &RetCode <> %OK THEN
EXIT (&RetCode, MSG &msg + "verification failed");
ELSE
EXIT (&RetCode, MSG &msg + "verified");
ENDIF;
END
Look into inheriting CommonTokenStream and feeding an instance of your subclass into ANTLR. From the code example that you give, I suspect that you might be interested in taking a look at the filter and the rewrite options available in version 3.
Also, take a look at this other related stack overflow question.
I have just been going through some of my old questions and thought it was worth responding with the final solution that worked the best. In the end, the best way to translate a language was to use StringTemplate. This takes care of re-indenting the output for you. There is a very good example called 'cminus' in the ANTLR example pack that shows how to use it.