Bison custom syntax error - syntax-error

I have some grammar rules for a C compiler and translator to Matlab language. I want to capture the syntax error due to missing ';' at the end of a statement.
For example I have the return statement:
stmt_return : RETURN {...some actions...}
exp ';' {...others actions...}
| RETURN {...some actions...}
';' {...others actions...}
How can I handle the lack of ';' and print a custom error message instead of the default message "syntax error".
I tried to add these rules but rightly produce conflicts:
stmt_return : RETURN exp { yyerror("...")}
| RETURN { yyerror("...")}

I found this solution:
stmt_return : RETURN {...some actions...}
exp sc {...others actions...}
| RETURN {...some actions...}
sc {...others actions...}
;
sc : ';'
| { yyerror("Missing ';'"); } error
;

Related

How to prevent default "syntax error" in Bison

As described in the header, I am using Bison and Flex to get a parser, yet I need to handle the error and continue after I find one. Thus I use:
Stmt: Reference '=' Expr ';' { printf(" Reference = Expr ;\n");}
| '{' Stmts '}' { printf("{ Stmts }");}
| WHILE '(' Bool ')' '{' Stmts '}' { printf(" WHILE ( Bool ) { Stmts } ");}
| FOR NAME '=' Expr TO Expr BY Expr '{' Stmts '}' { printf(" FOR NAME = Expr TO Expr BY Expr { Stmts } ");}
| IF '(' Bool ')' THEN Stmt { printf(" IF ( Bool ) THEN Stmt ");}
| IF '(' Bool ')' THEN Stmt ELSE Stmt { printf(" IF ( Bool ) THEN Stmt ELSE Stmt ");}
| READ Reference ';' { printf(" READ Reference ;");}
| WRITE Expr ';' { printf(" WRITE Expr ;");}
| error ';' { yyerror("Statement is not valid"); yyclearin; yyerrok;}
;
however, I always get a msg "syntax error" and I do not know where does it come from and how to prevent it so that my own "error code" will be executed.
I am trying to do an error recovery here so that my parser will continue to parse the input till the EOF.
People often confuse the purpose of error rules in yacc/bison -- they are for error RECOVERY, not for error HANDLING. So an error rule is not called in response to an error -- the error happens and then the error rule is used to recover.
If you want to handle the error yourself (so avoid printing a "syntax error" message), you need to define your own yyerror function (that is the error handler) that does something with "syntax error" string other than printing it. One option is to do nothing, and then print a message in your error recovery rule (eg, where you call yyerror, change it to printf instead). The problem being that if error recovery fails, you won't get any message (you will get a failure return from yyparse, so could print a message there).

Parse Nested Block Structure using ANTLR

I have this program
{
run_and_branch(Test1)
then
{
}
else
{
}
{
run_and_branch(Test2)
then
{
}
else
{
run(Test3);
run(Test4);
run(Test5);
}
}
run_and_branch(Test6)
then
{
}
else
{
}
run(Test7);
{
run(Test8);
run(Test9);
run(Test_10);
}
}
Below is my ANLTR Grammar File
prog
: block EOF;
block
: START_BLOCK END_BLOCK -> BLOCK|
START_BLOCK block* END_BLOCK -> block*|
test=run_statement b=block* -> ^($test $b*)|
test2=run_branch_statement THEN pass=block ELSE fail=block -> ^($test2 ^(PASS $pass) ^(FAIL $fail))
;
run_branch_statement
: RUN_AND_BRANCH OPEN_BRACKET ID CLOSE_BRACKET -> ID;
run_statement
: RUN OPEN_BRACKET ID CLOSE_BRACKET SEMICOLON -> ID;
THEN : 'then';
ELSE : 'else';
RUN_AND_BRANCH : 'run_and_branch';
RUN : 'run';
START_BLOCK
: '{' ;
END_BLOCK
: '}' ;
OPEN_BRACKET
: '(';
CLOSE_BRACKET
: ')';
SEMICOLON
: ';'
;
ID : ('a'..'z'|'A'..'Z'|'_'|'0'..'9') (':'|'%'|'='|'\''|'a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'.'|'+'|'*'|'/'|'\\')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Using ANTLWorks I get the following AST:
As you can see in the AST there is no link between the Test1 and Test2 as depedency. I want to have the AST show this information so that I can traverse the AST and get the Test depedency Structure
I am expecting the AST look something like this
ANTLR doesn't work this way. ANTLR produces a tree, not a graph, so there is no way to represent the desired output at the grammar level. In addition, if you tried to write tail-recursive rules to link control flow this way you would quickly run into stack overflow exceptions since ANTLR produces recursive-descent parsers.
You need to take the AST produced by ANTLR and perform separate control flow analysis on it to get a control flow graph.

ANTLR: How to skip multiline comments

Given the following lexer:
lexer grammar CodeTableLexer;
#header {
package ch.bsource.ice.parsers;
}
CodeTabHeader : OBracket Code ' ' Table ' ' Version CBracket;
CodeTable : Code ' '* Table;
EndCodeTable : 'end' ' '* Code ' '* Table;
Code : 'code';
Table : 'table';
Version : '1.0';
Row : 'row';
Tabdef : 'tabdef';
Override : 'override' | 'no_override';
Obsolete : 'obsolete';
Substitute : 'substitute';
Status : 'activ' | 'inactive';
Pkg : 'include_pkg' | 'exclude_pkg';
Ddic : 'include_ddic' | 'exclude_ddic';
Tab : 'tab';
Naming : 'naming';
Dfltlang : 'dfltlang';
Language : 'english' | 'german' | 'french' | 'italian' | 'spanish';
Null : 'null';
Comma : ',';
OBracket : '[';
CBracket : ']';
Boolean
: 'true'
| 'false'
;
Number
: Int* ('.' Digit*)?
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '$' | '#' | '.' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length() - 1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"'))* '"'
;
Comment
: '--' ~('\r' | '\n')* { skip(); }
| '/*' .* '*/' { skip(); }
;
Space
: (' ' | '\t') { skip(); }
;
NewLine
: ('\r' | '\n' | '\u000C') { skip(); }
;
fragment Int
: '1'..'9'
| '0'
;
fragment Digit
: '0'..'9'
;
... and the following parser:
parser grammar CodeTableParser;
options {
tokenVocab = CodeTableLexer;
backtrack = true;
output = AST;
}
#header {
package ch.bsource.ice.parsers;
}
parse
: block EOF
;
block
: CodeTabHeader^ codeTable endCodeTable
;
codeTable
: CodeTable^ codeTableData
;
codeTableData
: (Identifier^ obsolete?) (tabdef | row)*
;
endCodeTable
: EndCodeTable
;
tabdef
: Tabdef^ Identifier+
;
row
: Row^ rowData
;
rowData
: (Number^ | (Identifier^ (Comma Number)?))
Override?
obsolete?
status?
Pkg?
Ddic?
(tab | field)*
;
tab
: Tab^ value+
;
field
: (Identifier^ value) | naming
;
value
: OBracket? (Identifier | String | Number | Boolean | Null) CBracket?
;
naming
: Naming^ defaultNaming (l10nNaming)*
;
defaultNaming
: Dfltlang^ String
;
l10nNaming
: Language^ String?
;
obsolete
: Obsolete^ Substitute String
;
status
: Status^ Override?
;
... finally my class for making the parser case-insensitive:
package ch.bsource.ice.parsers;
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + 1 - 1) >= n) return CharStream.EOF
return Character.toLowerCase(data[p + 1 - 1]);
}
}
... single-line comments are skipped as expected, while multi-line comments aren't... here is the error message I get:
codetable_1.txt line 38:0 mismatched character '<EOF>' expecting '*'
codetable_1.txt line 38:0 mismatched input '<EOF>' expecting EndCodeTable
java.lang.NullPointerException
...
Am I missing something? Is there anything I should be aware of? I'm using antlr 3.4.
Here is also the example source code I'm trying to parse:
[code table 1.0]
/*
This is a multi-line comment
*/
code table my_table
-- this is a single-line comment
row 1
id "my_id_1"
name "my_name_1"
descn "my_description_1"
naming
dfltlang "My description 1"
english "My description 1"
german "Meine Beschreibung 1"
-- this is another single-line comment
row 2
id "my_id_2"
name "my_name_2"
descn "my_description_2"
naming
dfltlang "My description 2"
english "My description 2"
german "Meine Beschreibung 2"
end code table
Any help would be really appreciated :-)
Thanks,
j3d
To do this in antlr4
BlockComment
: '/*' .*? '*/' -> skip
;
Bart gave me an amazing support and I think we all really appreciate him :-)
Anyway, the problem was a bug in the FileStream class I use to convert parsed char stream to lowercase. Here below is the correct Java source code:
import java.io.IOException;
import org.antlr.runtime.*;
public class ANTLRNoCaseFileStream extends ANTLRFileStream {
public ANTLRNoCaseFileStream(String fileName) throws IOException {
super (fileName, null);
}
public ANTLRNoCaseFileStream(String fileName, String encoding) throws IOException {
super (fileName, null);
}
public int LA(int i) {
if (i == 0) return 0;
if (i < 0) i++;
if ((p + i - 1) >= n) return CharStream.EOF;
return Character.toLowerCase(data[p + i - 1]);
}
}
I use 2 rules that I use to skip line and block comments (I print them during parsing for debug purposes). They are split in 2 for better readability, and the block comment does support nested comments.
Also, I do not skip EOL chars (\r and / or \n) in my grammar because I need them explicitly for some rules.
LineComment
: '//' ~('\n'|'\r')* //NEWLINE
{System.out.println("lc > " + getText());
skip();}
;
BlockComment
#init { int depthOfComments = 0;}
: '/*' {depthOfComments++;}
( options {greedy=false;}
: ('/' '*')=> BlockComment {depthOfComments++;}
| '/' ~('*')
| ~('/')
)*
'*/' {depthOfComments--;}
{
if (depthOfComments == 0) {
System.out.println("bc >" + getText());
skip();
}
}
;

why is this grammar an error 208?

I don't understand why the following grammar leads to error 208 complaining IF will be never matched:
error(208): test.g:11:1: The following token definitions can never be matched because prior tokens match the same input: IF
ANTLRWorks 1.4.3
ANTLT 3.4
grammar test;
#lexer::members {
private boolean rawAhead() {
}
}
parse : IF*;
RAW : ({rawAhead()}?=> . )+;
IF : 'if';
ID : ('A'..'Z'|'a'..'z')+;
Either remove RAW rule or ID rule solves the error...
From my point of view, IF does have the possibility to be matched when rawAhead() returns false.
Bood wrote:
I think it actually matters, say if we have an and just an 'if' outside of the mmode, e.g. <#/>if<#/>, then the if here will be matched with IF, not RAW it should be (same length, match the first), right?
Yeah, you're right, good point. Giving it some more thought that is the expected behavior AFAIK. But, it seems things work a bit differently: the RAW rule gets precedence over the ID and IF rules, even when placed at the end of the lexer grammar as you can see:
freemarker_simple.g
grammar freemarker_simple;
#lexer::members {
private boolean mmode = false;
private boolean rawAhead() {
if(mmode) return false;
int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
return !(
(ch1 == '<' && ch2 == '#') ||
(ch1 == '<' && ch2 == '/' && ch3 == '#') ||
(ch1 == '$' && ch2 == '{')
);
}
}
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : ({rawAhead()}?=> . )+;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRStringStream("<#/if>if<#if>foo<#if>"));
freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
parser.parse();
}
}
will print the following to the console:
TAG_START '<#'
IF 'if'
TAG_END '>'
RAW 'if'
TAG_START '<#'
IF 'if'
TAG_END '>'
RAW 'foo'
TAG_START '<#'
IF 'if'
TAG_END '>'
As you can see, the 'if' and 'foo' are tokenized as RAW in the input:
<#/if>if<#if>foo<#if>
^^ ^^^

ANTLR antlrWorks error messages are not displayed to the output console

When enter the following input with an error at the third line:
SELECT entity_one, entity_two FROM myTable;
first_table, extra_table as estable, tineda as cam;
asteroid tenga, tenta as myName, new_eNoal as coble
I debugged it with antlrWorks and found that the error message corresponding to the third line gets shown on the debugger output window:
output/__Test___input.txt line 3:8 required (...)+ loop did not match anything at input ' '
output/__Test___input.txt line 3:9 missing END_COMMAND at 'tenga'
but when I run the application by itself these error messages are not being displayed at the console.
The error messages get displayed on the console whenever the error is on the first line like:
asteroid tenga, tenta as myName, new_eNoal as coble
SELECT entity_one, entity_two FROM myTable;
first_table, extra_table as estable, tineda as cam;
console output:
inputSql.rst line 1:8 required (...)+ loop did not match anything at input ' '
inputSql.rst line 1:9 missing END_COMMAND at 'tenga'
How could I have them displayed on the console too when the errors are not located at the 1st line?
UserRequest.g
grammar UserRequest;
tokens{
COMMA = ',' ;
WS = ' ' ;
END_COMMAND = ';' ;
}
#header {
package com.linktechnology.input;
}
#lexer::header {
package com.linktechnology.input;
}
#members{
public static void main(String[] args) throws Exception {
UserRequestLexer lex = new UserRequestLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
UserRequestParser parser = new UserRequestParser(tokens);
try {
parser.request();
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
process : request* EOF ;
request : (sqlsentence | create) END_COMMAND ;
sqlsentence : SELECT fields tableName ;
fields : tableName (COMMA tableName)* FROM ;
create : tableName (COMMA tableName)+ ;
tableName : WS* NAME (ALIAS NAME)? ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
NAME : LETTER ( LETTER |DIGIT | '-' | '_' )* ;
fragment LETTER: LOWER | UPPER;
fragment LOWER: 'a'..'z';
fragment UPPER: 'A'..'Z';
fragment DIGIT: '0'..'9';
SELECT : ('SELECT ' |'select ' ) ;
FROM : (' FROM '|' from ') ;
ALIAS : ( ' AS ' |' as ' ) ;
WHITESPACE : ( '\r' | '\n' | '\t' | WS | '\u000C' )+ { $channel = HIDDEN; } ;
That is because in your main method, you invoke parser.request() while when debugging, you choose the process rule as the starting point. And since request consumes a single (sqlsentence | create) END_COMMAND from your input, it produces no error.
Change the main method into:
#members{
public static void main(String[] args) throws Exception {
UserRequestLexer lex = new UserRequestLexer(new ANTLRFileStream(args[0]));
CommonTokenStream tokens = new CommonTokenStream(lex);
UserRequestParser parser = new UserRequestParser(tokens);
try {
parser.process();
} catch (RecognitionException e) {
e.printStackTrace();
}
}
}
and you'll see the same errors on the console since process forces the parser to consume the entire input, all the way to EOF.