Antrl lexer/parser exception understanding - antlr

I have the following language i wish to parse using antlr 1.2.2.
TEST <name>
{
<param_name> = <param value>;
}
while
<...> - means user value, not part of the language keywords
for example
TEST myTest
{
my_param = 1.0;
}
the value can be an integer, a real or a quated string
my_param = 1.0;, my_param = 1; and my_param = "myStringValue"; are all valid inputs.
here is the grammer for this parsing.
parse_test : TESTKEYWORD TEST_NAME '{' param_value_def '}';
param_value_def : ID EQUALS param_value ';';
param_value : REAL|INTEGER|QUOTED_STRING;
TESTKEYWORD : 'TEST';
QUOTED_STRING : '"' ~('"')* '"';
INTEGER : MINUS? DIGIT DIGIT*
REAL : INTEGER '.' DIGIT DIGIT*;
EQUALS : '=';
fragment
MINUS : '-';
fragment
DIGIT : '0'..'9';
when i feed the sample input to the antlr interpreter, i get a `MismatchedTokenException' related to the param_value rule.
can you help me cipher the error message and what i am doing wrong?
thanks

Although ANTLRWorks is not a tool well written, you can use its debugger to see which token in the input leads to this exception, and then you can see which rules need to be revised (since you did not post the full grammar).
http://www.antlr.org/works/index.html

Related

Yield a modified token in ANTLR4

I have a syntax like the following
Identifier
: [a-zA-Z0-9_.]+
| '`' Identifier '`'
;
When I matched an identifier, e.g `someone`, I'd like to strip the backtick and yield a different token, aka someone
Of course, I could walk through the final token array, but is it possible to do it during token parsing?
If I well understand, given the input (file t.text) :
one `someone`
two `fred`
tree `henry`
you would like that tokens are automatically produced as if the grammar had the lexer rules :
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
ID : [a-zA-Z0-9_.]+ ;
But tokens are identified by a type, i.e. an integer, not by the name of the lexer rule. You can change this type with setType() :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::members { int next_number = 1001; }
question
#init {System.out.println("Question last update 1117");}
: expr+ EOF
;
expr
: ID BACKTICK_ID
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`' { setType(next_number); next_number+=1; } ;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
[#0,0:2='one',<ID>,1:0]
[#1,4:12='`someone`',<1001>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='`fred`',<1002>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='`henry`',<1003>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1117
line 1:4 mismatched input '`someone`' expecting BACKTICK_ID
line 2:4 mismatched input '`fred`' expecting BACKTICK_ID
line 3:5 mismatched input '`henry`' expecting BACKTICK_ID
The basic types come from the lexer rules :
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
the other from setType. Instead of incrementing a number for each token, you could write the tokens found in a table, and before creating a new one, access the table to check if it already exists and avoid duplicate tokens receive a different number.
Anyway you can do nothing useful in the parser because parser rules need to know the type number.
If you have a set of names known in advance, you can list them in a tokens statement :
grammar Question;
/* Change `someone` to SOMEONE, `fred` to FRED, etc. */
#lexer::header {
import java.util.*;
}
tokens { SOMEONE, FRED, HENRY }
#lexer::members {
Map<String,Integer> keywords = new HashMap<String,Integer>() {{
put("someone", QuestionParser.SOMEONE);
put("fred", QuestionParser.FRED);
put("henry", QuestionParser.HENRY);
}};
}
question
#init {System.out.println("Question last update 1746");}
: expr+ EOF
;
expr
: ID SOMEONE
| ID FRED
| ID HENRY
;
ID : [a-zA-Z0-9_.]+ ;
BACKTICK_ID : '`' ID '`'
{ String textb = getText();
String texta = textb.substring(1, textb.length() - 1);
System.out.println("text before=" + textb + ", text after="+ texta);
if ( keywords.containsKey(texta)) {
setType(keywords.get(texta)); // reset token type
setText(texta); // remove backticks
}
}
;
WS : [ \r\n\t] -> skip ;
Execution :
$ grun Question question -tokens -diagnostics t.text
text before=`someone`, text after=someone
text before=`fred`, text after=fred
text before=`henry`, text after=henry
[#0,0:2='one',<ID>,1:0]
[#1,4:12='someone',<4>,1:4]
[#2,14:16='two',<ID>,2:0]
[#3,18:23='fred',<5>,2:4]
[#4,25:28='tree',<ID>,3:0]
[#5,30:36='henry',<6>,3:5]
[#6,38:37='<EOF>',<EOF>,4:0]
Question last update 1746
$ cat Question.tokens
ID=1
BACKTICK_ID=2
WS=3
SOMEONE=4
FRED=5
HENRY=6
As you can see, there are no more errors because the expr rule is happy with well identified tokens. Even if there are no
SOMEONE : 'someone' ;
FRED : 'fred' ;
HENRY : 'henry' ;
only ID and BACKTICK_ID, the types have been defined behind the scene by the tokens statement :
public static final int
ID=1, BACKTICK_ID=2, WS=3, SOMEONE=4, FRED=5, HENRY=6;
I'm afraid that if you want a free list of names, it's not possible because the parser works with types, not the name of lexer rules :
public static class ExprContext extends ParserRuleContext {
public TerminalNode ID() { return getToken(QuestionParser.ID, 0); }
public TerminalNode SOMEONE() { return getToken(QuestionParser.SOMEONE, 0); }
public TerminalNode FRED() { return getToken(QuestionParser.FRED, 0); }
public TerminalNode HENRY() { return getToken(QuestionParser.HENRY, 0); }
...
public final ExprContext expr() throws RecognitionException {
try { ...
setState(17);
case 1:
enterOuterAlt(_localctx, 1);
{
setState(11);
match(ID);
setState(12);
match(SOMEONE);
}
break;
In
match(SOMEONE);
SOMEONE is a constant representing the number 4.
If you don't have a list of known names, emit will not solve your problem because it creates a Token whose most important field is _type :
public Token emit() {
Token t = _factory.create(_tokenFactorySourcePair, _type, _text, _channel, _tokenStartCharIndex, getCharIndex()-1,
_tokenStartLine, _tokenStartCharPositionInLine);
emit(t);
return t;
}

skipping parts of a matched lexical element or token

I would like to match a "{NUM}" and then have the lexer rule return "NUM". so, I tried
NUM : ('{' { skip(); }) 'NUM' ('}' { skip(); });
But, that seems to skip everything and return empty on a match. would it be possible to skip parts of a lexer match ?
antlr 3.4
Invoking skip() anywhere in your rule will remove the entire token from the lexer, not just certain characters.
What you could do is this:
NUM
: '{NUM}' {setText("NUM");}
;
Or, if NUM is variable, do:
NUM
: '{' 'A'..'Z'+ '}' {setText($text.substring(1, $text.length() - 1));}
;
which removes the first and last char from the token.
EDIT
smartnut007 wrote:
Is there an equivalent way to do this for Tokens ?
If you mean how to change the text of tokens inside parser rules, try this:
parser_rule
: LEXER_RULE {$LEXER_RULE.setText("new-text");}
;
LEXER_RULE
: 'old-text'
;

ANTLR Date and Integer Matching

I am evaluating a relatively simple IF/THEN language but have run into a problem: I need to match both integers AND dates that are in the format YYYYMMDD. If I could write a real regular expression I could solve this pretty easily, but haven't figured out an ANTLR solution.
Grammar looks like this:
//overall rule to evaluate a single expression
singleEvaluation returns [boolean evalResult]
: integerEvaluation {$evalResult = $integerEvaluation.evalResult;}
| dateEvaluation {$evalResult = $dateEvaluation.evalResult;}
// etc
;
dateEvaluation returns [boolean evalResult]
: expr1=(INTEGER|'TODAY'|DATE_FIELD_IDENTIFIER) (leftOp=('+'|'-') leftModifier=INTEGER leftQualifier=DATE_QUALIFIER)?
operator=(EQ|NE|LT|LE|GT|GE)
expr2=(INTEGER|'TODAY'|DATE_FIELD_IDENTIFIER) (rightOp=('+'|'-') rightModifier=INTEGER rightQualifier=DATE_QUALIFIER)?
{ // code }
integerEvaluation returns [boolean evalResult]
: expr1=(NUM_FIELD_IDENTIFIER|INTEGER)
operator=(EQ|NE|LT|LE|GT|GE)
expr2=(NUM_FIELD_IDENTIFIER|INTEGER)
{ // code
}
;
fragment DIGIT: '0'..'9';
INTEGER: DIGIT+;
DATE_FIELD_IDENTIFIER: ('DOB'|'DATE_OF_HIRE');
NUM_FIELD_IDENTIFIER: ('AGE'|'DEPARTMENT_ID');
DATE_QUALIFIER:('YEAR'|'YEARS'|'MONTH'|'MONTHS'|'DAY'|'DAYS'|'TODAY');
EQ:'=';
NE: '<>';
LT: '<';
LE: '<=';
GT: '>';
GE: '>=';
An example of a statement needing to be parsed would be something like "65 > AGE" or "AGE < 65", or "DOB > 19500101".
Can someone suggest a way to make the parser differentiate between an INTEGER and the 8 digit date format?
After the lexer matches a INTEGER, you can inspect its matched text (referenced through $text), and based on that custom check, decide to change it's type from INTEGER to DATE. The DATE rule can be made as an empty fragment rule, and can then be used inside a parser rule just as if it were a normal lexer rule.
A quick demo:
INTEGER
: DIGIT+
{
// If this token starts with either '19' or '20', followed
// by 6 digits, change it to a DATE-token.
if ($text.matches("(19|20)\\d{6}")) {
$type = DATE;
}
}
;
fragment DATE : /* empty! */ ;
And then in a parser rule, you can just use DATE:
dateEvaluation
: DATE ...
;

ANTLR Variable Troubles

In short: how do I implement dynamic variables in ANTLR?
I come to you again with a basic ANTLR question.
I have this grammar:
grammar Amethyst;
options {
language = Java;
}
#header {
package org.omer.amethyst.generated;
import java.util.HashMap;
}
#lexer::header {
package org.omer.amethyst.generated;
}
#members {
HashMap memory = new HashMap();
}
begin: expr;
expr: (defun | println)*
;
println:
'println' atom {System.out.println($atom.value);}
;
defun:
'defun' VAR INT {memory.put($VAR.text, Integer.parseInt($INT.text));}
| 'defun' VAR STRING_LITERAL {memory.put($VAR.text, $STRING_LITERAL.text);}
;
atom returns [Object value]:
INT {$value = Integer.parseInt($INT.text);}
| ID
{
Object v = memory.get($ID.text);
if (v != null) $value = v;
else System.err.println("undefined variable " + $ID.text);
}
| STRING_LITERAL
{
String v = (String) memory.get($STRING_LITERAL.text);
if (v != null) $value = String.valueOf(v);
else System.err.println("undefined variable " + $STRING_LITERAL.text);
}
;
INT: '0'..'9'+ ;
STRING_LITERAL: '"' .* '"';
VAR: ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9')* ;
ID: ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
LETTER: ('a..z'|'A'..'Z')+ ;
WS: (' '|'\t'|'\n'|'\r')+ {skip();} ;
What it does (or should do), so far, is have a built-in "println" function to do exactly what you think it does, and a "defun" rule to define variables.
When "defun" is called on either a string or integer, the value is put into the "memory" HashMap with the first parameter being the variable's name and the second being its value.
When println is called on an atom, it should display the atom's value. The atom can be either a string or integer. It gets its value from memory and returns it. So for example:
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
null
NOTE: This output comes when I do:
println "greeting"
Output:
undefined variable "greeting"null
Does anyone know why this is so? Sorry if I'm not being clear, I don't understand most of this.
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
Because the input "greeting" is being tokenized as a VAR and a VAR is no atom. So the input defun greeting "Hello world!" is properly matched by the 2nd alternative of the defun rule:
defun
: 'defun' VAR INT // 1st alternative
| 'defun' VAR STRING_LITERAL // 2nd alternative
;
but the input println "greeting" cannot be matched by the println rule:
println
: 'println' atom
;
You must realize that the lexer does not produce tokens based on what the parser tries to match at a particular time. The input "greeting" will always be tokenized as a VAR, never as an ID rule.
What you need to do is remove the ID rule from the lexer, and replace ID with VAR inside your parser rules.

ANTLR Parser, need to which parser rule is matched

In ANTLR, for a given token, is there a way to tell which parser rule is matched?
For example, from the ANTLR grammar:
tokens
{
ADD='Add';
SUB='Sub';
}
fragment
ANYDIGIT : '0'..'9';
fragment
UCASECHAR : 'A'..'Z';
fragment
LCASECHAR : 'a'..'z';
fragment
DATEPART : ('0'..'1') (ANYDIGIT) '/' ('0'..'3') (ANYDIGIT) '/' (ANYDIGIT) (ANYDIGIT) (ANYDIGIT) (ANYDIGIT);
fragment
TIMEPART : ('0'..'2') (ANYDIGIT) ':' ('0'..'5') (ANYDIGIT) ':' ('0'..'5') (ANYDIGIT);
SPACE : ' ';
NEWLINE : '\r'? '\n';
TAB : '\t';
FORMFEED : '\f';
WS : (SPACE|NEWLINE|TAB|FORMFEED)+ {$channel=HIDDEN;};
IDENTIFIER : (LCASECHAR|UCASECHAR|'_') (LCASECHAR|UCASECHAR|ANYDIGIT|'_')*;
TIME : '\'' (TIMEPART) '\'';
DATE : '\'' (DATEPART) (' ' (TIMEPART))? '\'';
STRING : '\''! (.)* '\''!;
DOUBLE : (ANYDIGIT)+ '.' (ANYDIGIT)+;
INT : (ANYDIGIT)+;
literal : INT|DOUBLE|STRING|DATE|TIME;
var : IDENTIFIER;
param : literal|fcn_call|var;
fcn_name : ADD |
SUB |
DIVIDE |
MOD |
DTSECONDSBETWEEN |
DTGETCURRENTDATETIME |
APPEND |
STRINGTOFLOAT;
fcn_call : fcn_name WS? '('! WS? ( param WS? ( ','! WS? param)*)* ')'!;
expr : fcn_call WS? EOF;
And in Java:
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
nodes.reset();
Object obj;
while((obj = nodes.nextElement()) != null)
{
if(nodes.isEOF(obj))
{
break;
}
System.out.println(obj);
}
So, what I want to know, at System.out.println(obj), did the node match the fcn_name rule, or did it match the var rule.
The reason being, I am trying to handle vars differently than fcn_names.
Add this to your listener/visitor:
String[] ruleNames;
public void loadParser(gramParser parser) { //get parser
ruleNames = parser.getRuleNames(); //load parser rules from parser
}
Call loadParser() from wherever you create your listener/visitor, eg.:
MyParser parser = new MyParser(tokens);
MyListener listener = new MyListener();
listener.loadParser(parser); //so we can access rule names
Then inside each rule you can get the name of the rule like this:
ruleName = ruleNames[ctx.getRuleIndex()];
No, you cannot get the name of a parser rule (at least, not without an ugly hack ➊).
But if tree is an instance of CommonTree, it means you've already invoked the expr rule of your parser, which means you already know expr matches first (which in its turn matches fcn_name).
➊ On a related note, see: Get active Antlr rule