I am evaluating a relatively simple IF/THEN language but have run into a problem: I need to match both integers AND dates that are in the format YYYYMMDD. If I could write a real regular expression I could solve this pretty easily, but haven't figured out an ANTLR solution.
Grammar looks like this:
//overall rule to evaluate a single expression
singleEvaluation returns [boolean evalResult]
: integerEvaluation {$evalResult = $integerEvaluation.evalResult;}
| dateEvaluation {$evalResult = $dateEvaluation.evalResult;}
// etc
;
dateEvaluation returns [boolean evalResult]
: expr1=(INTEGER|'TODAY'|DATE_FIELD_IDENTIFIER) (leftOp=('+'|'-') leftModifier=INTEGER leftQualifier=DATE_QUALIFIER)?
operator=(EQ|NE|LT|LE|GT|GE)
expr2=(INTEGER|'TODAY'|DATE_FIELD_IDENTIFIER) (rightOp=('+'|'-') rightModifier=INTEGER rightQualifier=DATE_QUALIFIER)?
{ // code }
integerEvaluation returns [boolean evalResult]
: expr1=(NUM_FIELD_IDENTIFIER|INTEGER)
operator=(EQ|NE|LT|LE|GT|GE)
expr2=(NUM_FIELD_IDENTIFIER|INTEGER)
{ // code
}
;
fragment DIGIT: '0'..'9';
INTEGER: DIGIT+;
DATE_FIELD_IDENTIFIER: ('DOB'|'DATE_OF_HIRE');
NUM_FIELD_IDENTIFIER: ('AGE'|'DEPARTMENT_ID');
DATE_QUALIFIER:('YEAR'|'YEARS'|'MONTH'|'MONTHS'|'DAY'|'DAYS'|'TODAY');
EQ:'=';
NE: '<>';
LT: '<';
LE: '<=';
GT: '>';
GE: '>=';
An example of a statement needing to be parsed would be something like "65 > AGE" or "AGE < 65", or "DOB > 19500101".
Can someone suggest a way to make the parser differentiate between an INTEGER and the 8 digit date format?
After the lexer matches a INTEGER, you can inspect its matched text (referenced through $text), and based on that custom check, decide to change it's type from INTEGER to DATE. The DATE rule can be made as an empty fragment rule, and can then be used inside a parser rule just as if it were a normal lexer rule.
A quick demo:
INTEGER
: DIGIT+
{
// If this token starts with either '19' or '20', followed
// by 6 digits, change it to a DATE-token.
if ($text.matches("(19|20)\\d{6}")) {
$type = DATE;
}
}
;
fragment DATE : /* empty! */ ;
And then in a parser rule, you can just use DATE:
dateEvaluation
: DATE ...
;
Related
I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>
I would like to match a "{NUM}" and then have the lexer rule return "NUM". so, I tried
NUM : ('{' { skip(); }) 'NUM' ('}' { skip(); });
But, that seems to skip everything and return empty on a match. would it be possible to skip parts of a lexer match ?
antlr 3.4
Invoking skip() anywhere in your rule will remove the entire token from the lexer, not just certain characters.
What you could do is this:
NUM
: '{NUM}' {setText("NUM");}
;
Or, if NUM is variable, do:
NUM
: '{' 'A'..'Z'+ '}' {setText($text.substring(1, $text.length() - 1));}
;
which removes the first and last char from the token.
EDIT
smartnut007 wrote:
Is there an equivalent way to do this for Tokens ?
If you mean how to change the text of tokens inside parser rules, try this:
parser_rule
: LEXER_RULE {$LEXER_RULE.setText("new-text");}
;
LEXER_RULE
: 'old-text'
;
In short: how do I implement dynamic variables in ANTLR?
I come to you again with a basic ANTLR question.
I have this grammar:
grammar Amethyst;
options {
language = Java;
}
#header {
package org.omer.amethyst.generated;
import java.util.HashMap;
}
#lexer::header {
package org.omer.amethyst.generated;
}
#members {
HashMap memory = new HashMap();
}
begin: expr;
expr: (defun | println)*
;
println:
'println' atom {System.out.println($atom.value);}
;
defun:
'defun' VAR INT {memory.put($VAR.text, Integer.parseInt($INT.text));}
| 'defun' VAR STRING_LITERAL {memory.put($VAR.text, $STRING_LITERAL.text);}
;
atom returns [Object value]:
INT {$value = Integer.parseInt($INT.text);}
| ID
{
Object v = memory.get($ID.text);
if (v != null) $value = v;
else System.err.println("undefined variable " + $ID.text);
}
| STRING_LITERAL
{
String v = (String) memory.get($STRING_LITERAL.text);
if (v != null) $value = String.valueOf(v);
else System.err.println("undefined variable " + $STRING_LITERAL.text);
}
;
INT: '0'..'9'+ ;
STRING_LITERAL: '"' .* '"';
VAR: ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'0'..'9')* ;
ID: ('a'..'z'|'A'..'Z'|'0'..'9')+ ;
LETTER: ('a..z'|'A'..'Z')+ ;
WS: (' '|'\t'|'\n'|'\r')+ {skip();} ;
What it does (or should do), so far, is have a built-in "println" function to do exactly what you think it does, and a "defun" rule to define variables.
When "defun" is called on either a string or integer, the value is put into the "memory" HashMap with the first parameter being the variable's name and the second being its value.
When println is called on an atom, it should display the atom's value. The atom can be either a string or integer. It gets its value from memory and returns it. So for example:
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
null
NOTE: This output comes when I do:
println "greeting"
Output:
undefined variable "greeting"null
Does anyone know why this is so? Sorry if I'm not being clear, I don't understand most of this.
defun greeting "Hello world!"
println greeting
But when I run this code, I get this error:
line 3:8 no viable alternative at input 'greeting'
Because the input "greeting" is being tokenized as a VAR and a VAR is no atom. So the input defun greeting "Hello world!" is properly matched by the 2nd alternative of the defun rule:
defun
: 'defun' VAR INT // 1st alternative
| 'defun' VAR STRING_LITERAL // 2nd alternative
;
but the input println "greeting" cannot be matched by the println rule:
println
: 'println' atom
;
You must realize that the lexer does not produce tokens based on what the parser tries to match at a particular time. The input "greeting" will always be tokenized as a VAR, never as an ID rule.
What you need to do is remove the ID rule from the lexer, and replace ID with VAR inside your parser rules.
I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar:
grammar test;
program : idList;
idList : id* ;
id : ID ;
ID : LETTER (LETTER | NUMBER)* ;
LETTER : 'a' .. 'z' | 'A' .. 'Z' ;
NUMBER : '0' .. '9' ;
ANTLR Works will output:
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens.
A quick demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
IdList;
Id;
}
#parser::members {
private static void walk(CommonTree tree, int indent) {
if(tree == null) return;
for(int i = 0; i < indent; i++, System.out.print(" "));
System.out.println(tree.getText());
for(int i = 0; i < tree.getChildCount(); i++) {
walk((CommonTree)tree.getChild(i), indent + 1);
}
}
public static void main(String[] args) throws Exception {
testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123"));
testParser parser = new testParser(new CommonTokenStream(lexer));
walk((CommonTree)parser.program().getTree(), 0);
}
}
program : idList EOF -> idList;
idList : id* -> ^(IdList id*);
id : ID -> ^(Id ID);
ID : LETTER (LETTER | DIGIT)*;
SPACE : ' ' {skip();};
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
fragment DIGIT : '0' .. '9';
If you run the demo above, you will see the following being printed to the console:
IdList
Id
abc
Id
abc123
As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead:
idList : id* -> ^(IdList["idList"] id*);
id : ID -> ^(Id["id"] ID);
which will print:
idList
id
abc
id
abc123
I have the following language i wish to parse using antlr 1.2.2.
TEST <name>
{
<param_name> = <param value>;
}
while
<...> - means user value, not part of the language keywords
for example
TEST myTest
{
my_param = 1.0;
}
the value can be an integer, a real or a quated string
my_param = 1.0;, my_param = 1; and my_param = "myStringValue"; are all valid inputs.
here is the grammer for this parsing.
parse_test : TESTKEYWORD TEST_NAME '{' param_value_def '}';
param_value_def : ID EQUALS param_value ';';
param_value : REAL|INTEGER|QUOTED_STRING;
TESTKEYWORD : 'TEST';
QUOTED_STRING : '"' ~('"')* '"';
INTEGER : MINUS? DIGIT DIGIT*
REAL : INTEGER '.' DIGIT DIGIT*;
EQUALS : '=';
fragment
MINUS : '-';
fragment
DIGIT : '0'..'9';
when i feed the sample input to the antlr interpreter, i get a `MismatchedTokenException' related to the param_value rule.
can you help me cipher the error message and what i am doing wrong?
thanks
Although ANTLRWorks is not a tool well written, you can use its debugger to see which token in the input leads to this exception, and then you can see which rules need to be revised (since you did not post the full grammar).
http://www.antlr.org/works/index.html