ANTLR - Field that accept attributes with more than one word - antlr

My Grammar file (see below) parses queries of the type:
(name = Jon AND age != 16 OR city = NY);
However, it doesn't allow something like:
(name = 'Jon Smith' AND age != 16);
ie, it doesn't allow assign to a field values with more than one word, separated by White Spaces. How can I modify my grammar file to accept that?
options
{
language = Java;
output = AST;
}
tokens {
BLOCK;
RETURN;
QUERY;
ASSIGNMENT;
INDEXES;
}
#parser::header {
package pt.ptinovacao.agorang.antlr;
}
#lexer::header {
package pt.ptinovacao.agorang.antlr;
}
query
: expr ('ORDER BY' NAME AD)? ';' EOF
-> ^(QUERY expr ^('ORDER BY' NAME AD)?)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: NAME equality_op atom -> ^(equality_op NAME atom)
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
;
id_list
: '(' ID (',' ID)* ')'
-> ID+
;
NAME
: 'equipType'
| 'equipment'
| 'IP'
| 'site'
| 'managedDomain'
| 'adminState'
| 'dataType'
;
AD : 'ASC' | 'DESC' ;
equality_op
: '='
| '!='
| 'IN'
| 'NOT IN'
;
logical_op
: 'AND'
| 'OR'
;
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;

Include the String token as an alternative in your atom rule:
atom
: ID
| id_list
| Int
| Number
| String
;

Related

ANTLR arithmetic and comparison expressions grammer ANTLR

how to add relational operations to my code
Thanks
My code is
grammar denem1;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
Like this:
...
expr
: Id Assign expr -> ^(Assign Id expr)
| rel
;
rel
: add (('<=' | '<' | '>=' | '>')^ add)?
;
add
: mult (('+' | '-')^ mult)*
;
...
If possible, use ANTLR v4 instead of the old v3. In v4, you can simply do this:
stat
: expr ';'
;
expr
: Id Assign expr
| '-' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| expr ('<=' | '<' | '>=' | '>') expr
| Id
| Num
| '(' expr ')'
;

Parsing DECAF grammar in ANTLR

I am creating a the parser for DECAF with Antlr
grammar DECAF ;
//********* LEXER ******************
LETTER: ('a'..'z'|'A'..'Z') ;
DIGIT : '0'..'9' ;
ID : LETTER( LETTER | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
COMMENTS: '//' ~('\r' | '\n' )* -> channel(HIDDEN);
WS : [ \t\r\n\f | ' '| '\r' | '\n' | '\t']+ ->channel(HIDDEN);
CHAR: (LETTER|DIGIT|' '| '!' | '"' | '#' | '$' | '%' | '&' | '\'' | '(' | ')' | '*' | '+'
| ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '#' | '[' | '\\' | ']' | '^' | '_' | '`'| '{' | '|' | '}' | '~'
'\t'| '\n' | '\"' | '\'');
// ********** PARSER *****************
program : 'class' 'Program' '{' (declaration)* '}' ;
declaration: structDeclaration| varDeclaration | methodDeclaration ;
varDeclaration: varType ID ';' | varType ID '[' NUM ']' ';' ;
structDeclaration : 'struct' ID '{' (varDeclaration)* '}' ;
varType: 'int' | 'char' | 'boolean' | 'struct' ID | structDeclaration | 'void' ;
methodDeclaration : methodType ID '(' (parameter (',' parameter)*)* ')' block ;
methodType : 'int' | 'char' | 'boolean' | 'void' ;
parameter : parameterType ID | parameterType ID '[' ']' ;
parameterType: 'int' | 'char' | 'boolean' ;
block : '{' (varDeclaration)* (statement)* '}' ;
statement : 'if' '(' expression ')' block ( 'else' block )?
| 'while' '(' expression ')' block
|'return' expressionA ';'
| methodCall ';'
| block
| location '=' expression
| (expression)? ';' ;
expressionA: expression | ;
location : (ID|ID '[' expression ']') ('.' location)? ;
expression : location | methodCall | literal | expression op expression | '-' expression | '!' expression | '('expression')' ;
methodCall : ID '(' arg1 ')' ;
arg1 : arg2 | ;
arg2 : (arg) (',' arg)* ;
arg : expression;
op: arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' | '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : '\'' CHAR '\'' ;
bool_literal : 'true' | 'false' ;
When I give it the input:
class Program {
void main(){
return 3+5 ;
}
}
The parse tree is not building correctly since it is not recognizing the 3+5 as an expression. Is there anything wrong with my grammar that is causing the problem?
Lexer rules are matched from top to bottom. When 2 or more lexer rules match the same amount of characters, the one defined first will win. Because of that, a single digit integer will get matched as a DIGIT instead of a NUM.
Try parsing the following instead:
class Program {
void main(){
return 33 + 55 ;
}
}
which will be parsed just fine. This is because 33 and 55 are matched as NUMs, because NUM can now match 2 characters (DIGIT only 1, so NUM wins).
To fix it, make DIGIT a fragment (and LETTER as well):
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT : '0'..'9' ;
ID : LETTER( LETTER | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
Lexer fragments are only used internally by other lexer rules, and will never become tokens of their own.
A couple of other things: your WS rule matches way too much (it now also matches a | and a '), it should be:
WS : [ \t\r\n\f]+ ->channel(HIDDEN);
and you shouldn't match a char literal in your parser: do it in the lexer:
CHAR : '\'' ( ~['\r\n\\] | '\\' ['\\] ) '\'';
If you don't, the following will not get parsed properly:
class Program {
void main(){
return '1';
}
}
because the 1 wil be tokenized as a NUM and not as a CHAR.

ANTLR Group expressions and save Variables

If I have expressions like:
(name = Paul AND age = 16) OR country = china;
And I want to get:
QUERY
|
|-------------|
() |
| |
AND OR
| |
|-------| |
name age country
| | |
Paul 16 china
How can I print the () and the condition (AND/OR) before the fields name, age country?
My grammar file is something like this:
parse
: block EOF -> block
;
block
: (statement)* (Return ID ';')?
-> ^(QUERY statement*)
;
statement
: assignment ';'
-> assignment
;
assignment
: expression (condition expression)*
-> ^(condition expression*)
| '(' expression (condition expression)* ')' (condition expression)*
-> ^(Brackets ^(condition expression*))
;
condition
: AND
| OR
;
Brackets: '()' ;
OR : 'OR' ;
AND : 'AND' ;
..
But it only prints the first condition that appears in the expression ('AND' in this example), and I can't group what is between brackets, and what is not...
Your grammar looks odd to me, and there are errors in it: if the parser does not match "()", you can't use Brackets inside a rewrite rule. And why would you ever want to have the token "()" inside your AST?
Given your example input:
(name = Paul AND age = 16) OR country = china;
here's possible way to construct an AST:
grammar T;
options {
output=AST;
}
query
: expr ';' EOF -> expr
;
expr
: logical_expr
;
logical_expr
: equality_expr ( logical_op^ equality_expr )*
;
equality_expr
: atom ( equality_op^ atom )*
;
atom
: ID
| INT
| '(' expr ')' -> expr
;
equality_op
: '='
| 'IS' 'NOT'?
;
logical_op
: 'AND'
| 'OR'
;
ID : ('a'..'z' | 'A'..'Z')+;
INT : '0'..'9'+;
WS : (' ' | '\t' | '\r' | '\n')+ {skip();};
which would result in this:

Small grammar that doesn't work; what am I missing (antlr4)

I have the following grammar. It's supposed to accept the string shown in the comments in the header. It does not. I must be missing something fundamental. Hints on how to debug this would also be appreciated.
/*
Should accept:
b
a:b
a:b^10
b^10
Should not accept:
:b
a:
a:^10
*/
grammar test;
filter:
boostedField EOF
;
boostedField
: qualifiedField (CARET NUMBER)?
;
qualifiedField
: (FIELDNAME COLON)? term
;
term
: TERM
;
FIELDNAME: (LETTER | UNDERSCORE) (ALPHANUM | UNDERSCORE)* ;
NUMBER : NUM_CHAR+ ('.' NUM_CHAR+)? ;
COLON : ':' ;
CARET : '^' ;
WS : (' ' | '\t' | '\n' | '\r' | '\u3000') -> skip ;
UNDERSCORE: '_' ;
// a term may not have a colon or a caret (unless escaped)
TERM : TERM_START_CHAR TERM_CHAR*;
fragment TERM_START_CHAR
: ~( ' ' | '\t' | '\n' | '\r' | '\u3000' | ':' | '^' ) ;
fragment TERM_CHAR : (TERM_START_CHAR | ESCAPED_CHAR) ;
fragment ESCAPED_CHAR : ( '\\' . ) ;
fragment NUM_CHAR: '0'..'9';
fragment LETTER: 'a'..'z' | 'A'..'Z' ;
fragment ALPHANUM: LETTER | NUM_CHAR;

antlr gated predicate

This is a follow-up question from Antlr superfluous Predicate required? where I stated my problem in a simplified way, however it could not be solved there.
I have the following grammar and when I delete the {true}?=> predicates, the text is not recognized anymore. The input string is MODULE main LTLSPEC H {} {} {o} FALSE;. Note that the trailing ; is not tokenized as EOC, but as IGNORE. When I add {true}?=> to the EOC rule ; is tokenized as EOC.
I tried this from command-line with antlr-v3.3 and v3.4 without difference. Thanks in advance, I appreciate your help.
grammar NusmvInput;
options {
language = Java;
}
#parser::members{
public static void main(String[] args) throws Exception {
NusmvInputLexer lexer = new NusmvInputLexer(new ANTLRStringStream("MODULE main LTLSPEC H {} {} {o} FALSE;"));
NusmvInputParser parser = new NusmvInputParser(new CommonTokenStream(lexer));
parser.specification();
}
}
#lexer::members{
private boolean inLTL = false;
}
specification :
module+ EOF
;
module :
MODULE module_decl
;
module_decl :
NAME parameter_list ;
parameter_list
: ( LP (parameter ( COMMA parameter )*)? RP )?
;
parameter
: (NAME | INTEGER )
;
/**************
*** LEXER
**************/
COMMA
:{!inLTL}?=> ','
;
OTHER
: {!inLTL}?=>( '&' | '|' | 'xor' | 'xnor' | '=' | '!' |
'<' | '>' | '-' | '+' | '*' | '/' |
'mod' | '[' | ']' | '?')
;
RCP
: {!inLTL}?=>'}'
;
LCP
: {!inLTL}?=>'{'
;
LP
: {!inLTL}?=>'('
;
RP
: {!inLTL}?=>')'
;
MODULE
: {true}?=> 'MODULE' {inLTL = false;}
;
LTLSPEC
: {true}?=> 'LTLSPEC'
{inLTL = true; skip(); }
;
EOC
: ';'
{
if (inLTL){
inLTL = false;
skip();
}
}
;
WS
: (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;}
;
COMMENT
: '--' .* ('\n' | '\r') {$channel = HIDDEN;}
;
INTEGER
: {!inLTL}?=> ('0'..'9')+
;
NAME
:{!inLTL}?=> ('A'..'Z' | 'a'..'z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$' | '#' | '-')*
;
IGNORE
: {inLTL}?=> . {skip();}
;
It seems that without a predicate before MODULE and LTLSPEC, the NAME gets precedence over them even if these tokens are defined before the NAME token. Whether this is by design or a bug, I don't know.
However, the way you're trying to solve it seems rather complicated. As far as I can see, you seem to want to ignore (or skip) input starting with LTLSPEC and ending with a semi colon. Why not do something like this instead:
specification : module+ EOF;
module : MODULE module_decl;
module_decl : NAME parameter_list;
parameter_list : (LP (parameter ( COMMA parameter )*)? RP)?;
parameter : (NAME | INTEGER);
MODULE : 'MODULE';
LTLSPEC : 'LTLSPEC' ~';'* ';' {skip();};
COMMA : ',';
OTHER : ( '&' | '|' | 'xor' | 'xnor' | '=' | '!' |
'<' | '>' | '-' | '+' | '*' | '/' |
'mod' | '[' | ']' | '?')
;
RCP : '}';
LCP : '{';
LP : '(';
RP : ')';
EOC : ';';
WS : (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;};
COMMENT : '--' .* ('\n' | '\r') {$channel = HIDDEN;};
INTEGER : ('0'..'9')+;
NAME : ('A'..'Z' | 'a'..'z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$' | '#' | '-')*;