Lark simple sql grammar - lark-parser

I'm trying to parse a simple sql via this grammar:
grammar = ```
program : stmnt*
stmnt : select_stmnt | drop_stmnt
select_stmnt : select_clause from_clause? group_by_clause? having_clause? order_by_clause? limit_clause? SEMICOLON
select_clause : "select"i selectables
selectables : column_name ("," column_name)*
from_clause : "from"i source where_clause?
where_clause : "where"i condition
group_by_clause : "group"i "by"i column_name ("," column_name)*
having_clause : "having"i condition
order_by_clause : "order"i "by" (column_name ("asc"i|"desc"i)?)*
limit_clause : "limit"i INTEGER_NUMBER ("offset"i INTEGER_NUMBER)?
// NOTE: there should be no on-clause on cross join and this will have to enforced post parse
source : joining? table_name table_alias?
joining : source join_modifier? JOIN source ON condition
//source : table_name table_alias? joined_source?
//joined_source : join_modifier? JOIN table_name table_alias? ON condition
join_modifier : "inner" | ("left" "outer"?) | ("right" "outer"?) | ("full" "outer"?) | "cross"
condition : or_clause+
or_clause : and_clause ("or" and_clause)*
and_clause : predicate ("and" predicate)*
// NOTE: order of operator should be longest tokens first
predicate : comparison ( ( EQUAL | NOT_EQUAL ) comparison )*
comparison : term ( ( LESS_EQUAL | GREATER_EQUAL | LESS | GREATER ) term )*
term : factor ( ( "-" | "+" ) factor )*
factor : unary ( ( "/" | "*" ) unary )*
unary : ( "!" | "-" ) unary
| primary
primary : INTEGER_NUMBER | FLOAT_NUMBER | STRING | "true" | "false" | "null"
| IDENTIFIER
drop_stmnt : "drop" "table" table_name
FLOAT_NUMBER : INTEGER_NUMBER "." ("0".."9")*
column_name : IDENTIFIER
table_name : IDENTIFIER
table_alias : IDENTIFIER
// keywords
// define keywords as they have higher priority
SELECT.5 : "select"i
FROM.5 : "from"i
WHERE.5 : "where"i
JOIN.5 : "join"i
ON.5 : "on"i
// operators
STAR : "*"
LEFT_PAREN : "("
RIGHT_PAREN : ")"
LEFT_BRACKET : "["
RIGHT_BRACKET : "]"
DOT : "."
EQUAL : "="
LESS : "<"
GREATER : ">"
COMMA : ","
// 2-char ops
LESS_EQUAL : "<="
GREATER_EQUAL : ">="
NOT_EQUAL : ("<>" | "!=")
SEMICOLON : ";"
IDENTIFIER.9 : ("_" | ("a".."z") | ("A".."Z"))* ("_" | ("a".."z") | ("A".."Z") | ("0".."9"))+
%import common.ESCAPED_STRING -> STRING
%import common.SIGNED_NUMBER -> INTEGER_NUMBER
%import common.WS
%ignore WS
However, when I call the parser with text,
"""select cola, colb from foo left outer join bar b on x = 1 join jar j on jb > xw where cola <> colb and colx > coly""",
it parses the second join as a term, i.e. as part of the first join's condition. Any thoughts on how to do this correctly?

Related

"The following sets of rules are mutually left-recursive"

I have tried to write a grammar to recognize expressions like:
(A + MAX(B) ) / ( C - AVERAGE(A) )
IF( A > AVERAGE(A), 0, 1 )
X / (MAX(X)
Unfortunately antlr3 fails with these errors:
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
error(211): DerivedKeywords.g:110:13: [fatal] rule booleanTerm has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
error(206): DerivedKeywords.g:110:13: Alternative 1: after matching input such as decision cannot predict what comes next due to recursion overflow to additiveExpression from formula
I have spent some hours trying to fix these, it would be great if anyone could at least help me fix the first problem. Thanks
Code:
grammar DerivedKeywords;
options {
output=AST;
//backtrack=true;
}
WS : ( ' ' | '\t' | '\n' | '\r' )
{ skip(); }
;
//for numbers
DIGIT
: '0'..'9'
;
//for both integer and real number
NUMBER
: (DIGIT)+ ( '.' (DIGIT)+ )?( ('E'|'e')('+'|'-')?(DIGIT)+ )?
;
// Boolean operatos
AND : 'AND';
OR : 'OR';
NOT : 'NOT';
EQ : '=';
NEQ : '!=';
GT : '>';
LT : '<';
GTE : '>=';
LTE : '<=';
COMMA : ',';
// Token for Functions
IF : 'IF';
MAX : 'MAX';
MIN : 'MIN';
AVERAGE : 'AVERAGE';
VARIABLE : 'A'..'Z' ('A'..'Z' | '0'..'9')*
;
// OPERATORS
LPAREN : '(' ;
RPAREN : ')' ;
DIV : '/' ;
PLUS : '+' ;
MINUS : '-' ;
STAR : '*' ;
expression : formula;
formula
: functionExpression
| additiveExpression
| LPAREN! a=formula RPAREN! // First Problem
;
additiveExpression
: a=multiplicativeExpression ( (MINUS^ | PLUS^ ) b=multiplicativeExpression )*
;
multiplicativeExpression
: a=unaryExpression ( (STAR^ | DIV^ ) b=unaryExpression )*
;
unaryExpression
: MINUS^ u=unaryExpression
| primaryExpression
;
functionExpression
: f=functionOperator LPAREN e=formula RPAREN
| IF LPAREN b=booleanExpression COMMA p=formula COMMA s=formula RPAREN
;
functionOperator :
MAX | MIN | AVERAGE;
primaryExpression
: NUMBER
// Used for scientific numbers
| DIGIT
| VARIABLE
| formula
;
// Boolean stuff
booleanExpression
: orExpression;
orExpression : a=andExpression (OR^ b=andExpression )*
;
andExpression
: a=notExpression (AND^ b=notExpression )*
;
notExpression
: NOT^ t=booleanTerm
| booleanTerm
;
booleanOperator :
GT | LT | EQ | GTE | LTE | NEQ;
booleanTerm : a=formula op=booleanOperator b=formula
| LPAREN! booleanTerm RPAREN! // Second problem
;
error(210): The following sets of rules are mutually left-recursive [unaryExpression, additiveExpression, primaryExpression, formula, multiplicativeExpression]
- this means that if the parser enters unaryExpression rule, it has the possibility to match additiveExpression, primaryExpression, formula, multiplicativeExpression and unaryExpression again without ever consuming a single token from input - so it cannot decide whether to use those rules or not, because even if it uses the rules, the input will be the same.
You're probably trying to allow subexpressions in expressions by this sequence of rules - you need to make sure that path will consume the left parenthesis of the subexpression. Probably the formula alternative in primaryExpression should be changed to LPAREN formula RPAREN, and the rest of grammar be adjusted accordingly.

Antlr4: unexpected parsing

all I have develop ANTR4 grammar. During parse the string
Time;25 10 * * *;'faccalc_minus1_cron.out.'yyyyMMdd.HHmm;America/New_York
I have following errors
Invalid chars in expression! Expression: ;' Invalid chars: ;'
extraneous input ';' expecting {'', INTEGER, '-', '/', ','}
missing ';' at '_'
Incorrect timezone format :faccalc_minus1
I don't undestand why, as regex rule contain '_'.
How to fix it?
Regards,
Vladimir
lexer grammar FileTriggerLexer;
CRON
:
'cron'
;
MARKET_CRON
:
'marketCron'
;
COMBINED
:
'combined'
;
FILE_FEED
:
'FileFeed'
;
MANUAL_NOTICE
:
'ManualNotice'
;
TIME
:
'Time'
;
MARKET_TIME
:
'MarketTime'
;
SCHEDULE
:
'Schedule'
;
PRODUCT
:
'Product'
;
UCA_CLIENT
:
'UCAClient'
;
APEX_GSM
:
'ApexGSM'
;
DELAY
:
'Delay'
;
CATEGORY
:
'Category'
;
EXCHANGE
:
'Exchange'
;
CALENDAR_EXCHANGE
:
'CalendarExchange'
;
FEED
:
'Feed'
;
RANGE
:
'Range'
;
SYNTH
:
'Synth'
;
TRIGGER
:
'Trigger'
;
DELAYED_TRIGGER
:
'DelayedTrigger'
;
INTRA_TRIGGER
:
'IntraTrigger'
;
CURRENT_TRIGGER
:
'CurrentTrigger'
;
CALENDAR_FILE_FEED
:
'CalendarFileFeed'
;
PREVIOUS
:
'Previous'
;
LATE_DELAY
:
'LateDelay'
;
BUILD_ARCHIVE
:
'BuildArchive'
;
COMPRESS
:
'Compress'
;
LATE_TIME
:
'LateTime'
;
CALENDAR_CATEGORY
:
'CalendarCategory'
;
APEX_GPM
:
'ApexGPM'
;
PORTFOLIO_NOTICE
:
'PortfolioNotice'
;
FixedTimeOfDay: 'FixedTimeOfDay';
SEMICOLON
:
';'
;
ASTERISK
:
'*'
;
LBRACKET
:
'('
;
RBRACKET
:
')'
;
PERCENT
:
'%'
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
DOUBLE_QUOTE
:
'"'
;
QUOTE
:
'\''
;
SLASH
:
'/'
;
DOT
:
'.'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
EQUAL
:
'='
;
MORE_THAN
:
'>'
;
LESS
:
'<'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
WS
:
[ \t\r\n]+ -> skip
;
/**
* Define Fied Trigger valdiator grammar
*/
grammar FileTriggerValidator;
options
{
tokenVocab = FileTriggerLexer;
}
r
:
(
schedule
| file_feed
| time_feed
| market_time_feed
| manual_notice
| portfolio_notice
| not_checked
)+
;
not_checked
:
(
PRODUCT
| UCA_CLIENT
| APEX_GSM
| APEX_GPM
| DELAY
| CATEGORY
| CALENDAR_CATEGORY
| EXCHANGE
| CALENDAR_EXCHANGE
| FEED
| RANGE
| SYNTH
| TRIGGER
| DELAYED_TRIGGER
| INTRA_TRIGGER
| CURRENT_TRIGGER
| CALENDAR_FILE_FEED
| PREVIOUS
| LATE_DELAY
| LATE_TIME
| COMPRESS
| BUILD_ARCHIVE
)
(
SEMICOLON anyList
)?
;
anyList
:
anyElement
(
SEMICOLON anyElement
)*
;
anyElement
:
cron
| file_name
| with_step_value
| source_file
| timezone
| regEx
;
portfolio_notice
:
PORTFOLIO_NOTICE SEMICOLON regEx
;
manual_notice
:
MANUAL_NOTICE SEMICOLON file_name SEMICOLON timezone
;
time_feed
:
TIME SEMICOLON cron_part
(
timezone?
) SEMICOLON file_name SEMICOLON timezone
;
market_time_feed
:
MARKET_TIME SEMICOLON cron_part timezone SEMICOLON file_name SEMICOLON
timezone
(
SEMICOLON UNDERSCORE? INTEGER
)*
;
file_feed
:
file_feed_name SEMICOLON source_file SEMICOLON source_host SEMICOLON
source_host SEMICOLON regEx SEMICOLON regEx
(
SEMICOLON source_host
)*
;
regEx
:
(
ID
| DOT
| ASTERISK
| INTEGER
| PERCENT
| UNDERSCORE
| DASH
| LESS
| MORE_THAN
| EQUAL
| SLASH
| LBRACKET
| RBRACKET
| DOUBLE_QUOTE
| QUOTE
| COMMA
)+
;
source_host
:
ID
(
DASH ID
)*
;
file_feed_name
:
FILE_FEED
;
source_file
:
(
ID
| DASH
| UNDERSCORE
)+
;
schedule
:
SCHEDULE SEMICOLON schedule_defining SEMICOLON file_name SEMICOLON timezone
(
SEMICOLON DASH? INTEGER
)*
;
schedule_defining
:
cron
| market_cron
| combined_cron
;
cron
:
CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE RBRACKET
;
market_cron
:
MARKET_CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE COMMA
DOUBLE_QUOTE ID DOUBLE_QUOTE RBRACKET
;
combined_cron
:
COMBINED LBRACKET cron_list_element
(
COMMA cron_list_element
)* RBRACKET
;
mic_defining
:
ID
;
file_name
:
regEx
;
cron_list_element
:
cron
| market_cron
;
//
schedule_defined_string
:
cron
;
//
cron_part
:
minutes hours days_of_month month week_days
;
//
minutes
:
with_step_value
;
hours
:
with_step_value
;
//
int_list
:
INTEGER
| interval
(
COMMA INTEGER
| interval
)*
;
interval
:
INTEGER DASH INTEGER
;
//
days_of_month
:
with_step_value
;
//
month
:
with_step_value
;
//
week_days
:
with_step_value
;
//
timezone
:
timezone_part
(
SLASH timezone_part
)?
;
//
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
//
with_step_value
:
(
INTEGER
| COMMA
| SLASH
| ASTERISK
| DASH
)+
;
step
:
SLASH int_list
;
To analyze this kind of problem, dump the token stream to see what the lexer is actually doing. To directly dump the tokens, see this answer. AntlrDT, for example, also provides a graphical analysis of the corresponding parse-tree (I am the author of AntlrDT).
From this, easy to see that the first error occurs in the with_step_value rule: does not allow for a trailing semicolon.
Second error is in the timezone_part rule: does not allow for repeated ID UNDERSCORE occurrences.

ANTLR4 Token is not recognized when substituted

I try to modify the grammar of the sqlite syntax (I'm interested in a variant of the where clause only) and I'm keep having a weird error when substituting AND to it's own token.
grammar wtfql;
/*
SQLite understands the following binary operators, in order from highest to
lowest precedence:
||
* / %
+ -
<< >> & |
< <= > >=
= != <> IS IS NOT IN LIKE GLOB MATCH REGEXP
AND
OR
*/
start : expr EOF?;
expr
: literal_value
//BIND_PARAMETER
| ( table_name '.' )? column_name
| unary_operator expr
| expr '||' expr
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '<>' | K_IN ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* )? ')'
| '(' expr ')'
| expr K_NOT expr
| expr ( K_NOT K_NULL )
| expr K_NOT? K_IN ( '(' ( expr ( ',' expr )* ) ')' )
;
unary_operator
: '-'
| '+'
| K_NOT
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| K_NULL
;
function_name
: IDENTIFIER
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
// | '(' any_name ')'
;
keyword
: K_AND
| K_NOT
| K_NULL
| K_IN
| K_OR
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\"' ( ~'\"' | '\"\"' )* '\"'
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
DOT : '.';
OPEN_PAR : '(';
CLOSE_PAR : ')';
COMMA : ',';
STAR : '*';
PLUS : '+';
MINUS : '-';
TILDE : '~';
DIV : '/';
MOD : '%';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '=';
NOT_EQ2 : '<>';
K_AND : A N D;
K_NOT : N O T;
K_NULL : N U L L;
K_OR : O R;
K_IN : I N;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
writing
| expr K_AND expr
with the input
field1=1 and field2 = 2
results in
line 1:8 mismatched input 'and' expecting {<EOF>, '||', '*', '+', '-', '/', '%', '<', '<=', '>', '>=', '=', '<>', K_AND, K_NOT, K_OR, K_IN}
while
| expr 'and' expr
works like a charm:
$ antlr4 wtfql.g4 && javac -classpath /usr/local/Cellar/antlr/4.4/antlr-4.4-complete.jar wtfql*.java && cat test.txt | grun wtfql start -tree -gui
(start (expr (expr (expr (column_name (any_name feld1))) = (expr (literal_value 1))) and (expr (expr (column_name (any_name feld2))) = (expr (literal_value 2)))) <EOF>)
What am I missing?
I presume "and" is an IDENTIFIER since the rule for IDENTIFIER comes before the rule for AND and thus wins.
If you write 'and' in the parser rule this implicitly creates a token (not AND!) which comes before IDENTIFIER and thus wins.
Rule of thumb: More specific lexer rules first. Don't create new lexer tokens implicitly in parser rules.
If you check the token type, you'll get a clue what's going on.

ANTLR Group expressions and save Variables

If I have expressions like:
(name = Paul AND age = 16) OR country = china;
And I want to get:
QUERY
|
|-------------|
() |
| |
AND OR
| |
|-------| |
name age country
| | |
Paul 16 china
How can I print the () and the condition (AND/OR) before the fields name, age country?
My grammar file is something like this:
parse
: block EOF -> block
;
block
: (statement)* (Return ID ';')?
-> ^(QUERY statement*)
;
statement
: assignment ';'
-> assignment
;
assignment
: expression (condition expression)*
-> ^(condition expression*)
| '(' expression (condition expression)* ')' (condition expression)*
-> ^(Brackets ^(condition expression*))
;
condition
: AND
| OR
;
Brackets: '()' ;
OR : 'OR' ;
AND : 'AND' ;
..
But it only prints the first condition that appears in the expression ('AND' in this example), and I can't group what is between brackets, and what is not...
Your grammar looks odd to me, and there are errors in it: if the parser does not match "()", you can't use Brackets inside a rewrite rule. And why would you ever want to have the token "()" inside your AST?
Given your example input:
(name = Paul AND age = 16) OR country = china;
here's possible way to construct an AST:
grammar T;
options {
output=AST;
}
query
: expr ';' EOF -> expr
;
expr
: logical_expr
;
logical_expr
: equality_expr ( logical_op^ equality_expr )*
;
equality_expr
: atom ( equality_op^ atom )*
;
atom
: ID
| INT
| '(' expr ')' -> expr
;
equality_op
: '='
| 'IS' 'NOT'?
;
logical_op
: 'AND'
| 'OR'
;
ID : ('a'..'z' | 'A'..'Z')+;
INT : '0'..'9'+;
WS : (' ' | '\t' | '\r' | '\n')+ {skip();};
which would result in this:

ANTLR grammar error

I'm trying to built C-- compiler using ANTLR 3.4.
Full set of the grammar listed here,
program : (vardeclaration | fundeclaration)* ;
vardeclaration : INT ID (OPENSQ NUM CLOSESQ)? SEMICOL ;
fundeclaration : typespecifier ID OPENP params CLOSEP compoundstmt ;
typespecifier : INT | VOID ;
params : VOID | paramlist ;
paramlist : param (COMMA param)* ;
param : INT ID (OPENSQ CLOSESQ)? ;
compoundstmt : OPENCUR vardeclaration* statement* CLOSECUR ;
statementlist : statement* ;
statement : expressionstmt | compoundstmt | selectionstmt | iterationstmt | returnstmt;
expressionstmt : (expression)? SEMICOL;
selectionstmt : IF OPENP expression CLOSEP statement (options {greedy=true;}: ELSE statement)?;
iterationstmt : WHILE OPENP expression CLOSEP statement;
returnstmt : RETURN (expression)? SEMICOL;
expression : (var EQUAL expression) | sampleexpression;
var : ID ( OPENSQ expression CLOSESQ )? ;
sampleexpression: addexpr ( ( LOREQ | LESS | GRTR | GOREQ | EQUAL | NTEQL) addexpr)?;
addexpr : mulexpr ( ( PLUS | MINUS ) mulexpr)*;
mulexpr : factor ( ( MULTI | DIV ) factor )*;
factor : ( OPENP expression CLOSEP ) | var | call | NUM;
call : ID OPENP arglist? CLOSEP;
arglist : expression ( COMMA expression)*;
Used lexer rules as following,
ELSE : 'else' ;
IF : 'if' ;
INT : 'int' ;
RETURN : 'return' ;
VOID : 'void' ;
WHILE : 'while' ;
PLUS : '+' ;
MINUS : '-' ;
MULTI : '*' ;
DIV : '/' ;
LESS : '<' ;
LOREQ : '<=' ;
GRTR : '>' ;
GOREQ : '>=' ;
EQUAL : '==' ;
NTEQL : '!=' ;
ASSIGN : '=' ;
SEMICOL : ';' ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
OPENSQ : '[' ;
CLOSESQ : ']' ;
OPENCUR : '{' ;
CLOSECUR: '}' ;
SCOMMENT: '/*' ;
ECOMMENT: '*/' ;
ID : ('a'..'z' | 'A'..'Z')+/*(' ')*/ ;
NUM : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;};
COMMENT: '/*' .* '*/' {$channel = HIDDEN;};
But I try to save this it give me the error,
error(211): /CMinusMinus/src/CMinusMinus/CMinusMinus.g:33:13: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
|---> expression : (var EQUAL expression) | sampleexpression;
1 error
How can I resolve this problem?
As already mentioned: your grammar rule expression is ambiguous: both alternatives in that rule start, or can be, a var.
You need to "help" your parser a bit. If the parse can see a var followed by an EQUAL, it should choose alternative 1, else alternative 2. This can be done by using a syntactic predicate (the (var EQUAL)=> part in the rule below).
expression
: (var EQUAL)=> var EQUAL expression
| sampleexpression
;
More about predicates in this Q&A: What is a 'semantic predicate' in ANTLR?
The problem is this:
expression : (var EQUAL expression) | sampleexpression;
where you either start with var or sampleexpression. But sampleexpression can be reduced to var as well by doing sampleexpression->addExpr->MultExpr->Factor->var
So there is no way to find a k-length predicate for the compiler.
You can as suggested by the error message set backtrack=true to see whether this solves your problem, but it might lead not to the AST - parsetrees you would expect and might also be slow on special input conditions.
You could also try to refactor your grammar to avoid such recursions.