I write some Antlr4 rules to parse SQL,just want to distinguish between fields and tables.But they did something unexpected.My rule:
grammar Col;
stat : SELECT select_list FROM table_ref_list;
select_list : select_ele (',' select_ele)* ;
select_ele : Subquery_in_field_nor //subquery select column
| ID '(' .*? ')' //function calls
| NoDotId '(' .*? ')' NoDotId '(' .*? ')' //window function call
| ID //like column of tab
| DIGIT //number
| STRING //like this 'dad'
;
table_ref_list
: table_ref (',' table_ref)*
;
table_ref:table_block (NoDotId)?;
table_block : ID //So much like select_ele
| Subquery_in_field_nor
;
Subquery_in_field_nor : '(' (Subquery_in_field | ~[()])* ')'; //Resolving function nesting
Subquery_in_field : '(' .*? ')' ;
SELECT : [Ss][Ee][Ll][Ee][Cc][Tt];
FROM :[Ff][Rr][Oo][Mm];
NL : [ \r\n]+ ->skip;
ID : [A-Za-z] [A-Za-z0-9.]*;
NoDotId : [A-Za-z] [A-Za-z0-9]*;
DIGIT : ('-')? [0-9]+('.' [0-9]+)?;
STRING : '\'' .*? '\'';
And my sql file like this
SELECT substr(A.EMPNO,1,2),
A.ENAME,
'1',
'wwet',
18,
A.DEPTNO FROM EMP A
Display message
line 1:13 missing FROM at '(A.EMPNO,1,2)'
line 3:7 mismatched input ''1'' expecting {Subquery_in_field_nor, ID}
line 4:7 mismatched input ''wwet'' expecting {Subquery_in_field_nor, ID}
line 5:7 mismatched input '18' expecting {Subquery_in_field_nor, ID}
(stat SELECT (select_list (select_ele substr)) <missing FROM> (table_ref_list (table_ref (table_block (A.EMPNO,1,2))) , (table_ref (table_block A.ENAME)) , (table_ref (table_block '1')) , (table_ref (table_block 'wwet')) , (table_ref (table_block 18)) , (table_ref (table_block A.DEPTNO))))
I don't know why ?Why i user select_list : select_ele (',' select_ele)* ,but still Match table_ref?
substr(A.EMPNO,1,2) is not matched as a select_ele because it produces the tokens:
ID: substr
Subquery_in_field_nor: (A.EMPNO,1,2)
and not by any of the two alternatives:
| ID '(' .*? ')' //function calls
| NoDotId '(' .*? ')' NoDotId '(' .*? ')' //window function call
The part '(' .*? ')' is interpreted as: match a ( token, reluctantly followed by zero or more other tokens, ending with a ) token. But because ANTLR's lexer tries to grab as much characters as possible when tokenising the input, (A.EMPNO,1,2) will always be matched by a single Subquery_in_field_nor token.
Besides the greedy tokenisation in the lexer, you must also be aware that when 2 or more lexer rules can match the same characters, the one defined first "wins". So the input select can be matched by SELECT, ID and NoDotId. But since SELECT is defined first, that input will always become a SELECT token. However, the input foo can be matched by ID and NoDotId and since ID is defined first, it will win. You'll notice that there will never be a NoDotId token at all since whatever it matches is also matched by ID.
Here is a slightly modified grammar that works for your input:
grammar Col;
stat
: SELECT select_list FROM table_ref_list
;
select_list
: select_ele (',' select_ele)*
;
select_ele
: id Subquery_in_field_nor (id Subquery_in_field_nor)?
| id
| DIGIT
| STRING
;
table_ref_list
: table_ref (',' table_ref)*
;
table_ref
: table_block id?
;
table_block
: id
| Subquery_in_field_nor
;
id
: ID ('.' ID)*
;
Subquery_in_field_nor : '(' (Subquery_in_field | ~[()])* ')'; //Resolving function nesting
Subquery_in_field : '(' .*? ')' ;
SELECT : [Ss][Ee][Ll][Ee][Cc][Tt];
FROM : [Ff][Rr][Oo][Mm];
NL : [ \r\n]+ ->skip;
ID : [A-Za-z] [A-Za-z0-9]*;
DIGIT : ('-')? [0-9]+('.' [0-9]+)?;
STRING : '\'' .*? '\'';
Or better, grab an existing SQL grammar from the ANTLR4 Github repo: https://github.com/antlr/grammars-v4 Be aware that all of these grammars are opensource contributions from the community: test them properly, because many of them will have limitations (both in accuracy and in performance).
Related
Orignial question:
My code to parse:
N100G1M4
What I expcted: N100 G1 M4
But ANTLR can not idetify this because ANTLR always match longest substring?
How to handle the case?
Update
What I am going to do:
I am trying to parse CNC G-Code txt and get keywords from a file stream, which is usually used to control a machine and drive motors to move.
The G-Code rule is :
// Define a grammar called Hello
grammar GCode;
script : blocks+ EOF;
blocks:
assign_stat
| ncblock
| NEWLINE
;
ncblock :
ncelements NEWLINE //
;
ncelements :
ncelement+
;
ncelement
:
LINENUMEXPR // linenumber N100
| GCODEEXPR // G10 G54.1
| MCODEEXPR // M30
| coordexpr // X100 Y100 Z[A+b*c]
| FeedExpr // F10.12
| AccExpr // E2.0
// | callSubroutine
;
assign_stat:
VARNAME '=' expression NEWLINE
;
expression:
multiplyingExpression ('+' | '-') multiplyingExpression
;
multiplyingExpression
: powExpression (('*' | '/') powExpression)*
;
powExpression
: signedAtom ('^' signedAtom)*
;
signedAtom
: '+' signedAtom
| '-' signedAtom
| atom
;
atom
: scientific
| variable
| '(' expression ')'
;
LINENUMEXPR: 'N' Digit+ ;
GCODEEXPR : 'G' GPOSTFIX;
MCODEEXPR : 'M' INT;
coordexpr:
CoordExpr
| ParameterKeyword getValueExpr
;
getValueExpr:
'[' expression ']'
;
CoordExpr
:
ParameterKeyword SCIENTIFIC_NUMBER
;
ParameterKeyword: [XYZABCUVWIJKR];
FeedExpr: 'F' SCIENTIFIC_NUMBER;
AccExpr: 'E' SCIENTIFIC_NUMBER;
fragment
GPOSTFIX
: Digit+ ('.' Digit+)*
;
variable
: VARNAME
;
scientific
: SCIENTIFIC_NUMBER
;
SCIENTIFIC_NUMBER
: SIGN? NUMBER (('E' | 'e') SIGN? NUMBER)?
;
fragment NUMBER
: ('0' .. '9') + ('.' ('0' .. '9') +)?
;
HEX_INTEGER
: '0' [xX] HEX_DIGIT+
;
fragment HEX_DIGIT
: [0-9a-fA-F]
;
INT : Digit+;
fragment
Digit : [0-9];
fragment
SIGN
: ('+' | '-')
;
VARNAME
: [a-zA-Z_][a-zA-Z_0-9]*
;
NEWLINE
: '\r'? '\n'
;
WS : [ \t]+ -> skip ; // skip spaces, tabs, newlines
Sample program(it works well except the last line):
N200 G54.1
a = 100
b = 10
c = a + b
Z[a + b*c]
N002 G2 X30.1 Y20.1 I20.1 J0.1 K0.2 R20
N100 G1X100.5Z[VAR1+100]M3H3 // it works well except the last line
I want to parse N100G1X100.5YE5Z[VAR1+100]M3H3 to
-> N100 G1 X100 Z[VAR1+100]
-> or it will be better to split the node X100 to two subnode X 100:
I am trying to use ANTLR, but ANTLR always take the rule "longest match wins". N100G1X100 is identified to a word.
Append question:
What's the best tool to finish the task?
ANTLR has a strict separation between pasrer and lexer, and therefor the lexer operates in a predictable way (longest match wins). So if you have some sort of identifier rule that matches N100G1M4 but sometimes want to match N100, G1 and M4 separately, you're out of luck.
How to handle the case?
The only answer one can give (with the amount of details given) is: remove the rule that matches N100G1M4 as 1 token. If that is something you cannot do, then don't use ANTLR, but use a "scannerless" parser.
Scannerless Parser Generators
I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)
I want to parse query expressions that look like this:
Person Name=%John%
(Person Name=John% and Address=%Ontario%)
Person Fullname_3="John C. Smith"
But I'm totally new to Antlr4 and can't even figure out how to parse one single TABLE FIELD=QUERY clause. When I run the grammar below in Go as target, I get
line 1:7 mismatched input 'Name' expecting {'not', '(', FIELDNAME}
for a simple query like
Person Name=John
Why can't the Grammar parse FIELDNAME via parsing fieldsearch->field EQ searchterm->FIELDNAME?
I guess I'm misunderstanding something very fundamental here about how Antlr Grammars work, but what?
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start : searchclause EOF ;
searchclause
: table expr
;
expr
: fieldsearch
| unop fieldsearch
| LPAREN expr relop expr RPAREN
;
unop
: NOT
;
relop
: AND
| OR
;
fieldsearch
: field EQ searchterm
;
field
: FIELDNAME
;
table
: TABLENAME
;
searchterm
: STRING
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
EQ
: '='
;
LPAREN
: '('
;
RPAREN
: ')'
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
TABLENAME
: VALID_ID_START VALID_ID_CHAR*
;
FIELDNAME
: VALID_ID_START VALID_ID_CHAR*
;
STRING: '"' ~('\n'|'"')* ('"' | { panic("syntax-error - unterminated string literal") } ) ;
WS
: [ \r\n\t] + -> skip
;
Try looking at the tokens produced for that input using grun Mdb tokens -tokens. It will tell you that the input consists of two table names, an equals sign and then another table name. To match your grammar it would have needed to be a table name, a field name, an equals sign and a string.
The first problem is that TABLENAME and FIELDNAME have the exact same definition. In cases where two lexer rules would produce a match of the same length on the current input, ANTLR prefers the one that comes first in the grammar. So it will never produce a FIELDNAME token. To fix that just replace both of those rules with a single ID rule. If you want to, you can then introduce parser rules tableName : ID ; and fieldName : ID ; if you want to keep the names.
The other problem is more straight forward: John simply does not match your rules for a string since it's not in quotes. If you do want to allow John as a valid search term, you might want to define it as searchterm : STRING | ID ; instead of only allowing STRINGs.
I am translating a grammar from LALR to ANTLR and I am having trouble with translating this one rule, piecewise expression.
Attached is the sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression ';'
;
expression : binaryExpression
| piecesExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
literalExpression : INT
;
id : IDENTIFIER
;
piecesExpression : 'piecewise' '{' piece expression '}' ('(' expression ',' expression ')')? expression?
;
piece : expression '->' expression ';' (expression '->' expression ';')*
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
ANTLR v3.5 is complaining about the piecesExpression rule. It has 2 fatal errors and I would rather not use backtrack option.
Expected results:
piecewise {t -> s; t -> x; 100}
piecewise {t -> s; t -> x; 100} (0, x+1)
piecewise {t -> s; t -> x; 100} (0, x+1) y+5
How can piecesExpression be able to capture the above results?
Thanks in advance!
ANTLR has problems determining which alternatives to take in (at least) 2 cases:
piece starts with a expression but inside the piecewise{...}, it should also end with an expression
piecesExpression ends with '(' expression ... but also has an optional trailing expression (and an primitiveElement also matches '(' expression ... in its turn)
There's no need to use global backtracking, but without rewriting many rules, you do need to add some predicates (the (...)=> in the example below) to fix the two issues outlined above.
Try this:
piecesExpression
: 'piecewise' '{' ((expression '->')=> piece)+ expression '}'
( ('(' expression ',')=> '(' expression ',' expression ')' expression?
| expression
)
;
piece
: expression '->' expression ';'
;
I am not sure that the issue is actually the prefixes, but here goes.
I have these two rules in my grammar (among many others)
DOT_T : '.' ;
AND_T : '.AND.' | '.and.' ;
and I need to parse strings like this:
a.eq.b.and.c.ne.d
c.append(b)
this should get lexed as:
ID[a] EQ_T ID[b] AND_T ID[c] NE_T ID[d]
ID[c] DOT_T ID[append] LPAREN_T ID[b] RPAREN_T
the error I get for the second line is:
line 1:3 mismatched character "p"; expecting "n"
It doesn't lex the . as a DOT_T but instead tries to match .and. because it sees the a after ..
Any idea on what I need to do to make this work?
UPDATE
I added the following rule and thought I'd use the same trick
NUMBER_T
: DIGIT+
( (DECIMAL)=> DECIMAL
| (KIND)=> KIND
)?
;
fragment DECIMAL
: '.' DIGIT+ ;
fragment KIND
: '.' DIGIT+ '_' (ALPHA+ | DIGIT+) ;
but when I try parsing this:
lda.eq.3.and.dim.eq.3
it gives me the following error:
line 1:9 no viable alternative at character "a"
while lexing the 3. So I'm guessing the same thing is happening as above, but the solution doesn't work in this case :S Now I'm properly confused...
Yes, that is because of the prefixed '.'-s.
Whenever the lexer stumbles upon ".a", it tries to create a AND_T token. If the characters "nd" can then not be found, the lexer tries to construct another token that starts with a ".a", which isn't present (and ANTLR produces an error). So, the lexer will not give back the character "a" and fall back to create a DOT_T token (and then an ID token)! This is how ANTLR works.
What you can do is optionally match these AND_T, EQ_T, ... inside the DOT_T rule. But still, you will need to "help" the lexer a bit by adding some syntactic predicates that force the lexer to look ahead in the character stream to be sure it can match these tokens.
A demo:
grammar T;
parse
: (t=. {System.out.printf("\%-10s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
DOT_T
: '.' ( (AND_T)=> AND_T {$type=AND_T;}
| (EQ_T)=> EQ_T {$type=EQ_T; }
| (NE_T)=> NE_T {$type=NE_T; }
)?
;
ID
: ('a'..'z' | 'A'..'Z')+
;
LPAREN_T
: '('
;
RPAREN_T
: ')'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL)?
;
fragment DECIMAL : '.' DIGIT+ ;
fragment AND_T : ('AND' | 'and') '.' ;
fragment EQ_T : ('EQ' | 'eq' ) '.' ;
fragment NE_T : ('NE' | 'ne' ) '.' ;
fragment DIGIT : '0'..'9';
And if you feed the generated parser the input:
a.eq.b.and.c.ne.d
c.append(b)
the following output will be printed:
ID 'a'
EQ_T '.eq.'
ID 'b'
AND_T '.and.'
ID 'c'
NE_T '.ne.'
ID 'd'
ID 'c'
DOT_T '.'
ID 'append'
LPAREN_T '('
ID 'b'
RPAREN_T ')'
And for the input:
lda.eq.3.and.dim.eq.3
the following is printed:
ID 'lda'
EQ_T '.eq.'
NUMBER_T '3'
AND_T '.and.'
ID 'dim'
EQ_T '.eq.'
NUMBER_T '3'
EDIT
The fact that DECIMAL and KIND both start with '.' DIGIT+ is not good. Try something like this:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL ((KIND)=> KIND)?)?
;
fragment DECIMAL : '.' DIGIT+;
fragment KIND : '_' (ALPHA+ | DIGIT+); // removed ('.' DIGIT+) from this fragment
Note that the rule NUMBER_T will now never produce DECIMAL or KIND tokens. If you want that to happen, you need to change the type:
NUMBER_T
: DIGIT+ ((DECIMAL)=> DECIMAL {/*change type*/} ((KIND)=> KIND {/*change type*/})?)?
;