ANTLR - parameters - antlr

I have started to learn ANTLR and i was looking at the example grammars.
I found the CSV grammar quite interesting:
grammar CSV;
file: hdr row+ ;
hdr : row ;
row : field (',' field)* '\r'? '\n' ;
field
: TEXT
| STRING
|
;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ;
Is it possible to tell antlr that every row should have the same length as the hdr?
In my application i would need such a mechanism. To put it simply Id like to read in a INT variable contaniing the length N of an array and after it N values. Somthing like this:
// example-Input I want to parse
A = 3
a_0 = 2
a_1 = 5 //there have to be exactly A = 3 a_ entries
a_2 = 3
Is this possible with antlr? And if not directly in the grammar, can it be dione afterwards somehow?

Related

Array support for Hplsql.g4 or Hive.g4

Good day everyone,
I am using antlr4 to create a parser and lexer for Hive SQL (Hplsql.g4).
I believe this is the latest grammar file.
https://github.com/AngersZhuuuu/Spark-Hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4
However, I found at least two additions that are needed: IF and array indices.
For example, in a select statement, I may have:
a) SELECT if(a>8,12,20) FROM x
b) SELECT column_name[2] FROM x
Both are valid in Hive but both do not parse when I create a parser and lexer for java from the Hplsql.g4 above. I added an expression for the IF and it appears to work.
I added
expr :
...
| expr_if //I added
and a new rule:
expr_if :
T_IF T_OPEN_P bool_expr T_COMMA expr T_COMMA expr T_CLOSE_P //I added
;
However, figuring out how to allow an array index is not so easy because the grammar allows aliases:
select a from x
select a alias_of_a from x
select a[1] from x
select a[1] alias_of_a from x
should all be valid.
I tried adding a new expression for this like so:
expr :
...
| expr_array //I added
expr_array :
T_OPEN_SB L_INT T_OPEN_CB //I added
;
This didn't work for me. (T_OPEN_SB L_INT T_OPEN_CB are [ integer ] respectively). I tried so many variations on this as well. My questions are:
Am I using the right grammar file - if not is there a newer one with IF and array handling?
Has anyone been successful in extending this grammar to handle my cases above?
As per Bart's recommendations:
I updated ident.
I updated expr_atom.
I added array_index.
I had // | '[' .*? ']' commented out before.
Test Sql: select a[0] from t
Result:
line 1:8 no viable alternative at input 'selecta[0]'
line 1:8 mismatched input '[0]'
Tree
(program (block stmt (stmt select) (stmt (expr_stmt (expr (expr_atom (ident a)))))) [0] from t)
I feel like the problem is somehow related to select_list_alias below.
With select_list_alias containing ident and T_AS optional, ident is matching the array index.
I can't reconcile why this happens, especially since ident has been updated.
Excerpt from Hplsql.sql:
select_list :
select_list_set? select_list_limit? select_list_item (T_COMMA select_list_item)*
;
select_list_item :
(ident T_EQUAL)? expr select_list_alias?
| select_list_asterisk
;
select_list_alias :
{!_input.LT(1).getText().equalsIgnoreCase("INTO") && !_input.LT(1).getText().equalsIgnoreCase("FROM")}? T_AS? ident
| T_OPEN_P T_TITLE L_S_STRING T_CLOSE_P
;
If I pass in a simple SQL stmt to grun such as
select a[1] from t
The parse tree should look similar to this:
Instead of expr_atom, I want to see expr_array where it would split into expr_atom for the a and array_index for the [1].
Note that there is one SQL statement here. With my existing g4, the array index [1] (and the remainder of the stmt) gets parsed as a separate SQL statement.
Bart, I see from your parse tree that parsing resulted in two SQL statements from "select a[0] from t" - I was getting the same situation.
I will continue to explore different approaches - I am still suspicious of the select_list_alias which has T_AS? ident at the end. Just to confirm, I have commented out one line from ident_part like this: // | '[' .*? ']'
As mentioned in the comments: [ ... ] will be tokenised as a L_ID token. If you don;t want that, remove the | '[' .*? ']' part:
fragment
L_ID_PART :
[a-zA-Z] ([a-zA-Z] | L_DIGIT | '_')* // Identifier part
| ('_' | '#' | ':' | '#' | '$') ([a-zA-Z] | L_DIGIT | '_' | '#' | ':' | '#' | '$')+ // (at least one char must follow special char)
| '"' .*? '"' // Quoted identifiers
// | '[' .*? ']' <-- removed
| '`' .*? '`'
;
and create/edit the grammar like this:
expr_atom :
date_literal
| timestamp_literal
| bool_literal
| expr_array // <-- added
| ident
| string
| dec_number
| int_number
| null_const
;
// new rule
expr_array
: ident array_index+
;
// new rule
array_index
: T_OPEN_SB expr T_CLOSE_SB
;
The rules above will cause select a[1] alias_of_a from x to be parsed successfully, but wil fail on input like select a[1] alias_of_a from [identifier]: the [identifier] will not be matched as an identifier.
You could try adding something like this:
ident :
L_ID
| T_OPEN_SB ~T_CLOSE_SB+ T_CLOSE_SB // <-- added
| non_reserved_words
;
which will parse select a[1] alias_of_a from [identifier] properly, but have no good picture of the whole grammar (or deep knowledge of HPL/SQL) to determine if that will mess up other things :)
EDIT
With my proposed changes, the grammar looks like this: https://gist.github.com/bkiers/4aedd6074726cbcd5d87ede00000cd0d (I cannot post it here on SO because of the char limit)
Parsing select a[0] from t with this will result in the parse tree:
And parsing select a[0] from [t] with this will result in this parse tree:
You're also able to test it by running the following Java code:
String source = "select a[0] from [t]";
HplsqlLexer lexer = new HplsqlLexer(CharStreams.fromString(source));
HplsqlParser parser = new HplsqlParser(new CommonTokenStream(lexer));
ParseTree root = parser.program();
JFrame frame = new JFrame("Antlr AST");
JPanel panel = new JPanel();
TreeViewer viewer = new TreeViewer(Arrays.asList(parser.getRuleNames()), root);
viewer.setScale(1.5);
panel.add(viewer);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.pack();
frame.setVisible(true);

ANTLR rules to match unquoted or quoted multiline string

I would like my grammar to be able to match either a single line string assignment terminated by a newline (\r\n or \n), possibly with a comment at the end, or a multiline assignment, denoted by double quotes. So for example:
key = value
key = spaces are allowed
key = until a new line or a comment # this is a comment
key = "you can use quotes as well" # this is a comment
key = "and
with quotes
you can also do
multiline"
Is that doable? I've been bashing my head on this, and got everything working except the multiline. It seems so simple, but the rules simply won't match appropriately.
Add on: this is just a part of a bigger grammar.
Looking at your example input:
# This is the most simple configuration
title = "FML rulez"
# We use ISO notations only, so no local styles
releaseDateTime = 2020-09-12T06:34
# Multiline strings
description = "So,
I'm curious
where this will end."
# Shorcut string; no quotes are needed in a simple property style assignment
# Or if a string is just one word. These strings are trimmed.
protocol = http
# Conditions allow for overriding, best match wins (most conditions)
# If multiple condition sets equally match, the first one will win.
title[env=production] = "One config file to rule them all"
title[env=production & os=osx] = "Even on Mac"
# Lists
hosts = [alpha, beta]
# Hierarchy is implemented using groups denoted by curly brackets
database {
# indenting is allowed and encouraged, but has no semantic meaning
url = jdbc://...
user = "admin"
# Strings support default encryption with a external key file, like maven
password = "FGFGGHDRG#$BRTHT%G%GFGHFH%twercgfg"
# groups can nest
dialect {
database = postgres
}
}
servers {
# This is a table:
# - the first row is a header, containing the id's
# - the remaining rows are values
| name | datacenter | maxSessions | settings |
| alpha | A | 12 | |
| beta | XYZ | 24 | |
| "sys 2" | B | 6 | |
# you can have sub blocks, which are id-less groups (id is the column)
| gamma | C | 12 | {breaker:true, timeout: 15} |
# or you reference to another block
| tango | D | 24 | $environment |
}
# environments can be easily done using conditions
environment[env=development] {
datasource = tst
}
environment[env=production] {
datesource = prd
}
I'd go for something like this:
grammar TECL;
input_file
: configs EOF
;
configs
: NL* ( config ( NL+ config )* NL* )?
;
config
: property
| group
| table
;
property
: WORD conditions? ASSIGN value
;
group
: WORD conditions? NL* OBRACE configs CBRACE
;
conditions
: OBRACK property ( AMP property )* CBRACK
;
table
: row ( NL+ row )*
;
row
: PIPE ( col_value PIPE )+
;
col_value
: ~( PIPE | NL )*
;
value
: WORD
| VARIABLE
| string
| list
;
string
: STRING
| WORD+
;
list
: OBRACK ( value ( COMMA value )* )? CBRACK
;
ASSIGN : '=';
OBRACK : '[';
CBRACK : ']';
OBRACE : '{';
CBRACE : '}';
COMMA : ',';
PIPE : '|';
AMP : '&';
VARIABLE
: '$' WORD
;
NL
: [\r\n]+
;
STRING
: '"' ( ~[\\"] | '\\' . )* '"'
;
WORD
: ~[ \t\r\n[\]{}=,|&]+
;
COMMENT
: '#' ~[\r\n]* -> skip
;
SPACES
: [ \t]+ -> skip
;
which will parse the example in the following parse tree:
And the input:
key = value
key = spaces are allowed
key = until a new line or a comment # this is a comment
key = "you can use quotes as well" # this is a comment
key = "and
with quotes
you can also do
multiline"
into the following:
For now: multiline quoted works, spaces in unquoted string not.
As you can see in the tree above, it does work. I suspect you used part of the grammar in your existing one and that doesn't work.
[...] and am I the process inserting the actions.
I would not embed actions (target code) inside your grammar: it makes it hard to read, and making changes to the grammar will be harder to do. And of course, your grammar will only work for 1 language. Better use a listener or visitor instead of these actions.
Good luck!

ANTLR: "for" keyword used for loops conflicts with "for" used in messages

I have the following grammar:
myg : line+ EOF ;
line : ( for_loop | command params ) NEWLINE;
for_loop : FOR WORD INT DO NEWLINE stmt_body;
stmt_body: line+ END;
params : ( param | WHITESPACE)*;
param : WORD | INT;
command : WORD;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
WORD : (LOWERCASE | UPPERCASE | DIGIT | [_."'/\\-])+ (DIGIT)* ;
INT : DIGIT+ ;
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
FOR: 'for';
DO: 'do';
END: 'end';
My problem is that the 2 following are valid in this language:
message please wait for 90 seconds
This would be a valid command printing a message with the word "for".
for n 2 do
This would be the beginning of a for loop.
The problem is that with the current lexer it doesn't match the for loop since 'for' is matched by the WORD rule as it appears first.
I could solve that by putting the FOR rule before the WORD rule but then 'for' in message would be matched by the FOR rule
This is the typical keywords versus identifier problem and I thought there were quite a number of questions regarding that here on Stackoverflow. But to my surprise I can only find an old answer of mine for ANTLR3.
Even though the principle mentioned there remains the same, you no longer can change the returned token type in a parser rule, with ANTLR4.
There are 2 steps required to make your scenario work.
Define the keywords before the WORD rule. This way they get own token types you need for grammar parts which require specific keywords.
Add keywords selectively to rules, which parse names, where you want to allow those keywords too.
For the second step modify your rules:
param: WORD | INT | commandKeyword;
command: WORD | commandKeyword;
commandKeyword: FOR | DO | END; // Keywords allowed as names in commands.

What's wrong with this ANTLR grammar?

I want to parse query expressions that look like this:
Person Name=%John%
(Person Name=John% and Address=%Ontario%)
Person Fullname_3="John C. Smith"
But I'm totally new to Antlr4 and can't even figure out how to parse one single TABLE FIELD=QUERY clause. When I run the grammar below in Go as target, I get
line 1:7 mismatched input 'Name' expecting {'not', '(', FIELDNAME}
for a simple query like
Person Name=John
Why can't the Grammar parse FIELDNAME via parsing fieldsearch->field EQ searchterm->FIELDNAME?
I guess I'm misunderstanding something very fundamental here about how Antlr Grammars work, but what?
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start : searchclause EOF ;
searchclause
: table expr
;
expr
: fieldsearch
| unop fieldsearch
| LPAREN expr relop expr RPAREN
;
unop
: NOT
;
relop
: AND
| OR
;
fieldsearch
: field EQ searchterm
;
field
: FIELDNAME
;
table
: TABLENAME
;
searchterm
: STRING
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
EQ
: '='
;
LPAREN
: '('
;
RPAREN
: ')'
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
TABLENAME
: VALID_ID_START VALID_ID_CHAR*
;
FIELDNAME
: VALID_ID_START VALID_ID_CHAR*
;
STRING: '"' ~('\n'|'"')* ('"' | { panic("syntax-error - unterminated string literal") } ) ;
WS
: [ \r\n\t] + -> skip
;
Try looking at the tokens produced for that input using grun Mdb tokens -tokens. It will tell you that the input consists of two table names, an equals sign and then another table name. To match your grammar it would have needed to be a table name, a field name, an equals sign and a string.
The first problem is that TABLENAME and FIELDNAME have the exact same definition. In cases where two lexer rules would produce a match of the same length on the current input, ANTLR prefers the one that comes first in the grammar. So it will never produce a FIELDNAME token. To fix that just replace both of those rules with a single ID rule. If you want to, you can then introduce parser rules tableName : ID ; and fieldName : ID ; if you want to keep the names.
The other problem is more straight forward: John simply does not match your rules for a string since it's not in quotes. If you do want to allow John as a valid search term, you might want to define it as searchterm : STRING | ID ; instead of only allowing STRINGs.

Grammar for ANLTR 4

I'm trying to develop a grammar to parse a DSL using ANTLR4 (first attempt at using it)
The grammar itself is somewhat similar to SQL in the sense that should
It should be able to parse commands like the following:
select type1.attribute1 type2./xpath_expression[#id='test 1'] type3.* from source1 source2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where (type1.attribute2 = "XX" AND
(type1.attribute3 <= "2014-01-12T00:00:00.123456+00:00" OR
type2./another_xpath_expression = "YY"))
EDIT: I've updated the grammar switching CHAR, SYMBOL and DIGIT to fragment as suggested by [lucas_trzesniewski], but I did not manage to get improvements.
Attached is the parse tree as suggested by Terence. I get also in the console the following (I'm getting more confused...):
warning(125): API.g4:16:8: implicit definition of token 'CHAR' in parser
warning(125): API.g4:20:31: implicit definition of token 'SYMBOL' in parser
line 1:12 mismatched input 'p' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:19 mismatched input 't' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:27 mismatched input 'm' expecting {'.', NUMBER, CHAR, SYMBOL}
line 1:35 mismatched input '#' expecting {NUMBER, CHAR, SYMBOL}
line 1:58 no viable alternative at input 'm'
line 3:13 no viable alternative at input '(deco.m'
I was able to put together the bulk of the grammar, but it fails to properly match all the tokens, therefore resulting in incorrect parsing depending on the complexity of the input.
By browsing on internet it seems to me that the main reason is down to the lexer selecting the longest matching sequence, but even after several attempts of rewriting lexer and grammar rules I could not achieve a robust set.
Below are my grammar and some test cases.
What would be the correct way to specify the rules? should I use lexer modes ?
GRAMMAR
grammar API;
get : K_SELECT (((element) )+ | '*')
'from' (source )+
( K_FROM_DATE dateTimeOffset )? ( K_TO_DATE dateTimeOffset )?
('where' expr )?
EOF
;
element : qualifier DOT attribute;
qualifier : 'raw' | 'std' | 'deco' ;
attribute : ( word | xpath | '*') ;
word : CHAR (CHAR | NUMBER)*;
xpath : (xpathFragment+);
xpathFragment
: '/' ( DOT | CHAR | NUMBER | SYMBOL )+
| '[' (CHAR | NUMBER | SYMBOL )+ ']'
;
source : ( 'system1' | 'system2' | 'ALL') ; // should be generalised.
date : (NUMBER MINUS NUMBER MINUS NUMBER) ;
time : (NUMBER COLON NUMBER (COLON NUMBER ( DOT NUMBER )?)? ( 'Z' | SIGN (NUMBER COLON NUMBER )));
dateTimeOffset : date 'T' time;
filter : (element OP value) ;
value : QUOTE .+? QUOTE ;
expr
: filter
| '(' expr 'AND' expr ')'
| '(' expr 'OR' expr ')'
;
K_SELECT : 'select';
K_RANGE : 'range';
K_FROM_DATE : 'fromDate';
K_TO_DATE : 'toDate' ;
QUOTE : '"' ;
MINUS : '-';
SIGN : '+' | '-';
COLON : ':';
COMMA : ',';
DOT : '.';
OP : '=' | '<' | '<=' | '>' | '>=' | '!=';
NUMBER : DIGIT+;
fragment DIGIT : ('0'..'9');
fragment CHAR : [a-z] | [A-Z] ;
fragment SYMBOL : '#' | [-_=] | '\'' | '/' | '\\' ;
WS : [ \t\r\n]+ -> skip ;
NONWS : ~[ \t\r\n];
TEST 1
select raw./priobj/tradeid/margin[#id='222'] deco.* deco.marginType from system1 system2
fromDate 2014-01-12T00:00:00.123456+00:00 toDate 2014-01-13T00:00:00.123456Z
where ( deco.marginType >= "MV" AND ( ( raw.CretSysInst = "RMS_EXODUS" OR deco.ExtSysNum <= "1234" ) OR deco.ExtSysStr = "TEST Spaced" ) )
TEST 2
select * from ALL
TEST 3
select deco./xpath/expr/text() deco./xpath/expr[a='3' and b gt '6] raw.* from ALL where raw.attr3 = "myvalue"
The image shows that my grammar is unable to recognise several parts of the commands
What is a bit puzzling me is that the single parts are instead working properly,
e.g. parsing only the 'expr' as shown by the tree below
That kind of thing: word : (CHAR (CHAR | NUMBER)+); is indeed a job for the lexer, not the parser.
This: DIGIT : ('0'..'9'); should be a fragment. Same goes for this: CHAR : [a-z] | [A-Z] ;. That way, you could write NUMBER : CHAR+;, and WORD: CHAR (CHAR | NUMBER)*;
The reason is simple: you want to deal with meaningful tokens in your parser, not with parts of words. Think of the lexer as the thing that will "cut" the input text at meaningful points. Later on, you want to process full words, not individual characters. So think about where is it most meaningful to make those cuts.
Now, as the ANTLR master has pointed out, to debug your problem, dump the parse tree and see what goes on.