So I have a simple string that I was hoping to run through a ragel state machine.
key1=value1; key2="value2"; key3=value3
Here is a simplified version of my ragel
# Key Value Parts
name = ( token+ ) %on_name ;
value = ( ascii+ -- (" " | ";" | "," | "\r" | "\n" | "\\" | "\"" ) ) %on_value ;
pair = ( name "=" (value | "\"" value "\"") "; " ) ; ## ISSUE WITH FORCING ;
string = ( pair )+ ;
# MACHINE
main := string >begin_parser #end_parser ;
The issue I'm having is that I'll never have a semicolon after the last key/value pair, so I'd like it to be optional, but when I do that the state machine finds several patches for the value. Is there some sort of syntax where I can say pair has to end with a (";" | *eof*)?
I did change, my main line to this, but it seems like a hack and doesn't really work with some of the other things I'd like to do with this state machine.
main := string >begin_parser #end_parser $/on_value;
I was too wrapped up in Ragel syntax and wasn't think about how I was processing. Instead of trying to add an optional semi-colon onto the end I should have force one on the front after it had already process a key value pair.
pair = ( name "=" (value | "\"" value "\"") ) ;
string = pair ( "; " pair )* ;
Related
Good day everyone,
I am using antlr4 to create a parser and lexer for Hive SQL (Hplsql.g4).
I believe this is the latest grammar file.
https://github.com/AngersZhuuuu/Spark-Hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4
However, I found at least two additions that are needed: IF and array indices.
For example, in a select statement, I may have:
a) SELECT if(a>8,12,20) FROM x
b) SELECT column_name[2] FROM x
Both are valid in Hive but both do not parse when I create a parser and lexer for java from the Hplsql.g4 above. I added an expression for the IF and it appears to work.
I added
expr :
...
| expr_if //I added
and a new rule:
expr_if :
T_IF T_OPEN_P bool_expr T_COMMA expr T_COMMA expr T_CLOSE_P //I added
;
However, figuring out how to allow an array index is not so easy because the grammar allows aliases:
select a from x
select a alias_of_a from x
select a[1] from x
select a[1] alias_of_a from x
should all be valid.
I tried adding a new expression for this like so:
expr :
...
| expr_array //I added
expr_array :
T_OPEN_SB L_INT T_OPEN_CB //I added
;
This didn't work for me. (T_OPEN_SB L_INT T_OPEN_CB are [ integer ] respectively). I tried so many variations on this as well. My questions are:
Am I using the right grammar file - if not is there a newer one with IF and array handling?
Has anyone been successful in extending this grammar to handle my cases above?
As per Bart's recommendations:
I updated ident.
I updated expr_atom.
I added array_index.
I had // | '[' .*? ']' commented out before.
Test Sql: select a[0] from t
Result:
line 1:8 no viable alternative at input 'selecta[0]'
line 1:8 mismatched input '[0]'
Tree
(program (block stmt (stmt select) (stmt (expr_stmt (expr (expr_atom (ident a)))))) [0] from t)
I feel like the problem is somehow related to select_list_alias below.
With select_list_alias containing ident and T_AS optional, ident is matching the array index.
I can't reconcile why this happens, especially since ident has been updated.
Excerpt from Hplsql.sql:
select_list :
select_list_set? select_list_limit? select_list_item (T_COMMA select_list_item)*
;
select_list_item :
(ident T_EQUAL)? expr select_list_alias?
| select_list_asterisk
;
select_list_alias :
{!_input.LT(1).getText().equalsIgnoreCase("INTO") && !_input.LT(1).getText().equalsIgnoreCase("FROM")}? T_AS? ident
| T_OPEN_P T_TITLE L_S_STRING T_CLOSE_P
;
If I pass in a simple SQL stmt to grun such as
select a[1] from t
The parse tree should look similar to this:
Instead of expr_atom, I want to see expr_array where it would split into expr_atom for the a and array_index for the [1].
Note that there is one SQL statement here. With my existing g4, the array index [1] (and the remainder of the stmt) gets parsed as a separate SQL statement.
Bart, I see from your parse tree that parsing resulted in two SQL statements from "select a[0] from t" - I was getting the same situation.
I will continue to explore different approaches - I am still suspicious of the select_list_alias which has T_AS? ident at the end. Just to confirm, I have commented out one line from ident_part like this: // | '[' .*? ']'
As mentioned in the comments: [ ... ] will be tokenised as a L_ID token. If you don;t want that, remove the | '[' .*? ']' part:
fragment
L_ID_PART :
[a-zA-Z] ([a-zA-Z] | L_DIGIT | '_')* // Identifier part
| ('_' | '#' | ':' | '#' | '$') ([a-zA-Z] | L_DIGIT | '_' | '#' | ':' | '#' | '$')+ // (at least one char must follow special char)
| '"' .*? '"' // Quoted identifiers
// | '[' .*? ']' <-- removed
| '`' .*? '`'
;
and create/edit the grammar like this:
expr_atom :
date_literal
| timestamp_literal
| bool_literal
| expr_array // <-- added
| ident
| string
| dec_number
| int_number
| null_const
;
// new rule
expr_array
: ident array_index+
;
// new rule
array_index
: T_OPEN_SB expr T_CLOSE_SB
;
The rules above will cause select a[1] alias_of_a from x to be parsed successfully, but wil fail on input like select a[1] alias_of_a from [identifier]: the [identifier] will not be matched as an identifier.
You could try adding something like this:
ident :
L_ID
| T_OPEN_SB ~T_CLOSE_SB+ T_CLOSE_SB // <-- added
| non_reserved_words
;
which will parse select a[1] alias_of_a from [identifier] properly, but have no good picture of the whole grammar (or deep knowledge of HPL/SQL) to determine if that will mess up other things :)
EDIT
With my proposed changes, the grammar looks like this: https://gist.github.com/bkiers/4aedd6074726cbcd5d87ede00000cd0d (I cannot post it here on SO because of the char limit)
Parsing select a[0] from t with this will result in the parse tree:
And parsing select a[0] from [t] with this will result in this parse tree:
You're also able to test it by running the following Java code:
String source = "select a[0] from [t]";
HplsqlLexer lexer = new HplsqlLexer(CharStreams.fromString(source));
HplsqlParser parser = new HplsqlParser(new CommonTokenStream(lexer));
ParseTree root = parser.program();
JFrame frame = new JFrame("Antlr AST");
JPanel panel = new JPanel();
TreeViewer viewer = new TreeViewer(Arrays.asList(parser.getRuleNames()), root);
viewer.setScale(1.5);
panel.add(viewer);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.pack();
frame.setVisible(true);
I would like my grammar to be able to match either a single line string assignment terminated by a newline (\r\n or \n), possibly with a comment at the end, or a multiline assignment, denoted by double quotes. So for example:
key = value
key = spaces are allowed
key = until a new line or a comment # this is a comment
key = "you can use quotes as well" # this is a comment
key = "and
with quotes
you can also do
multiline"
Is that doable? I've been bashing my head on this, and got everything working except the multiline. It seems so simple, but the rules simply won't match appropriately.
Add on: this is just a part of a bigger grammar.
Looking at your example input:
# This is the most simple configuration
title = "FML rulez"
# We use ISO notations only, so no local styles
releaseDateTime = 2020-09-12T06:34
# Multiline strings
description = "So,
I'm curious
where this will end."
# Shorcut string; no quotes are needed in a simple property style assignment
# Or if a string is just one word. These strings are trimmed.
protocol = http
# Conditions allow for overriding, best match wins (most conditions)
# If multiple condition sets equally match, the first one will win.
title[env=production] = "One config file to rule them all"
title[env=production & os=osx] = "Even on Mac"
# Lists
hosts = [alpha, beta]
# Hierarchy is implemented using groups denoted by curly brackets
database {
# indenting is allowed and encouraged, but has no semantic meaning
url = jdbc://...
user = "admin"
# Strings support default encryption with a external key file, like maven
password = "FGFGGHDRG#$BRTHT%G%GFGHFH%twercgfg"
# groups can nest
dialect {
database = postgres
}
}
servers {
# This is a table:
# - the first row is a header, containing the id's
# - the remaining rows are values
| name | datacenter | maxSessions | settings |
| alpha | A | 12 | |
| beta | XYZ | 24 | |
| "sys 2" | B | 6 | |
# you can have sub blocks, which are id-less groups (id is the column)
| gamma | C | 12 | {breaker:true, timeout: 15} |
# or you reference to another block
| tango | D | 24 | $environment |
}
# environments can be easily done using conditions
environment[env=development] {
datasource = tst
}
environment[env=production] {
datesource = prd
}
I'd go for something like this:
grammar TECL;
input_file
: configs EOF
;
configs
: NL* ( config ( NL+ config )* NL* )?
;
config
: property
| group
| table
;
property
: WORD conditions? ASSIGN value
;
group
: WORD conditions? NL* OBRACE configs CBRACE
;
conditions
: OBRACK property ( AMP property )* CBRACK
;
table
: row ( NL+ row )*
;
row
: PIPE ( col_value PIPE )+
;
col_value
: ~( PIPE | NL )*
;
value
: WORD
| VARIABLE
| string
| list
;
string
: STRING
| WORD+
;
list
: OBRACK ( value ( COMMA value )* )? CBRACK
;
ASSIGN : '=';
OBRACK : '[';
CBRACK : ']';
OBRACE : '{';
CBRACE : '}';
COMMA : ',';
PIPE : '|';
AMP : '&';
VARIABLE
: '$' WORD
;
NL
: [\r\n]+
;
STRING
: '"' ( ~[\\"] | '\\' . )* '"'
;
WORD
: ~[ \t\r\n[\]{}=,|&]+
;
COMMENT
: '#' ~[\r\n]* -> skip
;
SPACES
: [ \t]+ -> skip
;
which will parse the example in the following parse tree:
And the input:
key = value
key = spaces are allowed
key = until a new line or a comment # this is a comment
key = "you can use quotes as well" # this is a comment
key = "and
with quotes
you can also do
multiline"
into the following:
For now: multiline quoted works, spaces in unquoted string not.
As you can see in the tree above, it does work. I suspect you used part of the grammar in your existing one and that doesn't work.
[...] and am I the process inserting the actions.
I would not embed actions (target code) inside your grammar: it makes it hard to read, and making changes to the grammar will be harder to do. And of course, your grammar will only work for 1 language. Better use a listener or visitor instead of these actions.
Good luck!
I have started to learn ANTLR and i was looking at the example grammars.
I found the CSV grammar quite interesting:
grammar CSV;
file: hdr row+ ;
hdr : row ;
row : field (',' field)* '\r'? '\n' ;
field
: TEXT
| STRING
|
;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ;
Is it possible to tell antlr that every row should have the same length as the hdr?
In my application i would need such a mechanism. To put it simply Id like to read in a INT variable contaniing the length N of an array and after it N values. Somthing like this:
// example-Input I want to parse
A = 3
a_0 = 2
a_1 = 5 //there have to be exactly A = 3 a_ entries
a_2 = 3
Is this possible with antlr? And if not directly in the grammar, can it be dione afterwards somehow?
I'm still trying to parse a simple Javadoc style format using ANTLR. Basically the format looks like this:
/**
* Description
*
* #name someId
*/
My parser grammar is here:
query_doc : BEGIN_QDOC description name NOMANSLAND* END_QDOC;
description : (DESCRIPTION_TEXT | NOMANSLAND)*;
name : OPEN_NAME INNER_WS NAMEID INNER_WS* CLOSE_NAME;
My lexer grammar is here:
BEGIN_QDOC : '/**';
END_QDOC : ('*/');
NOMANSLAND : '\r'? '\n' (' ' | '\t')* '*' (' ' | '\t')*;
DESCRIPTION_TEXT : ~('\n');
OPEN_NAME : '#name' -> mode(NAME);
mode NAME;
INNER_WS : (' ' | '\t')+;
NAMEID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_' | '?')+;
CLOSE_NAME : (('\r'? '\n') | '*/') -> mode(DEFAULT_MODE);
This appears to be working okay for the most part except for closing the #name definition in the following case:
/**
* #name someId*/
The above should be perfectly valid. We should not need a new line before ending the comment with '*/'. The issue I am having is that '*/' terminates the name definition successfully, but it consumes the token and only returns to the default mode so I need to have:
/**
* #name someId*/*/
if I actually want it to end the comment. I want it to return to the default mode and then realize that this token should end the comment (i.e. it should match END_QDOC). How can I accomplish this in ANTLR? I tried fixing it so that CLOSE_NAME is the inverse of ID:
CLOSE_NAME : ~('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_' | '?');
But ANTLR still consumes the * leaving a unrecognized token error on the remaining '/'. What I would really like to do is have ANTLR exit the mode without consuming the token so that '*/' is the next token when we return to DEFAULT_MODE. Any thoughts?
First of all, rather than use the mode command, you probably want to use -> pushMode(NAME) and -> popMode to go back to the default mode.
For your CLOSE_NAME rule, you could use a predicate instead of a matching literal for handling the end of a comment:
CLOSE_NAME
: ( '\r'? '\n'
| {_input.LA(1) == '*' && _input.LA(2) == '/'}?
)
-> popMode
;
This can produce a zero-length token and wasn't allowed in ANTLR 4.0, but the restriction was removed (changed to a warning) in ANTLR 4.1 since we realized that a zero-length token could be used to trigger a mode change and thus avoid infinite loops.
my real grammar is way more complex but I could strip down my problem. So this is the grammar:
grammar test2;
options {language=CSharp3;}
#parser::namespace { Test.Parser }
#lexer::namespace { Test.Parser }
start : 'VERSION' INT INT project;
project : START 'project' NAME TEXT END 'project';
START: '/begin';
END: '/end';
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
INT : '0'..'9'+;
NAME: ('a'..'z' | 'A'..'Z')+;
TEXT : '"' ( '\\' (.) |'"''"' |~( '\\' | '"' | '\n' | '\r' ) )* '"';
STARTA
: '/begin hello';
And I want to parse this (for example):
VERSION 1 1
/begin project
testproject "description goes here"
/end
project
Now it will not work like this (Mismatched token exception). If I remove the last Token STARTA, it works. But why? I don't get it.
Help is really appreciated.
Thanks.
When the lexer sees the input "/begin " (including the space!), it is committed to the rule STARTA. When it can't match said rule, because the next char in the input is a "p" (from "project") and not a "h" (from "hello"), it will try to match another rule that can match "/begin " (including the space!). But there is no such rule, producing the error:
mismatched character 'p' expecting 'h'
and the lexer will not give up the space and match the START rule.
Remember that last part: once the lexer has matched something, it will not give up on it. It might try other rules that match the same input, but it will not backtrack to match a rule that matches less characters!
This is simply how the lexer works in ANTLR 3.x, no way around it.