Let's say I was trying to do a grammar for a simple JPA query syntax like so
select e from Entity e where e.name=:name and e.data>:time
Is there any documentation on how to do that alias part(the "e" basically)?
I am basically trying to get errors if the user types in
select a from Entity e where a.name=:name
Notice a is not defined so this should fail. Should I be doing this in the grammar at all? or should I do it after the grammar is parsed when I walk the tree?
Should I be doing this in the grammar at all?
What you should or shouldn't be doing is your business, of course :)
or should I do it after the grammar is parsed when I walk the tree?
Yes, this is usually done while evaluating your AST, not during AST creation (so not during parsing).
Related
I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !
There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).
This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.
I am trying to solve this question but I really don't know how to get started. I would appreciate some help.
The bitwise operators for a language are shown in the table below alongside the grammar. The operators and the grammar rules are in order of precedence from highest to lowest. The characters a, b and c represent terminals in the language.
Grammar table:
Show that the grammar is ambiguous using expression: a >> b ^ c
Rewrite the grammar so that it is unambiguous.
The Dragon Book says: "A grammar that produces more than one parse tree for some sentence is said to be ambiguous." So to show that a grammar is ambiguous, you need to show at least two parse trees for a single sentence generated by the grammar. In this case, the sentence to use is already given to you, so for Q1 you just need to find two different parse trees for a >> b ^ c. Shiping's comment gives you a big clue for that.
For Q2, where they ask you to "rewrite the grammar", I suspect the unspoken requirement is that the resulting grammar generate exactly the same language as the original. (So Shiping's suggestion to introduce parentheses to the language would not be accepted.) The general approach for doing this is to introduce a nonterminal for each precedence level in the precedence chart, and then modify the grammar rules to use the new nonterminals in such a way that the grammar can only generate parse trees that respect the precedence chart.
For example, look at the two trees you found for Q1. You should observe that one of them conforms to the precedence chart and one does not. You want a new grammar that allows the precedence-conforming tree but not the other.
As another clue, consider the difference between these two grammars:
E -> E + E
E -> E * E
E -> a | b
and
E -> E + T
T -> T * F
F -> a | b
Although they generate the same language, the first is ambiguous but the second is not.
I'm working on a parser for a grammar in ANTLR. I'm currently working on expressions where () has the highest order precedence, then Unary Minus, etc.
When I add the line ANTLR gives the error: The following sets of rules are mutually left-recursive [add, mul, unary, not, and, expr, paren, accessMem, relation, or, assign, equal] How can I go about solving this issue? Thanks in advance.
Easiest answer is to use antlr4 not 3, which has no problem with immediate left recursion. It automatically rewrites the grammar underneath the covers to do the right thing. There are plenty of examples. one could for example examine the Java grammar or my little blog entry on left recursive rules. If you are stuck with v3, then there are earlier versions of the Java grammar and also plenty of material on how to build arithmetic expression rules in the documentation and book.
I'm starting to study ANTLR.
The aim is to 'translate' Strings into SQL statements.
One simple example of what I want to do:
If I receive the String "name = A and age = B" --- ANTLR ---> "select * from USERS where name = 'A' and age = 'B'"
I've been reading some information about ANTLR, and following some examples, but those just convert the input stream of characters (source file) into a AST. But how can I use ANTLR to translate the input message, and use the translated output?
Can you give me some highlights or tell me where can I found some information about that?
I'm using the Eclipse IDE and Maven ANTLR Plugin.
ANTLR is just a parser generator. You can insert actions into the grammar that collect information or directly print output. The most common mechanism is to allow ANTLR to create an intermediate presentation in the form of an AST or, with ANTLR 4, a parse tree. From there, you build a tree walker to either build an internal model or directly generate output. From the internal model, which represent constructs in your output language, you can then generate the output. I typically use StringTemplate for generating structured text.
When the input and output are very similar and, more importantly, the order of output is very similar, you can get away with syntax directed translation: i.e. actions directly in the grammar or actions applied directly to a parse tree.
When the order of output is very different, you have to build some form of intermediate representation. Imagine simply reading in a bunch of integers and printing them back out in reverse order. You can do that by simply printing out the numbers as you see them. This is all explained in my [shameless plug] book, Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages http://amzn.com/B00A376HGG
is there a tool to check my bnf grammar?
For example:
<assign>::=<var>=<expr>
<var>::=A|B|C
<expr>::=<expr>+<expr>
|<var>
A = B + C is a valid statement according to my bnf grammar and
A = B * C is not.
Is there a tool to check if given statement is valid or not?
Have used this while in my CS classes, I think it can pretty much do what you're looking for, that is, validating a statement with a given grammar.