How to get a parameter to the ANTLR lexer object?

How to get a parameter to the ANTLR lexer object? - antlr

I'm writing a JAVA software to parse SQL queries. In order to do so I'm using ANTLR with presto.g4.
The code I'm currently using is pretty standard:
PrestoLexer lexer = new PrestoLexer(
new CaseChangingCharStream(CharStreams.fromString(query), true));
lexer.removeErrorListeners();
lexer.addErrorListener(errorListener);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PrestoParser parser = new PrestoParser(tokens);
I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?
update:
I've used #Mike's suggestion below and my lexer now inherits from the built-in lexer and added a predicate function. My issue is now pure grammar.
This is my string definition:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
I sometimes have a query with weird escaping for which the predicate returns true. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
And when I try to parse it I'm getting:
'\'',''),'
As a single string.
how can I handle this one?

I don't know what you need the parameter for, but you mentioned SQL, so let me present a solution I used since years: predicates.
In MySQL (which is the dialect I work with) the syntax differs depending on the MySQL version number. So in my grammar I use semantic predicates to switch off and on language parts that belong to a specific version. The approach is simple:
test:
{serverVersion < 80014}? ADMIN_SYMBOL
| ONLY_SYMBOL
;
The ADMIN keyword is only acceptable for version < 8.0.14 (just an example, not true in reality), while the ONLY keyword is a possible alternative in any version.
The variable serverVersion is a member of a base class from which I derive my parser. That can be specified by:
options {
superClass = MySQLBaseRecognizer;
tokenVocab = MySQLLexer;
}
The lexer also is derived from that class, so the version number is available in both lexer and parser (in addition to other important settings like the SQL mode). With this approach you can also implement more complex functions for predicates, that need additional processing.
You can find the full code + grammars at the MySQL Workbench Github repository.

I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?
No, the lexer works independently from the parser. You cannot direct the lexer while parsing.

Related

ANTLR4 : clean grammar and tree with keywords (aliases ?)

I am looking for a solution to a simple problem.
The example :
SELECT date, date(date)
FROM date;
This is a rather stupid example where a table, its column, and a function all have the name "date".
The snippet of my grammar (very simplified) :
simple_select
: SELECT selected_element (',' selected_element) FROM from_element ';'
;
selected_element
: function
| REGULAR_WORD
;
function
: REGULAR_WORD '(' function_argument ')'
;
function_argument
: REGULAR_WORD
;
from_element
: REGULAR_WORD
;
DATE: D A T E;
FROM: F R O M;
SELECT: S E L E C T;
REGULAR_WORD
: (SIMPLE_LETTER) (SIMPLE_LETTER | '0'..'9')*
;
fragment SIMPLE_LETTER
: 'a'..'z'
| 'A'..'Z'
;
DATE is a keyword (it is used somewhere else in the grammar).
If I want it to be recognised by my grammar as a normal word, here are my solutions :
1) I add it everywhere I used REGULAR_WORD, next to it.
Example :
selected_element
: function
| REGULAR_WORD
| DATE
;
=> I don't want this solution. I don't have only "DATE" as a keyword, and I have many rules using REGULAR_WORD, so I would need to add a list of many (50+) keywords like DATE to many (20+) parser rules : it would be absolutely ugly.
PROS: make a clean tree
CONS: make a dirty grammar
2) I use a parser rule in between to get all those keywords, and then, I replace every occurrence of REGULAR_WORD by that parser rule.
Example :
word
: REGULAR_WORD
| DATE
;
selected_element
: function
| word
;
=> I do not want this solution either, as it adds one more parser rule in the tree and polluting the informations (I do not want to know that "date" is a word, I want to know that it's a selected_element, a function, a function_argument or a from_element ...
PROS: make a clean grammar
CONS: make a dirty tree
Either way, I have a dirty tree or a dirty grammar. Isn't there a way to have both clean ?
I looked for aliases, parser fragment equivalent, but it doesn't seem like ANTLR4 has any ?
Thank you, have a nice day !

There are four different grammars for SQL dialects in the Antlr4 grammar repository and all four of them use your second strategy. So it seems like there is a consensus among Antlr4 sql grammar writers. I don't believe there is a better solution given the design of the Antlr4 lexer.
As you say, that leads to a bit of noise in the full parse tree, but the relevant non-terminal (function, selected_element, etc.) is certainly present and it does not seem to me to be very difficult to collapse the unit productions out of the parse tree.
As I understand it, when Antlr4 was being designed, a decision was made to only automatically produce full parse trees, because the design of condensed ("abstract") syntax trees is too idiosyncratic to fit into a grammar DSL. So if you find an AST more convenient, you have the responsibility to generate one yourself. That's generally straight-forward although it involves a lot of boilerplate.
Other parser generators do have mechanisms which can handle "semireserved keywords". In particular, the Lemon parser generator, which is part of the Sqlite project, includes a %fallback declaration which allows you to specify that one or more tokens should be automatically reclassified in a context in which no grammar rule allows them to be used. Unfortunately, Lemon does not generate Java parsers.
Another similar option would be to use a parser generator which supports "scannerless" parsing. Such parsers typically use algorithms like Earley/GLL/GLR, capable of parsing arbitrary CFGs, to get around the need for more lookahead than can conveniently be supported in fixed-lookahead algorithms such as LALR(1).

This is the socalled keywords-as-identifiers problem and has been discussed many times before. For instance I asked a similar question already 6 years ago in the ANTLR mailing list. But also here at Stackoverflow there are questions touching this area, for instance Trying to use keywords as identifiers in ANTLR4; not working.
Terence Parr wrote a wiki article for ANTLR3 in 2008 that shortly describes 2 possible solutions:
This grammar allows "if if call call;" and "call if;".
grammar Pred;
prog: stat+ ;
stat: keyIF expr stat
| keyCALL ID ';'
| ';'
;
expr: ID
;
keyIF : {input.LT(1).getText().equals("if")}? ID ;
keyCALL : {input.LT(1).getText().equals("call")}? ID ;
ID : 'a'..'z'+ ;
WS : (' '|'\n')+ {$channel=HIDDEN;} ;
You can make those semantic predicates more efficient by intern'ing those strings so that you can do integer comparisons instead of string compares.
The other alternative is to do something like this
identifier : KEY1 | KEY2 | ... | ID ;
which is a set comparison and should be faster.
Normally, as #rici already mentioned, people prefer the solution where you keep all keywords in an own rule and add that to your normal identifier rule (where such a keyword is allowed).
The other solution in the wiki can be generalized for any keyword, by using a lookup table/list in an action in the ID lexer rule, which is used to check if a given string is a keyword. This solution is not only slower, but also sacrifies clarity in your parser grammar, since you can no longer use keyword tokens in your parser rules.

Identifying the version of the grammar with Antlr4

Is there a good way to have Antlr4 identify the version of a grammar used to parse input?
If I have two grammars, GA and GB, where GA is a subset of GB where GB imports GA is there a way to have Antlr4 report if the parsed input was parsed using GA or GB?
I could simply try parsing it with GB first and if that failed try parsing it with GA but I was wondering if there was a more efficient way to have Antlr keep track of what rules where used and say, "I successfully parsed this but only used rules from the GA grammar".

The right approach would be to correlate each rule (or only key rules) to a parser version.
First of all you are going to need a field to track current version:
#members {
int currentVersion = 1;
}
Now, let's suppose you have a rule RULE_ONE which correlates with version one and RULE_TWO which correlates with version two.
Each time a rule correlated with a higher version is accepted the currentVersion field should be changed:
RULE_ONE
{currentVersion = Math.max(1, currentVersion);} //1 is the parser version
: some_token
;
RULE_TWO
{currentVersion = Math.max(2, currentVersion);} //2 is the parser version
: some_token
;
Thus, when parsing is done, you can get the maximum version which has been used.

Not exactly what you are asking, but in my MySQL grammar I have to support multiple server versions, which I do by using semantic predicates. That means, I can use a single grammar and enable/disable certain paths depending on a serverVersion field I have in my parser. This is how it looks like:
alterDatabase:
DATABASE_SYMBOL schemaRef (
createDatabaseOption+
| {serverVersion < 80000}? UPGRADE_SYMBOL DATA_SYMBOL DIRECTORY_SYMBOL NAME_SYMBOL
)
;
and works very well. I can use this approach even in the lexer (but there with validating semantic predicates, for performance reasons), which allows me to switch keywords on and off, like this:
CONTRIBUTORS_SYMBOL: C O N T R I B U T O R S {serverVersion < 50700}?;

ANTLR4 parse tree simplification

Is there any means to get ANTLR4 to automatically remove redundant nodes in generated parse trees?
More specifically, I've been experimenting with a grammar for GLSL and you end up with long linear sequences of "expressions" in the parse tree due to the rule forwarding needed to give the automatic handling of operator precedence.
Most of the generated tree nodes are simply "forward to the next level of precedence", so don't provide any useful syntactic information - you only really need the last expression node in each sequence (i.e. the point at which the rule forwarding stopped), or the point where it becomes an actual tree node with more than one child (i.e. an actual expression was encountered in the source) ...
I was hoping there would be an easy way to kill off the dummy intermediate expression nodes - this type of structure must be common in any grammar with operator precedence.
The basic structure of the grammar is a fairly direct clone taken from the Khronos specification for the language:
https://www.khronos.org/registry/gles/specs/3.1/es_spec_3.1.pdf

ANTLR v4 is able to generate code from a single recursive rule dealing with different precedence levels, if you use a grammar like this (example for basic math):
expr : '(' expr ')'
| '-' expr
| expr ('*'|'/') expr
| expr ('+'|'-') expr
| INT
;
ANTLR v3 was unable to do so and basically required you to write one rule per precedence level. So I'd advise you to rewrite your grammar to avoid these boilerplate rules.
Then, I think you're confusing the parse tree (aka concrete syntax tree) with the AST (abstract syntax tree). The AST is like a simplified version of the parse tree, which keeps only what's needed for your purpose. For instance, with the expr rule above, the AST wouldn't contain any node for parentheses, since the precedence is encoded in the tree itself and you usually don't need to know whether a part of a given expression was parenthesized or not.
Your program should build an AST from the parse tree and then go from there. Don't deal with parse trees directly, even if it seems convenient at first sight because the tool generates them for you. It'll quickly become cumbersome. Build your own tree structure (AST), tailored for the task at hand.

Use the Visitor implementation to access each node in sequence. Build your own tree by adding nodes to parents as they are visited. Decide at the time the node is visited whether to add it to your new tree or not. For example:
public T visitExpression(#NotNull AcParser.ExpressionContext ctx) {
// Expressionable parent = getParent(Expressionable.class, ctx);
// Class<? extends AcExpression> expClass = AcExpression.class;
AcExpression obj = null;
String text = ctx.getText();
//do something with text or children
for (int i=0; i<ctx.getChildCount(); i++){
printnl(ctx.getChild(i).getText()+"/");
}
return visitChildren(ctx);
}

ParseKit: What built-in Productions should I use in my Grammars?

I just started using ParseKit to explore language creation and perhaps build a small toy DSL. However, the current SVN trunk from Google is throwing a -[PKToken intValue]: unrecognized selector sent to instance ... when parsing this grammar:
#start = identifier ;
identifier = (Letter | '_') | (letterOrDigit | '_') ;
letterOrDigit = Letter | Digit ;
Against this input:
foo
Clearly, I am missing something or have incorrectly configured my project. What can I do to fix this issue?

Developer of ParseKit here.
First, see the ParseKit Tokenization docs.
Basically, ParseKit can work in one of two modes: Let's call them Tokens Mode and Chars Mode. (There are no formal names for these two modes, but perhaps there should be.)
Tokens Mode is more popular by far. Virtually every example you will find of using ParseKit will show how to use Tokens Mode. I believe all of the documentation on http://parsekit.com is using Tokens Mode. ParseKit's grammar feature (that you are using in your example only works in Tokens Mode).
Chars Mode is a very little-known feature of ParseKit. I've never had anyone ask about it before.
So the differences in the modes are:
In Tokens Mode, the ParseKit Tokenizer emits multi-char tokens (like Words, Symbols, Numbers, QuotedStrings etc) which are then parsed by the ParseKit parsers you create (programmatically or via grammars).
In Chars Mode, the ParseKit Tokenizer always emits single-char tokens which are then parsed by the ParseKit parsers you create programmatically. (grammars don't currently work with this mode as this mode is not popular).
You could use Chars Mode to implement Regular Expresions which parse on a char-by-char basis.
For your example, you should be ignoring Chars Mode and just use Tokens Mode. The following Built-in Productions are for Chars Mode only. Do not use them in your grammars:
(PK)Letter
(PK)Digit
(PK)Char
(PK)SpecificChar
Notice how all of those Productions sound like they match individual chars. That's because they do.
Your example above should probably look like:
#start = identifier;
identifier = Word; // by default Words start with a-zA-Z_ and contain -0-9a-zAZ_'
Keep in mind the Productions in your grammars (parsers like identifier) will be working on Tokens already emitted from ParseKit's Tokenizer. Not individual chars.
IOW: by the time your grammar goes to work parsing input, the input has already been tokenized into Tokens of type Word, Number, Symbol, QuotedString, etc.
Here are all of the Built-in Productions available for use in your Grammar:
Word
Number
Symbol
QuotedString
Comment
Any
S // Whitespace. only available when #preservesWhitespaceTokens=YES. NO by default.
Also:
DelimitedString('start', 'end', 'allowedCharset')
/xxx/i // RegEx match
There are also operators for composite parsers:
// Sequence
| // Alternation
? // Optional
+ // Multiple
* // Repetition
~ // Negation
& // Intersection
- // Difference

Stripping actions from ANTLR grammar changes its parsing algorithm

I have a grammar Foo.xtext (too complex to include it here). Xtext generates InternalFoo.g from it. After some tweaking it also generates DebugInternalFoo.g which claims to be the same thing without actions. Now, I strip off actions with ANTLR directly
java -cp antlr-3.4.jar org.antlr.tool.Strip Internal.g > Stripped.g
I'd expect the three grammars to behave the same way when I check them. But here is what I experienced
InternalFoo.g - error, rule assignment has non-LL(*) decision
DebugInternalFoo.g - no problem, parses fine
Stripped.g - warnings at rule assignment, decision can match using multiple alternatives. It fails to parse properly.
Is it possible that a grammar parses a text differently with or without actions? Or is it a bug in any of the action-remover tools? (The rule in question has syntactic predicates, and without them, it would really have a non-LL(*) decision.)
UPDATE:
I partly found what caused the problem. The rule in question was like this
trickyRule:
({ some complex action})
(expression '=')=>...
Stripping with Antlr removed the action, but left an empty group there:
// Stripped.g
trickyRule:
() (expression '=')=>...
The generation of the debug grammar removes both the action, and the now empty group around it:
// DebugInternalFoo.g
trickyRule:
(expression '=')=>...
So the lesson learned is: an empty group before a syntactic predicate is not the same as nothing at all.

Is it possible that a grammar parses a text differently with or without actions?
Yes, that is possible. org.antlr.tool.Strip leaves syntactic predicates1, but removes validating2- and gated3 semantic predicates (and member sections that these semantic predicates might use).
For example, the following rules would only match an A_TOKEN:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: {input.LT(1).getType() == A_TOKEN}? .
;
but if you use the Strip tool on it, it leaves the following:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: /*{input.LT(1).getType() == A_TOKEN}?*/ .
;
making it match any token.
In other words, Strip could change the behavior of the generated lexer or parser.
1 syntactic predicate: ( ... )=>
2 validating semantic predicate { ... }?
3 gated semantic predicate { ... }?=>

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas