Just starting to write my first lexer and i've come across this:
RPAREN options { paraphrase = ")"; } : ")";
I'd like to know what paraphrase actually does, does it mean that in this case RPAREN can also be used as simply ) in the parser?
thanks!
EDIT - just found this online
We can use paraphrases in Rules to make error messages user-friendly
is this correct?
paraphrase is not a valid option in ANTLR 3 or ANTLR 4. Including it would either produce a warning or error, and it would not have any impact on behavior.
Related
I'm writing a JAVA software to parse SQL queries. In order to do so I'm using ANTLR with presto.g4.
The code I'm currently using is pretty standard:
PrestoLexer lexer = new PrestoLexer(
new CaseChangingCharStream(CharStreams.fromString(query), true));
lexer.removeErrorListeners();
lexer.addErrorListener(errorListener);
CommonTokenStream tokens = new CommonTokenStream(lexer);
PrestoParser parser = new PrestoParser(tokens);
I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?
update:
I've used #Mike's suggestion below and my lexer now inherits from the built-in lexer and added a predicate function. My issue is now pure grammar.
This is my string definition:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
I sometimes have a query with weird escaping for which the predicate returns true. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
And when I try to parse it I'm getting:
'\'',''),'
As a single string.
how can I handle this one?
I don't know what you need the parameter for, but you mentioned SQL, so let me present a solution I used since years: predicates.
In MySQL (which is the dialect I work with) the syntax differs depending on the MySQL version number. So in my grammar I use semantic predicates to switch off and on language parts that belong to a specific version. The approach is simple:
test:
{serverVersion < 80014}? ADMIN_SYMBOL
| ONLY_SYMBOL
;
The ADMIN keyword is only acceptable for version < 8.0.14 (just an example, not true in reality), while the ONLY keyword is a possible alternative in any version.
The variable serverVersion is a member of a base class from which I derive my parser. That can be specified by:
options {
superClass = MySQLBaseRecognizer;
tokenVocab = MySQLLexer;
}
The lexer also is derived from that class, so the version number is available in both lexer and parser (in addition to other important settings like the SQL mode). With this approach you can also implement more complex functions for predicates, that need additional processing.
You can find the full code + grammars at the MySQL Workbench Github repository.
I wonder whether it's possible to pass a parameter to the lexer so the lexing will be different depends on that parameter?
No, the lexer works independently from the parser. You cannot direct the lexer while parsing.
Using yacc, I want to parse text like
begin foo ... end foo
The string foo is not known at compile time and there can be different
such strings in the same input.
So far, the only option I see is to check for syntactical correctness after parsing:
block : BEGIN IDENT something END IDENT
{ if (strcmp($2, $5) != 0) yyerror("Mismatch"); }
This feels wrong. The parser should already detect the errors. Is there something built-in to yacc?
yacc only knows about tokens which the lexer can identify. Since those are identical, the lexer could only improve this case by using states.
That is, you could tell lex to remember that it saw a BEGIN and to count the tokens itself, and return a different type of IDENT (and do the checking there).
However, yacc is better suited to this sort of thing, so the answer to the original question is "no", there is no better solution.
I'm constructing an English-like domain specific language with ANTLR. Its keywords are context-sensitive. (I know it sounds dirty, but it makes a lot of sense for the non-programmer target users.) For example, the usual logical operators such as or and not are to be treated as identifiers when surrounded in brackets, [like this and this]. My current approach looks like this:
bracketedStatement
: '[' bracketedWord+ ']'
;
bracketedWord
: (~(']')+
;
This, when combined with lexical definitions such as the following:
AND: 'and' ;
OR: 'or' ;
Produces the warning"Decision can match input such as "{AND..PROCESS, RPAREN..'with'}" using multiple alternatives: 1, 2". I'm clearly creating ambiguity for ANTLR, but I don't know how to resolve it. How do I fix this?
For anyone who finds this, check out this stack overflow question. It clarifies how to use the negation symbol correctly.
This has got to be one of those well-known examples that's somewhere on the internet, but I can't seem to find it.
I'm trying to learn XText and I figured a calculator expression parser would be a good start. But I'm getting syntax errors in my grammar:
Expression:
Term (('+'|'-') Term)*;
Term:
Factor (('*'|'/') Factor)*;
Factor:
number=Number | variable=ID | ('(' expression=Expression ')');
I get this error in the Expression and Term lines:
Multiple markers at this line
- Cannot change type twice within a rule
- An unassigned rule call is not allowed, when the 'current'
was already created.
What gives? How can I fix this? And when do I have instanceName=Rule vs. Rule entries in a grammar?
I downloaded xtext integrated with eclipse and it comes with a calculator example which does approximately what you wish called arithmetics. From what I can gather you will need to assign an associativity to your tokens. This grammar runs fine for me:
Expression:
Term (({Plus.left=current}'+'|{Minus.left=current}'-') right=Term)*;
Term:
Factor (({Multiply.left=current} '*'| {Division.left=current}'/') right=Factor)*;
Factor:
number=NUMBER | variable=ID | ('(' expression=Expression ')');
The example grammar they have for arithmetics can be viewed here. It includes a bit more than your, like function calls, but the basics are the same.
I have a grammar Foo.xtext (too complex to include it here). Xtext generates InternalFoo.g from it. After some tweaking it also generates DebugInternalFoo.g which claims to be the same thing without actions. Now, I strip off actions with ANTLR directly
java -cp antlr-3.4.jar org.antlr.tool.Strip Internal.g > Stripped.g
I'd expect the three grammars to behave the same way when I check them. But here is what I experienced
InternalFoo.g - error, rule assignment has non-LL(*) decision
DebugInternalFoo.g - no problem, parses fine
Stripped.g - warnings at rule assignment, decision can match using multiple alternatives. It fails to parse properly.
Is it possible that a grammar parses a text differently with or without actions? Or is it a bug in any of the action-remover tools? (The rule in question has syntactic predicates, and without them, it would really have a non-LL(*) decision.)
UPDATE:
I partly found what caused the problem. The rule in question was like this
trickyRule:
({ some complex action})
(expression '=')=>...
Stripping with Antlr removed the action, but left an empty group there:
// Stripped.g
trickyRule:
() (expression '=')=>...
The generation of the debug grammar removes both the action, and the now empty group around it:
// DebugInternalFoo.g
trickyRule:
(expression '=')=>...
So the lesson learned is: an empty group before a syntactic predicate is not the same as nothing at all.
Is it possible that a grammar parses a text differently with or without actions?
Yes, that is possible. org.antlr.tool.Strip leaves syntactic predicates1, but removes validating2- and gated3 semantic predicates (and member sections that these semantic predicates might use).
For example, the following rules would only match an A_TOKEN:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: {input.LT(1).getType() == A_TOKEN}? .
;
but if you use the Strip tool on it, it leaves the following:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: /*{input.LT(1).getType() == A_TOKEN}?*/ .
;
making it match any token.
In other words, Strip could change the behavior of the generated lexer or parser.
1 syntactic predicate: ( ... )=>
2 validating semantic predicate { ... }?
3 gated semantic predicate { ... }?=>