How to match a string, but case-insensitively? - antlr

Let's say that I want to match "beer", but don't care about case sensitivity.
Currently I am defining a token to be ('b'|'B' 'e'|'E' 'e'|'E' 'r'|'R') but I have a lot of such and don't really want to handle 'verilythisisaverylongtokenindeedomyyesitis'.
The antlr wiki seems to suggest that it can't be done (in antlr) ... but I just wondered if anyone had some clever tricks ...

I would like to add to the accepted answer: a ready -made set can be found at case insensitive antlr building blocks, and the relevant portion included below for convenience
fragment A:[aA];
fragment B:[bB];
fragment C:[cC];
fragment D:[dD];
fragment E:[eE];
fragment F:[fF];
fragment G:[gG];
fragment H:[hH];
fragment I:[iI];
fragment J:[jJ];
fragment K:[kK];
fragment L:[lL];
fragment M:[mM];
fragment N:[nN];
fragment O:[oO];
fragment P:[pP];
fragment Q:[qQ];
fragment R:[rR];
fragment S:[sS];
fragment T:[tT];
fragment U:[uU];
fragment V:[vV];
fragment W:[wW];
fragment X:[xX];
fragment Y:[yY];
fragment Z:[zZ];
So an example is
HELLOWORLD : H E L L O W O R L D;

How about define a lexer token for each permissible identifier character, then construct the parser token as a series of those?
beer: B E E R;
A : 'A'|'a';
B: 'B'|'b';
etc.

A case-insensitive option was just added to ANTLR
options { caseInsensitive = true; }
https://github.com/antlr/antlr4/commit/7bc825776357a0e6e7fc399bb0841d570a7e824b
The old links are now broken, these should continue to work.
Case-Insensitive Lexing
CaseChangingCharStream
CaseChangingCharStream.java

Define case-insensitive tokens with
BEER: [Bb] [Ee] [Ee] [Rr];

New documentation page has appeared in ANTLR GitHub repo: Case-Insensitive Lexing. You can use two approaches:
The one described in #javadba's answer
Or add a character stream to your code, which will transform an input stream to lower or upper case. Examples for the main languages you can find on the same doc page.
My opinion, it's better to use the first approach and have the grammar which describes all the rules. But if you use well-known grammar, for example from Grammars written for ANTLR v4, then second approach may be more appropriate.

A solution I used in C#: use ASCII code to shift character to smaller case.
class CaseInsensitiveStream : Antlr4.Runtime.AntlrInputStream {
public CaseInsensitiveStream(string sExpr)
: base(sExpr) {
}
public override int La(int index) {
if(index == 0) return 0;
if(index < 0) index++;
int pdx = p + index - 1;
if(pdx < 0 || pdx >= n) return TokenConstants.Eof;
var x1 = data[pdx];
return (x1 >= 65 && x1 <= 90) ? (97 + x1 - 65) : x1;
}
}

Related

No way to implement a q quoted string with custom delimiters in Antlr4

I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>

Capturing content which can start with Parser keywords in Xtext

The following is the simplified version of my actual grammar :-
grammar org.hello.World
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate world "http://www.hello.org/World"
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER)*
;
Greeting:
'<hello>' name=ID '</hello>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
So using above grammar if my input is like :-
<hi><hello>world</hello>
Then I am getting an syntax error saying that mismatched character 'i' expecting 'e' at Column 2 .
My requirement is that AnyContent should match "<hi>" , can anyone guide me about how to achieve that?
If you want to make it with Xtext. I advice you to split your problem. You first problem is syntaxic, you need to parser your file. The second problem is semantic, you want to give a "sense" to your objets and tell who is the container. Define the container and the containment for XML can't be done inside your grammar.
Make a custom Ecore and make an easy grammar, with start and end tag. You don't really care about the name of your tag.
Example :
Model returns XmlFile: (StartTag|EndTag|Text)+;
Text returns Text: text=STRING;
StartTag returns StartTag: '<' name=ID '>';
EndTag returns EndTag: '</' name=ID '>';
Change the TokenSource. The token source will give the token to your Parser. You can override the nature of your token, merge or split them.
The idea here is to merge all token outside the between of ">" and "</".
This token represent a Text, so you can create a single token for all elements containing between this elements. Example :
class CustomTokenSource extends XtextTokenStream{
new(TokenSource tokenSource, ITokenDefProvider tokenDefProvider) {
super(tokenSource,tokenDefProvider)
}
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
In this example you need to add your custom code on the method "tokenOverride".
Add your custom token source on your parser :
class XDSLParser extends DSLParser{
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new CustomTokenSource(tokenSource, getTokenDefProvider());
}
}
Compute the containement : the containment of your elements can be compute after the parsing. After it, you can get your model and change it as you will. To make it, you need to override the method "doParse" of your Parser "XDSLParser" as follow :
override protected IParseResult doParse(String ruleName, CharStream in, NodeModelBuilder nodeModelBuilder, int initialLookAhead) {
var IParseResult result = super.doParse( ruleName, in, nodeModelBuilder, initialLookAhead)
//Give you model
result.rootASTElement;
return result
}
Note : The model that you obtain after the parsing will be flat. The xmlFile Object will contain all the elements in the good order. You need to write an algorithm to build the containement on your AST model.
This will require a lot of tweaking in the grammar due to the nature of the antlr lexer that is used by Xtext. The lexer will not roll back for the keyword <hello>: As soon as it sees a < followed by an h it'll try consume the hello-token. Something along these lines could work though:
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER | '<' (ID | ANY_OTHER | '/' | '>') | '/' | '>' | 'hello')*
;
Greeting:
'<' 'hello '>' name=ID '<' '/' 'hello' '>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
The approach won't scale for real world grammars but maybe it helps to get on the some working track.

antlr rule boolean parameter showing up in syntactic predicate code one level higher, causing compilation errors

I have a grammar that can parse expressions like 1+2-4 or 1+x-y, creating an appropriate structure on the fly which later, given a Map<String, Integer> with appropriate content, can be evaluated numerically (after parsing is complete, i.e. for x or y only known later).
Inside the grammar, there are also places where an expression that can be evaluated on the spot, i.e. does not contain variables, should occur. I figured I could parse these with the same logic, adding a boolean parameter variablesAllowed to the rule, like so:
grammar MiniExprParser;
INT : ('0'..'9')+;
ID : ('a'..'z'| 'A'..'Z')('a'..'z'| 'A'..'Z'| '0'..'9')*;
PLUS : '+';
MINUS : '-';
numexpr returns [Double val]:
expr[false] {$val = /* some evaluation code */ 0.;};
varexpr /*...*/:
expr[true] {/*...*/};
expr[boolean varsAllowed] /*...*/:
e=atomNode[varsAllowed] {/*...*/}
(PLUS e2=atomNode[varsAllowed] {/*...*/}
|MINUS e2=atomNode[varsAllowed] {/*...*/}
)* ;
atomNode[boolean varsAllowed] /*...*/:
(n=INT {/*...*/})
|{varsAllowed}?=> ID {/*...*/}
;
result:
(numexpr) => numexpr {System.out.println("Numeric result: " + $numexpr.val);}
|varexpr {System.out.println("Variable expression: " + $varexpr.text);};
However, the generated Java code does not compile. In the part apparently responsible for the final rule's syntactic predicate, varsAllowed occurs even although the variable is never defined at this level.
/* ... */
else if ( (LA3_0==ID) && ((varsAllowed))) {
int LA3_2 = input.LA(2);
if ( ((synpred1_MiniExprParser()&&(varsAllowed))) ) {
alt3=1;
}
else if ( ((varsAllowed)) ) {
alt3=2;
}
/* ... */
Am I using it wrong? (I am using Eclipse' AntlrIDE 2.1.2 with Antlr 3.5.2.)
This problem is part of the hoisting process the parser uses for prediction. I encountered the same problem and ended up with a member var (or static var for the C target) instead of a parameter.

How can I build an ANTLR Works style parse tree?

I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar:
grammar test;
program : idList;
idList : id* ;
id : ID ;
ID : LETTER (LETTER | NUMBER)* ;
LETTER : 'a' .. 'z' | 'A' .. 'Z' ;
NUMBER : '0' .. '9' ;
ANTLR Works will output:
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens.
A quick demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
IdList;
Id;
}
#parser::members {
private static void walk(CommonTree tree, int indent) {
if(tree == null) return;
for(int i = 0; i < indent; i++, System.out.print(" "));
System.out.println(tree.getText());
for(int i = 0; i < tree.getChildCount(); i++) {
walk((CommonTree)tree.getChild(i), indent + 1);
}
}
public static void main(String[] args) throws Exception {
testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123"));
testParser parser = new testParser(new CommonTokenStream(lexer));
walk((CommonTree)parser.program().getTree(), 0);
}
}
program : idList EOF -> idList;
idList : id* -> ^(IdList id*);
id : ID -> ^(Id ID);
ID : LETTER (LETTER | DIGIT)*;
SPACE : ' ' {skip();};
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
fragment DIGIT : '0' .. '9';
If you run the demo above, you will see the following being printed to the console:
IdList
Id
abc
Id
abc123
As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead:
idList : id* -> ^(IdList["idList"] id*);
id : ID -> ^(Id["id"] ID);
which will print:
idList
id
abc
id
abc123

ANTLR Grammar gives me upsidedown-tree

I have a grammar which parses dot notion expressions like this:
a.b.c
memberExpression returns [Expression value]
: i=ID { $value = ParameterExpression($i.value); }
('.' m=memberExpression { $value = MemberExpression($m.value, $i.value); }
)*
;
This parses expressions fine and gives me a tree structure like this:
MemberExpression(
MemberExpression(
ParameterExpression("c"),
"b"
)
, "a"
)
But my problem is that I want a tree that looks like this:
MemberExpression(
MemberExpression(
ParameterExpression("a"),
"b"
)
, "c"
)
for the same expression "a.b.c"
How can I achieve this?
You could do this by collecting all tokens in a java.util.List using ANTLR's convenience += operator and create the desired tree using a custom method in your #parser::members section:
// grammar def ...
// options ...
#parser::members {
private Expression customTree(List tks) {
// `tks` is a java.util.List containing `CommonToken` objects
}
}
// parser ...
memberExpression returns [Expression value]
: ids+=ID ('.' ids+=ID)* { $value = customTree($ids); }
;
I think what you are asking for is mutually left recursive, and therefore ANTLR is not a good choice to parse it.
To elaborate, you need C at the root of the tree and therefore your rule would be:
rule: rule ID;
This rule will be uncertain whether it should match
a.b
or
a.b.c