ANTLR extraneous input x expecting x - antlr

I am trying to match a very basic ANTLR grammar. But ANTLR is keep telling me that he got the input '.' and expects '.' .
The full error is:
line 1:0 extraneous input '.' expecting '.'
line 1:2 missing '*' at '<EOF>'
With the grammar:
grammar regex;
#parser::header
{
package antlr;
}
#lexer::header
{
package antlr;
}
WHITESPACE : (' ' | '\t' | '\n' | '\r') -> channel(HIDDEN);
COMP : '.';
KLEENE : '*';
start : COMP KLEENE;
And input:
.*
Both files have the same charset:
regex.g: text/plain; charset=us-ascii
test.grammar: text/plain; charset=us-ascii
There should be no Lexer rule mix up. Why does this not work as expected?

Given your example grammar and this test class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = ".*";
regexLexer lexer = new regexLexer(CharStreams.fromString(source));
regexParser parser = new regexParser(new CommonTokenStream(lexer));
System.out.println(parser.start().toStringTree(parser));
}
}
the following is printed to my console:
(start . *)
My guess is you have either dumbed down the grammar too much causing the error in your original grammar to disappear, or you haven't generated new lexer/parser classes.

Related

Getting plain text in antlr instead of tokens

I'm trying to create a parser using antlr. My grammar is as follows.
code : codeBlock* EOF;
codeBlock
: text
| tag1Ops
| tag2Ops
;
tag1Ops: START_1_TAG ID END_2_TAG ;
tag2Ops: START_2_TAG ID END_2_TAG ;
text: ~(START_1_TAG|START_2_TAG)+;
START_1_TAG : '<%' ;
END_1_TAG : '%>' ;
START_2_TAG : '<<';
END_2_TAG : '>>' ;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER: [0-9]+;
WS : ( ' ' | '\n' | '\r' | '\t')+ -> channel(HIDDEN);
SPACES: SPACE+;
ANY_CHAR : .;
fragment SPACE : ' ' | '\r' | '\n' | '\t' ;
Along with various tags, I also need to implement a rule to get text which is not inside any of the tags. Things seem to be working fine with the current grammar, but since the 'text' rules falls to the Lexer side, any text entered is tokenized and I get a list of tokens, instead of a single string token. The antlr profiler in intellij also shows ambiguous calls for each token.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
I think I might be looking at the wrong angle, and would like to know if there is any other way to handle the 'text' rule.
First: you have a WS rule that places space chars on the hidden channel, yet later in the grammar, you have a SPACES rule. Given this SPACES rule is placed after WS and matches exactly the same, the SPACES rule will never be matched.
For example, 'Hi Hello, how are you??' needs to be a single token, instead of multiple tokens, which is generated by this grammar.
You can't do that in your current setup. What you can do is utilise lexical modes. A quick demo:
// Must be in a separate file called DemoLexer.g4
lexer grammar DemoLexer;
START_1_TAG : '<%' -> pushMode(IN_TAG);
START_2_TAG : '<<' -> pushMode(IN_TAG);
TEXT : ( ~[<] | '<' ~[<%] )+;
mode IN_TAG;
ID : [A-Za-z_][A-Za-z0-9_]*;
INT_NUMBER : [0-9]+;
END_1_TAG : '%>' -> popMode;
END_2_TAG : '>>' -> popMode;
SPACE : [ \t\r\n] -> channel(HIDDEN);
To test this lexer grammar, run this class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
String source = "<%FOO%>FOO BAR<<123>>456 mu!";
DemoLexer lexer = new DemoLexer(CharStreams.fromString(source));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-20s %s\n", DemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
START_1_TAG <%
ID FOO
END_1_TAG %>
TEXT FOO BAR
START_2_TAG <<
INT_NUMBER 123
END_2_TAG >>
TEXT 456 mu!
EOF <EOF>
Use your lexer grammar in a separate parser grammar like this:
// Must be in a separate file called DemoParser.g4
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
code
: codeBlock* EOF
;
...
EDIT
[...] but I am a bit confused on the TEXT : ( ~[<] | '<' ~[<%] )+; rule. can you elaborate what it does a bit further?
A breakdown of ( ~[<] | '<' ~[<%] )+:
( # start group
~[<] # match any char other than '<'
| # OR
'<' ~[<%] # match a '<' followed by any char other than '<' and '%'
)+ # end group, and repeat it once or more
And, can lexical modes be considered an alternative to semantic predicates?
Sort of. Semantic predicate are much more powerful: you can check whatever you like inside them through plain code. However, a big disadvantage is that you mix target specific code in your grammar, whereas lexical modes work with all targets. So, a rule of thumb is to avoid predicates if possible.

Antlr 4 Lexer with multiple modes failing to tokenise correctly

I'm trying to create a lexer with multiple modes using Antlr 4.7. My lexer currently is:
ACTIONONLY : 'AO';
BELIEFS : ':Initial Beliefs:' -> mode(INITIAL_BELIEFS);
NAME : ':name:';
WORD: ('a'..'z'|'A'..'Z'|'0'..'9'|'_')+;
COMMENT : '/*' .*? '*/' -> skip ;
LINE_COMMENT : '//' ~[\n]* -> skip ;
NEWLINE:'\r'? '\n' -> skip ;
WS : (' '|'\t') -> skip ;
mode INITIAL_BELIEFS;
GOAL_IB : ':Initial Goal:' -> mode(GOALS);
IB_COMMENT : '/*' .*? '*/' -> skip ;
IB_LINE_COMMENT : '//' ~[\n]* -> skip ;
IB_NEWLINE:'\r'? '\n' -> skip ;
IB_WS : (' '|'\t') -> skip ;
BELIEF_BLOCK: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'('|')'|','|'.')+;
mode REASONING_RULES;
R1: 'a';
R2: 'b';
mode GOALS;
GL_COMMENT : '/*' .*? '*/' -> skip ;
GL_LINE_COMMENT : '//' ~[\n]* -> skip ;
GL_NEWLINE:'\r'? '\n' -> skip ;
GL_WS : (' '|'\t') -> skip ;
GOAL_BLOCK: ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'('|')'|','|'.')+;
Note that there is no way, at present, to get into the REASONING_RULES mode (so this should not, as I understand it have any effect on the operation of the lexer). Obviously I do want to use this mode, but this is the minimal version of the lexer that seems to display the problem I'm having.
My parser is:
grammar ActionOnly;
options { tokenVocab = ActionOnlyLexer; }
// Mas involving ActionOnly Agents
mas : aoagents;
aoagents: ACTIONONLY (aoagent)+;
// Agent stuff
aoagent :
(ACTIONONLY?)
NAME w=WORD
BELIEFS (bs=BELIEF_BLOCK )?
GOAL_IB gs=GOAL_BLOCK;
and I'm trying to parse:
AO
:name: robot
:Initial Beliefs:
abelief
:Initial Goal:
at(4, 2)
This fails with the error
line 35:0 mismatched input 'at(4,' expecting GOAL_BLOCK
which I'm assuming is because it isn't tokenising correctly.
If I omit rule R2 in the REASONING_RULES mode then it parses correctly (in general I seem to be able to have one rule in REASONING_RULES and it will work, but more than one rule and it fails to match GOAL_BLOCK)
I'm really struggling to see what I'm doing wrong here, but this is the first time I've tried to use lexer modes with Antlr.
I don't get that error when I try your grammars. I also tested with ANTLR 4.7.
Here's my test rig:
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.Token;
public class Main {
public static void main(String[] args) {
String source = "AO\n" +
"\n" +
":name: robot\n" +
"\n" +
":Initial Beliefs:\n" +
"\n" +
"abelief\n" +
"\n" +
":Initial Goal:\n" +
"\n" +
"at(4, 2)";
ActionOnlyLexer lexer = new ActionOnlyLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
System.out.println("[TOKENS]");
for (Token t : tokens.getTokens()) {
System.out.printf(" %-20s %s\n", ActionOnlyLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
System.out.println("\n[PARSE-TREE]");
ActionOnlyParser parser = new ActionOnlyParser(tokens);
ParserRuleContext context = parser.mas();
System.out.println(" "+context.toStringTree(parser));
}
}
And this is printed to my console:
[TOKENS]
ACTIONONLY AO
NAME :name:
WORD robot
BELIEFS :Initial Beliefs:
BELIEF_BLOCK abelief
GOAL_IB :Initial Goal:
GOAL_BLOCK at(4,
GOAL_BLOCK 2)
EOF <EOF>
[PARSE-TREE]
(mas (aoagents AO (aoagent :name: robot :Initial Beliefs: abelief :Initial Goal: at(4,)))
Perhaps you need to generate new lexer/parser classes?
PS. note that ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'('|')'|','|'.')+ can be written as [a-zA-Z0-9_(),.]+

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4:
Any string as long as it doesn't contain the asterisk immediately
followed by a forward slash?
This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer sets: '*/'
This works but isn't correct: (~[*/])* as it prohibits a string containing the individual character * or /.
I had similar problem, my solution: ( ~'*' | ( '*'+ ~[/*]) )* '*'*.
The closest I can come is to put the test in the parser instead of the lexer. That's not exactly what you're asking for, but it does work.
The trick is to use a semantic predicate before any string that must be tested for any Evil Characters. The actual testing is done in Java.
grammar myTest;
#header
{
import java.util.*;
}
#parser::members
{
boolean hasEvilCharacters(String input)
{
if (input.contains("*/"))
{
return false;
}
else
{
return true;
}
}
}
// Mimics a very simple sentence, such as:
// I am clean.
// I have evil char*/acters.
myTest
: { hasEvilCharacters(_input.LT(1).getText()) }? String
(Space { hasEvilCharacters(_input.LT(1).getText()) }? String)*
Period EOF
;
String
: ('A'..'Z' | 'a'..'z')+
;
Space
: ' '
;
Period
: '.'
;
Tested with ANTLR 4.4 via the TestRig in ANTLRWorks 2 in NetBeans 8.0.1.
If the disallowed sequences are few there exists a solution without parser/lexer actions:
grammar NotParser;
program
: (starslash | notstarslash)+
;
notstarslash
: NOT_STAR_SLASH
;
starslash
: STAR_SLASH
;
STAR_SLASH
: '*'+ '/'
;
NOT_STAR_SLASH
: (F_NOT_STAR_SLASH | F_STAR_NOT_SLASH) +
;
fragment F_NOT_STAR_SLASH
: ~('*'|'/')
;
fragment F_STAR_NOT_SLASH
: '*'+ ~('*'|'/')
| '*'+ EOF
| '/'
;
The idea is to compose the token of
all tokens that are neither '*' nor '/'
all tokens that begin with '*' but are not followed with '/' or single '/'
There are some rules that deal with special situations (multiple '' followed by '/', or trailing '')

ANTLR3 NoViableAltException

This is the part of my grammar that make error:
expr : func_name '(' constant (',' constant)* ')' ;
constant
: '"' (~'"')* '"';
WS : (' '|'\t')+ {skip();} ;
And the error is about this part of text:
"w9ygS99Qp_", "vuPfq6YcbX"
The interpreter of ANTLRWorks give me the next leaves, which have a node constant as parent:
"
w9ygS99Qp_",
"
Then it is an NoViableAltException error.
Normally, it should have this leaves:
"
w9ygS99Qp_
"
Apparently, the problem is the _ before the ", because I tried to suppress the _, but the same error appears when te parser meet the next _"
Your constant should be a lexer rule, not a parser rule. Inside a parser rule, ~'"' matches any token other than a double quote-token. It does not match any charatcer except the double quote-char.
Do it like this instead:
expr : func_name '(' Constant (',' Constant)* ')' ;
Constant
: '"' (~'"')* '"';
OK, thus I have to compile the grammar in Java, and then test it in Java, if I understood well?
Yes, or use ANTLRWorks' debugger instead. The debugger works like a charm.
To test in plain Java, do something like this:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRStringStream("name(\"w9ygS99Qp_\", \"vuPfq6YcbX\")"));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.expr();
}
}
Maybe should I use ANTLR4, to be able to use ANTLRWorks correctly?
If you have a choice to use either ANTLR3 or ANTLR4, then go for ANTLR4. Note that there's a new (rewritten) version of ANTLRWorks for ANTLR4 grammars: http://tunnelvisionlabs.com/products/demo/antlrworks

Using antlr to parse a | separated file

So I think this should be easy, but I'm having a tough time with it. I'm trying to parse a | delimited file, and any line that doesn't start with a | is a comment. I guess I don't understand how comments work. It always errors out on a comment line. This is a legacy file, so there's no changing it. Here's my grammar.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: line+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
COMMENT: ~'|' .* '\n' ;
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace
Data:
! a comment
Another comment
| a | abc | b | def | ...
A grammar for that would look like this:
parse
: line* EOF
;
line
: ( comment | values ) ( NL | EOF )
;
comment
: ELEMENT+
;
values
: PIPE ( ELEMENT PIPE )+
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
And to test it, you just need to sprinkle a bit of code in your grammar like this:
grammar Route;
#members {
List<List<String>> values = new ArrayList<List<String>>();
}
parse
: line* EOF
;
line
: ( comment | v=values {values.add($v.line);} ) ( NL | EOF )
;
comment
: ELEMENT+
;
values returns [List<String> line]
#init {line = new ArrayList<String>();}
: PIPE ( e=ELEMENT {line.add($e.text);} PIPE )*
;
PIPE
: '|'
;
ELEMENT
: ('a'..'z')+
;
NL
: '\r'? '\n' | '\r'
;
WS
: (' '|'\t') {$channel=HIDDEN;}
;
Now generate a lexer/parser by invoking:
java -cp antlr-3.2.jar org.antlr.Tool Route.g
create a class RouteTest.java:
import org.antlr.runtime.*;
import java.util.List;
public class RouteTest {
public static void main(String[] args) throws Exception {
String data =
"a comment\n"+
"| xxxxx | y | zzz |\n"+
"another comment\n"+
"| a | abc | b | def |";
ANTLRStringStream in = new ANTLRStringStream(data);
RouteLexer lexer = new RouteLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RouteParser parser = new RouteParser(tokens);
parser.parse();
for(List<String> line : parser.values) {
System.out.println(line);
}
}
}
Compile all source files:
javac -cp antlr-3.2.jar *.java
and run the class RouteTest:
// Windows
java -cp .;antlr-3.2.jar RouteTest
// *nix/MacOS
java -cp .:antlr-3.2.jar RouteTest
If all goes well, you see this printed to your console:
[xxxxx, y, zzz]
[a, abc, b, def]
Edit: note that I simplified it a bit by only allowing lower case letters, you can always expand the set of course.
It's a nice idea to use ANTLR for a job like this, although I do think it's overkill. For example, it would be very easy to (in pseudo-code):
for each line in file:
if line begins with '|':
fields = /|\s*([a-z]+)\s*/g
Edit: Well, you can't express the distinction between comments and lines lexically, because there is nothing lexical that distinguishes them. A hint to get you in one workable direction.
line: comment | fields;
comment: NONBAR+ (BAR|NONBAR+) '\n';
fields = (BAR NONBAR)+;
This seems to work, I swear I tried it. Changing comment to lower case switched it to the parser vs the lexer, I still don't get it.
grammar Route;
#header {
package org.benheath.codegeneration;
}
#lexer::header {
package org.benheath.codegeneration;
}
file: (line|comment)+;
line: route+ '\n';
route: ('|' elt) {System.out.println("element: [" + $elt.text + "]");} ;
elt: (ELEMENT)*;
comment : ~'|' .* '\n';
ELEMENT: ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'_'|'#'|'#') ;
WS: (' '|'\t') {$channel=HIDDEN;} ; // ignore whitespace