How to generate XML directly within the parser using Java System.out.println - antlr

The input file that I want to parse consists of one or more rows and each row consists of a greeting (Hello or Greeting) followed by a person's first name. Here is a sample input:
Hello Roger
Greeting Sally
I want to create a parser that outputs XML. For the sample input I want the parser to generate this XML:
<messages>
<message>
<greeting>Hello</greeting>
<person>Roger</person>
</message>
<message>
<greeting>Greeting</greeting>
<person>Sally</person>
</message>
</messages>
I want the XML generated directly within the parser file (MyParser.g4) using Java System.out.println
Here is my lexer:
lexer grammar MyLexer;
GREETING : ('Hello' | 'Greeting') ;
ID : [a-zA-Z]+ ;
WS : [ \t]+ -> skip ;
EOL : [\n] ;
Here is my parser:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
document: (message+ {System.out.println("<messages>" + $message.value + "</messages>");});
message returns [String value]: (GREETING ID {value = "<message><greeting>" + $GREETING.text + "</greeting><name>" + $ID.text + "</name></message>";}) ;
I ran ANTLR on the lexer and parser and then compiled the Java code that ANTLR generated. This resulted in the following error message.
MyParser.java:154: error: cannot find symbol
value = "<message><greeting>" + (((MessageContext)_localctx).GREETING!=null?((MessageContext)_localctx).GREETING.getText():null) + "</greeting><name>" + (((MessageContext)_localctx).ID!=null?((MessageContext)_localctx).ID.getText():null) + "</name></message>";
^
symbol: variable value
location: class MyParser
What am I doing wrong, please?

You forgot the $ before value, it must be: $value = "<message><greeting>" + ….
And you also want to print every message, so not:
message+ {System.out.println( … );}
which will print just once, but like this instead:
(message {System.out.println( … );})+
This ought to do it:
parser grammar MyParser;
options { tokenVocab=MyLexer; }
document
: {System.out.println("<messages>");}
( message EOL? {
System.out.println(" " + $message.value);
})+
{System.out.println("</messages>");}
EOF
;
message returns [String value]
: GREETING ID {
$value = "<message><greeting>" + $GREETING.text + "</greeting><name>" + $ID.text + "</name></message>";
}
;
Can be tested like this:
String source = "Hello Roger\nGreeting Sally";
MyLexer lexer = new MyLexer(CharStreams.fromString(source));
MyParser parser = new MyParser(new CommonTokenStream(lexer));
parser.document();

Related

How to get the line of a token in parse rules?

I have searched everywhere and can't find a solution. I am new to ANTLR and for an assignment, I need to print out (using similar syntax that I have below) an error message when my parser comes across an unidentified token with the line number and token. The documentation for antlr4 says line is an attribute for Token objects that gives "[t]he line number on which the token occurs, counting from 1; translates to a call to getLine. Example: $ID.line."
I attempted to implement this in the following chunk of code:
not_valid : not_digit { System.out.println("Line " + $not_digit.line + " Contains Unrecognized Token " $not_digit.text)};
not_digit : ~( DIGIT );
But I keep getting the error
unknown attribute line for rule not_digit in $not_digit.line
My first thought was that I was applying an attribute for a lexer token to a parser rule because the documentation splits Token and Rule attributes into two different tables. so then I change the code to be:
not_valid : NOT_DIGIT { System.out.println("Line " + $NOT_DIGIT.line + " Contains Unrecognized Token " $NOT_DIGIT.text)};
NOT_DIGIT : ~ ( DIGIT ) ;
and also
not_valid : NOT_DIGIT { System.out.println("Line " + $NOT_DIGIT.line + " Contains Unrecognized Token " $NOT_DIGIT.text)};
NOT_DIGIT : ~DIGIT ;
like the example in the documentation, but both times I got the error
rule reference DIGIT is not currently supported in a set
I'm not sure what I am missing. All my searches turn up how to do this in Java outside of the parser, but I need to work in the action code (I think that's what it is called) in the parser.
A block like { ... } is called an action. You embed target specific code in it. So if you're working with Java, then you need to write Java between { and }
A quick demo:
grammar T;
parse
: not_valid EOF
;
not_valid
: r=not_digit
{
System.out.printf("line=%s, charPositionInLine=%s, text=%s\n",
$r.start.getLine(),
$r.start.getCharPositionInLine(),
$r.start.getText()
);
}
;
not_digit
: ~DIGIT
;
DIGIT
: [0-9]
;
OTHER
: ~[0-9]
;
Test it with the code:
String source = "a";
TLexer lexer = new TLexer(CharStreams.fromString(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
which will print:
line=1, charPositionInLine=0, text=a

ANTLR3 Tree Grammar: loop did not match anything at input 'EOF'

I am trying to create my first ANTLR3 tree grammar, but I keep hitting the same problem. The output of the parser is:
$ echo 'foo, bar' | ./run.sh
foo bar
TreeGrammar.g: node from line 0:0 required (...)+ loop did not match anything at input 'EOF'
Exception in thread "main" java.lang.NullPointerException
at Driver.main(Driver.java:29)
The output clearly shows that the stage-1 parser results in the right tokens ('foo' and 'bar'). Somehow the stage-2 tree-parser refuses to parse the results from stage-1. Since the code is very basic, it must be some simple, dumb oversight at my part ;-)
Here's my simple test code:
Grammar.g:
grammar Grammar;
options {
output = AST;
}
statement: word (','! word)* EOF!;
word: ID;
ID: ('a'..'z'|'A'..'Z')+;
WS: (' ' | '\t' | '\n' | '\r')+ { $channel = HIDDEN; } ;
TreeGrammar.g:
tree grammar TreeGrammar;
options {
tokenVocab = Grammar;
ASTLabelType = CommonTree;
output = template;
}
statement: word+;
word: ID;
Driver.java:
import java.io.*;
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Driver {
public static void main(String[] args) throws Exception {
FileReader groupFileR = new FileReader("Template.stg" );
StringTemplateGroup templates = new StringTemplateGroup(groupFileR);
groupFileR.close();
ANTLRInputStream input = new ANTLRInputStream(System.in);
GrammarLexer lexer = new GrammarLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
GrammarParser parser = new GrammarParser(tokens);
GrammarParser.statement_return result = parser.statement();
CommonTree t = (CommonTree)result.getTree();
System.out.println(t.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(t);
nodes.setTokenStream(tokens);
TreeGrammar walker = new TreeGrammar(nodes);
walker.setTemplateLib(templates);
walker.statement();
TreeGrammar.statement_return r2 = walker.statement();
StringTemplate output = (StringTemplate) r2.getTemplate();
System.out.println(output.toString());
}
}
Assuming your Stringtemplate groups is properly formed, your problem is most probably the fatc you walk your AST twice:
walker.statement();
TreeGrammar.statement_return r2 = walker.statement();
E.g., you call walker.statement() twice. This is what the (first) error is telling you:
TreeGrammar.g: node from line 0:0 required (...)+ loop did not match anything at input 'EOF'
You consume the input once with walker.statement() resulting the node stream is at the end (EOF), and then you call walker.statement() again and it expects tow walk word+ again, yet there's only a EOF left.

Antlr syntactic predicate no matching

I have the following grammar:
rule : (PATH)=> (PATH) SLASH WORD
{System.out.println("file: " + $WORD.text + " path: " + $PATH.text);};
WORD : ('a'..'z')+;
SLASH : '/';
PATH : (WORD SLASH)* WORD;
but it does not work for a string like "a/b/c/filename".
I thought I could solve this "path"-problem with the syntactic predicate feature. Maybe I am doing something wrong here and I have to redefine the grammar. Any suggestion for this problem?
You must understand that a syntactic predicate will not cause the parser to give the lexer some sort of direction w.r.t. what token the parser would "like" to retrieve. A syntactic predicate is used to force the parser to look ahead in an existing token stream to resolve ambiguities (emphasis on 'existing': the parser has no control over what token are created!).
The lexer operates independently from the parser, creating tokens in a systematic way:
it tries to match as much characters as possible;
whenever 2 (or more) rules match the same amount of characters, the rule defined first will get precedence over the rule(s) defined later.
So in your case, given the input "a/b/c/filename", the lexer will greedily match the entire input as a single PATH token.
If you want to get the file name, either retrieve it from the PATH:
rule : PATH
{
String file = $PATH.text.substring($PATH.text.lastIndexOf('/') + 1);
System.out.println("file: " + file + ", path: " + $PATH.text);
}
;
WORD : ('a'..'z')+;
SLASH : '/';
PATH : (WORD SLASH)* WORD;
or create a parser rule that matches a path:
rule : dir WORD
{
System.out.println("file: " + $WORD.text + ", dir: " + $dir.text);
}
;
dir : (WORD SLASH)+;
WORD : ('a'..'z')+;
SLASH : '/';

antlr3 - Generating a Parse Tree

I'm having trouble figuring out the antlr3 API so I can generate and use a parse tree in some javascript code. When I open the grammar file using antlrWorks (their IDE), the interpreter is able to show me the parse tree, and it's even correct.
I'm having a lot of difficulties tracking down resources on how to get this parse tree in my code using the antlr3 runtime. I've been messing around with the various functions in the runtime and Parser files but to no avail:
var input = "(PR=5000)",
cstream = new org.antlr.runtime.ANTLRStringStream(input),
lexer = new TLexer(cstream),
tstream = new org.antlr.runtime.CommonTokenStream(lexer),
parser = new TParser(tstream);
var tree = parser.query().tree;
var nodeStream = new org.antlr.runtime.tree.CommonTreeNodeStream(tree);
nodeStream.setTokenStream(tstream);
parseTree = new org.antlr.runtime.tree.TreeParser(nodeStream);
Since antlrWorks can display the parse tree without any tree grammar from myself, and since I have read that antlr automatically generates a parse tree from the grammar file, I'm assuming that I can access this basic parse tree with some runtime functions that I am probably not aware of. Am I correct in this thinking?
HugeAntlrs wrote:
Since antlrWorks can display the parse tree without any tree grammar from myself, and since I have read that antlr automatically generates a parse tree from the grammar file, I'm assuming that I can access this basic parse tree with some runtime functions that I am probably not aware of. Am I correct in this thinking?
No, that is incorrect. ANTLR creates a flat, 1 dimensional stream of tokens.
ANTLRWorks creates its own parse tree on the fly when interpreting some source. You have no access to this tree (not with Javascript or even with Java). You will have to define the tokens that you think should be the roots of your (sub) trees and/or define the tokens that need to be removed from your AST. Checkout the following Q&A that explains how to create a proper AST: How to output the AST built using ANTLR?
EDIT
Since there's no proper JavaScript demo on SO yet, here's a quick demo.
The following grammar parses boolean expression with the following operators:
or
and
is
not
where not has the highest precedence.
Of course there are true and false, and the expressions can be grouped using parenthesis.
file: Exp.g
grammar Exp;
options {
output=AST;
language=JavaScript;
}
parse
: exp EOF -> exp
;
exp
: orExp
;
orExp
: andExp (OR^ andExp)*
;
andExp
: eqExp (AND^ eqExp)*
;
eqExp
: unaryExp (IS^ unaryExp)*
;
unaryExp
: NOT atom -> ^(NOT atom)
| atom
;
atom
: TRUE
| FALSE
| '(' exp ')' -> exp
;
OR : 'or' ;
AND : 'and' ;
IS : 'is' ;
NOT : 'not' ;
TRUE : 'true' ;
FALSE : 'false' ;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;} ;
The grammar above produces an AST which can be fed to the tree-walker below:
file: ExpWalker.g
tree grammar ExpWalker;
options {
tokenVocab=Exp;
ASTLabelType=CommonTree;
language=JavaScript;
}
// `walk` returns a string
walk returns [expr]
: exp {expr = ($exp.expr == 1) ? 'True' : 'False';}
;
// `exp` returns either 1 (true) or 0 (false)
exp returns [expr]
: ^(OR a=exp b=exp) {expr = ($a.expr == 1 || $b.expr == 1) ? 1 : 0;}
| ^(AND a=exp b=exp) {expr = ($a.expr == 1 && $b.expr == 1) ? 1 : 0;}
| ^(IS a=exp b=exp) {expr = ($a.expr == $b.expr) ? 1 : 0;}
| ^(NOT a=exp) {expr = ($a.expr == 1) ? 0 : 1;}
| TRUE {expr = 1;}
| FALSE {expr = 0;}
;
(apologies for the messy JavaScript code inside { ... }: I have very little experience with JavaScript!)
Now download ANTLR 3.3 (no earlier version!) and the JavaScript runtime files:
http://www.antlr.org/download/antlr-3.3-complete.jar
http://www.antlr.org/download/antlr-javascript-runtime-3.1.zip
Rename antlr-3.3-complete.jar to antlr-3.3.jar and unzip antlr-javascript-runtime-3.1.zip and store all files in the same folder as your Exp.g and ExpWalker.g files.
Now generate the lexer, parser and tree-walker:
java -cp antlr-3.3.jar org.antlr.Tool Exp.g
java -cp antlr-3.3.jar org.antlr.Tool ExpWalker.g
And test it all with the following html file:
<html>
<head>
<script type="text/javascript" src="antlr3-all-min.js"></script>
<script type="text/javascript" src="ExpLexer.js"></script>
<script type="text/javascript" src="ExpParser.js"></script>
<script type="text/javascript" src="ExpWalker.js"></script>
<script type="text/javascript">
function init() {
var evalButton = document.getElementById("eval");
evalButton.onclick = evalExpression;
}
function evalExpression() {
document.getElementById("answer").innerHTML = "";
var expression = document.getElementById("exp").value;
if(expression) {
var lexer = new ExpLexer(new org.antlr.runtime.ANTLRStringStream(expression));
var tokens = new org.antlr.runtime.CommonTokenStream(lexer);
var parser = new ExpParser(tokens);
var nodes = new org.antlr.runtime.tree.CommonTreeNodeStream(parser.parse().getTree());
nodes.setTokenStream(tokens);
var walker = new ExpWalker(nodes);
var value = walker.walk();
document.getElementById("answer").innerHTML = expression + " = " + value;
}
else {
document.getElementById("exp").value = "enter an expression here first";
}
}
</script>
</head>
<body onload="init()">
<input id="exp" type="text" size="35" />
<button id="eval">evaluate</button>
<div id="answer"></div>
</body>
</html>
And behold the result:

Using ANTLR to parse a log file

I'm just about starting with ANTLR and trying to parse some pattern out of a log file
for example: log file:
7114422 2009-07-16 15:43:07,078
[LOGTHREAD] INFO StatusLog - Task 0
input :
uk.project.Evaluation.Input.Function1(selected=["red","yellow"]){}
7114437 2009-07-16 15:43:07,093
[LOGTHREAD] INFO StatusLog - Task 0
output :
uk.org.project.Evaluation.Output.Function2(selected=["Rocket"]){}
7114422 2009-07-16 15:43:07,078
[LOGTHREAD] INFO StatusLog - Task 0
input :
uk.project.Evaluation.Input.Function3(selected=["blue","yellow"]){}
7114437 2009-07-16 15:43:07,093
[LOGTHREAD] INFO StatusLog - Task 0
output :
uk.org.project.Evaluation.Output.Function4(selected=["Speech"]){}
Now I have to parse this file to only find 'Evaluation.Input.Function1' and it's values 'red' and 'yellow' and 'Evaluation.Output.Function2' and values 'Rocket' and ignore everything else and similarly the other 2 input and output functions 3,4 below. There are many such Input and Output functions and I have to find such sets of input/output functions. This is my attempted grammar which is not working. Any help would be appreciated. Being my first attempt at writing grammar and ANTLR it is becoming quite daunting now..
grammar test;
tag : inputtag+ outputtag+ ;
//Input tag consists of atleast one inputfunction with one or more values
inputtag: INPUTFUNCTIONS INPUTVALUES+;
//output tag consists of atleast one ontput function with one or more output values
outputtag : OUTPUTFUNCTIONS OUTPUTVALUES+;
INPUTFUNCTIONS
: INFUNCTION1 | INFUNCTION2;
OUTPUTFUNCTIONS
:OUTFUNCTION1 | OUTFUNCTION2;
// Possible input functions in the log file
fragment INFUNCTION1
:'Evaluation.Input.Function1';
fragment INFUNCTION2
:'Evaluation.Input.Function3';
//Possible values in the input functions
INPUTVALUES
: 'red' | 'yellow' | 'blue';
// Possible output functions in the log file
fragment OUTFUNCTION1
:'Evaluation.Output.Function2';
fragment OUTFUNCTION2
:'Evaluation.Output.Function4';
//Possible ouput values in the output functions
fragment OUTPUTVALUES
: 'Rocket' | 'Speech';
When you're only interested in a part of the file you're parsing, you don't need a parser and write a grammar for the entire format of the file. Only a lexer-grammar and ANTLR's options{filter=true;} will suffice. That way, you will only grab the tokens you defined in your grammar and ignore the rest of the file.
Here's a quick demo:
lexer grammar TestLexer;
options{filter=true;}
#lexer::members {
public static void main(String[] args) throws Exception {
String text =
"7114422 2009-07-16 15:43:07,078 [LOGTHREAD] INFO StatusLog - Task 0 input : uk.project.Evaluation.Input.Function1(selected=[\"red\",\"yellow\"]){}\n"+
"\n"+
"7114437 2009-07-16 15:43:07,093 [LOGTHREAD] INFO StatusLog - Task 0 output : uk.org.project.Evaluation.Output.Function2(selected=[\"Rocket\"]){}\n"+
"\n"+
"7114422 2009-07-16 15:43:07,078 [LOGTHREAD] INFO StatusLog - Task 0 input : uk.project.Evaluation.Input.Function3(selected=[\"blue\",\"yellow\"]){}\n"+
"\n"+
"7114437 2009-07-16 15:43:07,093 [LOGTHREAD] INFO StatusLog - Task 0 output : uk.org.project.Evaluation.Output.Function4(selected=[\"Speech\"]){}";
ANTLRStringStream in = new ANTLRStringStream(text);
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object obj : tokens.getTokens()) {
Token token = (Token)obj;
System.out.println("> token.getText() = "+token.getText());
}
}
}
Input
: 'Evaluation.Input.Function' '0'..'9'+ Params
;
Output
: 'Evaluation.Output.Function' '0'..'9'+ Params
;
fragment
Params
: '(selected=[' String ( ',' String )* '])'
;
fragment
String
: '"' ( ~'"' )* '"'
;
Now do:
javac -cp antlr-3.2.jar TestLexer.java
java -cp .:antlr-3.2.jar TestLexer // or on Windows: java -cp .;antlr-3.2.jar TestLexer
and you'll see the following being printed to the console:
> token.getText() = Evaluation.Input.Function1(selected=["red","yellow"])
> token.getText() = Evaluation.Output.Function2(selected=["Rocket"])
> token.getText() = Evaluation.Input.Function3(selected=["blue","yellow"])
> token.getText() = Evaluation.Output.Function4(selected=["Speech"])