ANTLR parse multiple files to produce one AST - antlr

How do i parse multiple source file and end up with just one AST to perform analysis and code generation from? Typically, I find example usage of ANTLR in the form of
public void process(String source)
{
ANTLRStringStream Input = new ANTLRStringStream(input);
TLexer lex = new TLexer(Input);
CommonTokenStream tokens = new CommonTokenStream(lex);
TParser parser = new TParser(tokens);
var tree = parser.parse().Tree;
}
but neither the lexer nor the parser seems to be able to take additional files. Am I supposed to create a lexer and parser pr. inputfile and use tree.Add() to add trees from the other files to the tree of the first file?

Here are three ways you could do this:
Use Bart's suggestion and combine the files into a single buffer. This would require adding a lexer rule that implements identical functionality to the C++ #line directive.
Combine the trees returned by the parser rule.
Use multiple input streams with a single lexer. This can be done by using code similar to that which handles include files by pushing all buffers onto the stack before lexing.
The second option would probably be the easiest. I don't use the Java target so I can't give code details, which is required for all of these solutions.

I think this close to what you are after. I've hard-coded two files to process but you can process as many as needed by creating a loop. At the step // create new parent node and merge trees here into fulltree see Bart's answer on duplicating a tree. It has the steps to create a parent node and attach children to it (sorry but I've not done this and didn't have time to integrate his code and test).
public class OneASTfromTwoFiles {
public String source1 = "file1.txt";
public String source2 = "file2.txt";
public static void main(String[] args)
CommonTree fulltree;
{
CommonTree nodes1 = process(source1);
CommonTree nodes2 = process(source2);
// create new parent node and merge trees here into fulltree
CommonTreeNodeStream nodes = new CommonTreeNodeStream(fulltree); //create node stream
treeEval walker = new treeEval(fulltree);
walker.startRule(); //walk the combined tree
}
public CommonTree process(String source)
{
CharStream afs = null;
// read file; exit if error
try {
afs = new ANTLRFileStream(source);
}
catch (IOException e) {
System.out.println("file not found");
System.exit(1);
}
TLexer lex = new TLexer(afs);
CommonTokenStream tokens = new CommonTokenStream(lex);
TParser parser = new TParser(tokens);
//note startRule is the name of the first rule in your parser grammar
TParser.startRule_return r = parser.startRule(); //parse this file
CommonTree ast = (CommonTree)r.getTree(); //create AST from parse of this file
return ast; //and return it
}
}

Related

Can Antlr detect missing lexer tokens

I'm using the Java syntax defined at https://github.com/antlr/grammars-v4/tree/master/java/java
My user somehow input his Java code as the following
class HelloWorld {
public static void main(String[] args) {
* my first program !
*/
System.out.println("Hello, World!");
}
}
He just forgets /* before line 3, but my parser is screwed up.
var stream = CharStreams.fromString(input);
ITokenSource lexer = new JavaLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
Parser parser = new JavaParser(tokens);
var tree = parser.compilationUnit();
The Definitive ANTLR 4 Reference claims Antlr can do single token insertion and single token deletion but I didn't find it's inserting /* for me.
How would I ask Antlr to recover from the missing /*?

Simple grammar parse returning no statements

Using C# and ANTLR4, I'm trying to parse a simple grammar, which is just a simple assign statement, which would look like:
int someinteger = 3;.
Below are my parser rules, which contain a compile unit, block and basic statement.
//The final compile unit sent to the interpreter.
compileUnit
: block EOF
;
//A block, array of statements.
block: statement*
;
//A single statement.
statement: stat_ass;
//An assign statement.
stat_ass: IDENTIFIER IDENTIFIER SET_EQUALS INTEGER ENDLINE;
When parsing int banana = 142;, the tokens returned are:
[IDENTIFIER, int]
[IDENTIFIER, banana]
[SET_EQUALS, =]
[INTEGER, 142]
[ENDLINE, ;]
However, when printing my parse tree, it just contains a block which has no statements.
ANTLR Parse Tree:
([] [10] <EOF>)
Can someone enlighten me on why this fails? Apologies if this is a simple mistake, I've run out of options I can think of to fix this.
Program.cs:
using Antlr4.Runtime;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace stork
{
class Program
{
static void Main(string[] args)
{
//Test input string.
string input = "int banana = 142;";
var chars = new AntlrInputStream(input);
var lexer = new storkLexer(chars);
var tokens = new CommonTokenStream(lexer);
//Debug print.
ANTLRDebug.PrintTokens(lexer);
//Debug print tree.
var parser = new storkParser(tokens);
ANTLRDebug.PrintParseList(parser);
//Getting tree.
parser.BuildParseTree = true;
var tree = parser.compileUnit();
}
}
}
ANTLRDebug.cs
https://github.com/c272/stork-lang/blob/master/stork/ANTLRDebug.cs
stork.g4
https://github.com/c272/stork-lang/blob/master/stork/Stork.g4
Your ANTLRDebug.PrintTokens method iterates over all the tokens from the lexer, consuming all of them. Afterwards the lexer is empty (it's like an iterator that way), so you're invoking the parser on an empty token stream.
You should call lexer.reset() after calling ANTLRDebug.PrintTokens (or call it at the end of that method) to reset the lexer to the beginning of the input stream.
PS: I recommend calling ToStringTree(parser) instead of just ToStringTree() as that will produce more readable output (rule names instead of numbers).

Lucene's WordnetSynonymParser

I am trying to use Lucene's WordnetSynonymParser class to create a synonym filter, but I'm not sure which of the prolog files I'm meant to be passing into the parse() function.
The documentation says:
See http://wordnet.princeton.edu/man/prologdb.5WN.html for a
description of the format.
so I've downloaded the prolog files, but I'm not sure which ones I should be passing in, and how I go about it.
Could someone please point me in the right direction?
Thanks for your help
EDIT:
Thanks to femtoRgon for pointing me in the direction of wn_s.pl. I have now got the following code:
Analyzer tempanalyzer = new SimpleAnalyzer(Version.LUCENE_40);
WordnetSynonymParser synparser = new WordnetSynonymParser(true, true, tempanalyzer);
FileReader doctoread = new FileReader("wn_s.pl");
synparser.parse(doctoread);
SynonymMap synmap = synparser.build();
Analyzer analyzer = new Analyzer() {
#Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
EnglishAnalyzer enganalyzer = new EnglishAnalyzer(Version.LUCENE_40);
CharArraySet engstopset = enganalyzer.getDefaultStopSet();
Tokenizer source = new StandardTokenizer(Version.LUCENE_40, reader);
TokenStream filter = new SynonymFilter(source, synmap, true);
filter = new StandardFilter(Version.LUCENE_40, filter);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new StopFilter(Version.LUCENE_40, filter, engstopset);
/*TokenStream filter = new StandardFilter(Version.LUCENE_40, source);
filter = new LowerCaseFilter(Version.LUCENE_40, filter);
filter = new StopFilter(Version.LUCENE_40, filter, engstopset);*/
return new TokenStreamComponents(source, filter);
}
};
which I then plan on passing into IndexWriterConfig, however I get the following compile error:
IndexFilesDB.java:133: cannot find symbol
symbol : method parse(java.io.FileReader)
location: class org.apache.lucene.analysis.synonym.WordnetSynonymParser
synparser.parse(doctoread);
I still don't fully understand WordnetSynonymParser, is it an error to do with the class or it just a simple error where the file is not being passes in correctly?
Thanks for your help.
wn_s.pl contains the synset pointers (that is, it defines groups of synonyms), which is what you need for a synonym filter, to my knowledge. I'd start with that.

How to add options for Analyze in Apache Lucene?

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.
I'm running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, ...
In total, there 36 combinations that I want to try.
How can I do easily and gracefully do this?
I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:
public class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
}
What I'd like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don't want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.
Any ideas on how to do this gracefully?
EDIT: By "gracefully", I mean that I can easily create a new Analyzer in some sort of loop:
analyzer = new MyAnalyzer(doStemming, doStopping, ...)
where doStemming and doStopping change with each loop iteration.
Solr solves this problem by using Tokenizer and TokenFilter factories. You could do the same, for example:
public interface TokenizerFactory {
Tokenizer newTokenizer(Reader reader);
}
public interface TokenFilterFactory {
TokenFilter newTokenFilter(TokenStream source);
}
public class ConfigurableAnalyzer {
private final TokenizerFactory tokenizerFactory;
private final List<TokenFilterFactory> tokenFilterFactories;
public ConfigurableAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory... tokenFilterFactories) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactories = Arrays.asList(tokenFilterFactories);
}
public TokenStream tokenStream(String field, Reader source) {
TokenStream sink = tokenizerFactory.newTokenizer(source);
for (TokenFilterFactory tokenFilterFactory : tokenFilterFactories) {
sink = tokenFilterFactory.newTokenFilter(sink);
}
return sink;
}
}
This way, you can configure your analyzer by passing a factory for one tokenizer and 0 to n filters as constructor arguments.
Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.
public class MyAnalyzer extends Analyzer {
private Set customStopSet;
public static final String[] STOP_WORDS = ...;
private boolean doStemming = false;
private boolean doStopping = false;
public JavaSourceCodeAnalyzer(){
super();
customStopSet = StopFilter.makeStopSet(STOP_WORDS);
}
public void setDoStemming(boolean val){
this.doStemming = val;
}
public void setDoStopping(boolean val){
this.doStopping = val;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
}
There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can't have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.
(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)
Is there a better answer?

ANTLR forward references

I need to create a grammar for a language with forward references. I think that the easiest way to achieve this is to make several passes on the generated AST, but I need a way to store symbol information in the tree.
Right now my parser correctly generates an AST and computes scopes of the variables and function definitions. The problem is, I don't know how to save the scope information into the tree.
Fragment of my grammar:
composite_instruction
scope JScope;
#init {
$JScope::symbols = new ArrayList();
$JScope::name = "level "+ $JScope.size();
}
#after {
System.out.println("code block scope " +$JScope::name + " = " + $JScope::symbols);
}
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction*)
;
I would like to put a reference to current scope into a tree, something like:
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction* {$JScope::symbols})
Is it even possible? Is there any other way to store current scopes in a generated tree? I can generate the scope info in a tree grammar, but it won't change anything, because I still have to store it somewhere for the second pass on the tree.
To my knowledge, the syntax for the rewrite rules doesn't allows for directly assigning values as your tentative snippet suggests. This is in part due to the fact that the parser wouldn't really know to what part of the tree/node the values should be added to.
However, one of the cool features of ANTLR-produced ASTs is that the parser makes no assumptions about the type of the Nodes. One just needs to implement a TreeAdapator which serves as a factory for new nodes and as a navigator of the tree structure. One can therefore stuff whatever info may be needed in the nodes, as explained below.
ANTLR provides a default tree node implementation, CommonTree, and in most cases (as in the situation at hand) we merely need to
subclass CommonTree by adding some custom fields to it
subclass the CommonTreeAdaptor to override its create() method, i.e. the way it produces new nodes.
but one could also create a novel type of node altogher, for some odd graph structure or whatnot. For the case at hand, the following should be sufficient (adapt for the specific target language if this isn't java)
import org.antlr.runtime.tree.*;
import org.antlr.runtime.Token;
public class NodeWithScope extends CommonTree {
/* Just declare the extra fields for the node */
public ArrayList symbols;
public string name;
public object whatever_else;
public NodeWithScope (Token t) {
super(t);
}
}
/* TreeAdaptor: we just need to override create method */
class NodeWithScopeAdaptor extends CommonTreeAdaptor {
public Object create(Token standardPayload) {
return new NodeWithScope(standardPayload);
}
}
One then needs to slightly modify the way the parsing process is started, so that ANTLR (or rather the ANTLR-produced parser) knows to use the NodeWithScopeAdaptor rather than CommnTree.
(Step 4.1 below, the rest if rather standard ANTLR test rig)
// ***** Typical ANTLR pipe rig *****
// ** 1. input stream
ANTLRInputStream input = new ANTLRInputStream(my_input_file);
// ** 2, Lexer
MyGrammarLexer lexer = new MyGrammarLexer(input);
// ** 3. token stream produced by lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// ** 4. Parser
MyGrammarParser parser = new MyGrammarParser(tokens);
// 4.1 !!! Specify the TreeAdapter
NodeWithScopeAdaptor adaptor = new NodeWithScopeAdaptor();
parser.setTreeAdaptor(adaptor); // use my adaptor
// ** 5. Start process by invoking the root rule
r = parser.MyTopRule();
// ** 6. AST tree
NodeWithScope t = (NodeWithScope)r.getTree();
// ** 7. etc. parse the tree or do whatever is needed on it.
Finally your grammar would have to be adapted with something akin to what follows
(note that the node [for the current rule] is only available in the #after section. It may however reference any token attribute and other contextual variable from the grammar-level, using the usual $rule.atrribute notation)
composite_instruction
scope JScope;
#init {
$JScope::symbols = new ArrayList();
$JScope::name = "level "+ $JScope.size();
}
#after {
($composite_instruction.tree).symbols = $JScope::symbols;
($composite_instruction.tree).name = $JScope::name;
($composite_instruction.tree).whatever_else
= new myFancyObject($x.Text, $y.line, whatever, blah);
}
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction*)
;