Regenerate source from Antlr4 ParseTree preserving whitespaces

Regenerate source from Antlr4 ParseTree preserving whitespaces - antlr

I am able to use Java.g4 grammar and generate lexers and parsers. This grammar is used to get started. I have a ParseTree and I walk it and change whatever I want and write it back to a Java source code file. The ParseTree is not directly changed though.
So for example this is the code. But I also modify method names and add annotations to methods. In some cases I also remove method parameters
#Override
public void enterClassDeclaration(JavaParser.ClassDeclarationContext ctx) {
printer.write( intervals );
printer.writeList(importFilter );
ParseTree pt = ctx.getChild(1);
/*
The class name is changed by a visitor because
we get a ParseTree back
*/
pt.accept(new ParseTreeVisitor<Object>() {
#Override
public Object visitChildren(RuleNode ruleNode) {
return null;
}
#Override
public Object visitErrorNode(ErrorNode errorNode) {
return null;
}
#Override
public Object visit(ParseTree parseTree) {
return null;
}
#Override
public Object visitTerminal(TerminalNode terminalNode) {
String className = terminalNode.getText();
System.out.println("Name of the class is [ " + className + "]");
printer.writeText( classModifier.get() + " class " + NEW_CLASS_IDENTIFIER );
return null;
}
});
}
But I am not sure how to print the changed Java code while preserving all the original whitespaces.
How is that done ?
Update : It seems that the whitespaces and comments are there but not accessible easily. So it looks like I need to specifically keep track of them and write them along with the code. Not sure though.
So more specifically the code is this.
package x;
import java.util.Enumeration;
import java.util.*;
As I hit the first ImportDeclarationContext I need to store all the hidden space tokens. When I write this code back I want to include those spaces too.
Solution :
Don't skip but add to a HIDDEN channel
//
// Whitespace and comments
//
WS : [ \t\r\n\u000C]+ -> channel(HIDDEN)
;
COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
;
Use code like this to get them back
I use this to get the whitespaces and comments before each method. But it should be possible to get whitespaces from other places too. I think.
((CommonTokenStream) tokens).getHiddenTokensToLeft( classOrInterfaceModifierContext.getStart().getTokenIndex(),
Token.HIDDEN_CHANNEL);

Related

How to add a semi colon ; automatically to each generated sql statement using jOOQ

I'm trying to add a semi colon ; to every jOOQ generated sql statement as I'm writing multiple DDL and insert statement to an output file.
I found a similar question here suggesting using an ExecuteListener here https://jooq-user.narkive.com/6adKecpt/adding-semicolon-at-the-end-of-sql-statement.
My setup is now as follows (using Groovy):
private DSLContext createDSLContext() {
def configuration = new DefaultConfiguration()
configuration.settings = new Settings()
.withRenderFormatted(true)
.withRenderKeywordCase(RenderKeywordCase.LOWER)
.withRenderQuotedNames(RenderQuotedNames.ALWAYS)
.withStatementType(StatementType.STATIC_STATEMENT)
configuration.set(
new DefaultExecuteListenerProvider(new DefaultExecuteListener() {
#Override
void renderEnd(ExecuteContext ctx) {
ctx.sql(ctx.sql() + ";")
}
}),
new DefaultExecuteListenerProvider(new DefaultExecuteListener() {
#Override
void start(ExecuteContext ctx) {
println "YEAH!!!"
}
}))
// return configuration.dsl();
return DSL.using(configuration)
}
but is not adding the semi colon, nor is it getting in the renderEnd method at all.
I added another execute listener to print something at the start (as I have seen in other examples) but it is also never called..
My code looks like:
file.withWriter { writer ->
// Drop schema objects.
DEFAULT_SCHEMA.tables.each {
switch (it.type) {
case TABLE:
writer.writeLine(dsl.dropTableIfExists(it).SQL)
break
case VIEW:
writer.writeLine(dsl.dropViewIfExists(it).SQL)
break
}
}
writer.writeLine("")
// Create schema objects.
def ddlStatements = dsl.ddl(DEFAULT_SCHEMA)
ddlStatements.each {
writer.writeLine(it.SQL)
writer.writeLine("")
}
// Insert data.
def insert = dsl.insertInto(Tales.CUSTOMER).columns(Tales.CUSTOMER.fields())
customers.each {insert.values(it) }
writer.writeLine(insert.SQL)
}

The ExecuteListener lifecycle is only triggered when you execute your queries with jOOQ. You're not doing that, you're just calling Query.getSQL()
You could wrap your queries into DSLContext.queries(Query...), and jOOQ will separate the statements using ; when you call Queries.getSQL() when you call Queries.toString(). Of course, that's not reliable, the behaviour of toString() might change in the future, which is why it would make sense to offer methods like Queries.getSQL() and the likes: https://github.com/jOOQ/jOOQ/issues/11755
For the time being, why not just add the semi colon manually to the writer in your code?

How to implement just some basic keywords highlighting in text editor?

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.

You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.

You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}

If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").

you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

Multiple stringtemplates from one rule

Being new to ANTLR I am trying to figure out how stringtemplates work. I would like to generate a piece of Java code based on a very simple input file. Because of its flexible concept I would like to use (string)templates.
In Java, one typically has to generate member declaration, initialize them somewhere else, and use them in even another place. The identifier names should match and are thus repeated. This means little template instantiations are needed here and there. Surely it can be done, but I cannot seem to find out how, maybe I am missing some important 'clue'?
I wrote a test program to investigate the concept. It takes a simple input file:
red = #FF0000
green = #00FF00
blue = #0000FF
and should produce something like the following output:
class MyColors {
// Class members
public java.awt.Color red;
public java.awt.Color green;
public java.awt.Color blue;
// Constructor
/* Question: How to access the right initializer value here?!? The values are not accessible at this level of the grammar*/
public MyColors() {
red = java.awt.Color.getColor("#FF0000");
green = java.awt.Color.getColor("#00FF00");
blue = java.awt.Color.getColor("#0000FF");
}
};
...where the names of the variables and initializers in the constructor are filled in according to the input.
The grammar I have defined is as follows:
grammar Test;
options {
output=template;
}
colors: (a+=def)+ -> colorClassDef(name={$a});
def: ident '=' name -> colorDef(id={$ident.text}, name={$name.text});
ident: ID;
name: ID;
ID: ('a'..'z'|'A'..'Z'|'#'|'0'..'9')+;
WS: (' '|'\t'|'\r'|'\n')+ { skip(); };
The template definitions are as follows:
group Test;
colorClassDef(name, id) ::= <<
class MyColors {
// Class members
<name:{ v | public java.awt.Color <v>;
}>
// Constructor
/* Question: How to access the initializer value here?!? */
public MyColors() {
<name:{ v | <v> = java.awt.Color.getColor("<id>");
}>
}
};
>>
/* How to return both id and name here seperately, as ID should go into the declaration and name should to into the init? */
colorDef(id, name) ::= <<
<id>
>>
Can anyone suggest how I can get <id> and <name> out of the rule 'def' inorder to include them in the right portion of the generated code?
I have found multiple questions regarding multiple return values, like Returning multiple values in ANTLR rule and antlr2 return multiple values, but none include stringtemplates.
I even bought 'the book' and worked my way trought the java bytecode generator, but did not find my answer there. All examples seem to generate one bit of output for one bit of input. (No regrets though, the book makes excellent bed-time reading ;-)
Can anyone point out to me what clue I am missing? What would be the most appropriate way to fix this problem? Some examplary code and pointers to the documentation would be much appreciated.
Thanks,
Maarten

Can anyone suggest how I can get and out of the rule 'def' inorder to include them in the right portion of the generated code?
Here's a straight-forward Java-centric approach to getting what you want. It's not as graceful as I would like (I assume there's room for improvement), but I think it solves the problem without a great deal of hassle. I renamed a few things, but I think I kept the spirit of your approach intact.
First, the template. Note that template colorClassDef requires every bit of information that's determined by the grammar: every id, every name, and the association between each id with its corresponding name. Here's one of accessing all of that from the template:
group Colors;
colorClassDef(ids, colors) ::= <<
class MyColors {
// Class members
<ids:{ id | public java.awt.Color <id>;
}>
// Constructor
public MyColors() {
<ids:{ id | <id> = java.awt.Color.getColor("<colors.(id)>");
}>
}
};
>>
Here I'm using parameter ids to store a list of all the incoming ids and parameter colors to store a map that associates an id (the key) to a name (the value). For the constructor portion, ST accesses the id's name from colors with the "indirect property lookup" syntax: <colors.(id)>. Since colors is a map, id is used as a key into the map and the value is written into the template.
Template colorClassDef handles everything, so I removed template colorDef.
Second, the grammar. It needs to provide the ids and color map. Here's one way of doing that:
grammar Colors;
options {
output=template;
}
colors
#init {
java.util.LinkedList<String> ids = new java.util.LinkedList<String>();
java.util.HashMap<String, String> colors = new java.util.HashMap<String, String>();
}
: (ident '=' name
{ids.add($ident.text); colors.put($ident.text, $name.text);}
)+ EOF
-> colorClassDef(ids={ids}, colors={colors})
;
ident: ID;
name: ID;
ID: ('a'..'z'|'A'..'Z'|'#'|'0'..'9')+;
WS: (' '|'\t'|'\r'|'\n')+ { skip(); };
(To keep the grammar relatively simple, I merged rules colors and def into colors.)
Each ident is added to list ids and each name is added to map colors as the value to the corresponding ident key. Then off they go to the template.
Here is a test class to test out the works:
public class ColorsTest {
public static void main(String[] args) throws Exception {
final String code = "red = #FF0000\ngreen = #00FF00\nblue = #0000FF";
process(code, "Colors.stg");
}
private static void process(final String code, String templateResourceName)
throws IOException, RecognitionException, Exception {
CharStream input = new ANTLRStringStream(code);
ColorsLexer lexer = new ColorsLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
ColorsParser parser = new ColorsParser(tokens);
InputStream stream = ColorsTest.class.getResourceAsStream(templateResourceName);
Reader reader = new InputStreamReader(stream);
parser.setTemplateLib(new StringTemplateGroup(reader));
reader.close();
stream.close();
ColorsParser.colors_return result = parser.colors();
if (parser.getNumberOfSyntaxErrors() > 0){
throw new Exception("Syntax Errors encountered!");
}
System.out.println(result.toString());
}
}
Here's a test case based on the input in your question.
Input
red = #FF0000
green = #00FF00
blue = #0000FF
Output
class MyColors {
// Class members
public java.awt.Color red;
public java.awt.Color green;
public java.awt.Color blue;
// Constructor
public MyColors() {
red = java.awt.Color.getColor("#FF0000");
green = java.awt.Color.getColor("#00FF00");
blue = java.awt.Color.getColor("#0000FF");
}
};

ANTLR forward references

I need to create a grammar for a language with forward references. I think that the easiest way to achieve this is to make several passes on the generated AST, but I need a way to store symbol information in the tree.
Right now my parser correctly generates an AST and computes scopes of the variables and function definitions. The problem is, I don't know how to save the scope information into the tree.
Fragment of my grammar:
composite_instruction
scope JScope;
#init {
$JScope::symbols = new ArrayList();
$JScope::name = "level "+ $JScope.size();
}
#after {
System.out.println("code block scope " +$JScope::name + " = " + $JScope::symbols);
}
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction*)
;
I would like to put a reference to current scope into a tree, something like:
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction* {$JScope::symbols})
Is it even possible? Is there any other way to store current scopes in a generated tree? I can generate the scope info in a tree grammar, but it won't change anything, because I still have to store it somewhere for the second pass on the tree.

To my knowledge, the syntax for the rewrite rules doesn't allows for directly assigning values as your tentative snippet suggests. This is in part due to the fact that the parser wouldn't really know to what part of the tree/node the values should be added to.
However, one of the cool features of ANTLR-produced ASTs is that the parser makes no assumptions about the type of the Nodes. One just needs to implement a TreeAdapator which serves as a factory for new nodes and as a navigator of the tree structure. One can therefore stuff whatever info may be needed in the nodes, as explained below.
ANTLR provides a default tree node implementation, CommonTree, and in most cases (as in the situation at hand) we merely need to
subclass CommonTree by adding some custom fields to it
subclass the CommonTreeAdaptor to override its create() method, i.e. the way it produces new nodes.
but one could also create a novel type of node altogher, for some odd graph structure or whatnot. For the case at hand, the following should be sufficient (adapt for the specific target language if this isn't java)
import org.antlr.runtime.tree.*;
import org.antlr.runtime.Token;
public class NodeWithScope extends CommonTree {
/* Just declare the extra fields for the node */
public ArrayList symbols;
public string name;
public object whatever_else;
public NodeWithScope (Token t) {
super(t);
}
}
/* TreeAdaptor: we just need to override create method */
class NodeWithScopeAdaptor extends CommonTreeAdaptor {
public Object create(Token standardPayload) {
return new NodeWithScope(standardPayload);
}
}
One then needs to slightly modify the way the parsing process is started, so that ANTLR (or rather the ANTLR-produced parser) knows to use the NodeWithScopeAdaptor rather than CommnTree.
(Step 4.1 below, the rest if rather standard ANTLR test rig)
// ***** Typical ANTLR pipe rig *****
// ** 1. input stream
ANTLRInputStream input = new ANTLRInputStream(my_input_file);
// ** 2, Lexer
MyGrammarLexer lexer = new MyGrammarLexer(input);
// ** 3. token stream produced by lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// ** 4. Parser
MyGrammarParser parser = new MyGrammarParser(tokens);
// 4.1 !!! Specify the TreeAdapter
NodeWithScopeAdaptor adaptor = new NodeWithScopeAdaptor();
parser.setTreeAdaptor(adaptor); // use my adaptor
// ** 5. Start process by invoking the root rule
r = parser.MyTopRule();
// ** 6. AST tree
NodeWithScope t = (NodeWithScope)r.getTree();
// ** 7. etc. parse the tree or do whatever is needed on it.
Finally your grammar would have to be adapted with something akin to what follows
(note that the node [for the current rule] is only available in the #after section. It may however reference any token attribute and other contextual variable from the grammar-level, using the usual $rule.atrribute notation)
composite_instruction
scope JScope;
#init {
$JScope::symbols = new ArrayList();
$JScope::name = "level "+ $JScope.size();
}
#after {
($composite_instruction.tree).symbols = $JScope::symbols;
($composite_instruction.tree).name = $JScope::name;
($composite_instruction.tree).whatever_else
= new myFancyObject($x.Text, $y.line, whatever, blah);
}
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction*)
;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Regenerate source from Antlr4 ParseTree preserving whitespaces - antlr

Related

How to add a semi colon ; automatically to each generated sql statement using jOOQ

How to implement just some basic keywords highlighting in text editor?

Lucene 4.1 : How split words that contains "dots" when indexing?

Multiple stringtemplates from one rule

ANTLR forward references

Categories

Resources