Retrieve token list from ParserRuleContext - antlr

Main question
Is there an easy way to get a list of tokens (ideally in a form of a TokenStream) from the parser rule class ParserRuleContext?
Related answers
In an answer for a question Traversal of tokens using ParserRuleContext in listener - ANTLR4 this solution appeared:
ParserRuleContext pctx = ctx.getParent();
List<TerminalNode> nodes = pctx.getTokens(pctx.getStart(), pctx.getStop());
But there is no method with signature ParserRuleContext::getTokens(Token, Token) in ANTLRv4.
My solution
I thought about retriving a list of tokens from TokenStream by using TokenStream:get(index: int) method, where index value will be set to a range of indicies of given ParserRuleContext start/stop tokens.
Additional question
Is there a way to get a subset of tokens from TokenStream in a form of another TokenStream?

So, I overlooked some classes and their inferfaces in the ANTLRv4 API.
Answer to main question
Proposed above solution is correct. Also BufferedTokenStream and CommonTokenStream classes have method public List<Token> getTokens(int start, int stop) which allows to retrive list of tokens from a given range (especially from a range between start token and stop token of ParserRuleContext class)
Answer to additional question
You can utilize ListTokenSource class which implements TokenSource interface. Then you can create CommonTokenStream class passing the ListTokenSource.
Code example of ParserRuleRewriter
I encapsulate above ideas into small code example featuring ParserRuleRewriter - a TokenStreamRewriter that rewrites only given parser rule. In the code tokenStream parameter is a token stream of a full program.
import org.antlr.v4.runtime.*;
import java.util.List;
public class ParserRuleRewriter {
private TokenStreamRewriter rewriter;
public ParserRuleRewriter(ParserRuleContext parserRule, CommonTokenStream tokenStream) {
Token start = parserRule.getStart();
Token stop = parserRule.getStop();
List<Token> ruleTokens = tokenStream.getTokens(start.getTokenIndex(), stop.getTokenIndex());
ListTokenSource tokenSource = new ListTokenSource(ruleTokens);
CommonTokenStream commonTokenStream = new CommonTokenStream(tokenSource);
commonTokenStream.fill();
rewriter = new TokenStreamRewriter(commonTokenStream);
}
public void replace(Token token, ParserRuleContext rule) {
rewriter.replace(token, rule.getText());
}
#Override
public String toString() {
return rewriter.getText();
}
}

Related

ANTLR4 store all tokens in specific channel

I have a lexer which puts every token the parser is interested in into the default channel and all comment-tokens in channel 1.
The default channel is used to create the actual tree while the comment channel is used to seperate the tokens and to store all comments.
Look at this scribble:
In chapter 12.1 p. 206-208 in The Definitive ANTLR4 Reference there is a comparable situation where comment tokens are shifted inside the token stream. The represented approach is to read out the comment-channel in an exit-method inside the parser.
In my opinion this is a very rough option for my problem, because i don't want to overwhelm my listener with that back-looking operations. Is there a possibility to override a method which puts tokens inside the comment-channel?
It looks like you misunderstand how channels work in ANTLR. What happens is that the lexer, as it comes along a token, assigns the default channel (just a number) during initialization of the token. That value is only changed when the lexer finds a -> channel() command or you change it explicitely in a lexer action. So there is nothing to do in a listener or whatever to filter out such tokens.
Later when you want to get all tokens "in" a given channel (i.e. all tokens that have a specific channel number assigned) you can just iterate over all tokens returned by your token stream and compare the channel value. Alternatively you can create a new CommonTokenStream instance and pass it the channel your are interested in. It will then only give you those tokens from that channel (it uses a token source, e.g. a lexer, to get the actual tokens and cache them).
I found out, that there is a easy way to override how tokens are created. To do this, one can override a method inside the CommonTokenFactory and give it to the Lexer. At this point i can check the channel and i am able to push the tokens in a separate set.
In my opinion this is a little bit hacky, but i do not need to iterate over the whole commonTokenStream later on.
This code is only to demonstrate the idea behind (in C#) .
internal class HeadAnalyzer
{
#region Methods
internal void AnalyzeHeader(Stream headerSourceStream)
{
var antlrFileStream =
new AntlrInputStream(headerSourceStream);
var mcrLexer = new MCRLexer(antlrFileStream);
var commentSaverTokenFactory = new MyTokenFactory();
mcrLexer.TokenFactory = commentSaverTokenFactory;
var commonTokenStream = new CommonTokenStream(mcrLexer);
var mcrParser = new MCRParser(commonTokenStream);
mcrParser.AddErrorListener(new DefaultErrorListener());
MCRParser.ProgramContext tree;
try
{
tree = mcrParser.program(); // create the tree
}
catch (SyntaxErrorException syntaxErrorException)
{
throw new NotImplementedException();
}
var headerContext = new HeaderContext();
var headListener = new HeadListener(headerContext);
ParseTreeWalker.Default.Walk(headListener, tree);
var comments = commentSaverTokenFactory.CommentTokens; // contains all comments :)
}
#endregion
}
internal class MyTokenFactory : CommonTokenFactory
{
internal readonly List<CommonToken> CommentTokens = new List<CommonToken>();
public override CommonToken Create(Tuple<ITokenSource, ICharStream> source, int type, string text, int channel, int start, int stop, int line, int charPositionInLine)
{
var token = base.Create(source, type, text, channel, start, stop, line, charPositionInLine);
if (token.Channel == 1)
{
CommentTokens.Add(token);
}
return token;
}
}
Maybe there are some better approaches. For my usecase it works as expected.

How to implement just some basic keywords highlighting in text editor?

I'm a novice programmer trying to learn plug-in development. I'd like to upgrade the sample XML editor so that some words like "cat", "dog", "hamster", "rabbit" and "bird" would be highlighted when it appears in an XML file (it's just for learning purpose). Can anyone give me some implementation tips or suggestions? I am clueless.. (But I am carrying out my research on this as well, I'm not being lazy. You have my word.) Thanks in advance.
You can detect words in the plain text part of the XML by modifying the sample XML editor as follows.
We can use the provided WordRule class to detect the words. The XMLScanner class which scans the plain text needs to be updated to include the word rule:
public XMLScanner(final ColorManager manager)
{
IToken procInstr = new Token(new TextAttribute(manager.getColor(IXMLColorConstants.PROC_INSTR)));
WordRule words = new WordRule(new WordDetector());
words.addWord("cat", procInstr);
words.addWord("dog", procInstr);
// TODO add more words here
IRule [] rules = new IRule [] {
// Add rule for processing instructions
new SingleLineRule("<?", "?>", procInstr),
// Add generic whitespace rule.
new WhitespaceRule(new XMLWhitespaceDetector()),
// Words rules
words
};
setRules(rules);
}
I have used the existing processing instruction token here to reduce the amount of new code, but you should define a new color and use a new token.
The WordRule constructor requires an IWordDetector class, we can use a very simple detector here:
class WordDetector implements IWordDetector
{
#Override
public boolean isWordStart(final char c)
{
return Character.isLetter(c);
}
#Override
public boolean isWordPart(final char c)
{
return Character.isLetter(c);
}
}
This is just accepting letters in words.

Lucene 4.1 : How split words that contains "dots" when indexing?

I'l trying to figure out what I should do to index my keywords that contains "." .
ex : this.name
I want to index the terms : this and name in my index.
I use the StandardAnalyser. I try to extends the WhitespaceTokensizer or extends TokenFilter, but I'm not sure if I'm in the right direction.
if I use the StandardAnalyser, I'll obtain "this.name" as a keyword, and that's not what I want, but the analyser do the rest correctly for me.
You can put a CharFilter in front of StandardTokenizer that converts periods and underscores to spaces. MappingCharFilter will work.
Here's MappingCharFilter added to a stripped-down StandardAnalyzer (see the original 4.1 version here):
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.charfilter.MappingCharFilter;
import org.apache.lucene.analysis.charfilter.NormalizeCharMap;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.util.StopwordAnalyzerBase;
import org.apache.lucene.util.Version;
import java.io.IOException;
import java.io.Reader;
public final class MyAnalyzer extends StopwordAnalyzerBase {
private int maxTokenLength = 255;
public MyAnalyzer() {
super(Version.LUCENE_41, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
}
#Override
protected TokenStreamComponents createComponents
(final String fieldName, final Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
src.setMaxTokenLength(maxTokenLength);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, stopwords);
return new TokenStreamComponents(src, tok) {
#Override
protected void setReader(final Reader reader) throws IOException {
src.setMaxTokenLength(MyAnalyzer.this.maxTokenLength);
super.setReader(reader);
}
};
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
builder.add(".", " ");
builder.add("_", " ");
NormalizeCharMap normMap = builder.build();
return new MappingCharFilter(normMap, reader);
}
}
Here's a quick test to demonstrate it works:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
public class TestMyAnalyzer extends BaseTokenStreamTestCase {
private Analyzer analyzer = new MyAnalyzer();
public void testPeriods() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"this.name; here.i.am; sentences ... end with periods.",
new String[] { "name", "here", "i", "am", "sentences", "end", "periods" } );
}
public void testUnderscores() throws Exception {
BaseTokenStreamTestCase.assertAnalyzesTo
(analyzer,
"some_underscore_term _and____ stuff that is_not in it",
new String[] { "some", "underscore", "term", "stuff" } );
}
}
If I understand you correctly, you need to use a tokenizer that removes dots -- that is, any name that contains a dot should be split at that point ("here.i.am" becomes "here" + "i" + "am").
you are getting caught by behavior documented here:
However, a dot that's not followed by whitespace is considered part of a token.
StandardTokenizer introduces some more complex to parsing rules than you may not be looking for. This one, in particular, is intended to prevent tokenization of URLs, IPs, idenifiers, etc. A simpler implementation might suit your needs, like LetterTokenizer.
If that doesn't really suit your needs (and it might well turn out to be throwing the baby out with the bathwater), then you may need to modify StandardTokenizer yourself, which is explicitly encouraged by the Lucene docs:
Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.
Sebastien Dionne: I didn't understand how to split a word, do I have to parse the document char by char ?
Sebastien Dionne: I still want to know how to split a token into multiple part, and index them all
You may have to write a custom analyzer.
Analyzer is a combination of Tokenizer and possibly a chain of TokenFilter instances.
Tokenizer : Takes in the input text passed by you probably as a java.io.Reader. It
JUST breakdowns the text. Doesn't alter, just breaks it down.
TokenFilter : Takes in the token emitted by Tokenizer, adds / removes / alters tokens and emits the same one by one until all are finished.
If it replaces a token with multiple tokens based on requirements, buffers all, emits them one by one to the Indexer.
You may check following resource, unfortunately, you may have to sign-up for a trial membership.
By writing a custom analyzer, you can breakdown the text the way you want to. You may even use some existing components like LowercaseFilter. Fortunately, it is achievable with Lucene to come up with some Analyzer that serves your purpose if you couldn't find that as a built-in or on the web.
" Writing Custom Filters: Lucene in Action 2"

How to add options for Analyze in Apache Lucene?

Lucene has Analyzers that basically tokenize and filter the corpus when indexing. Operations include converting tokens to lowercase, stemming, removing stopwords, etc.
I'm running an experiment where I want to try all possible combinations of analysis operations: stemming only, stopping only, stemming and stopping, ...
In total, there 36 combinations that I want to try.
How can I do easily and gracefully do this?
I know that I can extend the Analyzer class and implement the tokenStream() function to create my own Analyzer:
public class MyAnalyzer extends Analyzer
{
public TokenStream tokenStream(String field, final Reader reader){
return new NameFilter(
CaseNumberFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader)
)
), StopAnalyzer.ENGLISH_STOP_WORDS)
)
);
}
What I'd like to do is write one such class, which can somehow take boolean values for each of the possible operations (doStopping, doStemming, etc.). I don't want to have to write 36 different Analyzer classes that each perform one of the 36 combinations. What makes it difficult is the way the filters are all combined together in their constructors.
Any ideas on how to do this gracefully?
EDIT: By "gracefully", I mean that I can easily create a new Analyzer in some sort of loop:
analyzer = new MyAnalyzer(doStemming, doStopping, ...)
where doStemming and doStopping change with each loop iteration.
Solr solves this problem by using Tokenizer and TokenFilter factories. You could do the same, for example:
public interface TokenizerFactory {
Tokenizer newTokenizer(Reader reader);
}
public interface TokenFilterFactory {
TokenFilter newTokenFilter(TokenStream source);
}
public class ConfigurableAnalyzer {
private final TokenizerFactory tokenizerFactory;
private final List<TokenFilterFactory> tokenFilterFactories;
public ConfigurableAnalyzer(TokenizerFactory tokenizerFactory, TokenFilterFactory... tokenFilterFactories) {
this.tokenizerFactory = tokenizerFactory;
this.tokenFilterFactories = Arrays.asList(tokenFilterFactories);
}
public TokenStream tokenStream(String field, Reader source) {
TokenStream sink = tokenizerFactory.newTokenizer(source);
for (TokenFilterFactory tokenFilterFactory : tokenFilterFactories) {
sink = tokenFilterFactory.newTokenFilter(sink);
}
return sink;
}
}
This way, you can configure your analyzer by passing a factory for one tokenizer and 0 to n filters as constructor arguments.
Add some class variables to the custom Analyzer class which can be easily set and unset on the fly. Then, in the tokenStream() function, use those variables to determine which filters to perform.
public class MyAnalyzer extends Analyzer {
private Set customStopSet;
public static final String[] STOP_WORDS = ...;
private boolean doStemming = false;
private boolean doStopping = false;
public JavaSourceCodeAnalyzer(){
super();
customStopSet = StopFilter.makeStopSet(STOP_WORDS);
}
public void setDoStemming(boolean val){
this.doStemming = val;
}
public void setDoStopping(boolean val){
this.doStopping = val;
}
public TokenStream tokenStream(String fieldName, Reader reader) {
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
}
There is one gotcha: LowerCaseTokenizer takes as input the reader variable, and returns a TokenStream. This is fine for the following filters (StopFilter, PorterStemFilter), because they take TokenStreams as input and return them as output, and so we can chain them together nicely. However, this means you can't have a filter before the LowerCaseTokenizer that returns a TokenStream. In my case, I wanted to split camelCase words into parts, and this has to be done before converting to lower case. My solution was to perform the splitting manually in the custom Indexer class, so by the time MyAnalyzer sees the text, it has already been split.
(I have also added a boolean flag to my customer Indexer class, so now both can work based solely on flags.)
Is there a better answer?

ANTLR forward references

I need to create a grammar for a language with forward references. I think that the easiest way to achieve this is to make several passes on the generated AST, but I need a way to store symbol information in the tree.
Right now my parser correctly generates an AST and computes scopes of the variables and function definitions. The problem is, I don't know how to save the scope information into the tree.
Fragment of my grammar:
composite_instruction
scope JScope;
#init {
$JScope::symbols = new ArrayList();
$JScope::name = "level "+ $JScope.size();
}
#after {
System.out.println("code block scope " +$JScope::name + " = " + $JScope::symbols);
}
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction*)
;
I would like to put a reference to current scope into a tree, something like:
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction* {$JScope::symbols})
Is it even possible? Is there any other way to store current scopes in a generated tree? I can generate the scope info in a tree grammar, but it won't change anything, because I still have to store it somewhere for the second pass on the tree.
To my knowledge, the syntax for the rewrite rules doesn't allows for directly assigning values as your tentative snippet suggests. This is in part due to the fact that the parser wouldn't really know to what part of the tree/node the values should be added to.
However, one of the cool features of ANTLR-produced ASTs is that the parser makes no assumptions about the type of the Nodes. One just needs to implement a TreeAdapator which serves as a factory for new nodes and as a navigator of the tree structure. One can therefore stuff whatever info may be needed in the nodes, as explained below.
ANTLR provides a default tree node implementation, CommonTree, and in most cases (as in the situation at hand) we merely need to
subclass CommonTree by adding some custom fields to it
subclass the CommonTreeAdaptor to override its create() method, i.e. the way it produces new nodes.
but one could also create a novel type of node altogher, for some odd graph structure or whatnot. For the case at hand, the following should be sufficient (adapt for the specific target language if this isn't java)
import org.antlr.runtime.tree.*;
import org.antlr.runtime.Token;
public class NodeWithScope extends CommonTree {
/* Just declare the extra fields for the node */
public ArrayList symbols;
public string name;
public object whatever_else;
public NodeWithScope (Token t) {
super(t);
}
}
/* TreeAdaptor: we just need to override create method */
class NodeWithScopeAdaptor extends CommonTreeAdaptor {
public Object create(Token standardPayload) {
return new NodeWithScope(standardPayload);
}
}
One then needs to slightly modify the way the parsing process is started, so that ANTLR (or rather the ANTLR-produced parser) knows to use the NodeWithScopeAdaptor rather than CommnTree.
(Step 4.1 below, the rest if rather standard ANTLR test rig)
// ***** Typical ANTLR pipe rig *****
// ** 1. input stream
ANTLRInputStream input = new ANTLRInputStream(my_input_file);
// ** 2, Lexer
MyGrammarLexer lexer = new MyGrammarLexer(input);
// ** 3. token stream produced by lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// ** 4. Parser
MyGrammarParser parser = new MyGrammarParser(tokens);
// 4.1 !!! Specify the TreeAdapter
NodeWithScopeAdaptor adaptor = new NodeWithScopeAdaptor();
parser.setTreeAdaptor(adaptor); // use my adaptor
// ** 5. Start process by invoking the root rule
r = parser.MyTopRule();
// ** 6. AST tree
NodeWithScope t = (NodeWithScope)r.getTree();
// ** 7. etc. parse the tree or do whatever is needed on it.
Finally your grammar would have to be adapted with something akin to what follows
(note that the node [for the current rule] is only available in the #after section. It may however reference any token attribute and other contextual variable from the grammar-level, using the usual $rule.atrribute notation)
composite_instruction
scope JScope;
#init {
$JScope::symbols = new ArrayList();
$JScope::name = "level "+ $JScope.size();
}
#after {
($composite_instruction.tree).symbols = $JScope::symbols;
($composite_instruction.tree).name = $JScope::name;
($composite_instruction.tree).whatever_else
= new myFancyObject($x.Text, $y.line, whatever, blah);
}
: '{' instruction* '}' -> ^(INSTRUCTION_LIST instruction*)
;