Is there a way to edit nodes on an Antlr ParseTree? - antlr

I am recursively traversing an antlr parse tree and I want to edit the text of TerminalNodes in the tree. I want to be able to do this for any ParseTree and I don't want to write a specific Visitor for each ParseTree I may encounter.
I have looked through The Definitive ANTLR4 Reference and seen that antlr doesn't have any direct support for tree rewriting. I am looking for any possible workarounds or alternative solutions.
private void editTree(ParseTree tree){
for(int i = 0; i < tree.getChildCount();i++){
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode){
//Edit child's text
} else {
editTree(child);
}
}
}

TerminalNode has a member getSymbol(), which returns the lexed token. This is usually a CommonToken instance, which allows to set the text and other properties like line number, type etc. ParseTree.getText() does nothing else but asking the symbol to provide the text (which in turn is what you can set or what comes from the input stream).

Related

Does C++/WinRT provide mapping from enum symbol to string name?

I'm using C++/WinRT. The projection includes many enums. I find myself building my own table of enum values to string literals. This is not a big deal for enums with only a few defined values, but it's a pain when there are a lot of them.
What I really want is some form of compile-time or run-time reflection that converts an enum value into the string representation of the compile-time name that represents a given enum value. The code snippet below demonstrates. How can this be automated?
std::wostream& operator<< (
std::wostream& wout,
winrt::Windows::Graphics::DirectX::DirectXPixelFormat e)
{
// https://learn.microsoft.com/en-us/uwp/api/windows.graphics.directx.directxpixelformat
using winrt::Windows::Graphics::DirectX::DirectXPixelFormat;
switch (e) {
case DirectXPixelFormat::R8G8B8A8Int:
wout << L"R8G8B8A8Int";
break;
case DirectXPixelFormat::B8G8R8A8UIntNormalized:
wout << L"B8G8R8A8UIntNormalized";
break;
default:
// TODO: Many enums cases are missing.
// Find a way to compile-time-generate the string values from enum value.
wout << L"Unknown (" << std::to_wstring(static_cast<int32_t>(e)) << L")";
}
return wout;
}
I could build something that parses the winrt/*.h files to generate a header containing arrays of string literals, then #include the generated header. There probably exists sample code for doing this type of thing unrelated to C++/WinRT. But maybe C++/WinRT includes metadata in the SDK, which combined with one of the C++/WinRT command line tools, can easily do this for me? If it's there I have not found it.
I did find ApiInformation interface from winrt/Windows.Foundation.Metadata.h, as well as explanation of "Version Adaptive Code". I had hoped that OS COM interface behind ApiInformation has way to query a name for an enum value, but I was unable to find an answer there.
https://learn.microsoft.com/en-us/uwp/api/Windows.Foundation.Metadata.ApiInformation
how about this
https://learn.microsoft.com/en-us/windows/uwp/cpp-and-winrt-apis/move-to-winrt-from-cx#tostring
namespace winrt
{
hstring to_hstring(StatusEnum status)
{
switch (status)
{
case StatusEnum::Success: return L"Success";
case StatusEnum::AccessDenied: return L"AccessDenied";
case StatusEnum::DisabledByPolicy: return L"DisabledByPolicy";
default: return to_hstring(static_cast<int>(status));
}
}
}

After Antlr4 recognize error, how to ask the application to automatically fix it?

We know Antlr4 is using the sync-and-return recovery mechanism. For example, I have the following simple grammar:
grammar Hello;
r : prefix body ;
prefix: 'hello' ':';
body: INT ID ;
INT: [0-9]+ ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
I use the following listener to grab the input:
public class HelloLoader extends HelloBaseListener {
String input;
public void exitR(HelloParser.RContext ctx) {
input = ctx.getText();
}
}
The main method in my HelloRunner looks like this:
public static void main(String[] args) throws IOException {
CharStream input = CharStreams.fromStream(System.in);
HelloLexer lexer = new HelloLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
HelloParser parser = new HelloParser(tokens);
ParseTree tree = parser.r();
ParseTreeWalker walker = new ParseTreeWalker();
HelloLoader loader = new HelloLoader();
walker.walk(loader, tree);
System.out.println(loader.input);
}
Now if I enter a correct input "hello : 1 morning", I will get hello:1morning, as expected.
What if an incorrect input "hello ; 1 morning"? I will get the following output:
line 1:6 token recognition error at: ';'
line 1:8 missing ':' at '1'
hello<missing ':'>1morning
It seems that Antlr4 automatically recognized a wrong token ";" and delete it; however, it will not smartly add ":" in the corresponding place, but just claim <missing ':'>.
My question is: is there some way to solve this problem so that when Antlr found an error it will automatically fix it? How to achieve this coding? Do we need other tools?
Typically the input for a parser comes from some source file that contains some code or text that (supposedly) conforms to some grammar. A typical use scenario for syntax errors is to alert the user so that the source file can be corrected.
As the commented noted, you can insert your own error recovery system, but before trying to insert a single token into the token stream and recover, please consider that it would be a very limited solution. Why? Consider a much richer grammar where for a given token, many -- perhaps dozens or hundreds -- of other tokens can legally follow it. How would a single-token replacement strategy work then?
The hello.g4 example is the epitome of a trivial grammar, the "hello world" of ANTLR. But most of the time, for non-trivial grammars, the best we can do with imperfect syntax is to simply alert the user so the syntax can be corrected.

How to create suggestion messages with ANTLR?

I want to create an interactive version of the ANTLR calculator example, which tells the user what to type next. For instance, in the beginning, the ID, INT, NEWLINE, and WS tokens are possible. Ignoring WS, a suggestion message could be:
Type an identifier, a number, or newline.
After parsing a number, the message should be
Type +, -, *, or newline.
and so on. How to do this?
Edit
What I have tried so far:
private void accept(String sentence) {
ANTLRInputStream is = new ANTLRInputStream(sentence);
OperationLexer l = new OperationLexer(is);
CommonTokenStream cts = new CommonTokenStream(l);
final OperationParser parser = new OperationParser(cts);
parser.addParseListener(new OperationBaseListener() {
#Override
public void enterEveryRule(ParserRuleContext ctx) {
ATNState state = parser.getATN().states.get(parser.getState());
System.out.print("RULE " + parser.ruleNames[state.ruleIndex] + " ");
IntervalSet following = parser.getATN().nextTokens(state, ctx);
for (Integer token : following.toList()) {
System.out.print(parser.tokenNames[token] + " ");
}
System.out.println();
}
});
parser.prog();
}
prints the right suggestion for the first token, but for all other tokens, it print the current token. I guess capturing the state at enterEveryRule() is too early.
Accurately gathering this information in an LL(k) parser, where k>1, requires a thorough understanding of the parser internals. Several years ago, I faced this problem with ANTLR 3, and found the only real solution was so complex that it resulted in me becoming a co-author of ANTLR 4 specifically so I could handle this issue.
ANTLR (including ANTLR 4) disambiguates the parse tree during the parsing phase, which means if your grammar is not LL(1) then performing this analysis in the parse tree means you have already lost information necessary to be accurate. You'll need to write your own version of ParserATNSimulator (or a custom interpreter which wraps it) which does not lose the information.

Object oriented design patterns for parsing text files?

As part of a software package I'm working on, I need to implement a parser for application specific text files. I've already specified the grammar for these file on paper, but am having a hard time translating it into easily readable/updatable code (right now just it passes each line through a huge number of switch statements).
So, are there any good design patterns for implementing a parser in a Java style OO environment?
Any easy way to break a massive switch into an OO design would be to have
pseudo code
class XTokenType {
public bool isToken(string data);
}
class TokenParse {
public void parseTokens(string data) {
for each step in data {
for each tokenType in tokenTypess {
if (tokenType.isToken(step)) {
parsedTokens[len] = new tokenType(step);
}
...
}
}
...
}
}
Here your breaking each switch statement into a method on that token object to detect whether the next bit of the string is of that token type.
Previously:
class TokenParse {
public void parseTokens(string data) {
for each step in data {
switch (step) {
case x:
...
case y:
...
...
}
}
...
}
}
One suggestion is to create property file where you define rules. Load it during run time and use if else loop (since switch statements also does the same internally). This way if you want to change some parsing rules you have to change .property file not code. :)
You need to learn how to express context free grammars. You should be thinking about the GoF Interpreter and parser/generators like Bison, ANTRL, lex/yacc, etc.

Lucene stop phrases filter

I'm trying to write a filter for Lucene, similar to StopWordsFilter (thus implementing TokenFilter), but I need to remove phrases (sequence of tokens) instead of words.
The "stop phrases" are represented themselves as a sequence of tokens: punctuation is not considered.
I think I need to do some kind of buffering of the tokens in the token stream, and when a full phrase is matched, I discard all tokens in the buffer.
What would be the best approach to implements a "stop phrases" filter given a stream of words like Lucene's TokenStream?
In this thread I was given a solution: use Lucene's CachingTokenFilter as a starting point:
That solution was actually the right way to go.
EDIT: I fixed the dead link. Here is a transcript of the thread.
MY QUESTION:
I'm trying to implement a "stop phrases filter" with the new TokenStream
API.
I would like to be able to peek into N tokens ahead, see if the current
token + N subsequent tokens match a "stop phrase" (the set of stop phrases
are saved in a HashSet), then discard all these tokens when they match a
stop phrase, or keep them all if they don't match.
For this purpose I would like to use captureState() and then restoreState()
to get back to the starting point of the stream.
I tried many combinations of these API. My last attempt is in the code
below, which doesn't work.
static private HashSet<String> m_stop_phrases = new HashSet<String>();
static private int m_max_stop_phrase_length = 0;
...
public final boolean incrementToken() throws IOException {
if (!input.incrementToken())
return false;
Stack<State> stateStack = new Stack<State>();
StringBuilder match_string_builder = new StringBuilder();
int skippedPositions = 0;
boolean is_next_token = true;
while (is_next_token && match_string_builder.length() < m_max_stop_phrase_length) {
if (match_string_builder.length() > 0)
match_string_builder.append(" ");
match_string_builder.append(termAtt.term());
skippedPositions += posIncrAtt.getPositionIncrement();
stateStack.push(captureState());
is_next_token = input.incrementToken();
if (m_stop_phrases.contains(match_string_builder.toString())) {
// Stop phrase is found: skip the number of tokens
// without restoring the state
posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() + skippedPositions);
return is_next_token;
}
}
// No stop phrase found: restore the stream
while (!stateStack.empty())
restoreState(stateStack.pop());
return true;
}
Which is the correct direction I should look into to implement my "stop
phrases" filter?
CORRECT ANSWER:
restoreState only restores the token contents, not the complete stream. So
you cannot roll back the token stream (and this was also not possible with
the old API). The while loop at the end of you code is not working as you
exspect because of this. You may use CachingTokenFilter, which can be reset
and consumed again, as a source for further work.
You'll really have to write your own Analyzer, I should think, since whether or not some sequence of words is a "phrase" is dependent on cues, such as punctuation, that are not available after tokenization.