Does the Antlr4 generated code include anything like an unparser that can use the grammer and the parser tree to reconstruct the original source? How would I invoke that if it exists? I ask because it might be useful in some application and debugging.
It really depends what do you want to achieve. Remember that Lexer tokens which are put onto HIDDEN channel (like comments and which spaces) and are not parsed at all.
The approach I used was
use additional user specific information in lexer token class
parse the source and get AST
rewind the lexer(token source) and loop over all Lexem-es, including the hidden ones
for each hidden Lexeme, append the reference to the corresponding AST leaf
so every AST leaf "know" which white-space Lexemes are following it
recursively traverse the AST and print all the Lexemes
Yes! ANTLR's infrastructure (usually) makes the original source data available.
In the default case, you will be using a CommonTokenStream. This inherits from BufferedTokenStream, which offers a whole slew of methods for getting at stuff.
Methods getHiddenTokensOnLeft (and ...Right) will get you lists of tokens not appearing in the DEFAULT stream. Those tokens will reveal their source text using getText().
What I find even more convenient is BufferedTokenStream.getText(interval), which will give you the text (including hidden) on an Interval, which you can get from your tree element (RuleContext).
To make use of your CommonTokenStream and its methods, you just need to pass it from where you create it and set up your parser to whatever class is examining the parse tree, such as your XXXBaseListener - I just gave my Listener a constructor that stores the CommonTokenStream as an instance field.
So when I want the complete text for a rule ctx, I use this little method:
String originalString(ParserRuleContext ctx) {
return this.tokenStream.getText(ctx.getSourceInterval());
}
Alternatively, the tokens also contain line numbers and offsets, if you want to fiddle with those.
Related
And now, I am trying to rewrite my java application in Kotlin. And then, I met the log statement, like
log.info("do the print thing for {}", arg);
So I have two ways to do the log things in Kotlin like log.info("do the print thing for {}", arg) and log.info("do the print thing for $arg"). The 1st is delegate format to framework like Slf4j or Log4j; the 2nd is using Kotlin string template.
So what's the difference and which one's performance is better?
In general, these two ways produce the same log, unless the logging library is also configured to localise the message and parameters when formatting the message, which Kotlin's string interpolation does not do at all.
The crucial difference lies in the performance, when you turn off logging (at that particular level). As SLF4J's FAQ says:
There exists a very convenient alternative based on message formats.
Assuming entry is an object, you can write:
Object entry = new SomeObject();
logger.debug("The entry is {}.", entry);
After evaluating whether to log or not, and only if the decision is
affirmative, will the logger implementation format the message and
replace the '{}' pair with the string value of entry. In other words,
this form does not incur the cost of parameter construction in case
the log statement is disabled.
The following two lines will yield the exact same output. However, the
second form will outperform the first form by a factor of at least 30,
in case of a disabled logging statement.
logger.debug("The new entry is "+entry+".");
logger.debug("The new entry is {}.", entry);
Basically, if the logging is disabled, the message won't be constructed if you use parameterised logging. If you use string interpolation however, the message will always be constructed.
Note that Kotlin's string interpolation compiles to something similar to what a series of string concatenation (+) in Java compiles to (though this might change in the future).
"foo $bar baz"
is translated into:
StringBuilder().append("foo ").append(bar).append(" baz").toString()
See also: Unable to understand why to use parameterized logging
When I create a pipeline with the default tokenizer for say English, I can then call the method for adding a special case:
tokenizer.add_special_case("don't", case)
The tokenizer will happily accept a special case that contains whitespace:
tokenizer.add_special_case("some odd case", case)
but it appears that does not actually change the behavior of the tokenizer or will never match?
More generally, what is the best way of extending an existing tokenizer so that the some patterns which normally would result in multiple tokens only create one token? For example something like [A-Za-z]+\([A-Za-z0-9]+\)[A-Za-z]+ should not result in three tokens because of the parentheses but in a single token, e.g. for asdf(a33b)xyz while the normal English rules should still apply if that pattern does not match.
Is this something that can be done somehow by augmenting an existing tokenizer or would I have to first tokenize, then find entities that match the corresponding token patterns and then merge the entity tokens?
As you found, Tokenizer.add_special_case() doesn't work for handling tokens that contain whitespace. That's for adding strings like "o'clock" and ":-)", or expanding e.g. "don't" to "do not".
Modifying the prefix, suffix and infix rules (either by setting them on an existing tokenizer or creating a new tokenizer with custom parameters) also doesn't work since those are applied after whitespace splitting.
To override the whitespace splitting behavior, you have four options:
Merge after tokenization. You use Retokenizer.merge(), or possibly merge_entities or merge_noun_chunks. The relevant documentation is here:
https://spacy.io/usage/linguistic-features#retokenization and https://spacy.io/api/pipeline-functions#merge_entities and https://spacy.io/api/pipeline-functions#merge_noun_chunks
This is your best bet for keeping as much of the default behavior as possible.
Subclass Tokenizer and override __call__. Sample code:
from spacy.tokenizer import Tokenizer
def custom_tokenizer(nlp):
class MyTokenizer(Tokenizer):
def __call__(self, string):
# do something before
doc = super().__call__(string)
# do something after
return doc
return MyTokenizer(
nlp.vocab,
prefix_search=nlp.tokenizer.prefix_search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=nlp.tokenizer.infix_finditer,
token_match=nlp.tokenizer.token_match,
)
# usage:
nlp.tokenizer = custom_tokenizer(nlp)
Implement a completely new tokenizer (without subclassing Tokenizer). Relevant docs here: https://spacy.io/usage/linguistic-features#custom-tokenizer-example
Tokenize externally and instantiate Doc with words. Relevant docs here: https://spacy.io/usage/linguistic-features#own-annotations
To answer the second part of your question, if you don't need to change whitespace splitting behavior, you have two other options:
Add to the default prefix, suffix and infix rules. The relevant documentation is here: https://spacy.io/usage/linguistic-features#native-tokenizer-additions
Note from https://stackoverflow.com/a/58112065/594211: "You can add new patterns without defining a custom tokenizer, but there's no way to remove a pattern without defining a custom tokenizer."
Instantiate Tokenizer with custom prefix, suffix and infix rules. The relevant documentation is here: https://spacy.io/usage/linguistic-features#native-tokenizers
To get the default rules, you read the existing tokenizer's attributes (as shown above) or use the nlp object’s Defaults. There are code samples for the latter approach in https://stackoverflow.com/a/47502839/594211 and https://stackoverflow.com/a/58112065/594211.
Use token match for combining multiple tokens to single one
I have an antlr generated Java parser that uses the C target and it works quite well. The problem is I also want it to parse erroneous code and produce a meaningful AST. If I feed it a minimal Java class with one import after which a semicolon is missing it produces two "Tree Error Node" objects where the "import" token and the tokens for the imported class should be.
But since it parses the following code correctly and produces the correct nodes for this code it must recover from the error by adding the semicolon or by resyncing. Is there a way to make antlr reflect this fixed input it produces internally in the AST? Or can I at least get the tokens/text that produced the "Tree Node Errors" somehow?
In the C targets
antlr3commontreeadaptor.c around line 200 the following fragment indicates that the C target only creates dummy error nodes so far:
static pANTLR3_BASE_TREE
errorNode (pANTLR3_BASE_TREE_ADAPTOR adaptor, pANTLR3_TOKEN_STREAM ctnstream, pANTLR3_COMMON_TOKEN startToken, pANTLR3_COMMON_TOKEN stopToken, pANTLR3_EXCEPTION e)
{
// Use the supplied common tree node stream to get another tree from the factory
// TODO: Look at creating the erronode as in Java, but this is complicated by the
// need to track and free the memory allocated to it, so for now, we just
// want something in the tree that isn't a NULL pointer.
//
return adaptor->createTypeText(adaptor, ANTLR3_TOKEN_INVALID, (pANTLR3_UINT8)"Tree Error Node");
}
Am I out of luck here and only the error nodes the Java target produces would allow me to retrieve the text of the erroneous nodes?
I haven't used antlr much, but typically the way you handle this type of error is to add rules for matching wrong syntax, make them produce error nodes, and try to fix up after errors so that you can keep parsing. Fixing up afterwards is the problem because you don't want one error to trigger more and more errors for each new token until the end.
I solved the problem by adding new alternate rules to the grammer for all possible erroneous statements.
Each Java import statement gets translated to an AST subtree with the artificial symbol IMPORT as the root for example. To make sure that I can differentiate between ASTs from correct and erroneous code the rules for the erroneous statements rewrite them to an AST with a root symbol with the prefix ERR_, so in the example of the import statement the artifical root symbol would be ERR_IMPORT.
More different root symbols could be used to encode more detailed information about the parse error.
My parser is now as error tolerant as I need it to be and it's very easy to add rules for new kinds of erroneous input whenever I need to do so. You have to watch out to not introduce any ambiguities into your grammar, though.
In my language I can use a class variable in my method when the definition appears below the method. It can also call methods below my method and etc. There are no 'headers'. Take this C# example.
class A
{
public void callMethods() { print(); B b; b.notYetSeen();
public void print() { Console.Write("v = {0}", v); }
int v=9;
}
class B
{
public void notYetSeen() { Console.Write("notYetSeen()\n"); }
}
How should I compile that? what i was thinking is:
pass1: convert everything to an AST
pass2: go through all classes and build a list of define classes/variable/etc
pass3: go through code and check if there's any errors such as undefined variable, wrong use etc and create my output
But it seems like for this to work I have to do pass 1 and 2 for ALL files before doing pass3. Also it feels like a lot of work to do until I find a syntax error (other than the obvious that can be done at parse time such as forgetting to close a brace or writing 0xLETTERS instead of a hex value). My gut says there is some other way.
Note: I am using bison/flex to generate my compiler.
My understanding of languages that handle forward references is that they typically just use the first pass to build a list of valid names. Something along the lines of just putting an entry in a table (without filling out the definition) so you have something to point to later when you do your real pass to generate the definitions.
If you try to actually build full definitions as you go, you would end up having to rescan repatedly, each time saving any references to undefined things until the next pass. Even that would fail if there are circular references.
I would go through on pass one and collect all of your class/method/field names and types, ignoring the method bodies. Then in pass two check the method bodies only.
I don't know that there can be any other way than traversing all the files in the source.
I think that you can get it down to two passes - on the first pass, build the AST and whenever you find a variable name, add it to a list that contains that blocks' symbols (it would probably be useful to add that list to the corresponding scope in the tree). Step two is to linearly traverse the tree and make sure that each symbol used references a symbol in that scope or a scope above it.
My description is oversimplified but the basic answer is -- lookahead requires at least two passes.
The usual approach is to save B as "unknown". It's probably some kind of type (because of the place where you encountered it). So you can just reserve the memory (a pointer) for it even though you have no idea what it really is.
For the method call, you can't do much. In a dynamic language, you'd just save the name of the method somewhere and check whether it exists at runtime. In a static language, you can save it in under "unknown methods" somewhere in your compiler along with the unknown type B. Since method calls eventually translate to a memory address, you can again reserve the memory.
Then, when you encounter B and the method, you can clear up your unknowns. Since you know a bit about them, you can say whether they behave like they should or if the first usage is now a syntax error.
So you don't have to read all files twice but it surely makes things more simple.
Alternatively, you can generate these header files as you encounter the sources and save them somewhere where you can find them again. This way, you can speed up the compilation (since you won't have to consider unchanged files in the next compilation run).
Lastly, if you write a new language, you shouldn't use bison and flex anymore. There are much better tools by now. ANTLR, for example, can produce a parser that can recover after an error, so you can still parse the whole file. Or check this Wikipedia article for more options.
Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.