I have an ANTLR4 listener which handles a standard and well-formed grammar, however am struggling with how to deal the non-standard implementations. Although all of the variants go through the lexer without problems the parse stage is a lot trickier.
A traditional way of doing this would be something like
// Header of document
variant = STANDARD;
if (header.indexOf("microsoft") != -1) {
variant = MICROSOFT;
} else if (header.indexOf("google") != -1) {
variant = GOOGLE;
}
...
// Parsing a particular element
if (variant.equals(MICROSOFT)) {
// Microsoft-specific stuff
} else if (variant.equals(GOOGLE)) {
// Google-specific stuff
} else {
// Standard stuff
}
but this quickly becomes unmaintainable. The obvious solution is to have a ParseTreeListener for the standard implementation and then subclass it for each variant, but I don't know which variant it is until I've started the parse.
So how can I either switch from one listener to another part-way through the parse, or restart the parse with a new listener once I know which variant I'm dealing with?
If these variants occur frequently, you might want to consider embedding custom code to handle context sensitive parsing by using predicates (the {...}? construct in the following pseudo grammar):
rule
: { boolean-expression-a }? a-alternative
| { boolean-expression-b }? b-alternative
| /* fall through */ not-a-or-b-alternative
;
Let's say you want to parse a file containing chunks. A chunk consists of a header and a data row. In the header you can set your variant. The data of a normal variant contains 3 NUMBERs, Google's variant contains 2 NUMBERs and Microsoft's variant contains a single NUMBER. An example of such a file would look like this:
header: none
data: 1 2 3
header: google
data: 4 5
header: microsoft
data: 6
And here's a demo of a context sensitive ANTLR v4 grammar able to parse this:
grammar T;
#parser::members {
enum Variant {
GOOGLE,
MICROSOFT,
OTHER;
public static Variant tryValueOf(String name) {
try {
return Variant.valueOf(name.toUpperCase());
}
catch(Exception e) {
return OTHER;
}
}
}
private Variant variant = Variant.OTHER;
}
parse
: chunk+ EOF
;
chunk
: header data
;
header
: K_HEADER COLON NAME {variant = Variant.tryValueOf($NAME.text);}
;
data
: {variant == Variant.MICROSOFT}? K_DATA COLON NUMBER #MicrosoftData
| {variant == Variant.GOOGLE}? K_DATA COLON NUMBER NUMBER #GoogleData
| K_DATA COLON NUMBER NUMBER NUMBER #OtherData
;
K_DATA : 'data';
K_HEADER : 'header';
NAME : [a-zA-Z]+;
NUMBER : [0-9]+;
COLON : ':';
SPACE : [ \t\r\n] -> skip;
Resulting in the following parse:
Related
I'm trying to implement a lexer rule for an oracle Q quoted string mechanism where we have something like q'$some string$'
Here you can have any character in place of $ other than whitespace, (, {, [, <, but the string must start and end with the same character. Some examples of accepted tokens would be:
q'!some string!'
q'ssome strings'
Notice how s is the custom delimiter but it is fine to have that in the string as well because we would only end at s'
Here's how I was trying to implement the rule:
Q_QUOTED_LITERAL: Q_QUOTED_LITERAL_NON_TERMINATED . QUOTE-> type(QUOTED_LITERAL);
Q_QUOTED_LITERAL_NON_TERMINATED:
Q QUOTE ~[ ({[<'"\t\n\r] { setDelimChar( (char)_input.LA(-1) ); }
( . { !isValidEndDelimChar() }? )*
;
I have already checked the value I get from !isValidEndDelimChar() and I'm getting a false predicate here at the right place so everything should work, but antlr simply ignores this predicate. I've also tried moving the predicate around, putting that part in a separate rule, and a bunch of other stuff, after a day and a half of research on the same I'm finally raising this issue.
I have also tried to implement it in other ways but there doesn't seem to be a way to implement a custom char delimited string in antlr4 (The antlr3 version used to work).
Not sure why the { ... } action isn't invoked, but it's not needed. The following grammar worked for me (put the predicate in front of the .!):
grammar Test;
#lexer::members {
boolean isValidEndDelimChar() {
return (_input.LA(1) == getText().charAt(2)) && (_input.LA(2) == '\'');
}
}
parse
: .*? EOF
;
Q_QUOTED_LITERAL
: 'q\'' ~[ ({[<'"\t\n\r] ( {!isValidEndDelimChar()}? . )* . '\''
;
SPACE
: [ \t\f\r\n] -> skip
;
If you run the class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
Lexer lexer = new TestLexer(CharStreams.fromString("q'ssome strings' q'!foo!'"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", TestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
the following output will be printed:
Q_QUOTED_LITERAL q'ssome strings'
Q_QUOTED_LITERAL q'!foo!'
EOF <EOF>
In Edit distance: Ignore start/end, I offered a Perl 6 solution to a fuzzy fuzzy matching problem. I had a grammar like this (although maybe I've improved it after Edit #3):
grammar NString {
regex n-chars { [<.ignore>* \w]**4 }
regex ignore { \s }
}
The literal 4 itself was the length of the target string in the example. But the next problem might be some other length. So how can I tell the grammar how long I want that match to be?
Although the docs don't show an example or using the $args parameter, I found one in S05-grammar/example.t in roast.
Specify the arguments in :args and give the regex an appropriate signature. Inside the regex, access the arguments in a code block:
grammar NString {
regex n-chars ($length) { [<.ignore>* \w]**{ $length } }
regex ignore { \s }
}
class NString::Actions {
method n-chars ($/) {
put "Found $/";
}
}
my $string = 'The quick, brown butterfly';
loop {
state $from = 0;
my $match = NString.subparse(
$string,
:rule('n-chars'),
:actions(NString::Actions),
:c($from++),
:args( \(5) )
);
last unless ?$match;
}
I'm still not sure about the rules for passing the arguments though. This doesn't work:
:args( 5 )
I get:
Too few positionals passed; expected 2 arguments but got 1
This works:
:args( 5, )
But that's enough thinking about this for one night.
I was curious if I could insert things into the Match tree without actually anything. There's no associated problem I'm trying to solve.
In this example, I have a token market that checks that its match is a key in the hash. I was trying to then insert the value of that hash into the match tree somehow. I figured I could have a token that always matches, long_market_string, and then look into the tree somehow to see what market had matched.
grammar OrderNumber::Grammar {
token TOP {
<channel> <product> <market> <long_market_string> '/' <revision>
}
token channel { <[ M F P ]> }
token product { <[ 0..9 A..Z ]> ** 4 }
token market {
(<[ A..Z ]>** 1..2) <?{ %Market_Shortcode{$0}:exists }>
}
# this should figure out what market matched
# I don't particularly care how this happens as long as
# I can insert this into the match tree
token long_market_string { <?> }
token revision { <[ A..C ]> }
}
Is there some way to mess with the Match tree as it is being created?
I could do something that inverts things:
grammar AppleOrderNumber::Grammar {
token TOP {
<channel> <product> <long_market_string> '/' <revision>
}
token channel { <[ M F P ]> }
token product { <[ 0..9 A..Z ]> ** 4 }
token market {
(<[ A..Z ]>** 1..2) <?{ %Market_Shortcode{$0}:exists }>
}
token long_market_string { <market> }
token revision { <[ A..C ]> }
}
But, that handles that case. I'm more interested in inserting an arbitrary number of things.
Tokens are a type of method, so if you wrote a method that did all of the setup work that a token does for you, you could do almost anything.
This is not specced, and is currently not easy.
( I only have a vague idea of where to start looking in the source code to figure it out )
What you can do easily is add to the .made/.ast of the result
( .made and .ast are synonyms )
$/ = grammar {
token TOP {
.*
{
make 'World'
}
}
}.parse('Hello');
say "$/ $/.made()"; # Hello World
It doesn't even have to be inside of a grammar
'asdf' ~~ /{make 42}/;
say $/; # 「」
say $/.made # 42
Most of the time you would use an Actions class for this type of thing
grammar example-grammar {
token TOP {
[ <number> | <word> ]+ % \s*
}
token word {
<.alpha>+
}
token number {
\d+
{ make +$/ }
}
}
class example-actions {
method TOP ($/) { make $/.pairs.map:{ .key => .value».made} }
method number ($/) { #`( already done in grammar, so this could be removed ) }
method word ($/) { make ~$/ }
}
.say for example-grammar.parse(
'Hello 123 World',
:actions(example-actions)
).made».perl
# :number([123])
# :word(["Hello", "World"])
It sounds like you want to subvert the match tree into doing something the match tree isn't really supposed to do. The match tree tracks what substrings were matched where in the input string, not arbitrary data generated by the parser. If you want to track arbitrary data, what's wrong with the AST tree?
Sure, in one sense the AST tree has to mirror the parse tree, since it's constructed in a bottom-up fashion as the match methods complete successfully. But the AST itself, in the sense of "the object attached to any given node" is not so restricted. Consider for example:
grammar G {
token TOP { <foo> <bar> {make "TOP is: " ~ $<foo> ~ $<bar>} }
token foo { foo {make "foo"} }
token bar { bar {make "bar"} }
}
G.parse("foobar");
Here $/.made will simply be the string "TOP is: foobar" while the match tree has child nodes with the components that were used to construct the top-level AST. If then return to your example, we can make it:
grammar G {
my %Market_Shortcode = :AA('Double A');
token TOP {
<channel> <product> <market>
{} # Force the computation of the $/ object. Note that this will also terminate LTM here.
<long_market_string(~$<market>)> '/' <revision>
}
token channel { <[ M F P ]> }
token product { <[ 0..9 A..Z ]> ** 4 }
token market {
(<[ A..Z ]>** 1..2) <?{ %Market_Shortcode{$0}:exists }>
}
token long_market_string($shortcode) { <?> { say 'c='~$shortcode; make %Market_Shortcode{$shortcode} } }
token revision { <[ A..C ]> }
}
G.parse('M0000AA/A');
$<long_market_string>.ast will now be 'Double A'. Of course, I'd probably dispense with token long_market_name and just make the AST of token market whatever is in %Market_Shortcode (or a Market object with both short and long name, if you want to track both at once).
A less trivial example of this kind of thing would be something like a grammar of Python. Since Python's block level structure is line-based, your grammar (and thus match tree) need to reflect this in some way. But you can also chain several simple statements together on a single line by separating them with semi-colons. Now, you'll probably want the AST of a block to be a list of statements, while the AST of a single line may itself be a list of several statements. Thus you'd construct the AST of the block by (for example) flatmaping together the list of the lines (or something along those lines, depending on how you represent block statements like if and while).
Now, if you really, really, really want to do nasty things to the match tree I'm pretty sure it can be done, of course. You'll have to implement the parsing code yourself with method long_market_name, the API for which is undocumented and internal, and will likely involve at least some dropping down into nqp::ops. The stuff pointed to here will probably be useful. Other relevant files are src/core/{Match,Cursor}.pm in the Rakudo repo. Note also that the stringification of Matches is computed by extracting the matched substring from the input string, so if you want it to stringify usefully, you'll have to subclass Match.
I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)
I want to keep white space when I call text attribute of token, is there any way to do it?
Here is the situation:
We have the following code
IF L > 40 THEN;
ELSE
IF A = 20 THEN
PUT "HELLO";
In this case, I want to transform it into:
if (!(L>40){
if (A=20)
put "hello";
}
The rule in Antlr is that:
stmt_if_block: IF expr
THEN x=stmt
(ELSE y=stmt)?
{
if ($x.text.equalsIgnoreCase(";"))
{
WriteLn("if(!(" + $expr.text +")){");
WriteLn($stmt.text);
Writeln("}");
}
}
But the result looks like:
if(!(L>40))
{
ifA=20put"hello";
}
The reason is that the white space in $stmt was removed. I was wondering if there is anyway to keep these white space
Thank you so much
Update: If I add
SPACE: [ ] -> channel(HIDDEN);
The space will be preserved, and the result would look like below, many spaces between tokens:
IF SUBSTR(WNAME3,M-1,1) = ')' THEN M = L; ELSE M = L - 1;
This is the C# extension method I use for exactly this purpose:
public static string GetFullText(this ParserRuleContext context)
{
if (context.Start == null || context.Stop == null || context.Start.StartIndex < 0 || context.Stop.StopIndex < 0)
return context.GetText(); // Fallback
return context.Start.InputStream.GetText(Interval.Of(context.Start.StartIndex, context.Stop.StopIndex));
}
Since you're using java, you'll have to translate it, but it should be straightforward - the API is the same.
Explanation: Get the first token, get the last token, and get the text from the input stream between the first char of the first token and the last char of the last token.
#Lucas solution, but in java in case you have troubles in translating:
private String getFullText(ParserRuleContext context) {
if (context.start == null || context.stop == null || context.start.getStartIndex() < 0 || context.stop.getStopIndex() < 0)
return context.getText();
return context.start.getInputStream().getText(Interval.of(context.start.getStartIndex(), context.stop.getStopIndex()));
}
Looks like InputStream is not always updated after removeLastChild/addChild operations. This solution helped me for one grammar, but it doesn't work for another.
Works for this grammar.
Doesn't work for modern groovy grammar (for some reason inputStream.getText contains old text).
I am trying to implement function name replacement like this:
enterPostfixExpression(ctx: PostfixExpressionContext) {
// Get identifierContext from ctx
...
const token = CommonTokenFactory.DEFAULT.createSimple(GroovyParser.Identifier, 'someNewFnName');
const node = new TerminalNode(token);
identifierContext.removeLastChild();
identifierContext.addChild(node);
UPD: I used visitor pattern for the first implementation