antlr2 to antlr4 class specifier, options, TOKENS and more - antlr

I need to rewrite a grammar file from antlr2 syntax to antlr4 syntax and have the following questions.
1) Bart Kiers states there is a strict order: grammar, options, tokens, #header, #members in this SO post. This antlr2.org post disagrees stating header is before options. Is there a resource that states the correct order (if one exists) for antlr4?
2) The same antlr2.org post states: "The options section for a grammar, if specified, must immediately follow the ';' of the class specifier:
class MyParser extends Parser;
options { k=2; }
However, when running with antlr4, any class specifier creates this error:
syntax error: missing COLON at 'MyParser' while matching a rule
3) What happened to options in antlr4? says there are no rule-level options at the time.
warning(83): MyGrammar.g4:4:4: unsupported option k
warning(83): MyGrammar.g4:5:4: unsupported option exportVocab
warning(83): MyGrammar.g4:6:4: unsupported option codeGenMakeSwitchThreshold
warning(83): MyGrammar.g4:7:4: unsupported option codeGenBitsetTestThreshold
warning(83): MyGrammar.g4:8:4: unsupported option defaultErrorHandler
warning(83): MyGrammar.g4:9:4: unsupported option buildAST
i.) does antlr4's adaptive LL(*) parsing algorithm no longer require k token lookhead?
ii.) is there an equivalent in antlr4 for exportVocab?
iii.) are there equivalents in antlr4 for optimizations codeGenMakeSwitchThreshold and codeGenBitsetTestThreshold or have they become obsolete?
iv.) is there an equivalent for defaultErrorHandler ?
v.) I know antlr4 no longer builds AST. I'm still trying to get a grasp of how this will affect what uses the currently generated *Parser.java and *Lexer.java.
4) My current grammar file specifies a TOKENS section
tokens {
ROOT; FOO; BAR; TRUE="true"; FALSE="false"; NULL="null";
}
I changed the double quotes to single quotes and the semi-colons to commas and the equal sign to a colon to try and get rid of each syntax error but have this error:
mismatched input ':' expecting RBRACE
along with others. Rewritten looks like:
tokens {
ROOT; FOO; BAR; TRUE:'true'; FALSE:'false' ...
}
so I removed :'true' and :'false' and TRUE and FALSE will appear in the generated MyGrammar.tokens but I'm not sure if it will function the same as before.
Thanks!

Just look at the ultimate source for the syntax: the ANTLR4 grammar. As you can see the order plays no role in the prequel section (which includes named actions, options and the like, you can even have more than one option section). The only condition is that the prequel section must appear before any rule.
The error is about a wrong option. Remove that and the error will go away.
Many (actually most of the old) options are no longer needed and supported in ANTLR4.
i.) ANTLR4 uses unlimited lookahead (hence the * in ALL(*)). You cannot specify any other lookahead.
ii.) The exportVocab has long gone (not even ANTLR3 supports it). It only specifies a name for the .tokens file. Use the default instead.
iii.) Nothing like that is needed nor supported anymore. The prediction algorithm has completely changed in ANTLR4.
iv.) You use an error listener instead. There are many examples how to do that (also here at SO).
v.) Is that a question or just thinking loudly? Hint: ANTLR4 based parsers generate a parse tree.
I'm not 100% sure about this one, but I believe you can no longer specify the value a token should match in the tokens section. Instead this is only for virtual tokens and everything else must be specified as normal lexer tokens.
To sum up: most of the special options and tricks required for older ANTLR grammars are not needed anymore and must be removed. The new parsing algorithm can deal with all the ambiquities automatically, which former versions had trouble with and needed guidance from the user for.

Related

Parsing annotations within comments / Negative matching of strings in regexps

I'm defining the syntax of the Move IR. The test suite for this language includes various annotations to enable testing. I need to treat comments of this form specially:
//! new-transaction
// check: "Keep(ABORTED { code: 123,"
This file is an example arithmetic_operators_u8.mvir.
So far, I've got this working by disallowing ordinary single-line comments.
module MOVE-ANNOTATION-SYNTAX-CONCRETE
imports INT-SYNTAX
syntax #Layout ::= r"([\\ \\n\\r\\t])" // Whitespace
syntax Annotation ::= "//!" "new-transaction" [klabel(NewTransaction), symbol]
syntax Check ::= "//" "check:" "\"Keep(ABORTED { code:" Int ",\"" [klabel(CheckCode), symbol]
endmodule
module MOVE-ANNOTATION-SYNTAX-ABSTRACT
imports INT-SYNTAX
syntax Annotation ::= "#NewTransaction" [klabel(NewTransaction), symbol]
syntax Check ::= #CheckCode(Int) [klabel(CheckCode), symbol]
endmodule
I'd like to also be able to use ordinary comments.
As a first step, I was able to change the Layout to allow commits only if the begin with a ! using r"(\\/\\/[^!][^\\n\\r]*)"
I'd like to exclude all comments that start with either //! or // check: from comments. What's a good way of implementing this?
Where can I find documentation for the regular expression language that K uses?
K uses flex for its scanner, and thus for its regular expression language. As a result, you can find documentation on its regular expression language here.
You want a regular expression that expresses that comments can't start with ! or check:, but flex doesn't support negative lookahead or not patterns, so you will have to exhaustively enumerate all the cases of comments that don't start with those sequence of characters. It's a bit tedious, sadly.
For reference, here is a (simplified) regular expression drawn from the syntax of C that represents all pragmas that don't start with STDC. It should give you an idea of how to proceed:
#pragma([:space:]*)([^S].*|S|S[^T].*|ST|ST[^D].*|STD|STD[^C].*|STDC[_a-zA-Z0-9].*)?$

ANTLR4: parse number as identifier instead as numeric literal

I have this situation, of having to treat integer as identifier.
Underlying language syntax (unfortunately) allows this.
grammar excerpt:
grammar Alang;
...
NLITERAL : [0-9]+ ;
...
IDENTIFIER : [a-zA-Z0-9_]+ ;
Example code, that has to be dealt with:
/** declaration block **/
Method 465;
...
In above code example, because NLITERAL has to be placed before IDENTIFIER, parser picks 465 as NLITERAL.
What is a good way to deal with such a situations?
(Ideally, avoiding application code within grammar, to keep it runtime agnostic)
I found similar questions on SO, not exactly helpful though.
There's no good way to make 465 produce either an NLITERAL token or an IDENTIFIER token depending on context (you might be able to use lexer modes, but that's probably not a good fit for your needs).
What you can do rather easily though, is to allow NLITERALs in addition to IDENTIFIERS in certain places. So you could define a parser rule
methodName: IDENTIFIER | NLITERAL;
and then use that rule instead of IDENTIFIER where appropriate.

UTF-32 code points (surrogate pairs) in ANTLR3 lexer?

Is there any way to specify UTF-32 code points in the ANTLR3 lexer? More specifically I have a value like 0xAEC35 being returned from the UTF16LA (because I used a surrogate pair), but I cannot figure out how to specify this type of character (larger than 0xFFFF) in the lexer. As is, the lexer is throwing an error because the character is not matching anything.
I'm using ANTLR 3.5.2 and the internal handling has been changed to return UTF-32 but it seems unusual the lexer doesn't seem to handle those values very well.
ANTLR3 doesn't support Unicode beyond the BMP. You really should upgrade to ANTLR4, which is available since years.

Ignore MSGTokenError in JAVACC

I use JAVACC to parse some string defined by a bnf grammar with initial non-terminal G.
I would like to catch errors thrown by TokenMgrError.
In particular, I want to handle the following two cases:
If some prefix of the input satisfies G, but not all of the symbols are read from the input, consider this case as normal and return AST for found prefix by a call to G().
If the input has no prefix satisfying G, return null from G().
Currently I'm getting TokenMgrError 's in each of this case instead.
I started to modify the generated files (i.e, to change Error to Exception and add appropriate try/catch/throws statements), but I found it to be tedious. In addition, automatic generation of the modified files produced by JAVACC does not work. Is there a smarter way to accomplish this?
You can always eliminate all TokenMgrErrors by including
<*> TOKEN : { <UNEXPECTED: ~[] > }
as the final rule. This pushes all you issues to the grammar level where you can generally deal with them more easily.

ParseKit: What built-in Productions should I use in my Grammars?

I just started using ParseKit to explore language creation and perhaps build a small toy DSL. However, the current SVN trunk from Google is throwing a -[PKToken intValue]: unrecognized selector sent to instance ... when parsing this grammar:
#start = identifier ;
identifier = (Letter | '_') | (letterOrDigit | '_') ;
letterOrDigit = Letter | Digit ;
Against this input:
foo
Clearly, I am missing something or have incorrectly configured my project. What can I do to fix this issue?
Developer of ParseKit here.
First, see the ParseKit Tokenization docs.
Basically, ParseKit can work in one of two modes: Let's call them Tokens Mode and Chars Mode. (There are no formal names for these two modes, but perhaps there should be.)
Tokens Mode is more popular by far. Virtually every example you will find of using ParseKit will show how to use Tokens Mode. I believe all of the documentation on http://parsekit.com is using Tokens Mode. ParseKit's grammar feature (that you are using in your example only works in Tokens Mode).
Chars Mode is a very little-known feature of ParseKit. I've never had anyone ask about it before.
So the differences in the modes are:
In Tokens Mode, the ParseKit Tokenizer emits multi-char tokens (like Words, Symbols, Numbers, QuotedStrings etc) which are then parsed by the ParseKit parsers you create (programmatically or via grammars).
In Chars Mode, the ParseKit Tokenizer always emits single-char tokens which are then parsed by the ParseKit parsers you create programmatically. (grammars don't currently work with this mode as this mode is not popular).
You could use Chars Mode to implement Regular Expresions which parse on a char-by-char basis.
For your example, you should be ignoring Chars Mode and just use Tokens Mode. The following Built-in Productions are for Chars Mode only. Do not use them in your grammars:
(PK)Letter
(PK)Digit
(PK)Char
(PK)SpecificChar
Notice how all of those Productions sound like they match individual chars. That's because they do.
Your example above should probably look like:
#start = identifier;
identifier = Word; // by default Words start with a-zA-Z_ and contain -0-9a-zAZ_'
Keep in mind the Productions in your grammars (parsers like identifier) will be working on Tokens already emitted from ParseKit's Tokenizer. Not individual chars.
IOW: by the time your grammar goes to work parsing input, the input has already been tokenized into Tokens of type Word, Number, Symbol, QuotedString, etc.
Here are all of the Built-in Productions available for use in your Grammar:
Word
Number
Symbol
QuotedString
Comment
Any
S // Whitespace. only available when #preservesWhitespaceTokens=YES. NO by default.
Also:
DelimitedString('start', 'end', 'allowedCharset')
/xxx/i // RegEx match
There are also operators for composite parsers:
// Sequence
| // Alternation
? // Optional
+ // Multiple
* // Repetition
~ // Negation
& // Intersection
- // Difference