Parsing annotations within comments / Negative matching of strings in regexps

Parsing annotations within comments / Negative matching of strings in regexps - kframework

I'm defining the syntax of the Move IR. The test suite for this language includes various annotations to enable testing. I need to treat comments of this form specially:
//! new-transaction
// check: "Keep(ABORTED { code: 123,"
This file is an example arithmetic_operators_u8.mvir.
So far, I've got this working by disallowing ordinary single-line comments.
module MOVE-ANNOTATION-SYNTAX-CONCRETE
imports INT-SYNTAX
syntax #Layout ::= r"([\\ \\n\\r\\t])" // Whitespace
syntax Annotation ::= "//!" "new-transaction" [klabel(NewTransaction), symbol]
syntax Check ::= "//" "check:" "\"Keep(ABORTED { code:" Int ",\"" [klabel(CheckCode), symbol]
endmodule
module MOVE-ANNOTATION-SYNTAX-ABSTRACT
imports INT-SYNTAX
syntax Annotation ::= "#NewTransaction" [klabel(NewTransaction), symbol]
syntax Check ::= #CheckCode(Int) [klabel(CheckCode), symbol]
endmodule
I'd like to also be able to use ordinary comments.
As a first step, I was able to change the Layout to allow commits only if the begin with a ! using r"(\\/\\/[^!][^\\n\\r]*)"
I'd like to exclude all comments that start with either //! or // check: from comments. What's a good way of implementing this?
Where can I find documentation for the regular expression language that K uses?

K uses flex for its scanner, and thus for its regular expression language. As a result, you can find documentation on its regular expression language here.
You want a regular expression that expresses that comments can't start with ! or check:, but flex doesn't support negative lookahead or not patterns, so you will have to exhaustively enumerate all the cases of comments that don't start with those sequence of characters. It's a bit tedious, sadly.
For reference, here is a (simplified) regular expression drawn from the syntax of C that represents all pragmas that don't start with STDC. It should give you an idea of how to proceed:
#pragma([:space:]*)([^S].*|S|S[^T].*|ST|ST[^D].*|STD|STD[^C].*|STDC[_a-zA-Z0-9].*)?$

Related

Grammar-Kit: how to handle comment tokens

From the documentation provided for grammar-kit, I cannot figure out how I am supposed to correctly handle something like comments. My lexer currently returns TokenType.WHITE_SPACE for any comment blocks, but then no unique IElementType is generated for me to do syntax highlighting on.
If I create an IElementType and tell flex to return that for comments, I can perform syntax highlighting, but then that token is not a part of my language spec in the BNF, and so it is considered invalid.
What is the correct way to pass comments through as white space, but perform syntax highlighting on them in Intellij/grammar-kit/jflex?

You can use Grammar-Kit implementation as a reference:
Lexer
Grammar
ParserDefinition
Using TokenType.WHITE_SPACE for comments is a bad idea.
More details can be found here.

antlr2 to antlr4 class specifier, options, TOKENS and more

I need to rewrite a grammar file from antlr2 syntax to antlr4 syntax and have the following questions.
1) Bart Kiers states there is a strict order: grammar, options, tokens, #header, #members in this SO post. This antlr2.org post disagrees stating header is before options. Is there a resource that states the correct order (if one exists) for antlr4?
2) The same antlr2.org post states: "The options section for a grammar, if specified, must immediately follow the ';' of the class specifier:
class MyParser extends Parser;
options { k=2; }
However, when running with antlr4, any class specifier creates this error:
syntax error: missing COLON at 'MyParser' while matching a rule
3) What happened to options in antlr4? says there are no rule-level options at the time.
warning(83): MyGrammar.g4:4:4: unsupported option k
warning(83): MyGrammar.g4:5:4: unsupported option exportVocab
warning(83): MyGrammar.g4:6:4: unsupported option codeGenMakeSwitchThreshold
warning(83): MyGrammar.g4:7:4: unsupported option codeGenBitsetTestThreshold
warning(83): MyGrammar.g4:8:4: unsupported option defaultErrorHandler
warning(83): MyGrammar.g4:9:4: unsupported option buildAST
i.) does antlr4's adaptive LL(*) parsing algorithm no longer require k token lookhead?
ii.) is there an equivalent in antlr4 for exportVocab?
iii.) are there equivalents in antlr4 for optimizations codeGenMakeSwitchThreshold and codeGenBitsetTestThreshold or have they become obsolete?
iv.) is there an equivalent for defaultErrorHandler ?
v.) I know antlr4 no longer builds AST. I'm still trying to get a grasp of how this will affect what uses the currently generated *Parser.java and *Lexer.java.
4) My current grammar file specifies a TOKENS section
tokens {
ROOT; FOO; BAR; TRUE="true"; FALSE="false"; NULL="null";
}
I changed the double quotes to single quotes and the semi-colons to commas and the equal sign to a colon to try and get rid of each syntax error but have this error:
mismatched input ':' expecting RBRACE
along with others. Rewritten looks like:
tokens {
ROOT; FOO; BAR; TRUE:'true'; FALSE:'false' ...
}
so I removed :'true' and :'false' and TRUE and FALSE will appear in the generated MyGrammar.tokens but I'm not sure if it will function the same as before.
Thanks!

Just look at the ultimate source for the syntax: the ANTLR4 grammar. As you can see the order plays no role in the prequel section (which includes named actions, options and the like, you can even have more than one option section). The only condition is that the prequel section must appear before any rule.
The error is about a wrong option. Remove that and the error will go away.
Many (actually most of the old) options are no longer needed and supported in ANTLR4.
i.) ANTLR4 uses unlimited lookahead (hence the * in ALL(*)). You cannot specify any other lookahead.
ii.) The exportVocab has long gone (not even ANTLR3 supports it). It only specifies a name for the .tokens file. Use the default instead.
iii.) Nothing like that is needed nor supported anymore. The prediction algorithm has completely changed in ANTLR4.
iv.) You use an error listener instead. There are many examples how to do that (also here at SO).
v.) Is that a question or just thinking loudly? Hint: ANTLR4 based parsers generate a parse tree.
I'm not 100% sure about this one, but I believe you can no longer specify the value a token should match in the tokens section. Instead this is only for virtual tokens and everything else must be specified as normal lexer tokens.
To sum up: most of the special options and tricks required for older ANTLR grammars are not needed anymore and must be removed. The new parsing algorithm can deal with all the ambiquities automatically, which former versions had trouble with and needed guidance from the user for.

software\tool for checking syntax format

Im looking for a tool that can validate if a given text\paragraph subject to a specific format .
for example :
I can be able to check if the text is as following :
xxx{
sss:aaa;
}
yyy();
preferably open source tool, with easy rule sets like xml or something .
by text i mean a string that i get from i.e fgets(), or any function that reads from a file .

For something like this I'd suggest a parser (see, for instance, What is Parse/parsing?). You can build one from a definition of the language that you want to parse using a parser generator like Yacc or its free GNU equivalent Bison, or any number of other parser generators, many of which are also freely available.
Most parsers are used to transform a text that complies with a grammar into some other form (e.g. an intermediate language or a machine code) but that isn't neccesary - in your case the parser could simply say (at a minimum) "Yes" if the text conforms to a given grammar.
Parsers for simple grammars can be built by hand but, if you have the tools available, using a parser generator is easier and more robust in my experience.
Further, the text that you've shown is similar to a portion of code written in the C language (something close to a struct declaration followed by a function call), so you would be able to re-use parts of the grammar that you need from an existing Yacc grammar for C like this one.

Antlr: ignore keywords in specific context

I'm constructing an English-like domain specific language with ANTLR. Its keywords are context-sensitive. (I know it sounds dirty, but it makes a lot of sense for the non-programmer target users.) For example, the usual logical operators such as or and not are to be treated as identifiers when surrounded in brackets, [like this and this]. My current approach looks like this:
bracketedStatement
: '[' bracketedWord+ ']'
;
bracketedWord
: (~(']')+
;
This, when combined with lexical definitions such as the following:
AND: 'and' ;
OR: 'or' ;
Produces the warning"Decision can match input such as "{AND..PROCESS, RPAREN..'with'}" using multiple alternatives: 1, 2". I'm clearly creating ambiguity for ANTLR, but I don't know how to resolve it. How do I fix this?

For anyone who finds this, check out this stack overflow question. It clarifies how to use the negation symbol correctly.

Flex (lexical analyzer) {+} and {-} operators

I am trying to make a simple scanner using Flex. In the declarations section, I am trying to use the {-} operator to exclude reserved words from an id, but I can't get it to work. Every example I have found uses the {+} and {-} operators as in the following code:
[a-z]{-}[d]
However, I am trying to use these operators as in the following code, but I always get errors:
invalid_id "char"|"else"|"if"|"class"|"new"|"return"|"void"|"while"|"int"
all_ids [a-zA-Z_][a-zA-Z0-9_]*
id {all_ids}{-}{invalid_id}
Is there any way to make it work? Can these operators be used without square brackets?

The {-} and {+} operators only work on character classes like [a-z], not on full regular expressions or strings. flex does not have support for a more general {-} operator. The more general version of {+} of course is |. Your particular problem is in general solved by the fact that if two patterns match the same string, then the first pattern will be used. So changing your specification to the following will actually exclude all keywords from the IDs.
%%
"char"|"else"|"if"|"class"|"new"|"return"|"void"|"while"|"int" { return KEYWORD; }
[a-zA-Z_][a-zA-Z0-9_]* { return ID; }
%%

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas