Grammar-Kit: how to handle comment tokens - intellij-idea

From the documentation provided for grammar-kit, I cannot figure out how I am supposed to correctly handle something like comments. My lexer currently returns TokenType.WHITE_SPACE for any comment blocks, but then no unique IElementType is generated for me to do syntax highlighting on.
If I create an IElementType and tell flex to return that for comments, I can perform syntax highlighting, but then that token is not a part of my language spec in the BNF, and so it is considered invalid.
What is the correct way to pass comments through as white space, but perform syntax highlighting on them in Intellij/grammar-kit/jflex?

You can use Grammar-Kit implementation as a reference:
Lexer
Grammar
ParserDefinition
Using TokenType.WHITE_SPACE for comments is a bad idea.
More details can be found here.

Related

Parsing annotations within comments / Negative matching of strings in regexps

I'm defining the syntax of the Move IR. The test suite for this language includes various annotations to enable testing. I need to treat comments of this form specially:
//! new-transaction
// check: "Keep(ABORTED { code: 123,"
This file is an example arithmetic_operators_u8.mvir.
So far, I've got this working by disallowing ordinary single-line comments.
module MOVE-ANNOTATION-SYNTAX-CONCRETE
imports INT-SYNTAX
syntax #Layout ::= r"([\\ \\n\\r\\t])" // Whitespace
syntax Annotation ::= "//!" "new-transaction" [klabel(NewTransaction), symbol]
syntax Check ::= "//" "check:" "\"Keep(ABORTED { code:" Int ",\"" [klabel(CheckCode), symbol]
endmodule
module MOVE-ANNOTATION-SYNTAX-ABSTRACT
imports INT-SYNTAX
syntax Annotation ::= "#NewTransaction" [klabel(NewTransaction), symbol]
syntax Check ::= #CheckCode(Int) [klabel(CheckCode), symbol]
endmodule
I'd like to also be able to use ordinary comments.
As a first step, I was able to change the Layout to allow commits only if the begin with a ! using r"(\\/\\/[^!][^\\n\\r]*)"
I'd like to exclude all comments that start with either //! or // check: from comments. What's a good way of implementing this?
Where can I find documentation for the regular expression language that K uses?
K uses flex for its scanner, and thus for its regular expression language. As a result, you can find documentation on its regular expression language here.
You want a regular expression that expresses that comments can't start with ! or check:, but flex doesn't support negative lookahead or not patterns, so you will have to exhaustively enumerate all the cases of comments that don't start with those sequence of characters. It's a bit tedious, sadly.
For reference, here is a (simplified) regular expression drawn from the syntax of C that represents all pragmas that don't start with STDC. It should give you an idea of how to proceed:
#pragma([:space:]*)([^S].*|S|S[^T].*|ST|ST[^D].*|STD|STD[^C].*|STDC[_a-zA-Z0-9].*)?$

ANTLR4 and parsing a type-length-value format

I am trying create a grammar for a format that follows a type-length-value convention. Can ANTLR4 read in a length value and then parse that many characters?
NO ...
From your question (which is very short so I could miss something ...) I gather you are mixing grammar and encoding rules.
When you say type-length-value, it sounds like an encoding rule to me (how to serialize a data). In my experience, you write this code yourself.
A grammar is at a higher level: it's a piece of text that describes something. Antlr will help you breaking this text into tokens and then into a tree that you can navigate.
This step only handles text: if you were going that way to solve your problem, you would still have to handle type, length and value yourself.
EDIT:
with a bit of googling I found this https://github.com/NickstaDB/SerializationDumper

How to disable ParseKit's default Parsers?

From Parsekit: how to match individual quote characters?
If you define a parser:
#start = int;
int = /[+-]?[0-9]+/
Unfortunately it isn't going to be parsing any integers prefixed with a "+", unless you include:
#numberState = "+" // at the top.
In the number parse above, the "Symbol" default parser wasn't even mentioned, yet it is still active and overrides user defined parsers.
Okay so with numbers you can still fix it by adding the directive. What if you're trying to create a parser for "++"? I haven't found any directive that can make the following parser work.
#start = plusplus;
plusplus = "++";
The effects of default parsers on the user parser seems so arbitrary. Why can't I parse "++"?
Is it possible to just turn off default Parsers altogether? They seem to get in the way if I'm not doing something common.
Or maybe I've got it all wrong.
EDIT:
I've found a parser that would parse plus plus:
#start = plusplus;
plusplus = plus plus;
plus = "+";
I am guessing the answer is: the literal symbols defined in your parser cannot overlap between default parsers; It must be contained completely by at least once of them.
Developer of ParseKit here.
I have a few responses.
I think you'll find the ParseKit API highly elegant and sensible, the more you learn. Keep in mind that I'm not tooting my own horn by saying that. Although I built ParseKit, I did not design the ParseKit API. Rather, the design of ParseKit is based almost entirely on the designs found in Steven Metsker's Building Parsers In Java. I highly recommend you checkout the book if you want to deeply understand ParseKit. Plus it's a fantastic book about parsing in general.
You're confusing Tokenizer States with Parsers. They are two distinct things, but the details are more complex than I can answer here. Again, I recommend Metsker's book.
In the course of answering your question, I did find a small bug in ParseKit. Thanks! However, it was not affecting your outcome described above as you were not using the correct grammar to get the outcome it seems you were looking for. You'll need to update your source code from The Google Code Project now, or else my advice below will not work for you.
Now to answer your question.
I think you are looking for a grammar which both recognizes ++ as a single multi-char Symbol token and also recognizes numbers with leading + chars as explicitly-positive numbers rather than a + Symbol token followed by a Number token.
The correct grammar I believe you are looking for is something like this:
#symbols = '++'; // declare ++ as a multi-char symbol
#numberState = '+'; // allow explicitly-positive numbers
#start = (Number|Symbol)*;
Input like this:
++ +1 -2 + 3 ++
Will be tokenized like so:
[++, +1, -2, +, 3, ++]++/+1/-2/+/3/++^
Two reminders:
Again, you will need to update your source code now to see this work correctly. I had to fix a bug in this case.
This stuff is tricky, and I recommend reading Metsker's book to fully understand how ParseKit works.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?

Handling invalid input in the Lexer / Parser

so, I am parsing Hayes modem AT commands. Not read from a file, but passed as char * (I am using C).
1) what happens if I get something that I totally don't recognize? How do I handle that?
2) what if I have something like
my_token: "cmd param=" ("value_1" | "value_2");
and receive an invalid value for "param"?
I see some advice to let the back-end program (in C) handle it, but that goes against the grain for me. Catch teh problem as early as you can, is my motto.
Is there any way to catch "else" conditions in lexer/parser rules?
Thanks in advance ...
That's the thing: the whole point of your parser and lexer is to blow up if you get bad input, then you catch the blow up and present a pretty error message to the user.
I think you're looking for Custom Syntax Error Recovery to embed in your grammar.
EDIT
I've no experience with ANTLR and C (or C alone for that matter), so follow this advice with caution! :)
Looking at the page: http://www.antlr.org/api/C/using.html, perhaps the part at the bottom, Implementing Customized Methods is what you're after.
HTH