ANTLR4 and parsing a type-length-value format - antlr

I am trying create a grammar for a format that follows a type-length-value convention. Can ANTLR4 read in a length value and then parse that many characters?

NO ...
From your question (which is very short so I could miss something ...) I gather you are mixing grammar and encoding rules.
When you say type-length-value, it sounds like an encoding rule to me (how to serialize a data). In my experience, you write this code yourself.
A grammar is at a higher level: it's a piece of text that describes something. Antlr will help you breaking this text into tokens and then into a tree that you can navigate.
This step only handles text: if you were going that way to solve your problem, you would still have to handle type, length and value yourself.
EDIT:
with a bit of googling I found this https://github.com/NickstaDB/SerializationDumper

Related

Grammar-Kit: how to handle comment tokens

From the documentation provided for grammar-kit, I cannot figure out how I am supposed to correctly handle something like comments. My lexer currently returns TokenType.WHITE_SPACE for any comment blocks, but then no unique IElementType is generated for me to do syntax highlighting on.
If I create an IElementType and tell flex to return that for comments, I can perform syntax highlighting, but then that token is not a part of my language spec in the BNF, and so it is considered invalid.
What is the correct way to pass comments through as white space, but perform syntax highlighting on them in Intellij/grammar-kit/jflex?
You can use Grammar-Kit implementation as a reference:
Lexer
Grammar
ParserDefinition
Using TokenType.WHITE_SPACE for comments is a bad idea.
More details can be found here.

software\tool for checking syntax format

Im looking for a tool that can validate if a given text\paragraph subject to a specific format .
for example :
I can be able to check if the text is as following :
xxx{
sss:aaa;
}
yyy();
preferably open source tool, with easy rule sets like xml or something .
by text i mean a string that i get from i.e fgets(), or any function that reads from a file .
For something like this I'd suggest a parser (see, for instance, What is Parse/parsing?). You can build one from a definition of the language that you want to parse using a parser generator like Yacc or its free GNU equivalent Bison, or any number of other parser generators, many of which are also freely available.
Most parsers are used to transform a text that complies with a grammar into some other form (e.g. an intermediate language or a machine code) but that isn't neccesary - in your case the parser could simply say (at a minimum) "Yes" if the text conforms to a given grammar.
Parsers for simple grammars can be built by hand but, if you have the tools available, using a parser generator is easier and more robust in my experience.
Further, the text that you've shown is similar to a portion of code written in the C language (something close to a struct declaration followed by a function call), so you would be able to re-use parts of the grammar that you need from an existing Yacc grammar for C like this one.

How to disable ParseKit's default Parsers?

From Parsekit: how to match individual quote characters?
If you define a parser:
#start = int;
int = /[+-]?[0-9]+/
Unfortunately it isn't going to be parsing any integers prefixed with a "+", unless you include:
#numberState = "+" // at the top.
In the number parse above, the "Symbol" default parser wasn't even mentioned, yet it is still active and overrides user defined parsers.
Okay so with numbers you can still fix it by adding the directive. What if you're trying to create a parser for "++"? I haven't found any directive that can make the following parser work.
#start = plusplus;
plusplus = "++";
The effects of default parsers on the user parser seems so arbitrary. Why can't I parse "++"?
Is it possible to just turn off default Parsers altogether? They seem to get in the way if I'm not doing something common.
Or maybe I've got it all wrong.
EDIT:
I've found a parser that would parse plus plus:
#start = plusplus;
plusplus = plus plus;
plus = "+";
I am guessing the answer is: the literal symbols defined in your parser cannot overlap between default parsers; It must be contained completely by at least once of them.
Developer of ParseKit here.
I have a few responses.
I think you'll find the ParseKit API highly elegant and sensible, the more you learn. Keep in mind that I'm not tooting my own horn by saying that. Although I built ParseKit, I did not design the ParseKit API. Rather, the design of ParseKit is based almost entirely on the designs found in Steven Metsker's Building Parsers In Java. I highly recommend you checkout the book if you want to deeply understand ParseKit. Plus it's a fantastic book about parsing in general.
You're confusing Tokenizer States with Parsers. They are two distinct things, but the details are more complex than I can answer here. Again, I recommend Metsker's book.
In the course of answering your question, I did find a small bug in ParseKit. Thanks! However, it was not affecting your outcome described above as you were not using the correct grammar to get the outcome it seems you were looking for. You'll need to update your source code from The Google Code Project now, or else my advice below will not work for you.
Now to answer your question.
I think you are looking for a grammar which both recognizes ++ as a single multi-char Symbol token and also recognizes numbers with leading + chars as explicitly-positive numbers rather than a + Symbol token followed by a Number token.
The correct grammar I believe you are looking for is something like this:
#symbols = '++'; // declare ++ as a multi-char symbol
#numberState = '+'; // allow explicitly-positive numbers
#start = (Number|Symbol)*;
Input like this:
++ +1 -2 + 3 ++
Will be tokenized like so:
[++, +1, -2, +, 3, ++]++/+1/-2/+/3/++^
Two reminders:
Again, you will need to update your source code now to see this work correctly. I had to fix a bug in this case.
This stuff is tricky, and I recommend reading Metsker's book to fully understand how ParseKit works.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?

Take input from function parameters, not a file

All of the examples which I see read in a file the lex & parse it.
I need to have a function which takes a string (char *, I'm generating C code) as parameter, and acts upon that.
How can I do that best? I thought of writing the string to a stream, then feeding that to the lexer, but it doesn't feel right. Is there any better way?
Thanks in advance
You would need to use the antlr3NewAsciiStringInPlaceStream method.
You didn't say what version of Antlr you were using so I'll assume Antlr v3.
The inputs to this method are the string to parse, it's length and then you can probably use NULL for the last input.
This produces an input stream similar to the antlr3AsciiFileStreamNew that you would use to parse a file.
I see that you mentioned writing the input to a stream. If you can use C++ then that's the best method you'll probably come by.
This is the barebones code I normally use:
std::istringstream issInput(std::cin); // make this an ifstream if you want to parse a file
Lexer oLexer(issInput);
Parser oParser(oLexer);
oFactory("CommonASTWithHiddenTokens",&antlr::CommonASTWithHiddenTokens::factory);
antlr::ASTFactory oFactory;
oParser.initializeASTFactory(oFactory);
oParser.setASTFactory(&oFactory);
oParser.main();
antlr::RefAST ast = oParser.getAST();
if (ast)
{
TreeWalker oTreeWalker;
oTreeWalker.main(ast, rPCode);
}
I think you should feed it to a stream. You could feed it to stdin if you'd like. That way, your code shouldn't differ too much from reading strings from a file.