Why this notation for the Lexer production in antlr? - antlr

In the following SQL lexer:
https://github.com/tshprecher/antlr_psql/blob/master/antlr4/PostgreSQLLexer.g4
It defines true as:
TRUE : T R U E;
Why the capitals spaced out like that instead of just TRUE: 'TRUE' ? What's the reasoning for that notation? Does T refer to another production or something and that's why it's spelled like that?

These single letters are (fragment) lexer rules too. Check the grammar out! This way you can define case-insensitive keywords. This was the usual approach for case-insensitivity until this was built into ANTLR4 in version 4.10.

Related

Is there a parser tag for comments in K?

Is there any built-in tag for block, line or in-line comments for the parser generator?
e.g.
comment blocks "(*" Exp "*)" or inline comments "//" Exp.
In a parser generator like menhir, I would normally handle comments by pattern matching with the lexer, so comments wouldn't be part of the AST. Is there an equivalent in K?
If not, what is the recommended way of implementing comments?
You can declare the builtin sort #Layout to be the concatenation via pipes of a set of regular expression terminals (i.e., r"//[^\\n]*"). Any tokens which lex as one of these tokens are simply discarded by the lexer and the parser does not even see them. Note that this applies only to parsing terms using a generated parser or kast; parsing rules in .k files will still require the usual K syntax for comments.
Note that this is also how whitespace is parsed, so unless your language is whitespace sensitive, make sure to include in #Layout any whitespace characters which you want the parser to ignore.

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

Failed to parse command using ANTLR3 grammar, if command has same word which is declared as rule

I have facing a problem while parsing some command with the parser which, I have implemented using ANLTR3. Parser fails to parse those commands which contains 'any-word' that is declared as lexer rule in the grammar.
For Example take a look following grammar:
show :
SHOW TABLES '[' projectName? tableName']' -> ^(SHOW TABLES_ ^(PROJECT_NAME projectName)? ^(DATASET_TABLE tableName));
SHOW : S H O W;
If i try to parse command 'SHOW TABLES [sample-project:SHOW]' then parse fails for this command.But if I change the SHOW word then it works.
SHOW TABLES [sample-project:SHOW] - this works.
I don't want to get name as string which is surrounded in quotes(").
Can anyone suggest solution? I am using ANTLR3.
Thanks in advance.
This is a typical effect of using a reserved word as identifier. In ANTLR when you define a reserved word like your SHOW rule it will implicitly excluded from a identifier rule you might have defined after that keyword rule.
The solution to allow such keywords also as identifiers in rules like your tablName is to make that rule accept certain (or all) keywords that could be accepted in that place (and will not act as keywords then). Example:
tableName:
IDENTIFIER
| SHOW
| <others go here>
;

Solve ambiguity in my grammar with LALR parser

I'm using whittle to parse a grammar, but I'm running into the classical LALR ambiguity problem. My grammar looks like this (simplified):
<comment> ::= '{' <string> '}' # string enclosed in braces
<tag> ::= '[' <name> <quoted-string> ']' # [tagname "tag value"]
<name> ::= /[A-Za-z_]+/ # subset of all printable chars
<quoted-string> ::= '"' <string> '"' # string enclosed in quotes
<string> ::= /[:print:]/ # regex for all printable chars
The problem, of course, is <string>. It contains all printable characters and is therefore very greedy. Since it's an LALR parser, it tries to parse a <name> as a <string> and everything breaks. The grammar complicates things because it uses different string delimiters for different things, which is why I tried to make the <string> rule in the first place.
Is there a canonical way to normalize this grammar to make it LALR compliant, if it's even possible?
This is not "the classical LALR ambiguity problem", whatever that might be. It is simply an error in the lexical specification of the language.
I took a quick glance at the Whittle readme, but it didn't bear any resemblance to the grammar in the OP. So I'm assuming that the text in the OP is conceptual rather than literal, and the fact that it includes the obviously incorrect
<string> ::= /[:print:]/ # regex for all printable chars
is just a typo.
Better would have been /[:print:]*/, assuming that Ruby lets you get away with [:print:] rather than the Posix-standard [[:print:]].
But that wouldn't be correct either because lexing (usually) matches the longest possible string, and consequently that will gobble up the closing quote and any following text.
So the correct solution for quoted-string is to write it out correctly:
<quoted-string> ::= /"[^"]*"/
or even
<quoted-string> :: /"([^\\"]|\\.)*"/
# any number of characters other than quote or escape, or escaped pairs
You might have other ideas about how to escape internal double quotes; those are just examples. In both cases, you need to postprocess the token in order to (at least) strip the double-quotes and possible interpret escape sequences. That's just the way it goes.
Your comment sequences present a more difficult issue, assuming that your intention was that a comment might include nested braces (eg. {This comment {with this} ends here}) because the nested brace syntax is not regular and thus cannot be matched with a regular expression. Of course, very few "regular expression" libraries are really regular these days, and I don't know if Ruby contains some sort of brace-counting extension, like for example Lua's pattern syntax. The nested brace syntax is certainly context-free but to actually parse it you need to lexically analyze the contents of the outer {...} in a different way than the rest of the program.
It is this latter observation, and not any weakness in the LALR algorithm, that is causing you pain, and I'd say that this is a weakness with the (mostly undocumented afaics) lexical analysis section of whittle. In a flex-generated lexer, for example, it would be normal to use start conditions to separate the lexical environments (program / quoted string / braced comment), and the parser would then have no ambiguity.
Hope that helps.

ANTLR3 Dynamic quotes in lexer

I need to match something like the Perl regexp matcher
m/my regex!*/
where the quotes can be any character from a range. So the above is the same as
m%my regex!*%
A naive guess of a lexer rule would be
REGEX: 'm' quote=. (~(quote))* quote;
but that does not work, because the latter quote is not referring to the quote= but to some rule.
I can do it with a lot of own code, like
REGEX: 'm' quote=. { ... implement the loop and final match myself ... } ;
but somehow I think there should be a canonical way to do such things.
... but somehow I think there should be a canonical way to do such things.
There is not. You'll have to do this with custom code.
Take a look at PL/SQL parser (here). Oracle also supports those Perl style quoted strings.
Like:
q':select * from employees where last_name = 'smith':'
Use the custom code as an example. (It contains C and Java implementation).
Maybe in your case it can be even simplified.
Ivan