Antlr: "spaced token" lexer style for keywords - why? - antlr

I'm studying the collection of grammars for various languages
The SQL Lite Lexer uses this "spaced letter" style for defining the SQL keywords in the lexer.
so, for example:
CREATE: C R E A T E
... and then a set of fragments at the bottom for each letter in the alphabet.
I would have probably done the style below:
CREATE: 'CREATE'
I was curious what the spaced style they have used means - I tried both styles in the antlr intellij plugin and when giving it a program text of CREATE, it yields the same parse tree in both cases. I was curious/interested if the style they use has some intrinsic advantage, or is just stylistic?

The grammar you link uses a fragment for each character. That way it can process the grammar in a case-insensitive way.
At the bottom of the grammar, you see fragments defined like:
fragment A: [aA];
fragment B: [bB];
fragment C: [cC];
...
In other words, when you CREATE: C R E A T E, those spaced letters are actually fragments, which translate to CREATE: [cC] [rR] [eE] [aA] [tT] [eE]
For more details, see Case-Insensitive Lexing in the ANTLR documentation.

Related

Why this notation for the Lexer production in antlr?

In the following SQL lexer:
https://github.com/tshprecher/antlr_psql/blob/master/antlr4/PostgreSQLLexer.g4
It defines true as:
TRUE : T R U E;
Why the capitals spaced out like that instead of just TRUE: 'TRUE' ? What's the reasoning for that notation? Does T refer to another production or something and that's why it's spelled like that?
These single letters are (fragment) lexer rules too. Check the grammar out! This way you can define case-insensitive keywords. This was the usual approach for case-insensitivity until this was built into ANTLR4 in version 4.10.

How can I show that this grammar is ambiguous?

I want to prove that this grammar is ambiguous, but I'm not sure how I am supposed to do that. Do I have to use parse trees?
S -> if E then S | if E then S else S | begin S L | print E
L -> end | ; S L
E -> i
You can show it is ambiguous if you can find a string that parses more than one way:
if i then ( if i then print i else print i ; )
if i then ( if i then print i ) else print i ;
This happens to be the classic "dangling else" ambiguity. Googling your tag(s), title & grammar gives other hits.
However, if you don't happen to guess at an ambiguous string then googling your tag(s) & title:
how can i prove that this grammar is ambiguous?
There is no easy method for proving a context-free grammar ambiguous -- in fact,
the question is undecidable, by reduction to the Post correspondence problem.
You can put the grammar into a parser generator which supports all context-free grammars, a context-free general parser generator. Generate the parser, then parse a string which you think is ambiguous and find out by looking at the output of the parser.
A context-free general parser generator generates parsers which produce all derivations in polynomial time. Examples of such parser generators include SDF2, Rascal, DMS, Elkhound, ART. There is also a backtracking version of yacc (btyacc) but I don't think it does it in polynomial time. Usually the output is encoded as a graph where alternative trees for sub-sentences are encoded with a nested set of alternative trees.

Antlr Arrow Syntax

I found this syntax in an Antlr parser for bash:
file_descriptor
: DIGIT -> ^(FILE_DESCRIPTOR DIGIT)
| DIGIT MINUS -> ^(FILE_DESCRIPTOR_MOVE DIGIT);
What does the -> syntax do?
What is it called such that I can google it to read about it?
The 'Definitive Guide to Antlr4' only has one page about it. It refers to "lexer command", but it never names the operator. The usage in the book differs from the usage in the bash parser.
In ANTLR3, -> is used in parser rules and signifies a tree rewrite rule, which is no longer supported in ANTLR4.
In ANTLR4, the -> is used in lexer rules and has nothing to do with the old v3 functionality.

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?

What is the antlr4 (v-4.1) equivalent form of the following grammar rule (written for antlr3 (v-3.2))?
text
: tag => (tag)!
| outsidetag
;
The following is invalid in ANTLR 3:
text
: tag => (tag)!
| outsidetag
;
You probably meant the following:
text
: (tag)=> (tag)!
| outsidetag
;
where ( ... )=> is a syntactic predicate, which has no ANTLR4 equivalent: simply remove them. As 280Z28 mentioned (and also explained in the previous link): the lack of syntactic predicates is not a feature that was removed from ANTLR 4. It's a workaround for a weakness in ANTLR 3's prediction algorithm that no longer applies to ANTLR 4.
The exlamation mark in v3 denotes to removal of a rule in the generated AST. Since ANTLR4 does not produce AST's, also just remove the exclamation mark.
So, the v4 equivalent would look like this:
text
: tag
| outsidetag
;

Lexing space seperated words in ANTLR3 where some words are keywords

I am working on a project that involves transforming part of speech tagged text into an ANTLR3 AST with phrases as nodes of the AST.
The input to ANTLR looks like:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
i.e. (tag token)+ where neither the tag or the token contain white space.
Is the following a good way of lexing this:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
The grammar then has entries like the following to describe the lowest level of the AST:
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
(and 186 more of these!)
This approach seems to work but results in a ~9000 line Java Lexer and takes a large amount of memory to build (~2gb) hence I was wondering whether this is the optimal way of solving this problem.
Could you combine the TAG space TOKEN into a single AST tree? Then you could pass both the TAG and TOKEN into your source code for handling. If the Java code used to handle the resulting tree is very similar between the various TAGs, then you could perhaps simplify the ANTLR with the trade-off of a bit more complication in your Java code.