How can I use the same lexer to provide token streams with and without whitespace? - antlr

I have a lexer grammar that defines a lexer that is used in two ways: to identify tokens for a syntax-aware editor, and to identify tokens for the parser. In the first case, the lexer should return comments and whitespace, but in the second case, the comments and whitespace are not wanted. Do I need two different lexer classes, each defined by its own variant of the grammar? Or can I accomplish this with a single lexer by using channels? How?
If I need two separate grammars, I assume I can factor out all the rules except for comments and whitespace, and then import those rules from that separate "common" grammar.

Usually you filter out tokens (like whitespaces) via token channels (or skip them entirely). This is part of your grammar and hence you'd need 2 grammars if you want whitespaces in one use case and not in the other. And yes, you can import a base grammar with all the common rules into specialized grammars which only hold the differences. You can even override rules (define e.g. the whitespace rule in the base grammar and redefine it in your main grammar).
But keep in mind that not filtering whitespaces will have consequences for all your other rules. In that case you would have to explicitly add whitespace handling to your parser rules everywhere. For instance:
blah: a or b;
versus
blah: a WS* or WS* b;

Related

Regex/token/rule to match nested curly braces?

I need to match the values of key = value pairs in BibTeX files, which can contain arbitrarily nested braces, delimited by braces. I've got as far as matching at most two deep nested curly braces, like {some {stuff} like {this}} with the kludgey:
token brace-value {
'{' <-[{}]>* ['{' <-[}]>* '}' <-[{}]>* ]* '}'
}
I shudder at the idea of going one level further down... but proper parsing of my BibTeX stuff needs at least three levels deep.
Yes, I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile. My *.bib files are rather tame (and I wouldn't mind to handle a few stray entries by hand), the problem is that I have a lot of them, with much overlap. But some of the "same" entries have different keys, or extra data. I want to consolidate them into a few master files (the whole idea behind BibTeX, right?). Not fun by hand if bibtool gives a file with no duplicates (ha!) of some 20 thousand lines...
After perusing Lenz' "Parsing with Perl 6 Regexes and Grammars" (Apress, 2017), I realized the "regex" machinery (based on backtracking) might actually be a lot more capable than officially admitted, as a regex can call another, and nowhere do I see a prohibition on recursive calls.
Before digging in, a bit of context free grammars: A way to describing nested braces (and nothing else) is with the grammar:
S -> { S } S | <nothing>
I.e., nested braces are either an opening brace, nested braces, a closing brace, more nested braces; or nothing whatsoever. This translates more or less directly to Raku (there is no empty regex, fake it by making the construction optional):
my regex nb {
[ '{' <nb> '}' <nb> ]?
}
Lo and behold, this works. Need to fix up to avoid captures, kill backtracking (if it doesn't match on the first try, it won't ever match), and decorate with "anything else" fillers.
my regex nested-braces {
:ratchet
<-[{}]>*
[ '{' <.nested-braces> '}' <.nested-braces> ]?
<-[{}]>*
};
This checks out with my test cases.
For not-so-adventurous souls, there is the Text::Balanced module for Perl (formerly Perl 5, callable from Raku using Inline::Perl5). Not directly useful to me inside a grammar, unfortunately.
Solution
A way to describe nested braces (and nothing else)
Presuming a rule named &R, I'd likely write the following pattern if I was writing a quick small one-off script:
\{ <&R>* \}
If I was writing a larger program that should be maintainable I'd likely be writing a grammar and, using a rule named R the pattern would be:
'{' ~ '}' <R>*
This latter avoids leaning toothpick syndrome and uses the regex ~ operator.
These will both parse arbitrarily deeply nested paired braces, eg:
say '{{{{}}}}' ~~ token { \{ <&?ROUTINE>* \} } # 「{{{{}}}}」
(&?ROUTINE refers to the routine in which it appears. A regex is a routine. (Though you can't use <&?ROUTINE> in a regex declared with / ... / syntax.)
regex vs token
kill backtracking
my regex nested-braces {
:ratchet
The only difference between patterns declared with regex and token is that the former turns ratcheting off. So using it and then immediately turning ratcheting on is notably unidiomatic. Instead:
my token nested-braces {
Backtracking
the "regex" machinery (based on backtracking)
The grammar/regex engine does include backtracking as an optional feature because that's occasionally exactly what one wants.
But the engine is not "based on backtracking", and many grammars/parsers make little or no use of backtracking.
Recursion
a regex can call another, and nowhere do I see a prohibition on recursive calls.
This alone is nothing special for contemporary regex engines.
PCRE has supported recursion since 2000, and named regexes since 2003. Perl's default regex engine has supported both since 2007.
Their support for deeper levels of recursion and more named regexes being stored at once has been increasing over time.
Damian Conway's PPR uses these features of regexes to build non-trivial (but still small) parse trees.
Capabilities
a lot more capable
Raku "regexes" can be viewed as a cleaned up take on the unfolding regex evolution. To the degree this helps someone understand them, great.
But really, it's a whole new deal. For example, they're turing complete, in a sensible way, and thus able to parse anything.
than officially admitted
Well that's an odd thing to say! Raku's Grammars are frequently touted as one of Raku's most innovative features.
There are three major caveats:
Performance The primary current caveat is that a well written C parser will blow the socks off a well written Raku Grammar based parser.
Pay off It's often not worth the effort it takes to write a fully correct parser for a non-trivial format if there's an existing parser.
Left recursion Raku does not automatically rewrite left recursion (infinite loops).
Using existing parsers
I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile.
Using a foreign module in Raku can be a bit of a revelation. It is not necessarily like anything you'll have experienced before. Raku's foreign language adaptors can do smart marshaling for you so it can be like you're using native Raku features.
Two of the available foreign language adaptors are already sufficiently polished to be amazing -- the ones for Perl and for C.
I'm pretty sure there's a BibTeX package for Perl that wraps a C BibTeX parser. If you used that you'd hopefully get parsing results all nicely wrapped up into Raku objects as if it was all Raku in the first place, but retaining much of the high performance of the C code.
A Raku BibTeX Grammar?
Perhaps your needs do call for creating and using a small Raku Grammar.
(Maybe you're doing this partly as an exercise to familiarize yourself with Raku, or the regex/grammar aspect of Raku. For that it sounds pretty ideal.)
As soon as you begin to use multiple regexes together -- even just two -- you are closing in on grammar territory. After all, they're just an easy-to-use construct for using multiple regexes together.
So if you decide you want to stick with writing parsing code in Raku, expect to write it something like this:
grammar BiBTeX {
token TOP { ... }
token ...
token ...
}
BiBTeX.parse: my-bib-file
For more details, see the official doc's Grammar tutorial or read Moritz's book.
OK, just (re) checked. The documentation of '{' ~ '}' leaves a whole lot to desire, it is not at all clear it is meant to handle balanced, correctly nested delimiters.
So my final solution is really just along the lines:
my regex nested-braces {
:ratchet
'{' ~ '}' .*
}
Thanks everyone! Lerned quite a bit today.

Regular grammars

In a regular grammar with the following rules
S->aS/aSbS/ε is it acceptable to do the following steps:
S->aSbS->a{aSbS}bS->aa{aSbS}bSbS->aaa{aSbS}bSbSbS
Do i have to replace every S in each step or i can replace one S out of two for example? In this : aSbS can i do aSb (following the rule S->ε) and if i cannot
should i replace all the S's with the same rule? (aSbS)-> a(aS)b(aS) (following the rule (S->aS) ) or i can do (aSbS->a(aS)b(aSbS)) .
ps. I use the parentheses do indicate which S's i have replaced
In a formal grammar, a derivation step consists of replacing a single instance of the left-hand side of a rule with the right-hand side of that rule. In the case of a context-free grammar, the left-hand side is a single non-terminal, so a derivation step consists of replacing a single instance of a non-terminal with one of the possible corresponding right-hand sides.
You never perform two replacements at the same time, and each derivation step is independent of every other derivation step, so in a context-free grammar, there is no way to express the constraint that two non-terminals need to be replaced with the same right-hand side. (In fact, you cannot directly express this constraint with a context-sensitive grammar either, but you can achieve the effect by using markers.)
Regular grammars are a subset of context-free grammars where either every right-hand side which includes a non-terminal has the form aC or every right-hand side which includes a non-terminal has the form Ba. (This comes straight out of the Wikipedia page which you link.) Your grammar is not a regular grammar, because the rule S → aSbS has two non-terminals on the right-hide side, which is not either of the regular forms.

ParseKit: What built-in Productions should I use in my Grammars?

I just started using ParseKit to explore language creation and perhaps build a small toy DSL. However, the current SVN trunk from Google is throwing a -[PKToken intValue]: unrecognized selector sent to instance ... when parsing this grammar:
#start = identifier ;
identifier = (Letter | '_') | (letterOrDigit | '_') ;
letterOrDigit = Letter | Digit ;
Against this input:
foo
Clearly, I am missing something or have incorrectly configured my project. What can I do to fix this issue?
Developer of ParseKit here.
First, see the ParseKit Tokenization docs.
Basically, ParseKit can work in one of two modes: Let's call them Tokens Mode and Chars Mode. (There are no formal names for these two modes, but perhaps there should be.)
Tokens Mode is more popular by far. Virtually every example you will find of using ParseKit will show how to use Tokens Mode. I believe all of the documentation on http://parsekit.com is using Tokens Mode. ParseKit's grammar feature (that you are using in your example only works in Tokens Mode).
Chars Mode is a very little-known feature of ParseKit. I've never had anyone ask about it before.
So the differences in the modes are:
In Tokens Mode, the ParseKit Tokenizer emits multi-char tokens (like Words, Symbols, Numbers, QuotedStrings etc) which are then parsed by the ParseKit parsers you create (programmatically or via grammars).
In Chars Mode, the ParseKit Tokenizer always emits single-char tokens which are then parsed by the ParseKit parsers you create programmatically. (grammars don't currently work with this mode as this mode is not popular).
You could use Chars Mode to implement Regular Expresions which parse on a char-by-char basis.
For your example, you should be ignoring Chars Mode and just use Tokens Mode. The following Built-in Productions are for Chars Mode only. Do not use them in your grammars:
(PK)Letter
(PK)Digit
(PK)Char
(PK)SpecificChar
Notice how all of those Productions sound like they match individual chars. That's because they do.
Your example above should probably look like:
#start = identifier;
identifier = Word; // by default Words start with a-zA-Z_ and contain -0-9a-zAZ_'
Keep in mind the Productions in your grammars (parsers like identifier) will be working on Tokens already emitted from ParseKit's Tokenizer. Not individual chars.
IOW: by the time your grammar goes to work parsing input, the input has already been tokenized into Tokens of type Word, Number, Symbol, QuotedString, etc.
Here are all of the Built-in Productions available for use in your Grammar:
Word
Number
Symbol
QuotedString
Comment
Any
S // Whitespace. only available when #preservesWhitespaceTokens=YES. NO by default.
Also:
DelimitedString('start', 'end', 'allowedCharset')
/xxx/i // RegEx match
There are also operators for composite parsers:
// Sequence
| // Alternation
? // Optional
+ // Multiple
* // Repetition
~ // Negation
& // Intersection
- // Difference

Stripping actions from ANTLR grammar changes its parsing algorithm

I have a grammar Foo.xtext (too complex to include it here). Xtext generates InternalFoo.g from it. After some tweaking it also generates DebugInternalFoo.g which claims to be the same thing without actions. Now, I strip off actions with ANTLR directly
java -cp antlr-3.4.jar org.antlr.tool.Strip Internal.g > Stripped.g
I'd expect the three grammars to behave the same way when I check them. But here is what I experienced
InternalFoo.g - error, rule assignment has non-LL(*) decision
DebugInternalFoo.g - no problem, parses fine
Stripped.g - warnings at rule assignment, decision can match using multiple alternatives. It fails to parse properly.
Is it possible that a grammar parses a text differently with or without actions? Or is it a bug in any of the action-remover tools? (The rule in question has syntactic predicates, and without them, it would really have a non-LL(*) decision.)
UPDATE:
I partly found what caused the problem. The rule in question was like this
trickyRule:
({ some complex action})
(expression '=')=>...
Stripping with Antlr removed the action, but left an empty group there:
// Stripped.g
trickyRule:
() (expression '=')=>...
The generation of the debug grammar removes both the action, and the now empty group around it:
// DebugInternalFoo.g
trickyRule:
(expression '=')=>...
So the lesson learned is: an empty group before a syntactic predicate is not the same as nothing at all.
Is it possible that a grammar parses a text differently with or without actions?
Yes, that is possible. org.antlr.tool.Strip leaves syntactic predicates1, but removes validating2- and gated3 semantic predicates (and member sections that these semantic predicates might use).
For example, the following rules would only match an A_TOKEN:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: {input.LT(1).getType() == A_TOKEN}? .
;
but if you use the Strip tool on it, it leaves the following:
parser_rule1
: (parser_rule2)=> parser_rule2
;
parser_rule2
: /*{input.LT(1).getType() == A_TOKEN}?*/ .
;
making it match any token.
In other words, Strip could change the behavior of the generated lexer or parser.
1 syntactic predicate: ( ... )=>
2 validating semantic predicate { ... }?
3 gated semantic predicate { ... }?=>

How does a lexer return a semantic value that the parser uses?

Is it always necessary to do so? What does it look like?
Lexers don't deal with semantics, they only deal with turning a stream of characters into tokens (sequences of characters that have meaning to the compiler). Semantics are determined during syntactic analysis. See this answer to a previous question for further details on the stages of compilation.
In yacc, your lexer gets a global variable named yylval which is a C union. Back in yacc, this becomes the value for $1, $2, etc.
Lexer don't care about semantic the only mission in life for lexers is to convert the source code (stream of characters) into tokens each has this form <Token_type, Information_related_to_token> the information maybe the value of the token (string), the name of the operator (=) ...
Tokens then are sent to a parser that deals with syntactic analysis. as a side job a lexer can create a symbols table.