how to consume unprocessed string? - yacc

I am using Lex and Yacc to design a parser and encounter some issue about comment.
I use the following Lex rule.
'#'[^('\r'|'\n')]* { /* do nothing */ }
It works, but at the end of execution all the comments are printed to the standard output. Is there way to clear that? Thank you for the suggestion.

The characters ', |, (, and ) have no special meaning in [], so you're only matching (and discarding) comments that don't contain them. In addition, in most versions of lex ' has no special meaning at all -- only " can be used to quote literal strings. What you probably want is:
"#"[^\r\n]* { /* do nothing */ }
In addition, # has no special meaning either, so there's no real need to quote it.
In general, if you're using lex (or flex) as the input to a parser, you NEVER want the default echoing behavior, so its best to add a 'catch-all' rule at the very end:
.|\n { fprintf(stderr, "Unexpected character '%c' in input\n", *yytext); }

Related

Conditionally skipping an ANTLR lexer rule based on current line number

I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.
I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).

Split a BibTeX author field into parts

I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.
The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.

Does .parse anchor or :sigspace first in a Perl 6 rule?

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.

antlr4: need to convert sequences of symbols to characters in lexer

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.

How to define (f)lex/bison pattern ( /* comment*/ ) and/or ( 100 /* comment*/ )

how can I define the lex pattern ( ), or ( /* rem / ), and ( / foo / 100 / foo */ )
in using gnu (f)lex tool.
_space [ \t]
id [a-zA-Z_]+[a-zA-Z0-9_]
digit [0-9]
math_ops [\+\-\/\*\^\%]
rem_expr (({_space}*)*|("/*".*"*/")*|("//".*)*|([\n]*))*
arr_digid ("("*({digit}*|{id}*)*")"*){arr_expr1}*{math_ops}+
arr_expr1 {rem_expr}*{digit}*{rem_expr}*
arr_expr2 {rem_expr}*
%%
\({arr_expr2}*\) {
return _REM_;
}
\({arr_expr1}*\) {
return _PATTERN2_;
}
Generally, you do not return comments or whitespace from a lexer. Why would you? They are, by definition, not part of the semantics of the program you are trying to parse.
On the whole, the easiest way to deal with them is to just ignore them. Below, the first pattern matches any whitespace character other than newline (Use [[:space:]] to also ignore newlines), and the second one is a way of matching C-style comments. ("/*".*"*/" doesn't work because it will match from the beginning of the first comment on a line to the end of the last one.)
[[:blank:]] ;
[/][*][^*]*[*]+([^/*][^*]*[*]+)[/] ;
The fact that the patterns do not have an action (or, in general, do not have a return statement in their action) means that the (f)lex-generated scanner will simply proceed to analyze the next token.
Some other notes:
It's really not necessary to define a shortcut for every pattern. There is no problem with putting a pattern directly in the lex actions. And you certainly don't need to define shortcuts for character classes which already have shortcuts (like [[:blank:]] and [[:digit:]].
You don't need to backslash escape characters inside a character class, although with a couple of characters order is important. (That's why I used [*] in the C-comment pattern; I could equally have used "*" or \*, but I personally prefer [*].) So you could have defined:
math_ops [+/*^%-]
The - must go either at the end or the beginning of the list; ^ cannot go at the beginning, and (though you don't use it) ] would have to go at the beginning. The only character which requires backslash-escaping is a backslash itself.
However, my preference is always to let single-character tokens be handled with a single default rule at the end:
. { return yytext[0]; }
This is much more maintainable, and avoids the need to invent arbitrary token names for single-character tokens. You can just use a single-quoted character in your bison/yacc file.