Split a BibTeX author field into parts - raku

I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.

The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.

Related

How do I match using :global in Raku grammar?

I'm trying to write a Raku grammar that can parse commands that ask for programming puzzles.
This is a simplified version just for my question, but the commands combine a difficulty level with an optional list of languages.
Sample valid input:
No language: easy
One language: hard javascript
Multiple languages: medium javascript python raku
I can get it to match one language, but not multiple languages. I'm not sure where to add the :g.
Here's an example of what I have so far:
grammar Command {
rule TOP { <difficulty> <languages>? }
token difficulty { 'easy' | 'medium' | 'hard' }
rule languages { <language>+ }
token language { \w+ }
}
multi sub MAIN(Bool :$test) {
use Test;
plan 5;
# These first 3 pass.
ok Command.parse('hard', :token<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :token<difficulty>), '<difficulty> should not parse random words';
# Why does this parse <languages>, but <language> fails below?
ok Command.parse('js', :rule<languages>), '<languages> can parse a language';
# These last 2 fail.
ok Command.parse('js', :token<language>), '<language> can parse a language';
# Why does this not match both words? Can I use :g somewhere?
ok Command.parse('js python', :rule<languages>), '<languages> can parse multiple languages';
}
This works, even though my test #4 fails:
my token wrd { \w+ }
'js' ~~ &wrd; #=> 「js」
Extracting multiple languages works with a regex using this syntax, but I'm not sure how to use that in a grammar:
'js python' ~~ m:g/ \w+ /; #=> (「js」 「python」)
Also, is there an ideal way to make the order unimportant so that difficulty could come anywhere in the string? Example:
rule TOP { <languages>* <difficulty> <languages>? }
Ideally, I'd like for anything that is not a difficulty to be read as a language. Example: raku python medium js should read medium as a difficulty and the rest as languages.
There are two things at issue here.
To specify a subrule in a grammar parse, the named argument is always :rule, regardless whether in the grammar it's a rule, token, method, or regex. Your first two tests are passing because they represent valid full-grammar parses (that is, TOP), as the :token named argument is ignored since it's unknown.
That gets us:
ok Command.parse('hard', :rule<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :rule<difficulty>), '<difficulty> should not parse random words';
ok Command.parse('js', :rule<languages> ), '<languages> can parse a language';
ok Command.parse('js', :rule<language> ), '<language> can parse a language';
ok Command.parse('js python', :rule<languages> ), '<languages> can parse multiple languages';
# Output
ok 1 - <difficulty> can parse a difficulty
ok 2 - <difficulty> should not parse random words
ok 3 - <languages> can parse a language
ok 4 - <language> can parse a language
not ok 5 - <languages> can parse multiple languages
The second issue is how implied whitespace is handled in a rule. In a token, the following are equivalent:
token foo { <alpha>+ }
token bar { <alpha> + }
But in a rule, they would be different. Compare the token equivalents for the following rules:
rule foo { <alpha>+ }
token foo { <alpha>+ <.ws> }
rule bar { <alpha> + }
token bar { [<alpha> <.ws>] + }
In your case, you have <language>+, and since language is \w+, it's impossible to match two (because the first one will consume all the \w). Easy solution though, just change <language>+ to <language> +.
To allow the <difficulty> token to float around, the first solution that jumps to my mind is to match it and bail in a <language> token:
token language { <!difficulty> \w+ }
<!foo> will fail if at that position, it can match <foo>. This will work almost perfect until you get a language like 'easyFoo'. The easy fix there is to ensure that the difficulty token always occurs at a word boundary:
token difficulty {
[
| easy
  | medium
| hard
]
>>
}
where >> asserts a word boundary on the right.

Does .parse anchor or :sigspace first in a Perl 6 rule?

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.

Why do parser combinators don't backtrack in case of failure?

I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user
def value: Parser[Any] =
obj | stringLit | num | "true" | "false" | failure("illegal start of value")
This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.
To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser
import scala.util.parsing.combinator._
object ExprParserTest extends App with JavaTokenParsers {
// Grammar
val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
val program = rep(declaration) ~ rep(statement)
// Test
println(parseAll(program, "1 to 2")) // OK
println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
println(parseAll(program, "abc")) // OK
}
It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?
Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.
Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.

how to consume unprocessed string?

I am using Lex and Yacc to design a parser and encounter some issue about comment.
I use the following Lex rule.
'#'[^('\r'|'\n')]* { /* do nothing */ }
It works, but at the end of execution all the comments are printed to the standard output. Is there way to clear that? Thank you for the suggestion.
The characters ', |, (, and ) have no special meaning in [], so you're only matching (and discarding) comments that don't contain them. In addition, in most versions of lex ' has no special meaning at all -- only " can be used to quote literal strings. What you probably want is:
"#"[^\r\n]* { /* do nothing */ }
In addition, # has no special meaning either, so there's no real need to quote it.
In general, if you're using lex (or flex) as the input to a parser, you NEVER want the default echoing behavior, so its best to add a 'catch-all' rule at the very end:
.|\n { fprintf(stderr, "Unexpected character '%c' in input\n", *yytext); }

Why do i have a shift reduce/conflict on the ')' and not '('?

I have syntax like
%(var)
and
%var
and
(var)
My rules are something like
optExpr:
| '%''('CommaLoop')'
| '%' CommaLoop
CommaLoop:
val | CommaLoop',' val
Expr:
MoreRules
| '(' val ')'
The problem is it doesnt seem to be able to tell if ) belongs to %(CommaLoop) or % (val) but it complains on the ) instead of the (. What the heck? shouldnt it complain on (? and how should i fix the error? i think making %( a token is a good solution but i want to be sure why $( isnt an error before doing this.
This is due to the way LR parsing works. LR parsing is effectively bottom-up, grouping together tokens according to the RHS of your grammar rules, and replacing them with the LHS. When the parser 'shifts', it puts a token on the stack, but doesn't actually match a rule yet. Instead, it tracks partially matched rules via the current state. When it gets to a state that corresponds to the end of the rule, it can reduce, popping the symbols for the RHS off the stack and pushing back a single symbol denoting the LHS. So if there are conflicts, they don't show up until the parser gets to the end of some rule and can't decide whether to reduce (or what to reduce).
In your example, after seeing % ( val, that is what will be on the stack (top is at the right side here). When the lookahead is ), it can't decide whether it should pop the val and reduce via the rule CommaLoop: val, or if it should shift the ) so it can then pop 3 things and reduce with the rule Expr: '(' val ')'
I'm assuming here that you have some additional rules such as CommaLoop: Expr, otherwise your grammar doesn't actually match anything and bison/yacc will complain about unused non-terminals.
Right now, your explanation and your grammar don't seem to match. In your explanation, you show all three phrases as having 'var', but your grammar shows the ones starting with '%' as allowing a comma-separated list, while the one without allows only a single 'val'.
For the moment, I'll assume all three should allow a comma-separated list. In this case, I'd factor the grammar more like this:
optExpr: '%' aList
aList: CommaLoop
| parenList
parenList: '(' CommaLoop ')'
CommaLoop:
| val
| CommaLoop ',' val
Expr: MoreRules
| parenList
I've changed optExpr and Expr so neither can match an empty sequence -- my guess is you probably didn't intend that to start with. I've fleshed this out enough to run it through byacc; it produces no warnings or errors.