How to get concise syntax error messages from grako/TatSu - error-handling

If the input to a grako/tatsu generated parser has a syntax error, such as 3 + / 3 to the calc.py examples, one gets a long list of Python calling sequences in addition to the relevant
3 + / 3
^
I could use try - except constructions but then I lose the relevant part of the error message as well.
I would like to use grako/tatsu to parse grammar rules for a rule compiler and I appreciate the possibility of separating the syntax and semantics in a clean way. The users would be quite annoyed of the excessive error messages. Is there a way for clean error messages?

This should be the same as in any Python program. If you let the exception escape main(), then a stack trace will be printed. Instead, you can write:
try:
do_parse()
except Exception as e:
print(str(e))

Related

Precedence inside a function call

Using the defined-or operator ( // ) in a function call produces the result I'd expect:
say( 'nan'.Int // 42); # OUTPUT: «42»
However, using the lower-precedence orelse operator instead throws an error:
say( 'nan'.Int orelse 42);
# OUTPUT: «Error: Unable to parse expression in argument list;
# couldn't find final ')'
# (corresponding starter was at line 1)»
What am I missing about how precedence works?
(Or is the error a bug and I'm just overthinking this?)
I'd say, it's a grammar bug, as
say ("nan".Int orelse 42); # 42
works.
TL;DR My super useful naanswer (not-an-answer / non-authoritative answer / food for thought) is it might be a bug or it might not. :)
Other examples:
say(42 and 42);
say(42 ==> 99);
yield the same error.
What am I missing about how precedence works?
Perhaps nothing. Perhaps it will be desirable and possible to fix the grammar so these function-call-arg-list-signifying parens determine precedence just like plain expression parens do.
If so, perhaps fixing it would best wait, or perhaps realistically must wait, until when or after RakuAST lands (6.e?). Or perhaps even later, lf/when grammar cleanup/slangs lands (6.f?).
Or perhaps it's going to always stay as it is for reasons such as good usability (despite the initial "huh?") and/or expediency and/or single-pass parsing and/or whatever.
I've dug a little to see if I could find relevant commentary. Here are some (juicy?) bits:
the OPP is a bit more complex than a standard binary-operator OPP
(from a comment on #perl6)
If you scroll backwards from Larry's comment you'll see he said this in the context of Raku's extraordinary seamless parsing (no delimiters introduced) in a single pass of nested sub-languages that each can have arbitrary grammars.
(Btw, one thought I had: did std parse say(42 and 42) fine? I'm not sure if there's a running std anywhere these days.)
While we do have complete control of stock Raku, I'm not convinced there's anything compelling about bending over backwards to fix every wrinkle of this sort (foo(... op ...) in this case) when the general case (..... where the middle ... inside the outer pair of .s has arbitrary syntax) means we'll be hitting limits in how "perfect" it can all be when there's a huge amount of anarchic language / syntax mixing going on in userland/module space, as I anticipate will emerge in years to come.
So, imo, if it's reasonably easy to fix, without unduly cramping or burdening user slang freedom, great. If not, I think the current situation is fair enough (though perhaps it'll be desirable, viable and reasonable to improve the error message).
Perhaps consider the foregoing in combination with:
Raku borrows many concepts from human language ...
(from the doc)
in combination with:
☞ Self-clocking code produces better syntax error messages
(from Seeing Wrong Right)
in combination with:
Break that clock and your error messages will turn to mush
(from a mailing list comment)
But then again:
Please don't assume that rakudo's idiosyncracies and design fossils are canonical.
Do you mean this, maybe...?
> say ( NaN.Int orelse 42 )
42
since
> say( NaN.Int orelse 42 )
===SORRY!=== Error while compiling:
Unable to parse expression in argument list; couldn't find final ')' (corresponding starter was at line 1)
------> say( '42'.Int⏏ orelse 42 )
expecting any of:
infix
infix stopper
I would tend to agree with #lizmat that there is a grammar bug in the compiler.

error recovery in byacc/j and jflex using error token like in yacc

i am writing a compiler for a small language using byacc/j and jflex. i have no problem in finding first error in a given input file. the problem is i cant find more errors. first i used to use yacc and lex and i used special symbol 'error' token at the end of some grammar rules which was built in yacc and i could use 'yyerrok' to simply continue parsing and finding more errors but , in byacc/j i cant find something like that and yyerrok does not work and byacc/j does not recognize that. any suggestions to find more than one error in byacc/j ? or is there ' error ' and 'yyerrok' in byacc/j ?
The only thing that yyerrok does is reset the count of tokens since the last error notification. Yacc parsers suppress error messages in the first three tokens after an error recovery, to prevent cascading error messages.
Using yyerrok -- or setting yyerrflag to 0 -- indicates that error recovery was successful and that error messages should now be produced. It does not have any other effect: with or without yyerrok, parsing will continue.
yyerrok is a C macro, and Java doesn't have macros. So apparently it was dropped from the Java interface. But yyerrflag exists as a parser class member and you should be able to just set it to zero in a parser action.

How to serialize/deserialize YAML::Binary?

UPDATE
It now seems that anything put in a vector breaks. I have tried char and u/int/8/16/32 and they all generate some kind of error. I'm a bit perplexed. There may be an error in my code, but I'm not sure what the YAML should look like, so I'm probably not doing a very a very good job of looking for when the data becomes incorrect.
Is YAML::Binary from Yaml-Cpp finished yet? I've tried serializing my data as int's, but Yaml-Cpp seems to be confused about ints and chars, and this generally never works. Instead, now I'm trying to use Yaml::Binary, but I get an error on the other side when I try to recover the YAML::Binary node on the other end. Specifically, this chunk fails:
3: 0\n6: WAUAAAAAAABYBQAAAAAAAP////84UBspV0FVQUFBQUFBQUJZQlFBQUFBQUFBUC8vLy84NFVCc3BWMEZWUVVGQlFVRkJRVUpaUWxGQlFVRkJRVUZCVUM4dkx5ODRORlZDYzNCV01FWldVVlZHUWxGVlJrSlJWVXBhVVd4R1FsRlZSa0pSVlVaQ1ZVTTRka3g1T0RST1JsWkRZek5DVjAxRldsZFZWbFpIVVd4R1ZsSnJTbEpXVlhCaFZWZDRSMUZzUmxaU2EwcFNWbFZhUTFaVlRUUmthM2cxVDBSU1QxSnNXa1JaZWs1RFZqQXhSbGRzWkZaV2JGcElWVmQ0UjFac1NuSlRiRXBYVmxoQ2FGWldaRFJTTVVaelVteGFVMkV3Y0ZOV2JGWmhVVEZhVmxSVVVtdGhNMmN4VkRCU1UxUXhTbk5YYTFKYVpXczFSRlpxUVhoU2JHUnpXa1phVjJKR2NFbFdWbVEwVWpGYWMxTnVTbFJpUlhCWVZteG9RMkZHV2xkYVJGSlRUVlZhZWxWdGVHRlZNa1YzWTBaT1YySkdXbWhWVkVaaFZteFNWVlZ0ZEdoTk1tTjRWa1JDVTFVeFVYaFRiazVZWVRGS1lWcFhjekZTUmxweFVWaG9VMkpIVW5wWGExcGhWakpLUjJORmJGZFdiVkV3VldwR1lXTXhUblZUYkZKcFVsaENXVlp0ZUc5Uk1rWkhWMnhrWVZKR1NsUlVWbFpoWld4V2RHVkhSbFpOYTFZeldUQmFUMVl5U2tkWGJXaFdWa1ZhYUZadGVGTldWbFowWkVkb1RrMXRUalJXYTFKRFZURlZlRlZZYUZSaWF6VlpXVlJHUzFsV2NGaGpla1pUVW14d2VGVldhRzlWTWtwSVZXNXdXR0V4Y0doV2FrcExVakpPUm1KR1pGZGlWa1YzVmxkd1IxbFhUWGhVYmxaVVlrWktjRlZzYUVOWFZscDBaVWM1VWsxcldraFdNbmhyV1ZaS1IxTnNVbFZXYkZwb1dsZDRWMlJIVmtoU2JGcE9ZVEZaZWxkVVFtRlVNVmw1VTJ0a1dHSlhhRmRXYTFaaFlVWmFkR1ZHVGxkV2JGb3dXa1ZrYjFSck1YUlVhbEpYWVRGS1JGWlVSbFpsUmxaWllVWlNhV0Y2VmxwWFZsSkhVekZzVjJOR2FHcGxhMXBVVlcxNGQyVkdWbGRoUnpsV1RXdHdTVlpYTlhkWFIwVjRZMGRvVjJGcmNFeFZha3BQVW0xS1IxcEdaR2xXYTFZelZteGtkMUl4YkZoVVdHaFZZbXhhVlZscldrdGpSbFp6WVVWT1dGWnNjREJhVldNMVZXc3hjbGRyYUZkTmJtaHlWMVphUzFJeA==\n7: /USER_NAMES/src/sockets/rsc/atkrscs.tar.gz\n1: 0\n4: 0\n5: 651633\n2: 0
As:
terminate called after throwing an instance of 'YAML::ParserException'
what(): yaml-cpp: error at line 7, column 7: unknown escape character:
What should I do? Is there another way to send/receive binary? Did I do something wrong?

ANTLR reports error and I think it should be able to resolve input with backtracking

I have a simple grammar that works for the most part, but at one place it reports error and I think it shouldn't, because it can be resolved using backtracking.
Here is the portion that is problematic.
command: object message_chain;
object: ID;
message_chain: unary_message_chain keyword_message?
| binary_message_chain keyword_message?
| keyword_message;
unary_message_chain: unary_message+;
binary_message_chain: binary_message+;
unary_message: ID;
binary_message: BINARY_OPERATOR object;
keyword_message: (ID ':' object)+;
This is simplified version, object is more complex (it can be result of other command, raw value and so on, but that part works fine). Problem is in message_chain, in first alternative. For input like obj unary1 unary2 it works fine, but for intput like obj unary1 unary2 keyword1:obj2 is trys to match keyword1 as unary message and fails when it reaches :. I would think that it this situation parser would backtrack and figure that there is : and recognize that that is keyword message.
If I make keyword message non-optional it works fine, but I need keyword message to be optional.
Parser finds keyword message if it is in second alternative (binary_message) and third alternative (just keyword_message). So something like this gives good results: 1 + 2 + 3 Keyword1:Value
What am I missing? Backtracking is set to true in options and it works fine in other cases in the same grammar.
Thanks.
This is not really a case for PEG-style backtracking, because upon failure that returns to decision points in uncompleted derivations only. For input obj unary1 unary2 keyword1:obj2, with a single token lookahead, keyword1 could be consumed by unary_message_chain. The failure may not occur before keyword_message, and next to be tried would be the second alternative of message_chain, i.e. binary_message_chain, thus missing the correct parse.
However as this grammar is LL(2), it should be possible to extend lookahead to avoid consuming keyword1 from within unary_message_chain. Have you tried explicitly setting k=2, without backtracking?

Handling invalid input in the Lexer / Parser

so, I am parsing Hayes modem AT commands. Not read from a file, but passed as char * (I am using C).
1) what happens if I get something that I totally don't recognize? How do I handle that?
2) what if I have something like
my_token: "cmd param=" ("value_1" | "value_2");
and receive an invalid value for "param"?
I see some advice to let the back-end program (in C) handle it, but that goes against the grain for me. Catch teh problem as early as you can, is my motto.
Is there any way to catch "else" conditions in lexer/parser rules?
Thanks in advance ...
That's the thing: the whole point of your parser and lexer is to blow up if you get bad input, then you catch the blow up and present a pretty error message to the user.
I think you're looking for Custom Syntax Error Recovery to embed in your grammar.
EDIT
I've no experience with ANTLR and C (or C alone for that matter), so follow this advice with caution! :)
Looking at the page: http://www.antlr.org/api/C/using.html, perhaps the part at the bottom, Implementing Customized Methods is what you're after.
HTH