My problem is that the message passed to yyerror is already formatted (i.e. it is actually an English explanation what went wrong), and what I would like to get is just the current token (i.e. the one before the error pseudo-token).
So how to get it?
I use gplex/gppg which are lex/yacc implementations in C#.
I am sorry for not being 100% precise -- what I need is token (symbol) not the body (text) which was matched (by the token).
Let's say I have a rule [A-Za-z0-9_]+ constitutes an ID. So I would like to get token ID not a foobar.
Found this in an old project of mine, with a redefined yyerror:
int yyerror (char *msg) {
printf("oha, %s: '%s' in line %d\n", msg, yytext, yylineno);
return 0;
}
This was a c++-project using flex/bison, and the interesting thing i think you can find in yytext.
There's no standard, but bison and most versions of yacc store the current token in yychar. Unfortunately, this is generally a local variable (of yyparse), so you can't access it in other functions (such as yyerror), only in parser actions.
It might be helpful if you say WHY you want the current token -- its not generally a useful peice of information. You mention the error pseudo-token, which makes no sense as that is associated with error recovery, not errors as such -- by the time it comes into the picture normally a bunch of tokens from the input have been discarded.
Related
I want to process undefined and unexpected token error in yyerror func (or maybe by another func if it's possible)
for example, i get a error message from Bison
...
LAC: checking lookahead EXECSQL: S4
Error: popping nterm component_list ()
Stack now 0
Cleanup: discarding lookahead token $undefined ()
Stack now 0
ERRSTAT = "%X0000002C"
But I want to print which token hasn't been founded and the line number. Is it possible to implement it in Bison and how?
The special token $undefined is reported when yylex returns a token number which doesn't appear in any parser rule. Most of the time, that's the result of the lexer fallback rule:
. { return yytext[0]; }
But it can also happen if you declare a token in your parser file, and the lexer returns that token, but the token is never actually used in any rule.
Unused tokens don't have names, in the sense that the array of names which Bison includes in your parser doesn't include unused tokens, and so there's no way to look up what the token name originally was. You can, however, often get the token number from the variable yychar. If that number is greater than 0 and less than 256, then the token is probably a single-character token, and you could use that to print an additional error message. However, there's no simple way to modify the error message generated by Bison's verbose error messages; if you're using that feature, you'll still see the invalid token message.
In order to print line numbers, you only need to enable line number counting in the lexical scanner, using
%option yylineno
in your Flex (.l) file. Then you can print the value of yylineno in yyerror. (If you're using a "pure" (reentrant) scanner, then yylineno will be in the scanner_t object. In the normal use case where that object is an extra parser argument, it will also be available inside yyerror.)
I know that the above is a bit confusing because there are a lot of different code-generation options with slightly different behaviours. You didn't specify the particular options you're using, so the answer is a bit generic.
I need to be able to split one token into 2 for highlighting purposes, I have a token that looks like this:
ID_INTERP: '$' IDEN;
but I want to highlight the dollar sign differently from the identifier, so is it possible to split this token into two, one with the dollar sign and the other with the identifier? I know I can change the entire token into a different type under certain conditions, but I'd like to be able to add and change what text it contains, basically to change the tokenstream so instead of saying
ID_INTERP["$foo"]
it would see something like this:
DOLLAR_SIGN["$"] IDEN["foo"]
It is possible by extending your token source to emit more than a single token for a given match. I have used this idea to generate 2 tokens for the lexer rule DOT_IDENTIFIER (see the MySQL grammar in the MySQL Workbench parser). On match it pushes a dot token and sets the result to IDENTIFIER, effectivly creating 2 separate tokens for a single rule.
Sam Harwell described the technique to extend your lexer for this approach in his answer with some Java code. And here is a possible C++ implementation that I'm using:
std::unique_ptr<antlr4::Token> MySQLBaseLexer::nextToken() {
// First respond with pending tokens to the next token request, if there are any.
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
return pending;
}
// Let the main lexer class run the next token recognition.
// This might create additional tokens again.
auto next = Lexer::nextToken();
if (!_pendingTokens.empty()) {
auto pending = std::move(_pendingTokens.front());
_pendingTokens.pop_front();
_pendingTokens.push_back(std::move(next));
return pending;
}
return next;
}
I have a method. The language for this question unimportant, but here are the stubs in Java and Python so people have something to relate to:
Token getToken(long seconds){
...
}
def get_token(seconds):
...
The documentation for this method reads:
Get the current token, or a new one.
Guarantees that the returned token will be valid for at least the given amount of seconds.
Since I am not a native english speaker, the two things are puzzling me.
I would like to name the argument for my method something more saying than seconds, but it should not be too long.
I have considered the following (Python styled):
timeout_seconds
minimum_timeout_seconds
minimum_timeout
required_timeout
required_timeout_seconds
I don't think any of them are spot on, and the two of them are a bit long for my taste.
What do people prefer?
Is there a word that can express the purpose better than the ones I have used?
Secondly, the documentation for the argument reads:
The number of seconds there should at least be left until the current token expires.
If there is less than this number of seconds left until expiration, the token will be renewed automatically.
I don't feel the wording here is right. Any thoughts?
As you are dealing with tokens I'd Take the JSON Web Token (JWT) RFC as inspiration. Hence I would use
expires_in_seconds
as the variable name if keeping with Python styling.
The word "timeout" is more commonly used when an operation ceases to try to succeed in whatever it is trying to do, whereas "expires" indicates that the subject (in this case a token) is coming to the end of its period of validity.
As for the documentation I'd rather:
The number of seconds for which the token is valid.
However it does feel like the code you are using maybe trying to create it's own web token standard, which is something I would warn against! e.g. "If there is less than this number of seconds left until expiration, the token will be renewed automatically" seems odd.
I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.
I tried error reporting in following manner.
#members{
public String getErrorMessage(RecognitionException e,String[] tokenNames)
{
List stack=getRuleInvocationStack(e,this.getClass().getName());
String msg=null;
if(e instanceof NoViableAltException){
<some code>
}
else{
msg=super.getErrorMessage(e,tokenNames);
}
String[] inputLines = e.input.toString().split("\r\n");
String line = "";
if(e.token.getCharPositionInLine()==0)
line = "at \"" + inputLines[e.token.getLine() - 2];
else if(e.token.getCharPositionInLine()>0)
line = "at \"" + inputLines[e.token.getLine() - 1];
return ": " + msg.split("at")[0] + line + "\" => [" + stack.get(stack.size() - 1) + "]";
}
public String getTokenErrorDisplay(Token t){
return t.toString();
}
}
And now errors are displayed as follows.
line 6:7 : missing CLOSSB at "int a[6;" => [var_declaration]
line 8:0 : missing SEMICOL at "int p" => [var_declaration]
line 8:5 : missing CLOSB at "get(2;" => [call]
I have 2 questions.
1) Is there a proper way to do the same thing I have done?
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
Thank you.
1) Is there a proper way to do the same thing I have done?
I don't know if there is a defined proper way of showing errors. My take on showing errors is a litmis test. If the user can figure out how to fix the error based on what you have given them then it is good. If the user is confued by the error message then the message needs more work. Based on the examples given in the question, symbols were only char constants.
My favorite way of seeing errors is with the line with an arrow pointing at the location.
i.e.
Expected closing brace on line 6.
int a[6;
^
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You will have to read the separately generated token file and then make a map, i.e. a dictionary data structure, to translate the token name into the token character(s).
EDIT
First we have to clarify what is meant by symbol. If you limit the definition of symbol to only tokens that are defined in the tokens file with a char or string then this can be done, i.e. '!'=13, or 'public'=92, if however you chose to use the definition of symbol to be any text associated with a token, then that is something other than what I was or plan to address.
When ANTLR generates its token map it uses three different sources:
The char or string constants in the lexer
The char or string constants in the parser.
Internal tokens such as Invalid, Down, Up
Since the tokens in the lexer are not the complete set, one should use the tokens file as a starting point. If you look at the tokens file you will note that the lowest value is 4. If you look at the TokenTypes file (This is the C# version name) you will find the remaining defined tokens.
If you find names like T__ in the tokens file, those are the names ANTLR generated for the char/string literals in the parser.
If you are using string and/or char literals in parser rules, then ANTLR must create a new set of lexer rules that include all of the string and/or char literals in the parser rules. Remember that the parser can only see tokens and not raw text. So string and/or char literals cannot be passed to the parser.
To see the new set of lexer rules, use org.antlr.Tool –Xsavelexer, and then open the created grammar file. The name may be like.g . If you have string and/or char literals in your parser rules you will see lexer rules with name starting with T .
Now that you know all of the tokens and their values you can create a mapping table from the info given in the error to the string you want to output instead for the symbol.
The code at http://markmail.org/message/2vtaukxw5kbdnhdv#query:+page:1+mid:2vtaukxw5kbdnhdv+state:results
is an example.
However the mapping of the tokens can change for such things as changing rules in the lexer or changing char/string literals in the parser. So if the message all of a sudden output the wrong string for a symbol you will have to update the mapping table by hand.
While this is not a perfect solution, it is a possible solution depending on how you define symbol.
Note: Last time I looked ANTLR 4.x creates the table automatically for access within the parser because it was such a problem for so many with ANTLR 3.x.
Bhathiya wrote:
*1) Is there a proper way to do the same thing I have done?
There is no single way to do this. Note that proper error-handling and reporting is tricky. Terence Parr spends a whole chapter on this in The Definitive ANTLR Reference (chapter 10). I recommend you get hold of a copy and read it.
Bhathiya wrote:
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You can't. For SEMICOL this may seem easy to do, but how would you get this information for a token like FOO:
FOO : (X | Y)+;
fragment X : '4'..'6';
fragment Y : 'a' | 'bc' | . ;