How to use yylval with strings in yacc - yacc

I want to pass the actual string of a token. If I have a token called ID, then I want my yacc file to actually know what ID is called. I thing I have to pass a string using yylval to the yacc file from the flex file. How do I do that?

The key to returning a string or any complex type via yylval is the YYSTYPE union created by yacc in the y.tab.h file. The YYSTYPE is a union with a member for each type of token defined within the yacc source file. For example to return the string associated with a SYMBOL token in the yacc source file you declare this YYSTYPE union using %union in the yacc source file:
/*** Yacc's YYSTYPE Union ***/
/* The yacc parser maintains a stack (array) of token values while
it is parsing. This union defines all the possible values tokens
may have. Yacc creates a typedef of YYSTYPE for this union. All
token types (see %type declarations below) are taken from
the field names of this union. The global variable yylval which lex
uses to return token values is declared as a YYSTYPE union.
*/
%union {
long int4; /* Constant integer value */
float fp; /* Constant floating point value */
char *str; /* Ptr to constant string (strings are malloc'd) */
exprT expr; /* Expression - constant or address */
operatorT *operatorP; /* Pointer to run-time expression operator */
};
%type <str> SYMBOL
Then in the LEX source file there is a pattern that matches the SYMBOL token. It is the responsibility of code associated with that rule to return the actual string that represents the SYMBOL. You can't just pass a pointer to the yytext buffer because it is a static buffer that is reused for each token that is matched. To return the matched text the static yytext buffer must be replicated on the heap with _strdup() and a pointer to this string passed via yyval.str. It is then the yacc rule that matches the SYMBOL token's responsibility to free the heap allocated string when it is done with it.
[A-Za-z_][A-Za-z0-9_]* {{
int i;
/*
* condition letter followed by zero or more letters
* digits or underscores
* Convert matched text to uppercase
* Search keyword table
* if found
* return <keyword>
* endif
*
* set lexical value string to matched text
* return <SYMBOL>
*/
/*** KEYWORDS and SYMBOLS ***/
/* Here we match a keywords or SYMBOL as a letter
* followed by zero or more letters, digits or
* underscores.
*/
/* Convert the matched input text to uppercase */
_strupr(yytext); /* Convert to uppercase */
/* First we search the keyword table */
for (i = 0; i<NITEMS(keytable); i++) {
if (strcmp(keytable[i].name, yytext)==0)
return (keytable[i].token);
}
/* Return a SYMBOL since we did not match a keyword */
yylval.str=_strdup(yytext);
return (SYMBOL);
}}

See the Flex manual section on Interfacing with YACC.
15 Interfacing with Yacc
One of the main uses of flex is as a
companion to the yacc
parser-generator. yacc parsers expect
to call a routine named yylex() to
find the next input token. The routine
is supposed to return the type of the
next token as well as putting any
associated value in the global yylval.
To use flex with yacc, one specifies
the `-d' option to yacc to instruct it
to generate the file y.tab.h
containing definitions of all the
%tokens appearing in the yacc input.
This file is then included in the flex
scanner. For example, if one of the
tokens is TOK_NUMBER, part of the
scanner might look like:
%{
#include "y.tab.h"
%}
%%
[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;

Setting up the context
Syntax analysis (to check if an input text follows a specified grammar) consist of two phases:
tokenizing, which is done by tools like lex or flex, with interface yylex()) and
parsing the stream of token generated in step 1 ( as per a user specified grammar), which is done by tools like bison/yacc with the interface yyparse()).
While doing phase 1, given an input stream, each call to yylex() identifies a token (a char string) and yytext points to the first character of that string.For example: With an input stream of "int x = 10;" and with lex rules for tokenization conforming to C language, then first 5 calls to yylex() will identify the following 5 tokens "int", "x", "=", "10", ";" and each time the yytext will point to first char of the return token.
Phase 2, The parser (which you mentioned as yacc ) is a program which is calling this yylex function every time to get a token and uses these tokens to see if it is matching the rules of a grammar. These calls to yylex will return tokens as some integer codes. For example in the previous example, the first 5 calls to yylex() may return the following integers to the parser: TYPE, ID, EQ_OPERATOR and INTEGER ( whose actual integer values are defined in some header file).
Now all parser can see is those integer codes, which may not be useful at times. For example, in the running example you may want to associate TYPE to int, ID to some symbol table pointer, and INTEGER to decimal 10. To facilitate that, each token returned by yylex with associated with another VALUE whose default type is int, but you may have custom types for that. In lex environment this VALUE is accessed as yylval.
For example, again as per the running example, yylex may have the following rule to identify 10
[0-9]+ { yylval.intval = atoi(yytext); return INTEGER; }
and following to identify x
[a-zA-Z][a-zA-Z0-9]* {yylval.sym_tab_ptr = SYM_TABLE(yytext); return ID;}
Note that here I have defined the VALUE's ( or yylval's) type as a union containing an int (intval) and an int* pointer (sym_tab_ptr).
But in the yacc world, this VALUE is identified / accessed as $n. For example, consider the following yacc rule to identify a specific assignment statement
TYPE ID '=' VAL: { //In this action part of the yacc rule, use $2 to get the symbol table pointer associated with ID, use $4 to get decimal 10.}
Answering your question
If you want to access the yytext value of a certain token (which is related to lex world) in yacc world, use that old friend VALUE as folowing:
Augment the union type of VALUE to add another field say char* lex_token_str
In the lex rule, do yylval.lex_token_str = strdup(yytext)
Then in yacc world access it using the appropriate $n.
In case you want to access more that a single value of a token, (for example for the lex identified token ID, the parser may want to access both the name and the symbol table pointer), then augment the union type of VALUE with a structure member, containing char* (for name) and int*(for symtab pointer).

Related

How to use define in objective c in another define

I need to use define inside another define to make code simple by replacing only at one place.
Problem (Objective C code)
#define URL #"www.example.com/"
#define UserLogin #"<Login xmlns=\"http://www.example.com/\"><Email>%#</Email><Password>%#</Password></Login>"
.
.
.
#define UserRegistration #"<Reg xmlns=\"http://www.example.com/\"><Email>%#</Email></Reg>"
I am having a list of statements like this can I use URL at the place of xmlns=\".../\"
can I use the above defined URL like xmlns=\"URL/\"
so I can replace url at only at one place.
You can in the following way --
#define URL #"www.example.com/"
#define UrlWithUrl [NSString stringWithFormat:#"<Login xmlns=\"http://%#/\">",URL]
A #define token will be expanded in a subsequent #define provided it is not in a string – in your case you wish to use URL in a string so just writing in the string will not result in it being expanded.
However in (Objective-)C adjacent string literals are automatically concatenated by the compiler to become a single string literal, e.g.:
#"one " #"two"
is transformed by the compiler into:
#"one two"
Knowing that you can rewrite your definition of UserLogin as three strings which will be joined by the compiler into one:
#define UserLogin #"<Login xmlns=\"" URL #"\"><Email>%#</Email><Password>%#</Password></Login>"
and a use of UserLogin in your code will be replaced by the three strings which are then joined by the compiler, e.g.
NSLog("%#", UserLogin);
becomes after preprocessing:
NSLog("%#", #"<Login xmlns=\"" #"www.example.com/" #"\"><Email>%#</Email><Password>%#</Password></Login>");
and then the adjacent string literals are joined:
NSLog("%#", #"<Login xmlns=\"www.example.com/\"><Email>%#</Email><Password>%#</Password></Login>");
For more complex cases you will need to read up on the preprocessor, in particular stringification
In Xcode you can see the results of your macros by selecting the menu item Product:Perform Action:Preprocess "...", this opens a window showing your source file after all the macros have been expanded, i.e. the resultant source code the compiler will compile.
HTH

Bison parser with operator tokens in variable name

I am new to bison, and have the misfortune of needing to write a parser for a language that may have what would otherwise be an operator within a variable name. For example, depending on context, the expression
FOO = BAR-BAZ
could be interpreted as either:
the variable "FOO" being assigned the value of the variable "BAR" minus the value of the variable "BAZ", OR
the variable "FOO" being assigned the value of the variable "BAR-BAZ"
Fortunately the language requires variable declarations ahead of time, so I can determine whether a given string is a valid variable via a function I've implemented:
bool isVariable(char* name);
that will return true if the given string is a valid variable name, and false otherwise.
How do I tell bison to attempt the second scenario above first, and only if (through use of isVariable()) that path fails, go back and try it as the first scenario above? I've read that you can have bison try multiple parsing paths and cull invalid ones when it encounters a YYERROR, so I've tried a set of rules similar to:
variable:
STRING { if(!isVariable($1)) YYERROR; }
;
expression:
expression '-' expression
| variable
;
but when given "BAR-BAZ" the parser tries it as a single variable and just stops completely when it hits the YYERROR instead of exploring the "BAR" - "BAZ" path as I expect. What am I doing wrong?
Edit:
I'm beginning to think that my flex rule for STRING might be the culprit:
((A-Z0-9][-A-Z0-9_///.]+)|([A-Z])) {yylval.sval = strdup(yytext); return STRING;}
In this case, if '-' appears in the middle of alphanumeric characters, the whole lot is treated as 1 STRING, without the possibility for subdivision by the parser (and therefore only one path explored). I suppose I could manually parse the STRING in the parser action, but it seems like there should be a better way. Perhaps flex could give back alternate token streams (one for the "BAR-BAZ" case and another for the "BAR"-"BAZ" case) that are diverted to different parser stacks for exploration? Is something like that possible?
It's not impossible to solve this problem within a bison-generated parser, but it's not easy, and the amount of hackery required might detract from the readability and verifiability of the grammar.
To be clear, GLR parsers are not fallback parsers. The GLR algorithm explores all possible parses in parallel, and rejects invalid ones as it goes. (The bison implementation requires that the parse converge to a single possible parse; the original GLR algorithm produces forest of parse trees.) Also, the GLR algorithm does not contemplate multiple lexical analyses.
If you want to solve this problem in the context of the parser, you'll probably need to introduce special handling for whitespace, or at least for - which are not surrounded by whitespace. Otherwise, you will not be able to distinguish between a - b (presumably always subtraction) and a-b (which might be the variable a-b if that variable were defined). Leaving aside that issue, you would be looking for something like this (but this won't work, as explained below):
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: var
| '(' expr ')'
var : ident { if (!isVariable($1)) { /* reject this production */ } }
ident : WORD
| ident '-' WORD { $$ = concatenate($1, "-", $3); }
This won't work because the action associated with var : ident is not executed until after the parse has been disambiguated. So if the production is rejected, the parse fails, because the parser has already determined that the production is necessary. (Until the parser makes that determination, actions are deferred.)
Bison allows GLR grammars to use semantic predicates, which are executed immediately instead of being deferred. But that doesn't help, because semantic predicates cannot make use of computed semantic values (since the semantic value computations are still deferred when the semantic predicate is evaluated). You might think you could get around this by making the computation of the concatenated identifier (in the second ident production) a semantic predicate, but then you run into another limitation: semantic predicates do not themselves have semantic values.
Probably there is a hack which will get around this problem, but that might leave you with a different problem. Suppose that a, c, a-b and b-c are defined variables. Then, what is the meaning of a-b-c? Is it (a-b) - c or a - (b-c) or an error?
If you expect it to be an error, then there is no problem since the GLR parser will find both possible parses and bison-generated GLR parsers signal a syntax error if the parse is ambiguous. But then the question becomes: is a-b-c only an error if it is ambiguous? Or is it an error because you cannot use a subtraction operator without surround whitespace if its arguments are hyphenated variables? (So that a-b-c can only be resolved to (a - b) - c or to (a-b-c), regardless of whether a-b and b-c exist?) To enforce the latter requirement, you'll need yet more complication.
If, on the other hand, your language is expected to model a "fallback" approach, then the result should be (a-b) - c. But making that selection is not a simple merge procedure between two expr reductions, because of the possibility of a higher precedence * operator: d * a-b-c either resolves to (d * a-b) - c or (d * a) - b-c; in those two cases, the parse trees are radically different.
An alternative solution is to put the disambiguation of hyphenated variables into the scanner, instead of the parser. This leads to a much simpler and somewhat clearer definition, but it leads to a different problem: how do you tell the scanner when you don't want the semantic disambiguation to happen? For example, you don't want the scanner to insist on breaking up a variable name into segments when you the name occurs in a declaration.
Even though the semantic tie-in with the scanner is a bit ugly, I'd go with that approach in this case. A rough outline of a solution is as follows:
First, the grammar. Here I've added a simple declaration syntax, which may or may not have any resemblance to the one in your grammar. See notes below.
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: VARIABLE
| '(' expr ')'
decl : { splitVariables(false); } "set" VARIABLE
{ splitVariables(true); } '=' expr ';'
{ addVariable($2); /* ... */ }
(See below for the semantics of splitVariables.)
Now, the lexer. Again, it's important to know what the intended result for a-b-c is; I'll outline two possible strategies. First, the fallback strategy, which can be implemented in flex:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] {
yymore();
if (isVariable(yytext))
candidate_len = yyleng;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]* { if (!isVariable(yytext))
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
}
That uses yymore and yyless to find the longest prefix sequence of hyphenated words which is a valid variable. (If there is no such prefix, it chooses the first word. An alternative would be to select the entire sequence if there is no such prefix.)
A similar alternative, which only allows the complete hyphenated sequence (in the case where that is a valid variable) or individual words. Again, we use yyless and yymore, but this time we don't bother checking intermediate prefixes and we use a second start condition for the case where we know we're not going to combine words:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>("-"[[:alpha:]][[:alnum:]]*)*[[:alpha:]][[:alnum:]]* {
if (isVariable(yytext)) {
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
} else {
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(NO_COMBINE);
return WORD;
}
}
<NO_COMBINE>[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<NO_COMBINE>"-" { return '-'; }
<NO_COMBINE>.|\n { yyless(0); /* rescan */
BEGIN(INITIAL);
}
Both of the above solutions use isVariable to decide whether or not a hyphenated sequence is a valid variable. As mentioned earlier, there must be a way to turn off the check, for example in the case of a declaration. To accomplish this, we need to implement splitVariables(bool). The implementation is straightforward; it simply needs to set a flag visible to isVariable. If the flag is set to true, then isVariable always returns true without actually checking for the existence of the variable in the symbol table.
All of that assumes that the symbol table and the splitVariables flag are shared between the parser and the scanner. A naïve solution would make both of these variables globals; a cleaner solution would be to use a pure parser and lexer, and pass the symbol table structure (including the flag) from the main program into the parser, and from there (using %lex-param) into the lexer.

Issues of Error handling with ANTLR3

I tried error reporting in following manner.
#members{
public String getErrorMessage(RecognitionException e,String[] tokenNames)
{
List stack=getRuleInvocationStack(e,this.getClass().getName());
String msg=null;
if(e instanceof NoViableAltException){
<some code>
}
else{
msg=super.getErrorMessage(e,tokenNames);
}
String[] inputLines = e.input.toString().split("\r\n");
String line = "";
if(e.token.getCharPositionInLine()==0)
line = "at \"" + inputLines[e.token.getLine() - 2];
else if(e.token.getCharPositionInLine()>0)
line = "at \"" + inputLines[e.token.getLine() - 1];
return ": " + msg.split("at")[0] + line + "\" => [" + stack.get(stack.size() - 1) + "]";
}
public String getTokenErrorDisplay(Token t){
return t.toString();
}
}
And now errors are displayed as follows.
line 6:7 : missing CLOSSB at "int a[6;" => [var_declaration]
line 8:0 : missing SEMICOL at "int p" => [var_declaration]
line 8:5 : missing CLOSB at "get(2;" => [call]
I have 2 questions.
1) Is there a proper way to do the same thing I have done?
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
Thank you.
1) Is there a proper way to do the same thing I have done?
I don't know if there is a defined proper way of showing errors. My take on showing errors is a litmis test. If the user can figure out how to fix the error based on what you have given them then it is good. If the user is confued by the error message then the message needs more work. Based on the examples given in the question, symbols were only char constants.
My favorite way of seeing errors is with the line with an arrow pointing at the location.
i.e.
Expected closing brace on line 6.
int a[6;
^
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You will have to read the separately generated token file and then make a map, i.e. a dictionary data structure, to translate the token name into the token character(s).
EDIT
First we have to clarify what is meant by symbol. If you limit the definition of symbol to only tokens that are defined in the tokens file with a char or string then this can be done, i.e. '!'=13, or 'public'=92, if however you chose to use the definition of symbol to be any text associated with a token, then that is something other than what I was or plan to address.
When ANTLR generates its token map it uses three different sources:
The char or string constants in the lexer
The char or string constants in the parser.
Internal tokens such as Invalid, Down, Up
Since the tokens in the lexer are not the complete set, one should use the tokens file as a starting point. If you look at the tokens file you will note that the lowest value is 4. If you look at the TokenTypes file (This is the C# version name) you will find the remaining defined tokens.
If you find names like T__ in the tokens file, those are the names ANTLR generated for the char/string literals in the parser.
If you are using string and/or char literals in parser rules, then ANTLR must create a new set of lexer rules that include all of the string and/or char literals in the parser rules. Remember that the parser can only see tokens and not raw text. So string and/or char literals cannot be passed to the parser.
To see the new set of lexer rules, use org.antlr.Tool –Xsavelexer, and then open the created grammar file. The name may be like.g . If you have string and/or char literals in your parser rules you will see lexer rules with name starting with T .
Now that you know all of the tokens and their values you can create a mapping table from the info given in the error to the string you want to output instead for the symbol.
The code at http://markmail.org/message/2vtaukxw5kbdnhdv#query:+page:1+mid:2vtaukxw5kbdnhdv+state:results
is an example.
However the mapping of the tokens can change for such things as changing rules in the lexer or changing char/string literals in the parser. So if the message all of a sudden output the wrong string for a symbol you will have to update the mapping table by hand.
While this is not a perfect solution, it is a possible solution depending on how you define symbol.
Note: Last time I looked ANTLR 4.x creates the table automatically for access within the parser because it was such a problem for so many with ANTLR 3.x.
Bhathiya wrote:
*1) Is there a proper way to do the same thing I have done?
There is no single way to do this. Note that proper error-handling and reporting is tricky. Terence Parr spends a whole chapter on this in The Definitive ANTLR Reference (chapter 10). I recommend you get hold of a copy and read it.
Bhathiya wrote:
2) I want to replace CLOSSB, SEMICOL, CLOSB etc. with their real symbols. How can I do that using the map in .g file?
You can't. For SEMICOL this may seem easy to do, but how would you get this information for a token like FOO:
FOO : (X | Y)+;
fragment X : '4'..'6';
fragment Y : 'a' | 'bc' | . ;

No spaces between tokens in ANTLR

I'm writing a very simple subset of a C# grammar as an exercise.
However, I have a rule which whitespaces are giving me some troubles.
I want to distinguish the following:
int a;
int? b;
Where the the first is a "regular" int type and the second is a nullable int type.
However, with my current grammar I'm not being able to parse this.
type : typeBase x='?' -> { x == null } typeBase
-> ^('?' typeBase)
;
typeBase : 'int'
| 'float'
;
The thing is that whith these rules, it only works with a whitespace before '?', like this:
int ? a;
Which I'd don't want.
Any ideas?
1) Your definition of whitespace seems to be flawed ... the grammar you present should accept "int?" and "int ?". Maybe you should take a look of the definition of whitespace.
2) If you want to disallow "int ? a" you can define extra tokens 'int?' and 'float?' ... normally you allow whitespace to appear between every token, so you have to make it one token.

How to use JMS Properties on IBM MQ JMS Interface?

I'm using MQ JMS interface with MQ 6.0.2.
It seems that only pre defined properties are suported and not arbitrary ones.
for instance, I can properly getJMSCorrelationID(), getJMSPriority() etc. However, when i set an arbitrary property on the sender:
message.setStringProperty("my arbitrary name", "value");
I can't get the property from the message on the receiver:
message.getStringProperty("my arbitrary name");
I simply get null.
Is there a way to do that as in any JMS implementation, or is that an MQ JMS limitation?
If you have the complete client install, you can go to C:\Program Files\IBM\WebSphere MQ\tools\jms\samples\interactive\ or somewhere in /opt/mqm/samp and look for SampleConsumerJava.java and SampleProducerJava.java.
From the sample Producer program:
// Set custom properties
msg.setStringProperty("MyStringProperty", "My Year Of Birth");
msg.setIntProperty("MyIntProperty", 2007);
And from the sample Consumer:
// Get values for custom properties, if available
String property1 = msg.getStringProperty("MyStringProperty");
// Get value for an int property, store the result in long to validate
// the get operation.
long property2 = ((long) Integer.MAX_VALUE) + 1;
property2 = msg.getIntProperty("MyIntProperty");
if ((property1 != null) && (property2 < Integer.MAX_VALUE)) {
System.out.println("[Message has my custom properties]");
Property names follow the rules for Java variable names and cant have spaces in them.
Per the JMS 1.1 specification:
An identifier is an unlimited-length
character sequence that must begin
with a Java identifier start
character; all following characters
must be Java identifier part
characters. An identifier start
character is any character for which
the method
Character.isJavaIdentifierStart
returns true. This includes ‘_’ and
‘$’. An identifier part character is
any character for which the method
Character.isJavaIdentifierPart returns
true.
Following the clues here takes us to the Javadoc for the Character.isJavaIdentifierPart method which lists the valid characters for an identifier:
A character may be part of a Java
identifier if any of the following are
true:
* it is a letter
* it is a currency symbol (such as '$')
* it is a connecting punctuation character (such as '_')
* it is a digit
* it is a numeric letter (such as a Roman numeral character)
* it is a combining mark
* it is a non-spacing mark
* isIdentifierIgnorable(codePoint) returns true for the character
Note that white space is specifically excluded from the set of valid identifier characters. The set of valid first characters is a little more restrictive and includes the following characters:
* isLetter(ch) returns true
* getType(ch) returns LETTER_NUMBER
* ch is a currency symbol (such as "$")
* ch is a connecting punctuation character (such as "_").
Use a valid identifier and try again. For example:
message.setStringProperty("my.arbitrary.name", "value");
message.getStringProperty("my.arbitrary.name");
Or possibly...
message.setStringProperty("myArbitraryName", "value");
message.getStringProperty("myArbitraryName");
By the way, switch to V7 at your earliest opportunity. Not only is the support for properties much better in general, but the ability to directly read/write MQMD headers is vastly improved as shown in the IBM example.