Creating multiple variables yacc - variables

I am creating a compiler in yacc but I cannot find a way to allow the user to create multiple variables with individual identifiers. Currently they can assign a number to a word but all words have the same value. The code I'm using it:
...
%{
float var=0;
%}
...
exp: NUMBER
| WORD { $$ = var; }
| exp '/' exp { $$ = $1 / $3; }
| ...

$$ will assign to the exp token the value var. So this is static.
If you want to parse some WORD and get its value, you should use $$ = $1 where $1 is the value of the first token of your rule (id est the WORD token)
Is that what you intended to do? I'm not sure about that, since you've done it right for the exp '/' exp ?
EDIT: To store each word in variable, I'd suggest you to use a table of floats.
You will need to use a counter to increment the table index. But you should take care that the different words values will be stored in the matching order.
EDIT2:
(Don't know if it will compile as is)
I think it would look like :
exp: NUMBER
| variable AFFECT exp { $$ = $3; var[ctr][0]="$1"; var[ctr][1]=$3; ctr++; }
| variable { $$ = lookupVar($1); }
And define lookupVar to look for the string $1 within the table

Your code seems to be similar to mfcalc sample in
bison manual.
Probably mfcalc sample will provide useful information, even if it isn't
fully identical to your purpose.
mfcalc has a symbol table in order to keep the names of VAR(probably
corresponds to WORD in your code).
Actually mfcalc enforces symbol name look-up in lexical analysis, and assigns
a pointer to symbol record, to the semantic value of VAR.
In bison source code, the semantic value can be referred simply as like
$1->value.var.
Hope this helps

Related

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

Split a BibTeX author field into parts

I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.
The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.

yacc lex when parsing CNC GCODES

I have to parse motion control programs (CNC machines, GCODE)
It is GCODE plus similar looking code specific to hardware.
There are lots of commands that consist of a single letter and number, example:
C100Z0.5C100Z-0.5
C80Z0.5C80Z-0.5
So part of my (abreviated) lex (racc & rex actually) looks like:
A {[:A,text]}
B {[:B,text]}
...
Z {[:Z,text]}
So I find a command that takes ANY letter as an argument, and in racc started typing:
letter : A
| B
| C
......
Then I stopped, I haven't used yacc is 30 years, is there some kind of shortcut for the above? Have I gone horribly off course?
It is not clear what are you trying to accomplish. If you want to create Yacc rule that covers all letters you could create token for that:
%token letter_token
In lex you would find with regular expressions each letter and simply return letter_token:
Regex for letters {
return letter_token;
}
Now you can use letter_token in Yacc rules:
letter : letter_token
Also you haven't said what language you're using. But if you need, you can get specific character you assigned with letter_token, by defining union:
%union {
char c;
}
%token <c> letter_token
Let's say you want to read single characters, Lex part in assigning character to token would be:
[A-Z] {
yylval.c = *yytext;
return letter_token;
}
Feel free to ask any further questions, and read more here about How to create a Minimal, Complete, and Verifiable example.

YACC or Bison Action Variables positional max value

In YACC and other Yacc like programs. There are action positional variables for the current parsed group of tokens. I might want to process some csv file input that the number of columns changes for unknown reasons. With my rules quoted_strings and numbers can be one or more instances found.
rule : DATE_TOKEN QUOTED_NUMBERS q_string numbers { printf(..... $1,$2....}
q_string
: QUOTED_STRING
| QUOTED_STRING q_string
;
numbers
: number numbers
| number
;
number
: INT_VALUE
| FLOAT_VALUE
;
Actions can be added to do things with what ever has been parsed as is
{ printf("%s %s %s \n",$<string>1, $<string>1, $<string>1); }
Is there a runtime macro, constuct or variable that tells me how many tokens have been read so that I can write a loop to print all token values?
What is $max
The $n variables in a bison action refer to right-hand side symbols, not to tokens. If the corresponding rhs object is a non-terminal, $n refers to that non-terminal's semantic value, which was set by assigning to $$ in the semantic action of that nonterminal.
So if there are five symbols on the right-hand side of a rule, then you can use $1 through $5. There is no variable notation which allows you to refer to the "nth" symbol.

Bison/Yacc, make literal token return its own value?

Below is my rule, when i replace $2 with '=' my code works. I know by default all literal tokens uses their ascii value (hence why multi character token require a definition)
The below doesnt work. The function is called with 0 instead of '=' like i expect. is there an option i can set? (It doesn't appear so via man pages)
AssignExpr: var '=' rval { $$ = func($1, $2, $3); }
In another piece of code i have MathOp: '=' | '+' | '%' ... hence why i am interested.
The value for $2 in this context will be whatever the yylex function put into the global variable yylval before it returned the token '='. If the lexer doesn't put anything into yylval, it will probably still be 0, as you're seeing.
I think you are right, Bison just doesn't work that way.
You can easily fix it, of course:
just declare a token for =, recognize it in your lexer, and return its semantic value, or...
declare a production for it and return it with $$
or...
assign '=' to yylval in yylex()