YACC or Bison Action Variables positional max value - yacc

In YACC and other Yacc like programs. There are action positional variables for the current parsed group of tokens. I might want to process some csv file input that the number of columns changes for unknown reasons. With my rules quoted_strings and numbers can be one or more instances found.
rule : DATE_TOKEN QUOTED_NUMBERS q_string numbers { printf(..... $1,$2....}
q_string
: QUOTED_STRING
| QUOTED_STRING q_string
;
numbers
: number numbers
| number
;
number
: INT_VALUE
| FLOAT_VALUE
;
Actions can be added to do things with what ever has been parsed as is
{ printf("%s %s %s \n",$<string>1, $<string>1, $<string>1); }
Is there a runtime macro, constuct or variable that tells me how many tokens have been read so that I can write a loop to print all token values?
What is $max

The $n variables in a bison action refer to right-hand side symbols, not to tokens. If the corresponding rhs object is a non-terminal, $n refers to that non-terminal's semantic value, which was set by assigning to $$ in the semantic action of that nonterminal.
So if there are five symbols on the right-hand side of a rule, then you can use $1 through $5. There is no variable notation which allows you to refer to the "nth" symbol.

Related

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

Antlr4 grammar - Allow variable namess with spaces

I am new to Antlr and I want to write a compiler for the custom programming language which has variable names with spaces. Following is the sample code:
SET Variable with a Long Name TO FALSE
SET Variable with Numbers 1 2 3 in the Name TO 3 JUN 1990
SET Variable with Symbols # %^& TO "A very long text string"
Variable rules:
Can contain white spaces
Can contain special symbols
I want to write the compiler in javascript. Following is my grammar:
grammar Foo;
compilationUnit: stmt*;
stmt:
assignStmt
| invocationStmt
;
assignStmt: SET ID TO expr;
invocationStmt: name=ID ((expr COMMA)* expr)?;
expr: ID | INT | STRING;
COMMA: ',';
SAY: 'say';
SET: 'set';
TO: 'to';
INT: [0-9]+;
STRING: '"' (~('\n' | '"'))* '"';
ID: [a-zA-Z_] [ a-zA-Z0-9_]*;
WS: [ \n\t\r]+ -> skip;
I tried supplying input source code as:
"set variable one to 1".
But got the error "Undefined token identifier".
Any help is greatly appreciated.
ID: [a-zA-Z_] [ a-zA-Z0-9_]*;
will match "set variable one to 1". Like most lexical analysers, ANTLR's scanners greedily match as much as they can. set doesn't get matched even though it has a specific pattern. (And even if you managed that, "variable one to 1" would match on the next token; the match doesn't stop just because to happens to appear.)
The best way to handle multi-word variable names is to treat them as multiple words. That is, recognise each word as a separate token, and recognise an identifier as a sequence of words. That has the consequence that two words and two words end up being the same identifier, but IMHO, that's a feature, not a bug.

antlr4: need to convert sequences of symbols to characters in lexer

I am writing a parser for Wolfram Language. The language has a concept of "named characters", which are specified by a name delimited by \[, and ]. For example: \[Pi].
Suppose I want to specify a regular expression for an identifier. Identifiers can include named characters. I see two ways to do it: one is to have a preprocessor that would convert all named characters to their unicode representation, and two is to enumerate all possible named characters in their source form as part of the regular expression.
The second approach does not seem feasible because there are a lot of named characters. I would prefer to have ranges of unicode characters in my regex.
So I want to preprocess my token stream. In other words, it seems to me that the lexer needs to check if the named characters syntax is correct and then look up the name and convert it to unicode.
But if the syntax is incorrect or the name does not exist I need to tell the user about it. How do I propagate this error to the user and yet let antlr4 recover from the error and resume? Maybe I can sort of "pipe" lexers/parsers? (I am new to antlr).
EDIT:
In Wolfram Language I can have this string as an identifier: \[Pi]Squared. The part between brackets is called "named character". There is a limited set of named characters, each of which corresponds to a unicode code point. I am trying to figure out how to tokenize identifiers like this.
I could have a rule for my token like this (simplified to just a combination of named characters and ASCII characters):
NAME : ('\\[' [a-z]+ ']'|[a-zA-Z])+ ;
but I would like to check if the named character actually exists (and other attributes such as if it is a letter, but the latter part is outside of the scope of the question), so this regex won't work.
I considered making a list of allowed named characters and just making a long regex that enumerates all of them, but this seems ugly.
What would be a good approach to this?
END OF EDIT
A common approach is to write the lexer/parser to allow syntactically correct input and defer semantic issues to the analysis of the generated parse tree. In this case, the lexer can naively accept named characters:
NChar : NCBeg .? RBrack ;
fragment NCBeg : '\\[' ;
fragment LBrack: '[' ;
fragment RBrack: ']' ;
Update
In the parser, allow the NChar's to exist in the parse-tree as discrete terminal nodes:
idents : ident+ ;
ident : NChar // named character string
| ID // simple character string?
| Literal // something quoted?
| ....
;
This makes analysis of the parse tree considerably easier: each ident context will contain only one non-null value for a discretely identifiable alt; and isolates analysis of all ordering issues to the idents context.
Update2
For an input \[Pi]Squared, the parse tree form that would be easiest to analyze would be an idents node with two well-ordered children, \[Pi] and Squared.
Best practice would not be to pack both children into the same token - would just have to later manually break the token text into the two parts to check if it is contains a valid named character and whether the particular sequence of parts is allowable.
No regex is going to allow conclusive verification of the named characters. That will require a list. Tightening the lexer definition of an NChar can, however, achieve a result equivalent to a regex:
NChar : NCBeg [A-Z][A-Za-z]+ RBrack ;
If the concern is that there might be a space after the named character, consider that this circumstance is likely better treated with a semantic warning as opposed to a syntactic error. Rather than skipping whitespace in the lexer, put the whitespace on the hidden channel. Then, in the verification analysis of each idents context, check the hidden channel for intervening whitespace and issue a warning as appropriate.
----
A parse-tree visitor can then examine, validate, and warn as appropriate regarding unknown or misspelled named characters.
To do the validation in the parser, if more desirable, use a predicated rule to distinguish known from unknown named characters:
#members {
ArrayList<String> keyList = .... // list of named chars
public boolean inList(String id) {
return keyList.contains(id) ;
}
}
nChar : known
| unknown
;
known : NChar { inList($NChar.getText()) }? ;
unknown : NChar { error("Unknown " + $NChar.getText()); } ;
The inList function could implement a distance metric to detect misspellings, but correcting the text directly in the parse-tree is a bit complex. Easier to do when implemented as a parse-tree decoration during a visitor operation.
Finally, a scrape and munge of the named characters into a usable map (both unicode and ascii) is likely worthwhile to handle both representations as well as conversions and misspelling.

antlr3 - read closure value to a variable

I would like to parse and read a closure value in a simple text line like this:
1 !something
line
: (NUMBER EXCLAMATION myText=~('\r\n')*)
{ myFunction($myText.text); }
NUMBER
: '0'..'9'+;
EXCLAMATION
: '!';
What I get in myText variable is just the final 'g' of 'something' because as can see in generated code myText is rewrited in a while loop for each occurence of ~('\r\n').
My answer is: is there any elegant way to read the 'something' value to the variable 'myText'?
TIA
Inside parser rules, the ~ does not negate characters, but tokens. So ~('\r\n') would match any token other than the literal '\r\n' token (in your example, that would be a NUMBER or EXCLAMATION).
The lexer cannot be "driven" by the parser: after the parser matched a NUMBER and a EXCLAMATION, you can't tell the lexer to produce some other tokens than it has previously done. The lexer will always produce tokens based on some simple rules, regardless of what the parser "needs".
In other words: you can't handle this in the parser.

Creating multiple variables yacc

I am creating a compiler in yacc but I cannot find a way to allow the user to create multiple variables with individual identifiers. Currently they can assign a number to a word but all words have the same value. The code I'm using it:
...
%{
float var=0;
%}
...
exp: NUMBER
| WORD { $$ = var; }
| exp '/' exp { $$ = $1 / $3; }
| ...
$$ will assign to the exp token the value var. So this is static.
If you want to parse some WORD and get its value, you should use $$ = $1 where $1 is the value of the first token of your rule (id est the WORD token)
Is that what you intended to do? I'm not sure about that, since you've done it right for the exp '/' exp ?
EDIT: To store each word in variable, I'd suggest you to use a table of floats.
You will need to use a counter to increment the table index. But you should take care that the different words values will be stored in the matching order.
EDIT2:
(Don't know if it will compile as is)
I think it would look like :
exp: NUMBER
| variable AFFECT exp { $$ = $3; var[ctr][0]="$1"; var[ctr][1]=$3; ctr++; }
| variable { $$ = lookupVar($1); }
And define lookupVar to look for the string $1 within the table
Your code seems to be similar to mfcalc sample in
bison manual.
Probably mfcalc sample will provide useful information, even if it isn't
fully identical to your purpose.
mfcalc has a symbol table in order to keep the names of VAR(probably
corresponds to WORD in your code).
Actually mfcalc enforces symbol name look-up in lexical analysis, and assigns
a pointer to symbol record, to the semantic value of VAR.
In bison source code, the semantic value can be referred simply as like
$1->value.var.
Hope this helps