Order of precedence for token matching in Flex - tokenize

My apologies if the title of this thread is a little confusing. What I'm asking about is how does Flex (the lexical analyzer) handle issues of precedence?
For example, let's say I have two tokens with similar regular expressions, written in the following order:
"//"[!\/]{1} return FIRST;
"//"[!\/]{1}\< return SECOND;
Given the input "//!<", will FIRST or SECOND be returned? Or both? The FIRST string would be reached before the SECOND string, but it seems that returning SECOND would be the right behavior.

The longest match is returned.
From flex & bison, Text Processing Tools:
How Flex Handles Ambiguous Patterns
Most flex programs are quite ambiguous, with multiple patterns that can match
the same input. Flex resolves the ambiguity with two simple rules:
Match the longest possible string every time the scanner matches input.
In the case of a tie, use the pattern that appears first in the program.
You can test this yourself, of course:
file: demo.l
%%
"//"[!/] {printf("FIRST");}
"//"[!/]< {printf("SECOND");}
%%
int main(int argc, char **argv)
{
while(yylex() != 0);
return 0;
}
Note that / and < don't need escaping, and {1} is redundant.
bart#hades:~/Programming/GNU-Flex-Bison/demo$ flex demo.l
bart#hades:~/Programming/GNU-Flex-Bison/demo$ cc lex.yy.c -lfl
bart#hades:~/Programming/GNU-Flex-Bison/demo$ ./a.out < in.txt
SECOND
where in.txt contains //!<.

Related

Partial Match in a Grammar

I have a simple grammar, and I am using it to parse some text. The text is user inputted, but my program guarantees that it stars with a match to the grammar. (ie, if my grammar matched only a, the text might be abc or a or a_.) However, when I use the .parse method on my grammar, it fails on any non-exact match. How can I perform a partial match?
In Raku, Grammar.parse has to match the whole string. This is what causes it to fail if your grammar would only match a in the string abc. To allow matching only part of the input string, you can use Grammar.subparse instead.
grammar Foo {
token TOP { 'a' }
}
my $string = 'abc';
say Foo.parse($string); # Nil
say Foo.subparse($string); # 「a」
The input string will need to start with the potential Match. Otherwise, you will get a failed match.
say Foo.subparse('cbacb'); # #<failed match>
You can work around this using a Capture marker.
grammar Bar {
token TOP {
<-[a]>* # Match 0 or more characters that are *not* a
<( 'a' # Start the match, and match a single 'a'
}
}
say Bar.parse('a'); # 「a」
say Bar.subparse('a'); # 「a」
say Bar.parse('abc'); # Nil
say Bar.subparse('abc'); # 「a」
say Bar.parse('cbabc'); # Nil
say Bar.subparse('cbabc'); # 「a」
This works because <-[a]>*, a character class that includes any character except the letter a, will consume all the characters before a potential a. However, the Capture marker will cause these to be dropped from the eventual Match object, leaving you with just the a you wanted to match.
TL;DR
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
# Partial match anchored to end of string:
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
Vocabulary
There are traditionally two takes on the general notion of text "matching":
"Parsing"
"Regexes"
Raku:
Provides a unified text pattern language and engine that do both jobs.
Makes it easy to stick to one perspective, or other, or blend them, or refactor between them, as suits an individual dev and/or individual use case.
Takes "parsing" to mean more or less a single match starting at the start of the input string whereas "regexes" are much more flexible.
What you've written in your question and your first comment on Tyil's answer reflects the inherent ambiguity of the topic. I'll provide two answers rather than one to try help you and/or other readers to be clearer about Raku's use of vocabulary, and your options functionality wise.
Limited "partial matching" via .parse et al
You began with:
Partial match in a grammar ... I have a simple grammar ... my program guarantees that it starts with a match to the grammar
With that in mind, here's your question:
How can I perform a partial match?
The phrases "guarantees that it starts" and "partial match" are ambiguous.
One take is that you want what I'll call a "prefix" match, matching one or more characters anchored from the start of the string, and not merely any sub-string starting and ending anywhere in the input string.
This nicely fits with "parsing", or at least Raku's use of the word in its grammar methods.
All the built in Grammar methods with parse in their name insert an anchor to the start of the string in whatever grammar rule they use to start the parsing process. You cannot remove that anchor. This reflects the choice of vocabulary; "parse" is taken to mean matching from the start no matter what else happens.
The parse method for this "prefix" scenario is .subparse:
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
See also:
Search of SO for "[raku] subparse".
raku doc for .subparse.
But perhaps "guarantees that it starts" and "partial match" did not mean that you wanted anchoring at the start. Your comment on Tyil's answer highlights this ambiguity:
Will .subparse only match at the start, or match anywhere in the string?
Tyil provides a workaround. You can do what Tyil shows, but it'll only match if the very first a encountered in the input string is the one that's at the start of the sub-string you want your "parse" to match.
If instead the first a was a false positive, and there was a second or a subsequent a you wanted the "parse" match to start at, then, at least in the Raku world, it's helpful to call that "regexing" rather than "parsing" and to use "regex" matching via the ~~ smartmatch operator.
Unlimited "partial matching" via ~~
Raku lets you do unlimited partial matching if you use its ~~ construct with a regex.
For example, you could write:
# End of match at end of string:
↓
say 'abcaa' ~~ token { a* $ } # 「aa」
~~ with a regex tells Raku to:
Try match starting at the first character position in the string on the LHS;
If that fails, step forward one character, and try again, with the new position in the input string treated as a fresh starting point;
Repeat that until either matching once, or failing to find any match in the entire string.
Here I've left the start position of the match unspecified (which ~~ takes to mean it can be anywhere in the string) and anchored the end of the pattern to the end of the input string. So it successfully matches the aa at the end of the string.
This anchoring freedom illustrates just one of the many ways that ~~ smart matching provides much greater matching flexibility than using the parse methods.
If you have an existing grammar you can still use that:
grammar foo { token TOP { a* } }
# Anchor matching to end of string:
↓
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
You have to name both the grammar and the rule within it you wish to invoke and put them inside <...>. And you need to insert a . to avoid a correspondingly named sub-capture, presuming you don't want that.
Here's another example:
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
"Parsing" in Raku always starts at the beginning of an input string and results in either no match or one match.
In contrast, a "regex" can match arbitrary fragments, and can match any number of fragments. (You can even match overlapping fragments.)
In my last example I used :g, which is short for :global, which is a well known feature among traditional regex engines. :g matches as many times as a match is found in the input string (but not overlapping).
The match operation then returns either Nil (no matches at all) or a list of match objects (one or more). I've applied a .max(*.chars) to yield the longest match (the first if there are multiple longest sub-strings).

yacc lex when parsing CNC GCODES

I have to parse motion control programs (CNC machines, GCODE)
It is GCODE plus similar looking code specific to hardware.
There are lots of commands that consist of a single letter and number, example:
C100Z0.5C100Z-0.5
C80Z0.5C80Z-0.5
So part of my (abreviated) lex (racc & rex actually) looks like:
A {[:A,text]}
B {[:B,text]}
...
Z {[:Z,text]}
So I find a command that takes ANY letter as an argument, and in racc started typing:
letter : A
| B
| C
......
Then I stopped, I haven't used yacc is 30 years, is there some kind of shortcut for the above? Have I gone horribly off course?
It is not clear what are you trying to accomplish. If you want to create Yacc rule that covers all letters you could create token for that:
%token letter_token
In lex you would find with regular expressions each letter and simply return letter_token:
Regex for letters {
return letter_token;
}
Now you can use letter_token in Yacc rules:
letter : letter_token
Also you haven't said what language you're using. But if you need, you can get specific character you assigned with letter_token, by defining union:
%union {
char c;
}
%token <c> letter_token
Let's say you want to read single characters, Lex part in assigning character to token would be:
[A-Z] {
yylval.c = *yytext;
return letter_token;
}
Feel free to ask any further questions, and read more here about How to create a Minimal, Complete, and Verifiable example.

Bison Syntax Error easy file

i'm trying to run this .y file
%{
#include <stdlib.h>
#include <stdio.h>
int yylex();
int yyerror();
%}
%start BEGIN
%%
BEGIN: 'a' | BEGIN 'a'
%%
int yylex(){
return getchar();
}
int yyerror(char* s){
fprintf(stderr, "*** ERROR: %s\n", s);
return 0;
}
int main(int argn, char **argv){
yyparse();
return 0;
}
It's a simple program in bison, the syntax seems to me correct, but always get the Syntax error problem ...
Thanks for your help.
The lexer function yylex needs to return 0 to indicate the end of the input. However, your implementation simply passes through the value returned by getchar, which will be EOF (normally -1).
Also, your input is almost certain to include a newline character, which will also be passed through to the parser.
Since the parser recognizes neither \n nor EOF, it produces an error when it receives one of them.
At a minimum, you would need to modify yylex to correctly respond to end of input:
int yylex(void) {
int ch = getchar();
return (ch == EOF) ? 0 : ch;
}
But you will still have to deal with newline charactets, either by handling them in your lexer (possibly ignoring them or possibly returning an end of input imdication), or by handling them in your grammar.
Note that bison/yacc-generated parsers always parse the entire input stream, not just the longest sequence satisfying the grammar. That can be adjusted with some work -- see the documentation for the YYACCEPT special action -- but the standard behaviour is usually what is desired when parsing.
By the way, please use standard style conventions in your bison/yacc grammars, in order to avoid problems and in order to avoid confusing readers. Normally we reserve UPPER_CASE for terminal symbols, since those are also used as compile-time constants in the lexer. Non-terminals are usually written in lower_case although some prefer to use CamelCase. For the terminals, you need to avoid the use of names reserved by the standard library (such as EOF) or by (f)lex (BEGIN) or bison/yacc (END). There are lists of reserved names in the manuals.

Does .parse anchor or :sigspace first in a Perl 6 rule?

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.

Xcode - replace function with regex and two-digit capture group (back reference)

I would like to use the Xcode's find in project option to normalize the signatures of methods.
I wrote the find expression:
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)(\s*(:)\s*(\()\s*(\w+)\s*(\*?)\s*(\))\s*(\w+))?
and the replacement expression:
\1 \(\2\3\)\4\6\7\8\9\10\11
The test string is:
+(NSString *) testFunction : (NSInteger ) arg1
and the desired result:
+ (NSString*)testFunction:(NSInteger)arg1
Unfortunatelly Xcode isn't able to recognize te two digit capture group \10 and translates it to \1 and '0' character and so long. How to solve this problem or bug?
Thanks in advance,
Michał
I believe #trojanfoe is correct; regexes can only have nine capture groups. This is waaay more than you need for your particular example, though.
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)(\s*(:)\s*(\()\s*(\w+)\s*(\*?)\s*(\))\s*(\w+))?
\1 \(\2\3\)\4\6\7\8\9\10\11
The first thing I notice is that you're not using \5, so there's no reason to capture it at all. Next, I notice that \6 corresponds to the regex (:), so you can avoid capturing it and replace \6 with : in the output. \7 corresponds to (\(), so you can replace \7 with ( in the output. ...Iterating this approach yields a much simpler pair of regexes: one for zero-argument methods and one for one-argument methods.
^\s*([+-])\s*\((\w+)\s*(\*?)\s*\)\s*(\w+)
\1 \(\2\3\)\4
^([+-] \(\w+\*?\)\w+)\s*:\s*\(\s*(\w+)\s*(\*?)\s*\)\s*(\w+)
\1:\(\2\3\)\4
Notice that I can capture the whole regex [+-] \(\w+\*?\)\w+ without all those noisy \s*s, because it's been normalized already by the first regex's pass.
However, this whole idea is a huge mistake. Consider the following Objective-C method declarations:
-(const char *)toString;
-(id)initWithA: (A) a andB: (B) b andC: (C) c;
-(NSObject **)pointerptr;
-(void)performBlock: (void (^)(void)) block;
-(id)stringWithFormat: (const char *) fmt, ...;
None of these are going to be parsed correctly by your regex. The first one contains a two-word type const char instead of a single word; the second has more than one parameter; the third has a double pointer; the fourth has a very complicated type instead of a single word; and the fifth has not only const char but a variadic argument list. I could go on, through out parameters and arrays and __attribute__ syntax, but surely you're beginning to see why regexes are a bad match for this problem.
What you're really looking for is an indent program (named after GNU indent, which unfortunately doesn't do Objective-C). The best-known and best-supported Objective-C indent program is called uncrustify; get it here.