Following up from an earlier question, I'm a bit confused about the precedence of the /.+/ regex line; I would expect the below test to produce
line
line x
chunk abc
instead I get:
line
line x
line abc
def test_tokenizing(self):
p = Lark(r"""
_NL: /\n/
line.-1: /.+/? _NL
chunk: /abc/ _NL
start: (line|chunk)+
""", parser='lalr')
text = '\nx\nabc\n'
print(p.parse(text).pretty())
In Lark, priorities mean different things for rules and for terminals.
Just a quick reminder, rules have lowercase names, while terminals have UPPERCASE names.
In LALR mode, priorities on rules only affect which one is chosen in case of a reduce/reduce collision. It has no effect on the terminals inside it.
What you want is to change the priority on the terminal itself:
def test_tokenizing():
p = Lark(r"""
_NL: /\n/
line: EVERYTHING? _NL
EVERYTHING.-1: /.+/
chunk: /abc/ _NL
start: (line|chunk)+
""", parser='lalr')
text = '\nx\nabc\n'
print(p.parse(text).pretty())
Related
I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading
I am trying to make sure that spacy treats dot as a separate token except when it is between two digits. I noticed nlp.Defaults.infixes uses lookaround operators extensively, so I followed the example:
infixes = nlp.Defaults.infixes + (r'''[;,:]''',
r'(?<=[a-zA-Z_])[\.^]', r'[\.^](?=[a-zA-Z_])',
)
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
Now upper case behaves as expected:
list(nlp.tokenizer('HELLO.WORLD'))
[HELLO, ., WORLD]
But if right-hand side is lowercase, it fails:
list(nlp.tokenizer('HELLO.world'))
[HELLO.world]
list(nlp.tokenizer('hello.world'))
[hello.world]
list(nlp.tokenizer('hello.WORLD'))
[hello, ., WORLD]
Another example, where a regex finds parentheses but not slashes:
infixes = nlp.Defaults.infixes + (r'(?<=[a-zA-Z_])[\.\(\)/](?=[a-zA-Z_])', )
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
tests = [
'mid(inferior',
'mid/inferior',
'left.mid',
'left.mid/inferior']
pattern = re.compile(r'(?<=[a-zA-Z_])[\.\(\)/](?=[a-zA-Z_])')
for tt in tests:
print("-"*20)
print(pattern.split(tt))
print(list(nlp.tokenizer(tt)))
result:
--------------------
['mid', 'inferior']
[mid, (, inferior]
--------------------
['mid', 'inferior']
[mid, /, inferior]
--------------------
['left', 'mid']
[left.mid]
--------------------
['left', 'mid', 'inferior']
[left.mid/inferior]
As one can see, for some reason within spacy, the regex pattern does not split on dot, and if word is not purely alphabetic, it fails to split on slash, even though that should not be an issue for the above regex
The problem here is the order of the infix patterns. It uses the first match it finds and another infix pattern is matching before your pattern. You can add your custom patterns first instead:
infixes = (r'(?<=[a-zA-Z_])[\.\(\)/](?=[a-zA-Z_])', ) + nlp.Defaults.infixes
This can potentially lead to side effects for other patterns, especially if you add very short patterns first, so double-check the overall tokenization accuracy for your data before and after making the changes to make sure it's doing what you expect. This is probably fine as-is, but you might need to add your pattern in the middle rather than at the beginning to get the intended results.
Edit:
This doesn't work because the URL pattern is jumping in first.
print(nlp.tokenizer.explain("HELLO.world"))
# [('URL_MATCH', 'HELLO.world')]
If you don't care about URL tokenization, you can just set:
nlp.tokenizer.url_match = None
to remove the URL matching that's going to interact a lot with . and / in tokens.
I have the following rule where I'm trying to constraint the wildcard sensor to any string except those starting with fitbit. The problem I'm facing is that the regex I'm using seems to match any string, so is as if the rule does not exist (no output file is going to be generated).
rule readable_datetime:
input:
sensor_input = rules.download_dataset.output
params:
timezones = None,
fixed_timezone = config["READABLE_DATETIME"]["FIXED_TIMEZONE"]
wildcard_constraints:
sensor = "^(?!fitbit).*" # ignoring fitbit sensors
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
script:
"../src/data/readable_datetime.R"
I'm getting this error message with a rule (light_metrics) that needs the output of readable_time with sensor=light as input
MissingInputException in line 112 of features.snakefile:
Missing input files for rule light_metrics:
data/raw/p01/light_with_datetime.csv
I prefer to stay away from regexes if I can and maybe this works for you.
Assuming sensor is a list like:
sensor = ['fitbit', 'spam', 'eggs']
In rule readable_datetime use
wildcard_constraints:
sensor = '|'.join([re.escape(x) for x in sensor if x != 'fitbit'])
Explained: re.escape(x) escapes metacharacters in x so that we are not going to have spurious matches if x contains '.' or '*'. x in sensor if x != 'fitbit' should be self-explanatory and you can make it as complicated as you want. Finally, '|'.join() stitches everything together in a regex that can match only the items in sensor captured by the list comprehension.
(Why your regex doesn't work I haven't investigated...)
My solution is simply to remove the ^ from the wildcards_constraint regex. This works because the regex is applied to the whole path containing the wildcard rather than just to the wildcard itself.
This is discussed briefly here:
https://bitbucket.org/snakemake/snakemake/issues/904/problems-with-wildcard-constraints
My understanding is that the regex you specify for each wildcard is substituted in to a larger regex for the entire output path.
Without wildcard_constraints:
Searches for something like data/raw/(.*)/(.*)_with_datetime.csv, taking the first and second capture groups to be pid and sensor respectively.
With wildcard_constraints:
Searches for data/raw/(.*)/((?!fitbit).*)_with_datetime.csv, again taking the first and second capture groups to be pid and sensor respectively.
Here is a smaller working example:
rule all:
input:
"data/raw/p01/light_with_datetime.csv",
"data/raw/p01/fitbit_light_with_datetime.csv",
"data/raw/p01/light_fitbit_with_datetime.csv",
"data/raw/p01/light_fitbit_foo_with_datetime.csv",
rule A:
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
wildcard_constraints:
sensor="(?!fitbit).*"
shell:
"touch {output}"
When you run snakemake, it only complains about the file with sensor starting with fitbit being missing, but happily finds the others.
snakemake
Building DAG of jobs...
MissingInputException in line 1 of Snakefile:
Missing input files for rule all:
data/raw/p01/fitbit_light_with_datetime.csv
I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.
I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user
def value: Parser[Any] =
obj | stringLit | num | "true" | "false" | failure("illegal start of value")
This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.
To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser
import scala.util.parsing.combinator._
object ExprParserTest extends App with JavaTokenParsers {
// Grammar
val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
val program = rep(declaration) ~ rep(statement)
// Test
println(parseAll(program, "1 to 2")) // OK
println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
println(parseAll(program, "abc")) // OK
}
It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?
Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.
Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.