Wildcard constraint that does not match a string - snakemake

I have the following rule where I'm trying to constraint the wildcard sensor to any string except those starting with fitbit. The problem I'm facing is that the regex I'm using seems to match any string, so is as if the rule does not exist (no output file is going to be generated).
rule readable_datetime:
input:
sensor_input = rules.download_dataset.output
params:
timezones = None,
fixed_timezone = config["READABLE_DATETIME"]["FIXED_TIMEZONE"]
wildcard_constraints:
sensor = "^(?!fitbit).*" # ignoring fitbit sensors
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
script:
"../src/data/readable_datetime.R"
I'm getting this error message with a rule (light_metrics) that needs the output of readable_time with sensor=light as input
MissingInputException in line 112 of features.snakefile:
Missing input files for rule light_metrics:
data/raw/p01/light_with_datetime.csv

I prefer to stay away from regexes if I can and maybe this works for you.
Assuming sensor is a list like:
sensor = ['fitbit', 'spam', 'eggs']
In rule readable_datetime use
wildcard_constraints:
sensor = '|'.join([re.escape(x) for x in sensor if x != 'fitbit'])
Explained: re.escape(x) escapes metacharacters in x so that we are not going to have spurious matches if x contains '.' or '*'. x in sensor if x != 'fitbit' should be self-explanatory and you can make it as complicated as you want. Finally, '|'.join() stitches everything together in a regex that can match only the items in sensor captured by the list comprehension.
(Why your regex doesn't work I haven't investigated...)

My solution is simply to remove the ^ from the wildcards_constraint regex. This works because the regex is applied to the whole path containing the wildcard rather than just to the wildcard itself.
This is discussed briefly here:
https://bitbucket.org/snakemake/snakemake/issues/904/problems-with-wildcard-constraints
My understanding is that the regex you specify for each wildcard is substituted in to a larger regex for the entire output path.
Without wildcard_constraints:
Searches for something like data/raw/(.*)/(.*)_with_datetime.csv, taking the first and second capture groups to be pid and sensor respectively.
With wildcard_constraints:
Searches for data/raw/(.*)/((?!fitbit).*)_with_datetime.csv, again taking the first and second capture groups to be pid and sensor respectively.
Here is a smaller working example:
rule all:
input:
"data/raw/p01/light_with_datetime.csv",
"data/raw/p01/fitbit_light_with_datetime.csv",
"data/raw/p01/light_fitbit_with_datetime.csv",
"data/raw/p01/light_fitbit_foo_with_datetime.csv",
rule A:
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
wildcard_constraints:
sensor="(?!fitbit).*"
shell:
"touch {output}"
When you run snakemake, it only complains about the file with sensor starting with fitbit being missing, but happily finds the others.
snakemake
Building DAG of jobs...
MissingInputException in line 1 of Snakefile:
Missing input files for rule all:
data/raw/p01/fitbit_light_with_datetime.csv

Related

Snakemake: access a list within a dict by using a wildcard

To break it down, I have a dict that looks like this:
dict = {'A': ["sample1","sample2","sample3"],
'B': ["sample1","sample2"],
'C': ["sample1","sample2","sample3"]}
And I have a rule:
rule example:
input:
#some input
params:
# some params
output:
expand('{{x}}{sample}', sample=dict[wildcards.x])
# the alternative I tried was
# expand('{{x}}{sample}', sample=lambda wildcards: dict[wildcards.x])
log:
log = '{x}.log'
run:
"""
foo
"""
My problem is how can I access the dictonary with the wildcard.x as key such that I get the list of items corresponding to the wildcard as key.
The first example just gives me
name 'wildcards' is not defined
while the alternative just gives me
Missing input files for rule all
Since snakemake doesn't even runs the example rule.
I need to use expand, since I want the rule to run only once for each x wildcard while creating multiple samples in this one run.
You can use a lambda as a function of a wildcard in the input section only, and cannot use in the output. Actually output doesn't have any wildcards, it defines them.
Let's rethink of your task from another view. How do you decide how many samples the output produces? You are defining the dict: where does this information come from? You have not shown the actual script, but how does it know how many outputs to produce?
Logically you might have three separate rules (or at least two), one knows how to produce two samples, the other how to produce three ones.
As I can see, you are experiencing a Problem XY: you are asking the same question twice, but you are not expressing your actual problem (X), while forcing an incorrect implementation with defining all outputs as a dictionary (Y).
Update: another possible solution to your artificial example would be to use dynamic output:
rule example:
input:
#some input
output:
dynamic('{x}sample{n}')
That would work in your case because the files match the common pattern "sample{n}".

Partial Match in a Grammar

I have a simple grammar, and I am using it to parse some text. The text is user inputted, but my program guarantees that it stars with a match to the grammar. (ie, if my grammar matched only a, the text might be abc or a or a_.) However, when I use the .parse method on my grammar, it fails on any non-exact match. How can I perform a partial match?
In Raku, Grammar.parse has to match the whole string. This is what causes it to fail if your grammar would only match a in the string abc. To allow matching only part of the input string, you can use Grammar.subparse instead.
grammar Foo {
token TOP { 'a' }
}
my $string = 'abc';
say Foo.parse($string); # Nil
say Foo.subparse($string); # 「a」
The input string will need to start with the potential Match. Otherwise, you will get a failed match.
say Foo.subparse('cbacb'); # #<failed match>
You can work around this using a Capture marker.
grammar Bar {
token TOP {
<-[a]>* # Match 0 or more characters that are *not* a
<( 'a' # Start the match, and match a single 'a'
}
}
say Bar.parse('a'); # 「a」
say Bar.subparse('a'); # 「a」
say Bar.parse('abc'); # Nil
say Bar.subparse('abc'); # 「a」
say Bar.parse('cbabc'); # Nil
say Bar.subparse('cbabc'); # 「a」
This works because <-[a]>*, a character class that includes any character except the letter a, will consume all the characters before a potential a. However, the Capture marker will cause these to be dropped from the eventual Match object, leaving you with just the a you wanted to match.
TL;DR
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
# Partial match anchored to end of string:
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
Vocabulary
There are traditionally two takes on the general notion of text "matching":
"Parsing"
"Regexes"
Raku:
Provides a unified text pattern language and engine that do both jobs.
Makes it easy to stick to one perspective, or other, or blend them, or refactor between them, as suits an individual dev and/or individual use case.
Takes "parsing" to mean more or less a single match starting at the start of the input string whereas "regexes" are much more flexible.
What you've written in your question and your first comment on Tyil's answer reflects the inherent ambiguity of the topic. I'll provide two answers rather than one to try help you and/or other readers to be clearer about Raku's use of vocabulary, and your options functionality wise.
Limited "partial matching" via .parse et al
You began with:
Partial match in a grammar ... I have a simple grammar ... my program guarantees that it starts with a match to the grammar
With that in mind, here's your question:
How can I perform a partial match?
The phrases "guarantees that it starts" and "partial match" are ambiguous.
One take is that you want what I'll call a "prefix" match, matching one or more characters anchored from the start of the string, and not merely any sub-string starting and ending anywhere in the input string.
This nicely fits with "parsing", or at least Raku's use of the word in its grammar methods.
All the built in Grammar methods with parse in their name insert an anchor to the start of the string in whatever grammar rule they use to start the parsing process. You cannot remove that anchor. This reflects the choice of vocabulary; "parse" is taken to mean matching from the start no matter what else happens.
The parse method for this "prefix" scenario is .subparse:
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
See also:
Search of SO for "[raku] subparse".
raku doc for .subparse.
But perhaps "guarantees that it starts" and "partial match" did not mean that you wanted anchoring at the start. Your comment on Tyil's answer highlights this ambiguity:
Will .subparse only match at the start, or match anywhere in the string?
Tyil provides a workaround. You can do what Tyil shows, but it'll only match if the very first a encountered in the input string is the one that's at the start of the sub-string you want your "parse" to match.
If instead the first a was a false positive, and there was a second or a subsequent a you wanted the "parse" match to start at, then, at least in the Raku world, it's helpful to call that "regexing" rather than "parsing" and to use "regex" matching via the ~~ smartmatch operator.
Unlimited "partial matching" via ~~
Raku lets you do unlimited partial matching if you use its ~~ construct with a regex.
For example, you could write:
# End of match at end of string:
↓
say 'abcaa' ~~ token { a* $ } # 「aa」
~~ with a regex tells Raku to:
Try match starting at the first character position in the string on the LHS;
If that fails, step forward one character, and try again, with the new position in the input string treated as a fresh starting point;
Repeat that until either matching once, or failing to find any match in the entire string.
Here I've left the start position of the match unspecified (which ~~ takes to mean it can be anywhere in the string) and anchored the end of the pattern to the end of the input string. So it successfully matches the aa at the end of the string.
This anchoring freedom illustrates just one of the many ways that ~~ smart matching provides much greater matching flexibility than using the parse methods.
If you have an existing grammar you can still use that:
grammar foo { token TOP { a* } }
# Anchor matching to end of string:
↓
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
You have to name both the grammar and the rule within it you wish to invoke and put them inside <...>. And you need to insert a . to avoid a correspondingly named sub-capture, presuming you don't want that.
Here's another example:
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
"Parsing" in Raku always starts at the beginning of an input string and results in either no match or one match.
In contrast, a "regex" can match arbitrary fragments, and can match any number of fragments. (You can even match overlapping fragments.)
In my last example I used :g, which is short for :global, which is a well known feature among traditional regex engines. :g matches as many times as a match is found in the input string (but not overlapping).
The match operation then returns either Nil (no matches at all) or a list of match objects (one or more). I've applied a .max(*.chars) to yield the longest match (the first if there are multiple longest sub-strings).

Recursion/looping of rules in snakemake

I'm trying to bootstrap a hmm training, thus I need to loop through a few rules a couple of times. My idea was to do this:
dict={'boot1':'init', 'boot2':'boot1', 'final':'boot2'} # Define the workflow
rule a_rule_to_initialize_and_make_the_first_input
output:
'init_hmm'
rule make_model:
input:
'{0}_hmm'.format(dict[{run}]) # Create the loop by referencing the dict.
output:
'{run}_training_data'
rule train:
input:
'{run}_training_data'
output:
'{run}_hmm'
However, I can't access the wildcard {run} in the format function. Any hints as how I could get a hold of {run} within the input line. Or maybe a better way of performing the iteration?
I'm not sure if there's a better way to do the iteration, but the reason you can't access run is because wildcards aren't parsed unless they're in a string directly in the list of inputs or outputs. Snakemake lets you define lambda functions that get passed a wildcards object, so you need to do:
input:
lambda wildcards: '{0}_hmm'.format(dict[wildcards.run])

How to remove diacritics in Perl 6

Two related questions.
Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä, U+00E4) or two and more combined symbols (like p̄ and ḏ̣). This little code
my #symb;
#symb.push("ä");
#symb.push("p" ~ 0x304.chr); # "p̄"
#symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for #symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.
1) Remove diacritics from ä. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄ into p and Combining Macron U+0304. E.g. something like the following in bash:
$ echo p̄ | grep . -o | wc -l
2
Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.
Per the documentation:
multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method samemark(Str:D: Str:D $pattern --> Str:D)
Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.
Examples:
say 'åäö'.samemark('aäo'); # OUTPUT: «aäo␤»
say 'åäö'.samemark('a'); # OUTPUT: «aao␤»
say samemark('Pêrl', 'a'); # OUTPUT: «Perl␤»
say samemark('aöä', ''); # OUTPUT: «aöä␤»
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.
say "p̄".ords; # OUTPUT: «(112 772)␤»
You can use the uniname method/routine to get the Unicode name for a codepoint:
.uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»
or just use the uninames method/routine:
.say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»
If you just want the number of codepoints in the string, you can use codes:
say "p̄".codes; # OUTPUT: «2␤»
This is different than chars, which just counts the number of characters in the string:
say "p̄".chars; # OUTPUT: «1␤»
Also see #hobbs' answer using NFD.
This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.
my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino
The .NFD method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str then assembles those codepoints back into a string.
You can also put these pieces together to answer your second question; e.g.:
$in.NFD.map: { Uni.new($_).Str }
will return a list of 1-character strings, each with a single decomposed codepoint, or
$in.NFD.map(&uniname).join("\n")
will make a nice little unicode debugger.
I can't say this is better or faster, but I strip diacritics in this way:
my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"

SWI-Prolog predicate for reading in lines from input file

I'm trying to write a predicate to accept a line from an input file. Every time it's used, it should give the next line, until it reaches the end of the file, at which point it should return false. Something like this:
database :-
see('blah.txt'),
loop,
seen.
loop :-
accept_line(Line),
write('I found a line.\n'),
loop.
accept_line([Char | Rest]) :-
get0(Char),
C =\= "\n",
!,
accept_line(Rest).
accept_line([]).
Obviously this doesn't work. It works for the first line of the input file and then loops endlessly. I can see that I need to have some line like "C =\= -1" in there somewhere to check for the end of the file, but I can't see where it'd go.
So an example input and output could be...
INPUT
this is
an example
OUTPUT
I found a line.
I found a line.
Or am I doing this completely wrong? Maybe there's a built in rule that does this simply?
In SWI-Prolog, the most elegant way to do this is to first use a DCG to describe what a "line" means, and then use library(pio) to apply the DCG to a file.
An important advantage of this is that you can then easily apply the same DCG also on queries on the toplevel with phrase/2 and do not need to create a file to test the predicate.
There is a DCG tutorial that explains this approach, and you can easily adapt it to your use case.
For example:
:- use_module(library(pio)).
:- set_prolog_flag(double_quotes, codes).
lines --> call(eos), !.
lines --> line, { writeln('I found a line.') }, lines.
line --> ( "\n" ; call(eos) ), !.
line --> [_], line.
eos([], []).
Example usage:
?- phrase_from_file(lines, 'blah.txt').
I found a line.
I found a line.
true.
Example usage, using the same DCG to parse directly from character codes without using a file:
?- phrase(lines, "test1\ntest2").
I found a line.
I found a line.
true.
This approach can be very easily extended to parse more complex file contents as well.
If you want to read into code lists, see library(readutil), in particular read_line_to_codes/2 which does exactly what you need.
You can of course use the character I/O primitives, but at least use the ISO predicates. "Edinburgh-style" I/O is deprecated, at least for SWI-Prolog. Then:
get_line(L) :-
get_code(C),
get_line_1(C, L).
get_line_1(-1, []) :- !. % EOF
get_line_1(0'\n, []) :- !. % EOL
get_line_1(C, [C|Cs]) :-
get_code(C1),
get_line_1(C1, Cs).
This is of course a lot of unnecessary code; just use read_line_to_codes/2 and the other predicates in library(readutil).
Since strings were introduced to Prolog, there are some new nifty ways of reading. For example, to read all input and split it to lines, you can do:
read_string(user_input, _, S),
split_string(S, "\n", "", Lines)
See the examples in read_string/5 for reading linewise.
PS. Drop the see and seen etc. Instead:
setup_call_cleanup(open(Filename, read, In),
read_string(In, N, S), % or whatever reading you need to do
close(In))