Recursion/looping of rules in snakemake - snakemake

I'm trying to bootstrap a hmm training, thus I need to loop through a few rules a couple of times. My idea was to do this:
dict={'boot1':'init', 'boot2':'boot1', 'final':'boot2'} # Define the workflow
rule a_rule_to_initialize_and_make_the_first_input
output:
'init_hmm'
rule make_model:
input:
'{0}_hmm'.format(dict[{run}]) # Create the loop by referencing the dict.
output:
'{run}_training_data'
rule train:
input:
'{run}_training_data'
output:
'{run}_hmm'
However, I can't access the wildcard {run} in the format function. Any hints as how I could get a hold of {run} within the input line. Or maybe a better way of performing the iteration?

I'm not sure if there's a better way to do the iteration, but the reason you can't access run is because wildcards aren't parsed unless they're in a string directly in the list of inputs or outputs. Snakemake lets you define lambda functions that get passed a wildcards object, so you need to do:
input:
lambda wildcards: '{0}_hmm'.format(dict[wildcards.run])

Related

Snakemake: access a list within a dict by using a wildcard

To break it down, I have a dict that looks like this:
dict = {'A': ["sample1","sample2","sample3"],
'B': ["sample1","sample2"],
'C': ["sample1","sample2","sample3"]}
And I have a rule:
rule example:
input:
#some input
params:
# some params
output:
expand('{{x}}{sample}', sample=dict[wildcards.x])
# the alternative I tried was
# expand('{{x}}{sample}', sample=lambda wildcards: dict[wildcards.x])
log:
log = '{x}.log'
run:
"""
foo
"""
My problem is how can I access the dictonary with the wildcard.x as key such that I get the list of items corresponding to the wildcard as key.
The first example just gives me
name 'wildcards' is not defined
while the alternative just gives me
Missing input files for rule all
Since snakemake doesn't even runs the example rule.
I need to use expand, since I want the rule to run only once for each x wildcard while creating multiple samples in this one run.
You can use a lambda as a function of a wildcard in the input section only, and cannot use in the output. Actually output doesn't have any wildcards, it defines them.
Let's rethink of your task from another view. How do you decide how many samples the output produces? You are defining the dict: where does this information come from? You have not shown the actual script, but how does it know how many outputs to produce?
Logically you might have three separate rules (or at least two), one knows how to produce two samples, the other how to produce three ones.
As I can see, you are experiencing a Problem XY: you are asking the same question twice, but you are not expressing your actual problem (X), while forcing an incorrect implementation with defining all outputs as a dictionary (Y).
Update: another possible solution to your artificial example would be to use dynamic output:
rule example:
input:
#some input
output:
dynamic('{x}sample{n}')
That would work in your case because the files match the common pattern "sample{n}".

How to choose the second wildcard for a rule depending on the first?

I have two wildcards, the first for a species, the second is basically a counter for each file from a specific species. So I have Cow_A, Cow_B and Pig_A, Pig_B, Pig_C for example (the amount of files varies from species to species)
Now what I want is in the Input to select all files of one species using something like this
input: expand("{{species}}_{rep}", rep=count(species))
output: "{species}_combined"
How do I tell the input function count to use the current species wildcard?
You may use a function (or lambda) in the input section:
input:
lambda wildcards: expand("{species}_{rep}", species=[wildcards.species], rep=get_reps(wildcards.species))
output:
"{species}_combined"
You need to define a function that returns the reps for a species. That may be read from config or you may employ the glob_wildcards function:
def get_reps(species):
return glob_wildcards(f"{species}_{{rep}}").rep
You have to make use of a function, either an inputfunction:
def inputfunction(wildcards):
expand("{species}_{rep}", species=wildcards.species, rep=count(species))
input: inputfunction
output: "{species}_combined"
Or you can make use of a lambda function:
input: lambda wildcards: expand("{species}_{rep}", species=wildcards.species, rep=count(species))
output: "{species}_combined"

Wildcard constraint that does not match a string

I have the following rule where I'm trying to constraint the wildcard sensor to any string except those starting with fitbit. The problem I'm facing is that the regex I'm using seems to match any string, so is as if the rule does not exist (no output file is going to be generated).
rule readable_datetime:
input:
sensor_input = rules.download_dataset.output
params:
timezones = None,
fixed_timezone = config["READABLE_DATETIME"]["FIXED_TIMEZONE"]
wildcard_constraints:
sensor = "^(?!fitbit).*" # ignoring fitbit sensors
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
script:
"../src/data/readable_datetime.R"
I'm getting this error message with a rule (light_metrics) that needs the output of readable_time with sensor=light as input
MissingInputException in line 112 of features.snakefile:
Missing input files for rule light_metrics:
data/raw/p01/light_with_datetime.csv
I prefer to stay away from regexes if I can and maybe this works for you.
Assuming sensor is a list like:
sensor = ['fitbit', 'spam', 'eggs']
In rule readable_datetime use
wildcard_constraints:
sensor = '|'.join([re.escape(x) for x in sensor if x != 'fitbit'])
Explained: re.escape(x) escapes metacharacters in x so that we are not going to have spurious matches if x contains '.' or '*'. x in sensor if x != 'fitbit' should be self-explanatory and you can make it as complicated as you want. Finally, '|'.join() stitches everything together in a regex that can match only the items in sensor captured by the list comprehension.
(Why your regex doesn't work I haven't investigated...)
My solution is simply to remove the ^ from the wildcards_constraint regex. This works because the regex is applied to the whole path containing the wildcard rather than just to the wildcard itself.
This is discussed briefly here:
https://bitbucket.org/snakemake/snakemake/issues/904/problems-with-wildcard-constraints
My understanding is that the regex you specify for each wildcard is substituted in to a larger regex for the entire output path.
Without wildcard_constraints:
Searches for something like data/raw/(.*)/(.*)_with_datetime.csv, taking the first and second capture groups to be pid and sensor respectively.
With wildcard_constraints:
Searches for data/raw/(.*)/((?!fitbit).*)_with_datetime.csv, again taking the first and second capture groups to be pid and sensor respectively.
Here is a smaller working example:
rule all:
input:
"data/raw/p01/light_with_datetime.csv",
"data/raw/p01/fitbit_light_with_datetime.csv",
"data/raw/p01/light_fitbit_with_datetime.csv",
"data/raw/p01/light_fitbit_foo_with_datetime.csv",
rule A:
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
wildcard_constraints:
sensor="(?!fitbit).*"
shell:
"touch {output}"
When you run snakemake, it only complains about the file with sensor starting with fitbit being missing, but happily finds the others.
snakemake
Building DAG of jobs...
MissingInputException in line 1 of Snakefile:
Missing input files for rule all:
data/raw/p01/fitbit_light_with_datetime.csv

SWI-Prolog predicate for reading in lines from input file

I'm trying to write a predicate to accept a line from an input file. Every time it's used, it should give the next line, until it reaches the end of the file, at which point it should return false. Something like this:
database :-
see('blah.txt'),
loop,
seen.
loop :-
accept_line(Line),
write('I found a line.\n'),
loop.
accept_line([Char | Rest]) :-
get0(Char),
C =\= "\n",
!,
accept_line(Rest).
accept_line([]).
Obviously this doesn't work. It works for the first line of the input file and then loops endlessly. I can see that I need to have some line like "C =\= -1" in there somewhere to check for the end of the file, but I can't see where it'd go.
So an example input and output could be...
INPUT
this is
an example
OUTPUT
I found a line.
I found a line.
Or am I doing this completely wrong? Maybe there's a built in rule that does this simply?
In SWI-Prolog, the most elegant way to do this is to first use a DCG to describe what a "line" means, and then use library(pio) to apply the DCG to a file.
An important advantage of this is that you can then easily apply the same DCG also on queries on the toplevel with phrase/2 and do not need to create a file to test the predicate.
There is a DCG tutorial that explains this approach, and you can easily adapt it to your use case.
For example:
:- use_module(library(pio)).
:- set_prolog_flag(double_quotes, codes).
lines --> call(eos), !.
lines --> line, { writeln('I found a line.') }, lines.
line --> ( "\n" ; call(eos) ), !.
line --> [_], line.
eos([], []).
Example usage:
?- phrase_from_file(lines, 'blah.txt').
I found a line.
I found a line.
true.
Example usage, using the same DCG to parse directly from character codes without using a file:
?- phrase(lines, "test1\ntest2").
I found a line.
I found a line.
true.
This approach can be very easily extended to parse more complex file contents as well.
If you want to read into code lists, see library(readutil), in particular read_line_to_codes/2 which does exactly what you need.
You can of course use the character I/O primitives, but at least use the ISO predicates. "Edinburgh-style" I/O is deprecated, at least for SWI-Prolog. Then:
get_line(L) :-
get_code(C),
get_line_1(C, L).
get_line_1(-1, []) :- !. % EOF
get_line_1(0'\n, []) :- !. % EOL
get_line_1(C, [C|Cs]) :-
get_code(C1),
get_line_1(C1, Cs).
This is of course a lot of unnecessary code; just use read_line_to_codes/2 and the other predicates in library(readutil).
Since strings were introduced to Prolog, there are some new nifty ways of reading. For example, to read all input and split it to lines, you can do:
read_string(user_input, _, S),
split_string(S, "\n", "", Lines)
See the examples in read_string/5 for reading linewise.
PS. Drop the see and seen etc. Instead:
setup_call_cleanup(open(Filename, read, In),
read_string(In, N, S), % or whatever reading you need to do
close(In))

Limitting character input to specific characters

I'm making a fully working add and subtract program as a nice little easy project. One thing I would love to know is if there is a way to restrict input to certain characters (such as 1 and 0 for the binary inputs and A and B for the add or subtract inputs). I could always replace all characters that aren't these with empty strings to get rid of them, but doing something like this is quite tedious.
Here is some simple code to filter out the specified characters from a user's input:
local filter = "10abAB"
local input = io.read()
input = input:gsub("[^" .. filter .. "]", "")
The filter variable is just set to whatever characters you want to be allowed in the user's input. As an example, if you want to allow c, add c: local filter = "10abcABC".
Although I assume that you get input from io.read(), it is possible that you get it from somewhere else, so you can just replace io.read() with whatever you need there.
The third line of code in my example is what actually filters out the text. It uses string:gsub to do this, meaning that it could also be written like this:
input = string.gsub(input, "[^" .. filter .. "]", "").
The benefit of writing it like this is that it's clear that input is meant to be a string.
The gsub pattern is [^10abAB], which means that any characters that aren't part of that pattern will be filtered out, due to the ^ before them and the replacement pattern, which is the empty string that is the last argument in the method call.
Bonus super-short one-liner that you probably shouldn't use:
local input = io.read():gsub("[^10abAB]", "")