How to choose the second wildcard for a rule depending on the first? - snakemake

I have two wildcards, the first for a species, the second is basically a counter for each file from a specific species. So I have Cow_A, Cow_B and Pig_A, Pig_B, Pig_C for example (the amount of files varies from species to species)
Now what I want is in the Input to select all files of one species using something like this
input: expand("{{species}}_{rep}", rep=count(species))
output: "{species}_combined"
How do I tell the input function count to use the current species wildcard?

You may use a function (or lambda) in the input section:
input:
lambda wildcards: expand("{species}_{rep}", species=[wildcards.species], rep=get_reps(wildcards.species))
output:
"{species}_combined"
You need to define a function that returns the reps for a species. That may be read from config or you may employ the glob_wildcards function:
def get_reps(species):
return glob_wildcards(f"{species}_{{rep}}").rep

You have to make use of a function, either an inputfunction:
def inputfunction(wildcards):
expand("{species}_{rep}", species=wildcards.species, rep=count(species))
input: inputfunction
output: "{species}_combined"
Or you can make use of a lambda function:
input: lambda wildcards: expand("{species}_{rep}", species=wildcards.species, rep=count(species))
output: "{species}_combined"

Related

Snakemake: access a list within a dict by using a wildcard

To break it down, I have a dict that looks like this:
dict = {'A': ["sample1","sample2","sample3"],
'B': ["sample1","sample2"],
'C': ["sample1","sample2","sample3"]}
And I have a rule:
rule example:
input:
#some input
params:
# some params
output:
expand('{{x}}{sample}', sample=dict[wildcards.x])
# the alternative I tried was
# expand('{{x}}{sample}', sample=lambda wildcards: dict[wildcards.x])
log:
log = '{x}.log'
run:
"""
foo
"""
My problem is how can I access the dictonary with the wildcard.x as key such that I get the list of items corresponding to the wildcard as key.
The first example just gives me
name 'wildcards' is not defined
while the alternative just gives me
Missing input files for rule all
Since snakemake doesn't even runs the example rule.
I need to use expand, since I want the rule to run only once for each x wildcard while creating multiple samples in this one run.
You can use a lambda as a function of a wildcard in the input section only, and cannot use in the output. Actually output doesn't have any wildcards, it defines them.
Let's rethink of your task from another view. How do you decide how many samples the output produces? You are defining the dict: where does this information come from? You have not shown the actual script, but how does it know how many outputs to produce?
Logically you might have three separate rules (or at least two), one knows how to produce two samples, the other how to produce three ones.
As I can see, you are experiencing a Problem XY: you are asking the same question twice, but you are not expressing your actual problem (X), while forcing an incorrect implementation with defining all outputs as a dictionary (Y).
Update: another possible solution to your artificial example would be to use dynamic output:
rule example:
input:
#some input
output:
dynamic('{x}sample{n}')
That would work in your case because the files match the common pattern "sample{n}".

Wildcard constraint that does not match a string

I have the following rule where I'm trying to constraint the wildcard sensor to any string except those starting with fitbit. The problem I'm facing is that the regex I'm using seems to match any string, so is as if the rule does not exist (no output file is going to be generated).
rule readable_datetime:
input:
sensor_input = rules.download_dataset.output
params:
timezones = None,
fixed_timezone = config["READABLE_DATETIME"]["FIXED_TIMEZONE"]
wildcard_constraints:
sensor = "^(?!fitbit).*" # ignoring fitbit sensors
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
script:
"../src/data/readable_datetime.R"
I'm getting this error message with a rule (light_metrics) that needs the output of readable_time with sensor=light as input
MissingInputException in line 112 of features.snakefile:
Missing input files for rule light_metrics:
data/raw/p01/light_with_datetime.csv
I prefer to stay away from regexes if I can and maybe this works for you.
Assuming sensor is a list like:
sensor = ['fitbit', 'spam', 'eggs']
In rule readable_datetime use
wildcard_constraints:
sensor = '|'.join([re.escape(x) for x in sensor if x != 'fitbit'])
Explained: re.escape(x) escapes metacharacters in x so that we are not going to have spurious matches if x contains '.' or '*'. x in sensor if x != 'fitbit' should be self-explanatory and you can make it as complicated as you want. Finally, '|'.join() stitches everything together in a regex that can match only the items in sensor captured by the list comprehension.
(Why your regex doesn't work I haven't investigated...)
My solution is simply to remove the ^ from the wildcards_constraint regex. This works because the regex is applied to the whole path containing the wildcard rather than just to the wildcard itself.
This is discussed briefly here:
https://bitbucket.org/snakemake/snakemake/issues/904/problems-with-wildcard-constraints
My understanding is that the regex you specify for each wildcard is substituted in to a larger regex for the entire output path.
Without wildcard_constraints:
Searches for something like data/raw/(.*)/(.*)_with_datetime.csv, taking the first and second capture groups to be pid and sensor respectively.
With wildcard_constraints:
Searches for data/raw/(.*)/((?!fitbit).*)_with_datetime.csv, again taking the first and second capture groups to be pid and sensor respectively.
Here is a smaller working example:
rule all:
input:
"data/raw/p01/light_with_datetime.csv",
"data/raw/p01/fitbit_light_with_datetime.csv",
"data/raw/p01/light_fitbit_with_datetime.csv",
"data/raw/p01/light_fitbit_foo_with_datetime.csv",
rule A:
output:
"data/raw/{pid}/{sensor}_with_datetime.csv"
wildcard_constraints:
sensor="(?!fitbit).*"
shell:
"touch {output}"
When you run snakemake, it only complains about the file with sensor starting with fitbit being missing, but happily finds the others.
snakemake
Building DAG of jobs...
MissingInputException in line 1 of Snakefile:
Missing input files for rule all:
data/raw/p01/fitbit_light_with_datetime.csv

How can I run a subset of my snakemake rules several times with wildcards?

I have a snakemake pipeline in which the input files are divided into two groups - input I would like to pass through the entire pipeline (true input) and input that should only pass through the first few rules (control input). How can I pass the true input through all rules and the control input only through the first few?
The most obvious solution would be delegation i.e. running all rules on the first group (true) and then copy-pasting the rules that I want to run on the second group (control) and providing these with the second group of input separately.
However I think this isn't good practice for code maintainability and I would much prefer a solution that utilised wildcards somehow.
The code below is a simplification of the problem with less rules:
INPUT = [NAME1, NAME2, NAME3, CONTROL]
LABELS = [A, B, C, D]
rule all:
input:
expand("output/{input}_results.txt",
input = INPUT)
rule split_data:
'''
Read the true input and control then split them
'''
input:
"data/{input}.txt"
output:
"data/{input}/{label}.txt", label = LABELS)
script:
"scripts/split_data.py"
rule run_true_data:
'''
Read only the true split and produce results.
'''
input:
"data/{{input}}/{label}.txt", label = LABELS)
output:
"output/{input}_results.txt"
script:
"scripts/produce_results.py"
In the ideal version of the above, the input wildcard should produce [NAME1, NAME2, NAME3, CONTROL] for split_data only. Whilst run_true_data and all should receive only [NAME1, NAME2, NAME3].
In addition the labels should be generated depending on the wildcard (with a lambda, for example) but this is not important for now so I didn't include it to avoid confusing things.
Maybe you need to add some more details on the exact nature of the problem. Incase you just need a different set of input for your second rule why not just add another wildcard for this step which limits the input to only required entries. Something along the lines of below script
INPUT = ["NAME1", "NAME2", "NAME3", "CONTROL"]
TRUE_INPUT = ["NAME1", "NAME2", "NAME3"]
LABELS = ["A", "B", "C", "D"]
rule all:
input:
expand("data/{input}/{label}.txt",
input = INPUT, label = LABELS),
expand("output/{true_input}_results.txt", true_input = TRUE_INPUT)
rule split_data:
'''
Read the true input and control then split them
'''
input:
"data/{input}.txt"
output:
"data/{input}/{label}.txt"
script:
"scripts/split_data.py"
rule run_true_data:
'''
Read only the true split and produce results.
'''
input:
lambda wildcards: ["data/{}/{}.txt".format(wildcards.true_input, label) for label in LABELS]
output:
"output/{true_input}_results.txt"
script:
"scripts/produce_results.py"
In this way you should be able to control the labels and the input for the rule run_true_data

Recursion/looping of rules in snakemake

I'm trying to bootstrap a hmm training, thus I need to loop through a few rules a couple of times. My idea was to do this:
dict={'boot1':'init', 'boot2':'boot1', 'final':'boot2'} # Define the workflow
rule a_rule_to_initialize_and_make_the_first_input
output:
'init_hmm'
rule make_model:
input:
'{0}_hmm'.format(dict[{run}]) # Create the loop by referencing the dict.
output:
'{run}_training_data'
rule train:
input:
'{run}_training_data'
output:
'{run}_hmm'
However, I can't access the wildcard {run} in the format function. Any hints as how I could get a hold of {run} within the input line. Or maybe a better way of performing the iteration?
I'm not sure if there's a better way to do the iteration, but the reason you can't access run is because wildcards aren't parsed unless they're in a string directly in the list of inputs or outputs. Snakemake lets you define lambda functions that get passed a wildcards object, so you need to do:
input:
lambda wildcards: '{0}_hmm'.format(dict[wildcards.run])

Learning jython string manipulation

I'm learning jython, and I want to see how to replace the suffix of a string.
For example, I have string:
com.foo.ear
and I want to replace the suffix to get:
com.foo.war
I cannot get replace or re.sub to work
You mention re.sub; here's one way to use that:
import re
re.sub('.ear$','.war','com.foo.ear')
# -> 'com.foo.war'
The $ matches the end of the string.
Using replace would be even simpler:
'com.foo.ear'.replace('ear','war')
# -> 'com.foo.war'
Edit:
And since that looks like a path, you may want to look into using os.path.splitext:
'{0}{1}'.format(os.path.splitext('com.foo.ear')[0],'.war')
# -> 'com.foo.war'