Snakemake: Pass all wildcards to a single shell command - snakemake

Snakemake supports generalizing rules with wildcards, like so:
rule conversion:
input:
"file_{name}.txt"
output:
"file_{name}.csv"
shell:
"python process.py {wildcards.name}"
Now say I have another rule like:
rule all:
input:
expand("file_{output_types}.csv", output_types=["A", "B"])
Calling the all rule, Snakemake will see that conversion can produce the requested files file_A.csv and file_B.csv (assuming the input files exist).
What ends up running are two shell commands:
python process.py A
python process.py B
Is it possible to run just a single shell command using wildcards?
What I would like to happen is to use the wildcards to run the command:
python process.py A B
(In my use-case, process.py has a long spin-up, so I want to avoid running it multiple times in a row.)

What about this?
names = ['A', 'B']
rule conversion:
input:
expand("file_{output_types}.txt", output_types= names),
output:
expand("file_{output_types}.csv", output_types= names),
params:
names= names,
shell:
"python process.py {params.names}"

Related

Use wildcard in python shell rule

I would like to do something like the following in Snakemake:
rule RULE:
input: ...
output: "{tool}/{file}"
shell: lambda wildcards: command_for_tool[wildcards.tool]
possibly with the shell command wrapped in a format(.., file=wildcards.file) to expand the {file} that will be inside the command_for_tool.
Currently I can do this using a run: calling shell(..), but I can't use this because I'm benchmarking the memory usage of the rule and going via python adds 30+MB overhead.
It is possible to use some python code inside the shell: rule that returns a string, but in this case I cannot figure out how to use the wildcards.
It is also possible to use wildcards directly in a string value, where they will be substituted automatically, but this doesn't allow for the map-lookup I need.
Is there a clean solution to this? (Currently I'm trying to work around it using params:.) To me it seems like an omission/inconsistency in how snakemake works.
Using your own suggestion, a solution using params seems quite clean:
rule RULE:
input:
'in.txt',
output:
'{foo}.txt',
params:
cmd= lambda wc: command_for_tool[wc.foo],
shell:
"""
{params.cmd}
"""
although I can see that for consistency with the input and params directive, also shell: lambda wildcards: command_for_tool[wildcards.tool] should work.

Discard part of filename in Snakemake: "Wildcards in input files cannot be determined from output files"

I am running into a WildcardError: Wildcards in input files cannot be determined from output files problem with Snakemake. The issue is that I don't want to keep a variable part of my input file name. For instance, suppose I have these files.
$ mkdir input
$ touch input/a-foo.txt
$ touch input/b-wsdfg.txt
$ touch input/c-3523.txt
And I have a Snakemake file like this:
subjects = ['a', 'b', 'c']
result_pattern = "output/{kind}.txt"
rule all:
input:
expand(result_pattern, kind=subjects)
rule step1:
input:
"input/{kind}-{fluff}.txt"
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
I want the output file names to just have the part I'm interested in. I understand the principle that every wildcard in input needs a corresponding wildcard in output. So is what I'm trying to do a sort of anti-pattern? For instance, I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other. Should I be renaming my input files prior to feeding into snakemake?
I want the output file names to just have the part I'm interested in [...]. I suppose there could be two files input/a-foo.txt and input/a-bar.txt, and they would overwrite each other.
It seems to me you need to decide how to resolve such conflicts. If the input files are:
input/a-bar.txt
input/a-foo.txt <- Note duplicate {a}
input/b-wsdfg.txt
input/c-3523.txt
How do you want the output files to be named and according to what criteria? The answer is independent of snakemake but depending on your circumstances you could include python code within the Snakefile to do handle such conflicts automatically.
Basically, once you make such decisions you can work on the solution.
But suppose there are no file name conflicts, it seems like the wildcard system doesn't handle cases where you want to remove some variable fluff from a filename
The variable part can be handled using python's glob patterns:
import glob
...
rule step1:
input:
glob.glob("input/{kind}-*.txt")
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""
You could even be more elaborate and use a dedicated function to match files given the {kind} wildcard:
def get_kind_files(wc):
ff = glob.glob("input/%s-*.txt" % wc.kind)
if len(ff) != 1:
raise Exception('Exepected exactly 1 file for kind "%s"' % wc.kind)
# Possibly more checks tha you got the right file
return ff
rule step1:
input:
get_kind_files,
output:
"output/{kind}.txt"
shell:
"""
cp {input} {output}
"""

Using the expand() function in snakemake to perform a shell command multiple times

I would like to perform an R script multiple times on different input files with the help of snakemake. To do this I tried the use of the expand function.
I am relatively new to snakemake and when I understand it correctly, the expand function gives me for example multiple input files which are then all concatenated and available via {input}.
Is it possible to call the shell command on the files one by one?
Lets say I have this definition in my config.yaml:
types:
- "A"
- "B"
This would be my example rule:
rule manual_groups:
input:
expand("chip_{type}.bed",type=config["types"])
output:
expand("data/p_chip_{type}.model",type=config["types"])
shell:
"Rscript scripts/pre_process.R {input}"
This would lead to the command:
Rscript scripts/pre_process.R chip_A.bed chip_B.bed
Is it possible to instead call the command two times independently with two types like this:
Rscript scripts/pre_process.R chip_A.bed
Rscript scripts/pre_process.R chip_B.bed
Thank you for any help in advance!
Define final target files in rule all, and then just use appropriate wildcard (i.e., type) in rule manual_groups. This would run rule manual_groups separately for each output file listed in rule all.
rule all:
input:
expand("data/p_chip_{type}.model",type=config["types"])
rule manual_groups:
input:
"chip_{type}.bed"
output:
"data/p_chip_{type}.model"
shell:
"Rscript scripts/pre_process.R {input}"
PS- You may want to change wildcard term type because of potential conflict with Python's type method.
I would agree with the answer of #ManavalanGajapathy that this is the most reliable solution for your problem. This however is not a full answer.
The expand is just a regular Python function defined in Snakemake. That means that you can use it everywhere when you can use Python. It is just an utility that takes a string and parameters for substitution, and returns the list of strings where each string is a result of a single substitution. This utility can be handy in many places. Below I'm providing a fancy example that illustrates the idea. Let's imagine that you need to take a text file as an input and substitute some characters (the list should be provided from config). Let's imagine that you know the only way to do it: as a pipeline of sed scripts. Like that:
cat input.txt | sed 's/A/a/g' | sed 's/B/b/g' | sed 's/C/c/g' > output.txt
You come to the conclusion that you need to pipeline a chain of sed commands that differ in two symbols: sed 's/X/x/g'. Here is a solution using the expand function:
rule substitute:
input: "input.txt"
output: "output.txt"
params:
from = ["A", "B", "C"],
to = ["a", "b", "c"]
shell: "cat {input} | " + " | ".join(expand("sed 's/{from}/{to}/g'", zip, from=params.from, to=params.to)) + " > {output}"

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.
Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy