NameError when ues awk in snakemake - snakemake

First, here is a test example (Snakefile):
rule test:
input:"path/file1", "path/file2"
output:"path/file3"
shell:
"""
awk 'NR==FNR{score[$3]=$5;next}{{sum=0}for(i=$2;i<=$3;i++){sum+=score[i]}printf "%-10s\\t%-10s\\n",sum,$4}' {input[0]} {input[1]} >> {output}
"""
When I run this script, it returns NameError: The name 'score' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}, I have tried to {score} or replicate {}, but all doesn't work. So, I want to know how to solve this quetion. Thanks.

That is because snakemake tries to format the text and put the variables in the string. Since e.g. score are part of the script, snakemake can not deduce which variable it belongs to and it crashes. To escape this behaviour use double curly brackets: {{score[$3]}}. It gets rather ugly with multiple curly brackets, like in your rule
rule test:
input:"path/file1", "path/file2"
output:"path/file3"
shell:
"""
awk 'NR==FNR{{score[$3]=$5;next}}{{{{sum=0}}for(i=$2;i<=$3;i++){{sum+=score[i]}}printf "%-10s\\t%-10s\\n",sum,$4}}' {input[0]} {input[1]} >> {output}
"""
(I hope I didn't miss any, but I think you get the idea)

Related

Use wildcard in python shell rule

I would like to do something like the following in Snakemake:
rule RULE:
input: ...
output: "{tool}/{file}"
shell: lambda wildcards: command_for_tool[wildcards.tool]
possibly with the shell command wrapped in a format(.., file=wildcards.file) to expand the {file} that will be inside the command_for_tool.
Currently I can do this using a run: calling shell(..), but I can't use this because I'm benchmarking the memory usage of the rule and going via python adds 30+MB overhead.
It is possible to use some python code inside the shell: rule that returns a string, but in this case I cannot figure out how to use the wildcards.
It is also possible to use wildcards directly in a string value, where they will be substituted automatically, but this doesn't allow for the map-lookup I need.
Is there a clean solution to this? (Currently I'm trying to work around it using params:.) To me it seems like an omission/inconsistency in how snakemake works.
Using your own suggestion, a solution using params seems quite clean:
rule RULE:
input:
'in.txt',
output:
'{foo}.txt',
params:
cmd= lambda wc: command_for_tool[wc.foo],
shell:
"""
{params.cmd}
"""
although I can see that for consistency with the input and params directive, also shell: lambda wildcards: command_for_tool[wildcards.tool] should work.

Using the expand() function in snakemake to perform a shell command multiple times

I would like to perform an R script multiple times on different input files with the help of snakemake. To do this I tried the use of the expand function.
I am relatively new to snakemake and when I understand it correctly, the expand function gives me for example multiple input files which are then all concatenated and available via {input}.
Is it possible to call the shell command on the files one by one?
Lets say I have this definition in my config.yaml:
types:
- "A"
- "B"
This would be my example rule:
rule manual_groups:
input:
expand("chip_{type}.bed",type=config["types"])
output:
expand("data/p_chip_{type}.model",type=config["types"])
shell:
"Rscript scripts/pre_process.R {input}"
This would lead to the command:
Rscript scripts/pre_process.R chip_A.bed chip_B.bed
Is it possible to instead call the command two times independently with two types like this:
Rscript scripts/pre_process.R chip_A.bed
Rscript scripts/pre_process.R chip_B.bed
Thank you for any help in advance!
Define final target files in rule all, and then just use appropriate wildcard (i.e., type) in rule manual_groups. This would run rule manual_groups separately for each output file listed in rule all.
rule all:
input:
expand("data/p_chip_{type}.model",type=config["types"])
rule manual_groups:
input:
"chip_{type}.bed"
output:
"data/p_chip_{type}.model"
shell:
"Rscript scripts/pre_process.R {input}"
PS- You may want to change wildcard term type because of potential conflict with Python's type method.
I would agree with the answer of #ManavalanGajapathy that this is the most reliable solution for your problem. This however is not a full answer.
The expand is just a regular Python function defined in Snakemake. That means that you can use it everywhere when you can use Python. It is just an utility that takes a string and parameters for substitution, and returns the list of strings where each string is a result of a single substitution. This utility can be handy in many places. Below I'm providing a fancy example that illustrates the idea. Let's imagine that you need to take a text file as an input and substitute some characters (the list should be provided from config). Let's imagine that you know the only way to do it: as a pipeline of sed scripts. Like that:
cat input.txt | sed 's/A/a/g' | sed 's/B/b/g' | sed 's/C/c/g' > output.txt
You come to the conclusion that you need to pipeline a chain of sed commands that differ in two symbols: sed 's/X/x/g'. Here is a solution using the expand function:
rule substitute:
input: "input.txt"
output: "output.txt"
params:
from = ["A", "B", "C"],
to = ["a", "b", "c"]
shell: "cat {input} | " + " | ".join(expand("sed 's/{from}/{to}/g'", zip, from=params.from, to=params.to)) + " > {output}"

snakemake: correct quoting when using singularity

I want to run the following shell command
shell:
"""
Rscript -e "rmarkdown::render('{input.markdown}', output_dir = 'output/{wildcards.version}', params = list(datapath = '../data/{wildcards.version}', max_lab_days = {config[max_lab_days]}, seed = {config[seed]}))"
"""
everything is fine in normal mode but breaks down when setting --use-singularity. I guess this is come quoting related issue since singularity exec adds another layer of quotes here, right?
So, I guess my question is how to avoid this quotation hell - any ideas?
okay, turns out the single quotes, ', are the problem - never use them in a snakemake shell command or it will not be portable to singularity execution. Fortunately one may escape them for the Rscript -e command by replacing ' with \".
Is that really necessary?

How can one access Snakemake config variables inside `shell` section?

In snakemake I would like to access keys from the config from within the shell: directive. I can use {input.foo}, {output.bar}, and {params.baz}, but {config.quux} is not supported. Is there a way to achieve this?
rule do_something:
input: "source.txt"
output: "target.txt"
params:
# access config[] here. parameter tracking is a side effect
tmpdir = config['tmpdir']
shell:
# using {config.tmpdir} or {config['tmpdir']} here breaks the build
"./scripts/do.sh --tmpdir {params.tmpdir} {input} > {output}; "
I could assign the parts of the config I want to a key under params, and then use a {param.x} replacement, but this has unwanted side effects (e.g. the parameter is saved in the snakemake metadata (i.e. .snakemake/params_tracking). Using run: instead of shell: would be another workaround, but accessing {config.tmpdir} directly from the shell block, would be most desirable.
"./scripts/do.sh --tmpdir {config[tmpdir]} {input} > {output}; "
should work here.
It is stated in the documentation:
http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#standard-configuration
"For adding config placeholders into a shell command, Python string formatting syntax requires you to leave out the quotes around the key name, like so:"
shell:
"mycommand {config[foo]} ..."

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy