How to get a rule that would work the same on a directory and its sub-directories

How to get a rule that would work the same on a directory and its sub-directories - snakemake

I am trying to make a rule that would work the same on a directory and any of its sub sub-directory (to avoid having to repeat the rule several times). I would like to have access to the name of the subdirectory if there is one.
My approach was to make the sub-directory optional. Given that wildcards can be made to accept an empty string by explicitly giving the ".*" pattern, I therefore tried the following rule:
rule test_optional_sub_dir:
input:
"{adir}/{bdir}/a.txt"
output:
"{adir}/{bdir,.*}/b.txt"
shell:
"cp {input} {output}"
I was hoping that this rule would match both A/b.txt and A/B/b.txt.
However, A/b.txt doesn't match the rule. (Neither does A//b.txt which would be the litteral omission of bdir, I guess the double / gets removed before the matching happens).
The following rule works with both A/b.txt and A/B/b.txt:
rule test_optional_sub_dir2:
input:
"{path}/a.txt"
output:
"{path,.*}/b.txt"
shell:
"cp {input} {output}"
but the problem in this case is that I don't have easy access to the name of the directories in path. I could use the function pathlib.Path to break {path} up but this seems to get overly complicated.
Is there a better way to accomplish what I am trying to do?
Thanks a lot for your help.

How exactly you want to use the sub-directory in your rule might determine the best way to do this. Maybe something like:
def get_subdir(path):
dirs = path.split('/')
if len(dirs) > 1:
return dirs[1]
else:
return ''
rule myrule:
input:
"{dirpath}/a.txt"
output:
"{dirpath}/b.txt"
params:
subdir = lambda wildcards: get_subdir(wildcards.dirpath)
shell:
#use {params.subdir}
Of course, if your rule uses "run" or "script" instead of "shell" you don't even need that function and the subdir param, and can just figure out the subdir from the wildcard that gets passed into the script.

With some further fiddling, I found something that is close to what I want:
Let's say I want at least one directory and no more than 2 optional ones below it. The following works. The only downside is that opt_dir1 and opt_dir2 contain the trailing slash rather than just the name of the directory.
rule test_optional_sub_dir3:
input:
"{mand_dir}/{opt_dir1}{opt_dir2}a.txt"
output:
"{mand_dir}/{opt_dir1}{opt_dir2}b.txt"
wildcard_constraints:
mand_dir="[^/]+",
opt_dir1="([^/]+/)?",
opt_dir2="([^/]+/)?"
shell:
"cp {input} {output}"
Still interested in better approaches if anyone has one.

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:
rule all:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
This ends with the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.

I don't know what command you used for this rule, but the following workflow should suffice your purpose
rule all:
input:
expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
rule test:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").
Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.
So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file

snakemake wildcard scoping between rules

I've read the docs on rules, the FAQ, and this question, but I still can't tell: if a wildcard foo is defined in rule bar, can its values be accessed in rule baz?

If I understood your question correctly, the answer should be "no".
By using a wildcard in the rule you just define the pattern that can be applied to many different files. For example, in this rule you define the a way to produce files whose filenames match a pattern "new_name{n}.txt", where {n} can be any string:
rule example:
input: "old_name{n}.txt"
output: "new_name{n}.txt"
shell: "cp input output"
For sure this rule will be regarded only if the file "old_name{n}.txt" exists with the same string {n} used.
Returning back to your question: how could you access the value if this is just a pattern that may be applied to different values?
Another possible interpretation of your question is that you need to know all the values (all the inputs) the rule bar was applied to. In this case you probably need to employ checkpoint: this is the way to delay the pattern application until the part of pipeline finishes. But even in this case you wouldn't be accessing "the values of wildcard" explicitly.

I'm not sure I'm answering your question and what follows may not be entirely correct... Snakemake only cares that you have one and only one path that leads to the requested files (i.e., the files defined in the first rule, usually called all).
If rule bar defines wildcards that can to lead to the final output, then yes, those wildcards are visible to the following rules.
In the script below we ask for files A.txt and B.txt. To produce A.txt we don't need any wildcard. To produce B.txt we need to pass by wildcard {wc} defined in rule bar and used in rule B. Note that wildcard {sample} doesn't appear at all outside rule all. Note also that rule bar produces two files, B.tmp and C.tmp, but rule B only needs B.tmp. (You should be able to dry-run this script with snakemake -p -n)
rule all:
input:
expand('{sample}.txt', sample= ['A', 'B']),
rule A:
output:
'A.txt',
shell:
"touch {output}"
rule bar:
output:
expand('{wc}.tmp', wc= ['B', 'C'])
shell:
r"""
touch {output}
"""
rule B:
input:
'{wc}.tmp',
output:
'{wc}.txt',
shell:
r"""
touch {input} {output}
"""

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.

Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated

You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"

Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!

You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get a rule that would work the same on a directory and its sub-directories - snakemake

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

snakemake wildcard scoping between rules

Varying (known) number of outputs in Snakemake

How to gather files from subdirectories to run jobs in Snakemake?

how can I pass a string in config file into the output section?

Categories

Resources