I can't find appropriate words to describe my need, please see the code.
My ideal workflow would do this:
I have known the categories that will be created during the workflow (cates)
I don't know how many files and what files would be created in each category (files)
For each category, rule create_file will be run first.
And then the checkpoint is triggered, I will know what files have been created for each category.
Then, for each file created in the create_file rule, a mock rule append_to_file_name take the file as input, and do the operation.
Files produced in create_file is wildcard specific so I call my need as "wildcard-specific" wildcard
cates=["A", "B"]
# pretend that you don't know about the files about to be created
files={
"A": ["a.txt", "b.txt"],
"B": ["c.txt", "d.txt"]
}
def get_append_to_file_output(wildcards):
files = glob_wildcards(f"{wildcards.cate}/{{file}}.txt").sample
appened = expand(f"{wildcards.cate}/{{file}}_append.txt", file = files)
return appened
rule all:
input:
get_append_to_file_output,
checkpoint create_file:
output: ddd=directory("{cate}"),
run:
from pathlib import Path
Path(output.ddd).mkdir(parents=True, exist_ok=True)
for file in files[wildcards.cate]:
Path(file).touch()
rule append_to_file_name:
input: ddd="{cate}",
output: "{cate}/{file}_append.txt",
run:
from pathlib import Path
Path(output[0]).touch()
cates=["A", "B"]
# pretend that you don't know about the files about to be created
files={
"A": ["a", "b"],
"B": ["c", "d"]
}
def get_append_to_file_output(wildcards):
cate_dir = checkpoints.create_file.get(**wildcards).output.ddd
files = glob_wildcards(f"{cate_dir}/{{file}}.txt").file
print(files)
appened = expand(f"results/append_filename/{wildcards.cate}/{{file}}_append.txt", file = files)
return appened
rule all:
input:
expand("results/flags/{cate}_aggregate.flag", cate=cates)
checkpoint create_file:
output:
ddd=directory("results/create_file/{cate}"),
run:
from pathlib import Path
Path(output.ddd).mkdir(parents=True, exist_ok=True)
for file in files[wildcards.cate]:
Path(f"{output.ddd}/{file}.txt").touch()
rule append_to_filename:
input:
ddd=rules.create_file.output.ddd
output:
appended="results/append_filename/{cate}/{file}_append.txt"
shell:
"cp {input.ddd}/{wildcards.file}.txt {output.appended}"
rule fake_aggregate:
input:
get_append_to_file_output
output:
touch("results/flags/{cate}_aggregate.flag")
I've solved this by adding a fake aggregating rule
Related
I have the following basic structure of the workflow:
files are downloaded from a remote server,
converted locally and then
analyzed.
One of the analyses is time-consuming, but it scales well if run on multiple input files at a time. The output of this rule is independent of what files are analyzed together as a batch as long as they all share the same set of settings. Upstream and downstream rules operate on individual files, so from the perspective of the workflow, this rule is an outlier. What files are to be run together can told in advance, although ideally if some of the inputs failed to be produced along the way, the rule should be run on a reduced of files.
The following example illustrates the problem:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
"created/{sample}.txt"
output:
"analyzed/{sample}.txt"
params:
outdir = "analyzed/"
shell:
"""
sleep 1 # or longer
parallel md5sum {{}} \> {params.outdir}/{{/}} ::: {input}
"""
rule finalize:
input:
"analyzed/{sample}.txt"
output:
"done/{sample}.txt"
shell:
"touch {output}"
The rule analyze is the one to produce multiple output files from multiple inputs according to the assignment in groups. The rules create and finalize operate on individual files upstream and downstream, respectively.
Is there a way to implement such logic? I'd try like to try to avoid splitting the workflow to accommodate this irregularity.
Note: this question is not related to the similar sounding question here.
If I understand correctly. rule analyze takes in input files created/a.txt, created/b.txt, created/c.txt for group A and gives in output
analyzed/a.txt, analyzed/b.txt, analyzed/c.txt. The same for group B so rule analyze runs twice, everything else runs 6 times.
If so, I make rule analyze output a dummy file signaling that files in group A (or B, etc.) has been analyzed. Downstream rules will take in input this dummy file and will find the corresponding analyzed/{sample}.txtavailable.
Here's your example:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
# Map samples to groups by inverting dict groups
inv_groups = {}
for x in samples:
for k in groups:
if x in groups[k]:
inv_groups[x] = k
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
# Collect input for this group (A, B, etc)
grp= lambda wc: ["created/%s.txt" % x for x in groups[wc.group]]
output:
done = touch('created/{group}.done'),
shell:
"""
# Code that actually does the job...
for x in {input.grp}
do
sn=`basename $x .txt`
touch analyzed/$sn.txt
done
"""
rule finalize:
input:
# Get dummy file for this {sample}.
# If the dummy exists also the corresponding analyzed/{sample}.txt exists.
done = lambda wc: 'created/%s.done' % inv_groups[wc.sample],
output:
fout= "done/{sample}.txt"
params:
fin= "analyzed/{sample}.txt",
shell:
"cp {params.fin} {output.fout}"
I would like to change the folder names and the names of the read files. From here I found something similar (Move and rename files from multiple folders using snakemake):
workdir: "/path/to/workdir/"
import pandas as pd
from io import StringIO
sample_file = StringIO("""fastq sampleID
BOB_1234/fastq/BOB_1234.R1.fastq.gz TAG_1/fastq/TAG_1.R1.fastq.gz
BOB_1234/fastq/BOB_1234.R2.fastq.gz TAG_1/fastq/TAG_1.R2.fastq.gz
BOB_3421/fastq/BOB_3421.R1.fastq.gz TAG_2/fastq/TAG_2.R1.fastq.gz
BOB_3421/fastq/BOB_3421.R2.fastq.gz TAG_2/fastq/TAG_2.R2.fastq.gz""")
df = pd.read_table(sample_file, sep="\s+", header=0)
sampleID = df.sampleID
fastq = df.fastq
rule all:
input:
expand("{sample}", sample=df.sampleID)
rule move_and_rename_fastqs:
input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
output: "{sample}"
shell:
"""echo mv {input.fastq} {output}"""
I can an error:
MissingOutputException in line 19 of snakefile:
Job Missing files after 5 seconds:
TAG_1/fastq/TAG_1.R1.fastq.gz
What I currently want to get going is, to add all the files created in "someDir/" to my DAG and add them to my report. The problem is mainly that those files are created in the checkpoint, thus I can't define them as wildcards beforehand. The allFiles(wildcards) currently returns me the directory and not the files.
checkpoint someRule:
input:
"output/some.rds"
output:
directory("someDir/")
def allFiles(wildcards):
checkpoints.someRule.get(**wildcards).output[0] # is "output/some.rds" instead of wildcards
filenames, = glob_wildcards("someDir/{filenames}")
return expand("someDir/{fn}", fn=filenames)
rule all:
input:
allFiles
Found a workaround. If someone has the same problem, this worked for me.
def aggregate_input(wildcards):
checkpoint_output = checkpoints.someRule.get(**wildcards).output[0]
return expand('someDir/{i}',
i=glob_wildcards(os.path.join(checkpoint_output, '{i}')).i)
There still remains the problem, that the DAG doesn't include the checkpoint "someRule"
I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.
So far I used snakemake to generate individual plots with snakemake. This has worked great! Now though, I want to create a rule that creates a combined plot across the topics, without explicitly putting the name in the plot. See the combined_plot rule below.
topics=["soccer", "football"]
params=[1, 2, 3, 4]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"plot.py --input={input} --output={output}"
rule combined_plot:
input:
# all data_p={param}_{topic}.csv files
output:
"combined_p={param}_plot.png"
shell:
"plot2.py " + # one "--input=" and one "--output" for each csv file
Is there a simple way to do this with snakemake?
If I understand correctly, the code below should be more straightforward as it replaces the lambda and the glob with the expand function. It will execute the two commands:
plot2.py --input=data_p=1_soccer.csv --input=data_p=1_football.csv --output combined_p=1_plot.png
plot2.py --input=data_p=2_soccer.csv --input=data_p=2_football.csv --output combined_p=2_plot.png
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("combined_p={param}_plot.png", param=params),
rule combined_plot:
input:
csv= expand("data_p={{param}}_{topic}.csv", topic= topics)
output:
"combined_p={param}_plot.png",
run:
inputs= ['--input=' + x for x in input.csv]
shell("plot2.py {inputs} --output {output}")
I got a working version, by using a function called 'wcs' as input (see here) and I used run instead of shell. In the run section I could first define a variable before executing the result with shell(...).
Instead of referring to the files with glob I could also have directly used the topics in the lambda function.
If anyone with more experience sees this, please tell me if this is the "right" way to do it.
from glob import glob
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"echo plot.py {input} {output}"
rule combined_plot:
input:
lambda wcs: glob("data_p={param}_*.csv".format(**wcs))
output:
"combined_p={param}_plot.png"
run:
inputs=" ".join(["--input " + inp for inp in input])
shell("echo plot2.py {inputs}")