I have the following basic structure of the workflow:
files are downloaded from a remote server,
converted locally and then
analyzed.
One of the analyses is time-consuming, but it scales well if run on multiple input files at a time. The output of this rule is independent of what files are analyzed together as a batch as long as they all share the same set of settings. Upstream and downstream rules operate on individual files, so from the perspective of the workflow, this rule is an outlier. What files are to be run together can told in advance, although ideally if some of the inputs failed to be produced along the way, the rule should be run on a reduced of files.
The following example illustrates the problem:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
"created/{sample}.txt"
output:
"analyzed/{sample}.txt"
params:
outdir = "analyzed/"
shell:
"""
sleep 1 # or longer
parallel md5sum {{}} \> {params.outdir}/{{/}} ::: {input}
"""
rule finalize:
input:
"analyzed/{sample}.txt"
output:
"done/{sample}.txt"
shell:
"touch {output}"
The rule analyze is the one to produce multiple output files from multiple inputs according to the assignment in groups. The rules create and finalize operate on individual files upstream and downstream, respectively.
Is there a way to implement such logic? I'd try like to try to avoid splitting the workflow to accommodate this irregularity.
Note: this question is not related to the similar sounding question here.
If I understand correctly. rule analyze takes in input files created/a.txt, created/b.txt, created/c.txt for group A and gives in output
analyzed/a.txt, analyzed/b.txt, analyzed/c.txt. The same for group B so rule analyze runs twice, everything else runs 6 times.
If so, I make rule analyze output a dummy file signaling that files in group A (or B, etc.) has been analyzed. Downstream rules will take in input this dummy file and will find the corresponding analyzed/{sample}.txtavailable.
Here's your example:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
# Map samples to groups by inverting dict groups
inv_groups = {}
for x in samples:
for k in groups:
if x in groups[k]:
inv_groups[x] = k
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
# Collect input for this group (A, B, etc)
grp= lambda wc: ["created/%s.txt" % x for x in groups[wc.group]]
output:
done = touch('created/{group}.done'),
shell:
"""
# Code that actually does the job...
for x in {input.grp}
do
sn=`basename $x .txt`
touch analyzed/$sn.txt
done
"""
rule finalize:
input:
# Get dummy file for this {sample}.
# If the dummy exists also the corresponding analyzed/{sample}.txt exists.
done = lambda wc: 'created/%s.done' % inv_groups[wc.sample],
output:
fout= "done/{sample}.txt"
params:
fin= "analyzed/{sample}.txt",
shell:
"cp {params.fin} {output.fout}"
Related
I need to process my input file values, turning them into a comma-separated string (instead of white space) in order to pass them to a CLI program. To do this, I want to run the input files through a Python function. How can I reference the input files of a rule in the params section of the same rule?
This is what I've tried, but it doesn't work:
rule a:
input:
foo="a.txt"
bar=expand({build}.txt,build=config["build"])
output:
baz=result.txt
params:
joined_bar=lambda w: ",".join(input.bar) # this doesn't work
shell:
"""
qux --comma-separated-files {params.joined_bar} \
--foo {input.foo} \
>{output.baz}
"""
It fails with:
InputFunctionException:
AttributeError: 'builtin_function_or_method' object has no attribute 'bar'
Potentially related but (over-)complicated questions:
How to define parameters for a snakemake rule with expand input
Is Snakemake params function evaluated before input file existence?
Turns out I need to explicitly add input to the lambda w: part:
rule a:
input:
foo="a.txt"
bar=expand({build}.txt,build=config["build"])
output:
baz=result.txt
params:
joined_bar=lambda w, input: ",".join(input.bar) # ', input' was added
shell:
"""
qux --comma-separated-files {params.joined_bar} \
--foo {input.foo} \
>{output.baz}
"""
Interestingly, I found that one needs to use input in the lambda w, input. In my testing, lambda w, i did not work.
And alternative is to refer to the rule input in the standard way: rules.a.input.bar:
rule a:
input:
foo="a.txt"
bar=expand({build}.txt,build=config["build"])
output:
baz=result.txt
params:
joined_bar=lambda w: ",".join(rules.a.input.bar) # 'rules.a.' was added
shell:
"""
qux --comma-separated-files {params.joined_bar} \
--foo {input.foo} \
>{output.baz}
"""
Also see http://biolearnr.blogspot.com/2017/11/snakemake-using-inputoutput-values-in.html for a discussion.
I have a massive Snakefile. The bits that are likely important are these below. I want to make this a bit more flexible with input files.
in the runini file; if lanenumlanelaeve = 1, I want snakemake to start on rule cutadapt (as samples would have been merged already) with corresponding input files, if not follow what normal flow of rules with those corresponding input files. I know an if else needs to placed. But, I am not seeing how to add this/where. Maybe add something in the configfile?
# config file
configfile:'rna.config.yaml
# check run.ini file for various things
runini = configparser.ConfigParser()
runini.read('../Run/ini')
ss = runini['File']['SS']
rule all:
input: complete
def fastq(wildcards):
names = glob.glob(config['fq_glob'] %wildcards.sampleID)
return sorted(names)
rule merge:
input: fastq
output:
'merged_{sampleID}_merged_R1.fastq.gz'
'merged_{sampleID}_merged_R2.fastq.gz'
threads: 8
params:
r1 = config['pari_id'][0],
r2 = config['pari_id'][1]
run:
r1 = [x for x in input if params.r1 in x]
r2 = [x for x in input if params.r2 in x]
shell('cat %s > {output[0]}' %' '.join(r1))
shell('cat %s > {output[1]}' %' '.join(r2))
rule cutadapt:
input: rules.merge_fastqs.output
output:
r1 = 'trimmed/{sampleID}_trimmed_R1.fastq.gz',
r2 = 'trimmed/{sampleID}_trimmed_R2.fastq.gz'
log: 'multiqc/cutadapt/{sampleID}.cutadapt.log'
threads: 16
params: adapter = config['adapter_fa']
run:
shell('cutadapt -b {params.adapter} -B {params.adapter} \
--cores={threads} \
--minimum-length=20 \
-q 20 \
-o {output.r1} \
-p {output.r2} \
{input} > {log}')
It's not clear from the snippet you posted if the following would work, since a lot of values and relations have to be guessed. One possibility is to add an explicit python conditional statement along these lines:
if myvar==1:
rule x:
input: some_files,
output: processed_files,
else:
rule y:
input: other_files,
output: processed_files,
This type of conditional rule definition can be avoided by having a more wholesome rule definitions, but that would require knowing the full workflow.
I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.
I've some troubles with Snakemake, up to now I didn’t found pertinent informations
in the documentation (or somewhere else).
In fact, I've a big file with different samples (multiplexed analyses) and I would like to stop the execution of the pipeline for some sample according to result found after rules.
I've already tried to change this value out of a rule definition (using a checkpoint or a def), to make conditional input for folowing rules and to considere wildcards as a simple list to delete one item.
Below is an example of what I want to do (the conditional if is only indicative here) :
# Import the config file(s)
configfile: "../PATH/configfile.yaml"
# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]
# Rules
rule all:
input:
expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)
rule one:
input:
"PATH_TO_INPUT/{lauch}.{sample}.input"
output:
temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
shell:
"""
somescript.sh {input} {output}
"""
rule two:
input:
"PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
output:
"PATH_TO_OUTPUT/{lauch}.{sample}.output"
shell:
"""
somecheckpoint.sh {input} # Print a message and write in the log file for now
if [ file_dont_pass_checkpoint ]; then
# Delete the correspondant sample to the wildcard {sample}
# to continu the analysis only with samples who are pass the validation
fi
somescript2.sh {input} {output}
"""
If someone has an idea I'm interested.
Thank you in advance for your answers.
I think this is an interesting situation if I understand it correctly. If a sample passes some checks, then keep analysing it. Otherwise, stop early.
At the end of the pipeline, every sample must have a PATH_TO_OUTPUT/{lauch}.{sample}.output since this what the rule all asks for regardless of the check results.
You could have the rule(s) performing the checks writing a file containing a flag indicating whether for that sample the checks passed or not (say flag PASS or FAIL). Then according to that flag, the rule(s) doing the analysis either go for the full analysis (if PASS) or write an empty file (or whathever) if the flag is FAIL. Here's the gist:
rule all:
input:
expand('{sample}.output', sample= samples),
rule checker:
input:
'{sample}.input',
output:
'{sample}.check',
shell:
r"""
if [ some_check_is_ok ]
then
echo "PASS" > {output}
else
echo "FAIL" > {output}
fi
"""
rule do_analysis:
input:
chk= '{sample}.check',
smp= '{sample}.input',
output:
'{sample}.output',
shell:
r"""
if [ {input.chk} contains "PASS"]:
do_long_analysis.sh {input.smp} > {output}
else:
> {output} # Do nothing: empty file
"""
If you don't want to see the failed, empty output files at all, you could use the onsuccess directive to get rid of them at the end of the pipeline:
onsuccess:
for x in expand('{sample}.output', sample= samples):
if os.path.getsize(x) == 0:
print('Removing failed sample %s' % x)
os.remove(x)
The canonical solution to problems like this is to use checkpoints. Consider the following example:
import pandas as pd
def get_results(wildcards):
qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
return expand(
"results/processed/{sample}.txt",
sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
)
rule all:
input:
get_results
checkpoint qc:
input:
expand("results/preprocessed/{sample}.txt", sample=config["samples"])
output:
"results/qc.tsv"
shell:
"perfom-qc {input} > {output}"
rule process:
input:
"results/preprocessed/{sample}.txt"
output:
"results/processed/{sample.txt}"
shell:
"process {input} > {output}"
The idea is the following: at some point in your pipeline, after some (let's say) preprocessing, you add a checkpoint rule, which aggregates over all samples and generates some kind of QC table. Then, downstream of that, there is a rule that aggregates over samples (e.g. the rule all, or some other aggregation inside of the workflow). Let's say in that aggregation you only want to consider samples that pass the QC. For that, you let the required files ("results/processed/{sample}.txt") be determined via an input function, which reads the QC table generated by the checkpoint rule. Snakemake's checkpoint mechanism ensures that this input function is evaluated after the checkpoint has been executed, so that you can actually read the table results and base your decision about the samples on the qc criteria contained in that table. Any intermediate rules (like here the process rule) will then be automatically applied by Snakemake when re-evaluating the DAG.
So far I used snakemake to generate individual plots with snakemake. This has worked great! Now though, I want to create a rule that creates a combined plot across the topics, without explicitly putting the name in the plot. See the combined_plot rule below.
topics=["soccer", "football"]
params=[1, 2, 3, 4]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"plot.py --input={input} --output={output}"
rule combined_plot:
input:
# all data_p={param}_{topic}.csv files
output:
"combined_p={param}_plot.png"
shell:
"plot2.py " + # one "--input=" and one "--output" for each csv file
Is there a simple way to do this with snakemake?
If I understand correctly, the code below should be more straightforward as it replaces the lambda and the glob with the expand function. It will execute the two commands:
plot2.py --input=data_p=1_soccer.csv --input=data_p=1_football.csv --output combined_p=1_plot.png
plot2.py --input=data_p=2_soccer.csv --input=data_p=2_football.csv --output combined_p=2_plot.png
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("combined_p={param}_plot.png", param=params),
rule combined_plot:
input:
csv= expand("data_p={{param}}_{topic}.csv", topic= topics)
output:
"combined_p={param}_plot.png",
run:
inputs= ['--input=' + x for x in input.csv]
shell("plot2.py {inputs} --output {output}")
I got a working version, by using a function called 'wcs' as input (see here) and I used run instead of shell. In the run section I could first define a variable before executing the result with shell(...).
Instead of referring to the files with glob I could also have directly used the topics in the lambda function.
If anyone with more experience sees this, please tell me if this is the "right" way to do it.
from glob import glob
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"echo plot.py {input} {output}"
rule combined_plot:
input:
lambda wcs: glob("data_p={param}_*.csv".format(**wcs))
output:
"combined_p={param}_plot.png"
run:
inputs=" ".join(["--input " + inp for inp in input])
shell("echo plot2.py {inputs}")