I need to use nested checkpoints in snakemake since for every dynamic file I have to create again other dynamic files. So far, I am unable to resolve the two checkpoints properly. Below, you find a minimal toy example.
It seems that until the first checkout is not properly resolved, the second checkpoint is not even executed, thus a single aggregate rule won't work.
I don't know how to invoke the two checkpoints and resolve the wildcards.
import os.path
import glob
rule all:
input:
'collect/all_done.txt'
#generate a number of files
checkpoint create_files:
output:
directory('files')
run:
import random
r = random.randint(1,10)
for x in range(r):
output_dir = output[0] + '/' + str(x+1)
import os
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
output_file=output_dir + '/test.txt'
print(output_file)
with open(output_file, 'w') as f:
f.write(str(x+1))
checkpoint create_other_files:
input: 'files/{i}/test.txt'
output: directory('other_files/{i}/')
shell:
'''
L=$(( $RANDOM % 10))
for j in $(seq 1 $L);
do
mkdir -p {output}/{j}
cp -f {input} {output}/$j/test2.txt
done
'''
def aggregate(wildcards):
i_wildcard = checkpoints.create_files.get(**wildcards).output[0]
print('in_def_aggregate')
print(i_wildcard)
j_wildcard = checkpoints.create_other_files.get(**wildcards).output[0]
print(j_wildcard)
split_files = expand('other_files/{i}/{j}/test2.txt',
i =glob_wildcards(os.path.join(i_wildcard, '{i}/test.txt')).i,
j = glob_wildcards(os.path.join(j_wildcard, '{j}/test2.txt')).j
)
return split_files
#non-sense collect function
rule collect:
input: aggregate
output: touch('collect/all_done.txt')
Currently, I get the following error from snakemake:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 collect
1 create_files
3
[Thu Nov 14 14:45:01 2019]
checkpoint create_files:
output: files
jobid: 2
Downstream jobs will be updated after completion.
Job counts:
count jobs
1 create_files
1
files/1/test.txt
files/2/test.txt
files/3/test.txt
files/4/test.txt
files/5/test.txt
files/6/test.txt
files/7/test.txt
files/8/test.txt
files/9/test.txt
files/10/test.txt
Updating job 1.
in_def_aggregate
files
[Thu Nov 14 14:45:02 2019]
Error in rule create_files:
jobid: 2
output: files
InputFunctionException in line 53 of /TL/stat_learn/work/feldmann/Phd/Projects/HIVImmunoAdapt/HIVIA/playground/Snakefile2:
WorkflowError: Missing wildcard values for i
Wildcards:
Removing output files of failed job create_files since they might be corrupted:
files
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I am interested in having the files /other_files/{checkpoint_1_wildcard}/{checkpoint_2_wildcard}/test2.txt
I am not entirely sure what you were trying to do, so I rewrote it quite some. But does clarify the problem?
import glob
import random
from pathlib import Path
rule all:
input:
'collect/all_done.txt'
checkpoint first:
output:
directory('first')
run:
for i in range(random.randint(1,10)):
Path(f"{output[0]}/{i}").mkdir(parents=True, exist_ok=True)
Path(f"{output[0]}/{i}/test.txt").touch()
checkpoint second:
input:
'first/{i}/test.txt'
output:
directory('second/{i}')
run:
for j in range(random.randint(1,10)):
Path(f"{output[0]}/{j}").mkdir(parents=True, exist_ok=True)
Path(f"{output[0]}/{j}/test2.txt").touch()
rule copy:
input:
'second/{i}/{j}/test2.txt'
output:
'copy/{i}/{j}/test2.txt'
shell:
"""
cp -f {input} {output}
"""
def aggregate(wildcards):
outputs_i = glob.glob(f"{checkpoints.first.get().output}/*/")
outputs_i = [output.split('/')[-2] for output in outputs_i]
split_files = []
for i in outputs_i:
outputs_j = glob.glob(f"{checkpoints.second.get(i=i).output}/*/")
outputs_j = [output.split('/')[-2] for output in outputs_j]
for j in outputs_j:
split_files.append(f"copy/{i}/{j}/test2.txt")
return split_files
rule collect:
input:
aggregate
output:
touch('collect/all_done.txt')
Related
I have a massive Snakefile. The bits that are likely important are these below. I want to make this a bit more flexible with input files.
in the runini file; if lanenumlanelaeve = 1, I want snakemake to start on rule cutadapt (as samples would have been merged already) with corresponding input files, if not follow what normal flow of rules with those corresponding input files. I know an if else needs to placed. But, I am not seeing how to add this/where. Maybe add something in the configfile?
# config file
configfile:'rna.config.yaml
# check run.ini file for various things
runini = configparser.ConfigParser()
runini.read('../Run/ini')
ss = runini['File']['SS']
rule all:
input: complete
def fastq(wildcards):
names = glob.glob(config['fq_glob'] %wildcards.sampleID)
return sorted(names)
rule merge:
input: fastq
output:
'merged_{sampleID}_merged_R1.fastq.gz'
'merged_{sampleID}_merged_R2.fastq.gz'
threads: 8
params:
r1 = config['pari_id'][0],
r2 = config['pari_id'][1]
run:
r1 = [x for x in input if params.r1 in x]
r2 = [x for x in input if params.r2 in x]
shell('cat %s > {output[0]}' %' '.join(r1))
shell('cat %s > {output[1]}' %' '.join(r2))
rule cutadapt:
input: rules.merge_fastqs.output
output:
r1 = 'trimmed/{sampleID}_trimmed_R1.fastq.gz',
r2 = 'trimmed/{sampleID}_trimmed_R2.fastq.gz'
log: 'multiqc/cutadapt/{sampleID}.cutadapt.log'
threads: 16
params: adapter = config['adapter_fa']
run:
shell('cutadapt -b {params.adapter} -B {params.adapter} \
--cores={threads} \
--minimum-length=20 \
-q 20 \
-o {output.r1} \
-p {output.r2} \
{input} > {log}')
It's not clear from the snippet you posted if the following would work, since a lot of values and relations have to be guessed. One possibility is to add an explicit python conditional statement along these lines:
if myvar==1:
rule x:
input: some_files,
output: processed_files,
else:
rule y:
input: other_files,
output: processed_files,
This type of conditional rule definition can be avoided by having a more wholesome rule definitions, but that would require knowing the full workflow.
When executing the below snakemake pipeline, I get an error: IndexError: list index out of range. I think it's because fastqc_pretrim is being executed for all SAMPLEs. However, not all samples pass basecalling QC, so only some files will need to be processed here. I am trying to use checkpointing to get this to run. Looking at the log, we can see it is trying to run fastqc_pretrim for sample "FAQ20773_pass_barcode01_68fda206_1". However, if you look above that line in the LOG, FAQ20773_fail_barcode03_68fda206_0 is actually the only sample that passed with a .fastq.gz file. I'm not sure why the correct sample is not running.
LOG:
snakemake --use-conda --jobs 1 -pr
['FAQ20773_fail_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_2', 'FAQ20773_fail_barcode03_68fda206_0', 'FAQ20773_fail_barcode02_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_1']
The flag 'directory' used in rule guppy_basecall_persample is only valid for outputs, not inputs.
Building DAG of jobs...
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_2
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode03_68fda206_0
['basecall/FAQ20773_fail_barcode03_68fda206_0/pass/fastq_runid_68fda20603fe08e9e2a4eef8718997203b603497_0_0.fastq.gz']
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode02_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_1
[]
Updating job all.
Using shell: /usr/bin/bash
[Thu Aug 26 13:13:51 2021]
rule fastqc_pretrim:
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log
jobid: 19
reason: Missing output files: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
wildcards: sample=FAQ20773_pass_barcode01_68fda206_1
resources: tmpdir=/tmp
/home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py
Activating conda environment: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
Traceback (most recent call last):
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py", line 41, in <module> shell( File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/shell.py", line 130, in __new__ cmd = format(cmd, *args, stepout=2, **kwargs) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/utils.py", line 427, in format return fmt.format(_pattern, *args, **variables) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 161, in format return self.vformat(format_string, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 165, in vformat result, _ = self._vformat(format_string, args, kwargs, used_args, 2)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 205, in _vformat obj, arg_used = self.get_field(field_name, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 278, in get_field obj = obj[i]
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 1536, in __getitem__ return super().__getitem__(key)
IndexError: list index out of range
[Thu Aug 26 13:13:52 2021]
Error in rule fastqc_pretrim:
jobid: 19
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log (check log file(s) for error message)
conda-env: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
RuleException:
CalledProcessError in line 60 of /mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile:
Command 'source /home/hvasquezgross/miniconda3/bin/activate '/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4'; /home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py' returned non-zero exit status 1.
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile", line 60, in __rule_fastqc_pretrim
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Snakemake
import glob
configfile: "config.yaml"
inputdirectory=config["directory"]
SAMPLES, = glob_wildcards(inputdirectory+"/{sample}.fast5", followlinks=True)
print(SAMPLES)
wildcard_constraints:
sample="\w+\d+_\w+_\w+\d+_.+_\d"
##### target rules #####
rule all:
input:
expand('basecall/{sample}/sequencing_summary.txt', sample=SAMPLES),
"qc/multiqc.html"
rule make_indvidual_samplefiles:
input:
inputdirectory+"/{sample}.fast5",
output:
"lists/{sample}.txt",
shell:
"basename {input} > {output}"
checkpoint guppy_basecall_persample:
input:
directory=directory(inputdirectory),
samplelist="lists/{sample}.txt",
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
params:
config["basealgo"]
shell:
"guppy_basecaller -i {input.directory} --input_file_list {input.samplelist} -s {output.directory} -c {params} --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.guppy_basecall_persample.get(**wildcards).output[1]
print(checkpoint_output)
exparr = expand("basecall/{sample}/pass/{runid}.fastq.gz", sample=wildcards.sample,
runid=glob_wildcards(os.path.join(checkpoint_output, "pass/", "{runid}.fastq.gz")).runid)
print(exparr)
return exparr
rule fastqc_pretrim:
input:
aggregate_input
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_pretrim/{sample}.log"
threads: 1
wrapper:
"0.77.0/bio/fastqc"
rule multiqc:
input:
#expand("basecall/{sample}.fastq.gz", sample=SAMPLES)
expand("qc/fastqc_pretrim/{sample}_fastqc.zip", sample=SAMPLES)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.77.0/bio/multiqc"
I think you are making things more complicated than necessary by using checkpoint and wrapper. This is what I would do, more or less:
rule guppy_basecall_persample:
input:
...
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
shell:
r"""
guppy ...
"""
rule fastqc_pretrim:
input:
directory= directory("basecall/{sample}/"),
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip"
shell:
r"""
fastqc {input.directory}/pass/*.fastq.gz
"""
I would like to change the folder names and the names of the read files. From here I found something similar (Move and rename files from multiple folders using snakemake):
workdir: "/path/to/workdir/"
import pandas as pd
from io import StringIO
sample_file = StringIO("""fastq sampleID
BOB_1234/fastq/BOB_1234.R1.fastq.gz TAG_1/fastq/TAG_1.R1.fastq.gz
BOB_1234/fastq/BOB_1234.R2.fastq.gz TAG_1/fastq/TAG_1.R2.fastq.gz
BOB_3421/fastq/BOB_3421.R1.fastq.gz TAG_2/fastq/TAG_2.R1.fastq.gz
BOB_3421/fastq/BOB_3421.R2.fastq.gz TAG_2/fastq/TAG_2.R2.fastq.gz""")
df = pd.read_table(sample_file, sep="\s+", header=0)
sampleID = df.sampleID
fastq = df.fastq
rule all:
input:
expand("{sample}", sample=df.sampleID)
rule move_and_rename_fastqs:
input: fastq = lambda w: df[df.sampleID == w.sample].fastq.tolist()
output: "{sample}"
shell:
"""echo mv {input.fastq} {output}"""
I can an error:
MissingOutputException in line 19 of snakefile:
Job Missing files after 5 seconds:
TAG_1/fastq/TAG_1.R1.fastq.gz
I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.
So far I used snakemake to generate individual plots with snakemake. This has worked great! Now though, I want to create a rule that creates a combined plot across the topics, without explicitly putting the name in the plot. See the combined_plot rule below.
topics=["soccer", "football"]
params=[1, 2, 3, 4]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"plot.py --input={input} --output={output}"
rule combined_plot:
input:
# all data_p={param}_{topic}.csv files
output:
"combined_p={param}_plot.png"
shell:
"plot2.py " + # one "--input=" and one "--output" for each csv file
Is there a simple way to do this with snakemake?
If I understand correctly, the code below should be more straightforward as it replaces the lambda and the glob with the expand function. It will execute the two commands:
plot2.py --input=data_p=1_soccer.csv --input=data_p=1_football.csv --output combined_p=1_plot.png
plot2.py --input=data_p=2_soccer.csv --input=data_p=2_football.csv --output combined_p=2_plot.png
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("combined_p={param}_plot.png", param=params),
rule combined_plot:
input:
csv= expand("data_p={{param}}_{topic}.csv", topic= topics)
output:
"combined_p={param}_plot.png",
run:
inputs= ['--input=' + x for x in input.csv]
shell("plot2.py {inputs} --output {output}")
I got a working version, by using a function called 'wcs' as input (see here) and I used run instead of shell. In the run section I could first define a variable before executing the result with shell(...).
Instead of referring to the files with glob I could also have directly used the topics in the lambda function.
If anyone with more experience sees this, please tell me if this is the "right" way to do it.
from glob import glob
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"echo plot.py {input} {output}"
rule combined_plot:
input:
lambda wcs: glob("data_p={param}_*.csv".format(**wcs))
output:
"combined_p={param}_plot.png"
run:
inputs=" ".join(["--input " + inp for inp in input])
shell("echo plot2.py {inputs}")