Snakemake: remove output file - snakemake

I don't see how to use a Snakemake rule to remove a Snakemake output file that has become useless.
In concrete terms, I have a rule bwa_mem_sam that creates a file named {sample}.sam.
I have this other rule, bwa_mem_bam that creates a file named {sample.bam}.
Has the two files contain the same information in different formats, I'd like to remove the first one cannot succeed doing this.
Any help would be very much appreciated.
Ben.
rule bwa_mem_map:
input:
sam="{sample}.sam",
bam="{sample}.bam"
shell:
"rm {input.sam}"
# Convert SAM to BAM.
rule bwa_mem_map_bam:
input:
rules.sam_to_bam.output
# Use bwa mem to map reads on a reference genome.
rule bwa_mem_map_sam:
input:
reference=reference_genome(),
index=reference_genome_index(),
fastq=lambda wildcards: config["units"][SAMPLE_TO_UNIT[wildcards.sample]],
output:
"mapping/{sample}.sam"
threads: 12
log:
"mapping/{sample}.log"
shell:
"{BWA} mem -t {threads} {input.reference} {input.fastq} > {output} 2> {log} "\
"|| (rc=$?; cat {log}; exit $rc;)"
rule sam_to_bam:
input:
"{prefix}.sam"
output:
"{prefix}.bam"
threads: 8
shell:
"{SAMTOOLS} view --threads {threads} -b {input} > {output}"

You don't need a rule to remove you sam files. Just mark the ouput sam file in "bwa_mem_map_sam" rule as temporary:
rule bwa_mem_map_sam:
input:
reference=reference_genome(),
index=reference_genome_index(),
fastq=lambda wildcards: config["units"][SAMPLE_TO_UNIT[wildcards.sample]],
output:
temp("mapping/{sample}.sam")
threads: 12
log:
"mapping/{sample}.log"
shell:
"{BWA} mem -t {threads} {input.reference} {input.fastq} > {output} 2> {log} "\
"|| (rc=$?; cat {log}; exit $rc;)"
as soon as a temp file is not needed anymore (ie: not used as input in any other rule), it will be removed by snakemake.
EDIT AFTER COMMENT:
If I understand correctly, your statement "if the user asks for a sam..." means the sam file is put in the target rule. If this is the case, then as long as the input of the target rule contains the sam file, the file won't be deleted (I guess). If the bam file is put in the target rule (and not the sam), then it will be deleted.
The other way is this:
rule bwa_mem_map:
input:
sam="{sample}.sam",
bam="{sample}.bam"
output:
touch("{sample}_samErased.txt")
shell:
"rm {input.sam}"
and ask for "{sample}_samErased.txt" in the target rule.

Based on the comments above, you want to ask the user if he wants a sam or bam output.
You could use this as a config argument:
snakemake --config output_format=sam
Then you use this kind Snakefile:
samples = ['A','B']
rule all:
input:
expand('{sample}.mapped.{output_format}', sample=samples, output_format=config['output_format'])
rule bwa:
input: '{sample}.fastq'
output: temp('{sample}.mapped.sam')
shell:
"""touch {output}"""
rule sam_to_bam:
input: '{sample}.mapped.sam'
output: '{sample}.mapped.bam'
shell:
"""touch {output}"""

Related

snakemake - define input for aggregate rule without wildcards

I am writing a snakemake to produce Sars-Cov-2 variants from Nanopore sequencing. The pipeline that I am writing is based on the artic network, so I am using artic guppyplex and artic minion.
The snakemake that I wrote has the following steps:
zip all the fastq files for all barcodes (rule zipFq)
perform read filtering with guppyplex (rule guppyplex)
call the artic minion pipeline (rule minion)
move the stderr and stdout from qsub to a folder under the working directory (rule mvQsubLogs)
Below is the snakemake that I wrote so far, which works
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# zip fq
"zipFastq/{barcode}/{barcode}.zip",
# guppyplex
"guppyplex/{barcode}/{barcode}.fastq",
# nanopolish
"nanopolish/{barcode}",
# directory where the logs will be moved to
"logs/{barcode}"
], barcode = barcodes)
rule zipFq:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}"
output:
"zipFastq/{barcode}/{barcode}.zip"
shell:
"zip {output} {input.FQ}/*"
rule guppyplex:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"guppyplex/{barcode}/{barcode}.fastq"
shell:
"/home/ngs/miniconda3/envs/artic-ncov2019/bin/artic guppyplex --skip-quality-check --min-length {MINLENGTHGUPPY} --max-length {MAXLENGTHGUPPY} --directory {input.FQ} --prefix {wildcards.barcode} --output {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"{FAST5_PATH}/{{barcode}}"
params:
OUTDIR = "nanopolish/{barcode}"
output:
directory("nanopolish/{barcode}")
shell:
"""
mkdir {params.OUTDIR};
cd {params.OUTDIR};
export PATH=/home/ngs/miniconda3/envs/artic-ncov2019/bin:$PATH;
artic minion --normalise {NANOPOLISH_NORMALISE} --threads {THREADS} --scheme-directory {PRIMERSDIR} --read-file ../../{input.INFQ} --sequencing-summary {Seq_Sum} --fast5-directory {input.FAST5} nCoV-2019/{PRIMERVERSION} {wildcards.barcode} # variables in CAPITALS are parsed from config.yaml
"""
rule mvQsubLogs:
input:
# zipFQ
rules.zipFq.output,
# guppyplex
rules.guppyplex.output,
# nanopolish
rules.minion.output
output:
directory("logs/{barcode}")
shell:
"mkdir -p {output} \n"
"mv {LOGDIR}/{wildcards.barcode}* {output}/"
The above snakemake works and now I am trying to add another rule, but the difference here is that this rule is an aggregate function i.e. it should not be called for every barcode, but only once after all the rules are called for all barcodes
The rule that I am trying to incorporate (catFasta) would cat all {barcode}.consensus.fasta (generated by rule minion) into in a single file, as shown below (incorporated into the snakemake above):
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# zip fq
"zipFastq/{barcode}/{barcode}.zip",
# guppyplex
"guppyplex/{barcode}/{barcode}.fastq",
# nanopolish
"nanopolish/{barcode}",
# catFasta
"catFasta/cat_consensus.fasta",
# directory where the logs will be moved to
"logs/{barcode}"
], barcode = barcodes)
rule zipFq:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}"
output:
"zipFastq/{barcode}/{barcode}.zip"
shell:
"zip {output} {input.FQ}/*"
rule guppyplex:
input:
FQ = f"{FASTQ_PATH}/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"guppyplex/{barcode}/{barcode}.fastq"
shell:
"/home/ngs/miniconda3/envs/artic-ncov2019/bin/artic guppyplex --skip-quality-check --min-length {MINLENGTHGUPPY} --max-length {MAXLENGTHGUPPY} --directory {input.FQ} --prefix {wildcards.barcode} --output {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"{FAST5_PATH}/{{barcode}}"
params:
OUTDIR = "nanopolish/{barcode}"
output:
directory("nanopolish/{barcode}")
shell:
"""
mkdir {params.OUTDIR};
cd {params.OUTDIR};
export PATH=/home/ngs/miniconda3/envs/artic-ncov2019/bin:$PATH;
artic minion --normalise {NANOPOLISH_NORMALISE} --threads {THREADS} --scheme-directory {PRIMERSDIR} --read-file ../../{input.INFQ} --sequencing-summary {Seq_Sum} --fast5-directory {input.FAST5} nCoV-2019/{PRIMERVERSION} {wildcards.barcode} # variables in CAPITALS are parsed from config.yaml
"""
rule catFasta:
input:
expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"
rule mvQsubLogs:
input:
# zipFQ
rules.zipFq.output,
# guppyplex
rules.guppyplex.output,
# nanopolish
rules.minion.output,
# catFasta
rules.catFasta.output
output:
directory("logs/{barcode}")
shell:
"mkdir -p {output} \n"
"mv {LOGDIR}/{wildcards.barcode}* {output}/"
However, when I call snakemake with
(artic-ncov2019) ngs#bngs05b:/nexusb/SC2/ONT/scripts/SnakeMake> snakemake -np -s Snakefile_v2 --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" all # LOGDIR parsed from config.yaml
I get:
Building DAG of jobs...
MissingInputException in line 178 of /nexusb/SC2/ONT/scripts/SnakeMake/Snakefile_v2:
Missing input files for rule guppyplex:
/nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/barcode49/barcode49.consensus.fasta
Which I don't find easy to understand: snakemake is complaining about /nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/barcode49/barcode49.consensus.fasta whereas /nexus/Gridion/20210521_Covid7/Covid7/20210521_0926_X1_FAL11796_a5b62ac2/fastq_pass/ is FASTQ_PATH and I am not defining f"{FASTQ_PATH}/{{barcode}}.consensus.fasta" anywhere
A very same problem is described here, though the strategy in the accepted answer (the input for rule catFasta would be expand("nanopolish/{{barcode}}/{{barcode}}.consensus.fasta")) does not work for me.
Does anyone know how I can circumvent this?
The rule that fails is rule guppyplex, which looks for an input in the form of {FASTQ_PATH}/{{barcode}}.
Looks like the wildcard {barcode} is filled with barcode49/barcode49.consensus.fasta, which happened because of two reasons I think:
First (and most important): The workflow does not find a better way to produce the final output. In rule catFasta, you give an input file which is never described as an output in your workflow. The rule minion has the directory as an output, but not the file, and it is not perfectly clear for the workflow where to produce this input file.
It therefore infers that the {barcode} wildcard somehow has to contain this .consensus.fasta that it has never seen before. This wildcard is then handed over to the top, where the workflow crashes since it cannot find a matching input file.
Second: This initialisation of the wildcard with sth. you don't want is only possible since you did not constrain the wildcard properly. You can for example forbid the wildcard to contain a . (see wildcard_constraints here)
However, the main problem is that catFasta does not find the desired input. I'd suggest changing the output of minion to "nanopolish/{barcode}/{barcode}.consensus.fasta", since the you already take the OUTDIR from the params, that should not hurt your rule here.
Edit: Dummy test example:
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# guppyplex
"guppyplex/{barcode}/{barcode}.fastq",
# catFasta
"catFasta/cat_consensus.fasta",
], barcode = barcodes)
rule guppyplex:
input:
FQ = f"fastq/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"guppyplex/{barcode}/{barcode}.fastq"
shell:
"touch {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"fasta/{{barcode}}"
params:
OUTDIR = "nanopolish/{barcode}"
output:
"nanopolish/{barcode}/{barcode}.consensus.fasta"
shell:
"""
touch {output} && echo {wildcards.barcode} > {output}
"""
rule catFasta:
input:
expand("nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"

Snakemake cannot handle very long command line?

This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""

snakemake. How to pass target from command line when creating multiple targets

With help following a previous question, this code creates targets (copies of the file named "practice_phased_reversed.vcf" in each of two directories.
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/practice_phased_reversed.vcf",f=dirs)
rule r1:
input:
"practice_phased_reversed.vcf"
output:
"{f}/{input}"
shell:
"cp {input} {output}"
However, I would like to pass the target file on the snakemake command line.
I tried this (below), with the command "snakemake practice_phased_reversed.vcf", but it gave an error : "MissingRuleException: No rule to produce practice_phased_reversed.vcf"
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/{{base}}_phased_reversed.vcf",f=dirs)
rule r1:
input:
"{base}_phased_reversed.vcf"
output:
"{f}/{input}"
shell:
"cp {input} {output}"
Thanks for any help
I think you should pass the target file name as configuration option on the command line and use that option to construct the file names in the Snakefile:
target = config['target']
dirs = ['k_1','k2_10']
rule all:
input:
expand("{f}/%s" % target, f=dirs),
rule r1:
input:
target,
output:
"{f}/%s" % target,
shell:
"cp {input} {output}"
To be executed as:
snakemake -C target=practice_phased_reversed.vcf
Your target file practice_phased_reversed.vcf doesn't satisfy output requirements of rule r1. It is missing wildcard value for {f}.
Instead this following example, snakemake data/practice_phased_reversed.vcf, where data matches wildcard f, will work as expected.
Code:
rule r1:
input:
"{base}_phased_reversed.vcf"
output:
"{f}/{base}_phased_reversed.vcf"
shell:
"cp {input} {output}"

ChildIOException: error in snake make after running flye

So I have an issue when I run other programs after I ran flye in my snakemake pipeline. This is because the output from flye is a directory. My rules are as followd:
samples, = glob_wildcards("data/samples/{sample}.fastq")
rule all:
input:
[f"assembled/" for sample in samples],
[f"nanopolish/draft.fa" for sample in samples],
[f"nanopolish/reads.sorted.bam" for sample in samples],
[f"nanopolish/reads.indexed.sorted.bam" for sample in samples]
rule fly:
input:
"unzipped/read.fastq"
output:
directory("assembled/")
conda:
"envs/flye.yaml"
shell:
"flye --nano-corr {input} --genome-size 5m --out-dir {output}"
rule bwa:
input:
"assembled/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"nanopolish.yaml"
shell:
"bwa index {input} {output}"
rule nanopolish:
input:
"nanopolish/draft.fa",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"nanopolish.yaml"
shell:
"bwa mem -x ont2d -t 8 {input} | samtools sort -o {output}"
there are a few steps before this but they work just fine. when I run this it gives the following error:
ChildIOException:
File/directory is a child to another output:
/home/fronglesquad/snakemake_poging_1/assembled
/home/fronglesquad/snakemake_poging_1/assembled/assembly.fasta
I have googled the error. All I could find there that its because snakemake doesnt work well with output directorys. But this tool needs a output directory to work. Does anyone know how to bypass this?
(I think) The problem lies somewhere else in your code.
You have defined two rules, the first that outputs directory assembled, the second that outputs assembled/assembly.fasta. Since the output of the second rule is always at least the directory assembled, Snakemake complains. You can solve it by using the directory as input:
rule second:
input:
"assembled"
output:
...
shell:
cat {input}/assembly.fasta > {output}

snakemake - output one only file from multiple input files in one rule

I'm using snakemake for the first time in order to build a basic pipeline using cutadapt, bwa and GATK (trimming ; mapping ; calling). I would like to run this pipeline on every fastq file contained in a directory, without having to specify their name or whatever in the snakefile or in the config file. I would like to succeed in doing this.
The first two steps (cutadapt and bwa / trimming and mapping) are running fine, but I'm encountering some problems with GATK.
First, I have to generate g.vcf files from bam files. I'm doing this using these rules:
configfile: "config.yaml"
import os
import glob
rule all:
input:
"merge_calling.g.vcf"
rule cutadapt:
input:
read="data/Raw_reads/{sample}_R1_{run}.fastq.gz",
read2="data/Raw_reads/{sample}_R2_{run}.fastq.gz"
output:
R1=temp("trimmed_reads/{sample}_R1_{run}.fastq.gz"),
R2=temp("trimmed_reads/{sample}_R2_{run}.fastq.gz")
threads:
10
shell:
"cutadapt -q {config[Cutadapt][Quality_value]} -m {config[Cutadapt][min_length]} -a {config[Cutadapt][forward_adapter]} -A {config[Cutadapt][reverse_adapter]} -o {output.R1} -p '{output.R2}' {input.read} {input.read2}"
rule bwa_map:
input:
genome="data/genome.fasta",
read=expand("trimmed_reads/{{sample}}_{pair}_{{run}}.fastq.gz", pair=["R1", "R2"])
output:
temp("mapped_bam/{sample}_{run}.bam")
threads:
10
params:
rg="#RG\\tID:{sample}\\tPL:ILLUMINA\\tSM:{sample}"
shell:
"bwa mem -t 2 -R '{params.rg}' {input.genome} {input.read} | samtools view -Sb - > {output}"
rule picard_sort:
input:
"mapped_bam/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"java -Xmx4g -jar /home/alexandre/picard-tools/picard.jar SortSam I={input} O={output} SO=coordinate VALIDATION_STRINGENCY=SILENT"
rule picard_rmdup:
input:
bam="sorted_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam",
"picard_stats/{sample}.bam"
params:
reads="rmduped_reads/{sample}.bam",
stats="picard_stats/{sample}.bam",
shell:
"java -jar -Xmx2g /home/alexandre/picard-tools/picard.jar MarkDuplicates "
"I={input.bam} "
"O='{params.reads}' "
"VALIDATION_STRINGENCY=SILENT "
"MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 "
"REMOVE_DUPLICATES=TRUE "
"M='{params.stats}'"
rule samtools_index:
input:
"rmduped_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule GATK_raw_calling:
input:
bam="rmduped_reads/{sample}.bam",
bai="rmduped_reads/{sample}.bam.bai",
genome="data/genome.fasta"
output:
"Raw_calling/{sample}.g.vcf",
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -ploidy 2 --emitRefConfidence GVCF -T HaplotypeCaller -R {input.genome} -I {input.bam} --genotyping_mode DISCOVERY -o {output}"
These rules work fine. For example, if I have the files :
Cla001d_S281_L001_R1_001.fastq.gz
Cla001d_S281_L001_R2_001.fastq.gz
I can create one bam file (Cla001d_S281_L001_001.bam) and from that bam file create a GVCF file (Cla001d_S281_L001_001.g.vcf). I have a lot of sample like this one, and I need to create one GVCF file for each, and then merge these GVCF files into one file. The problem is that I'm unable to give the list of the file to merge to the following rule:
rule GATK_merge:
input:
???
output:
"merge_calling.g.vcf"
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar "
"-T CombineGVCFs "
"-R data/genome.fasta "
"--variant {input} "
"-o {output}"
I tried several things in order to do that, but cannot succeed. The problem is the link between the two rules (GATK_raw_calling and GATK_merge that is supposed to merge the output of GATK_raw_calling). I can't output one single file if I'm specifying the output of GATK_raw_calling as the input of the following rule (Wildcards in input files cannot be determined from output files), and I'm unable to make a link between the two rules if I'm not specifying these files as an input...
Is there a way to succeed in doing that? The difficulty is that I'm not defining a list of names or whatever, I think.
Thanks you in advance for your help.
You can try to generate a list of sample IDs using glob_wildcards on the initial fastq.gz files:
sample_ids, run_ids = glob_wildcards("data/Raw_reads/{sample}_R1_{run}.fastq.gz")
Then, you can use this to expand the input of GATK_merge:
rule GATK_merge:
input:
expand("Raw_calling/{sample}_{run}.g.vcf",
sample=sample_ids, run=run_ids)
If the same run ID always come with the same sample ID, you will need to zip instead of expanding, in order to avoid non-existing combinations:
rule GATK_merge:
input:
["Raw_calling/{sample}_{run}.g.vcf".format(
sample=sample_id,
run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)]
You can achieve this by using a python function as an input for your rule, as described in the snakemake documentation here.
Could look like this for example:
# Define input files
def gatk_inputs(wildcards):
files = expand("Raw_calling/{sample}.g.vcf", sample=<samples list>)
return files
# Rule
rule gatk:
input: gatk_inputs
output: <output file name>
run: ...
Hope this helps.