So I have an issue when I run other programs after I ran flye in my snakemake pipeline. This is because the output from flye is a directory. My rules are as followd:
samples, = glob_wildcards("data/samples/{sample}.fastq")
rule all:
input:
[f"assembled/" for sample in samples],
[f"nanopolish/draft.fa" for sample in samples],
[f"nanopolish/reads.sorted.bam" for sample in samples],
[f"nanopolish/reads.indexed.sorted.bam" for sample in samples]
rule fly:
input:
"unzipped/read.fastq"
output:
directory("assembled/")
conda:
"envs/flye.yaml"
shell:
"flye --nano-corr {input} --genome-size 5m --out-dir {output}"
rule bwa:
input:
"assembled/assembly.fasta"
output:
"nanopolish/draft.fa"
conda:
"nanopolish.yaml"
shell:
"bwa index {input} {output}"
rule nanopolish:
input:
"nanopolish/draft.fa",
"zipped/zipped.gz"
output:
"nanopolish/reads.sorted.bam"
conda:
"nanopolish.yaml"
shell:
"bwa mem -x ont2d -t 8 {input} | samtools sort -o {output}"
there are a few steps before this but they work just fine. when I run this it gives the following error:
ChildIOException:
File/directory is a child to another output:
/home/fronglesquad/snakemake_poging_1/assembled
/home/fronglesquad/snakemake_poging_1/assembled/assembly.fasta
I have googled the error. All I could find there that its because snakemake doesnt work well with output directorys. But this tool needs a output directory to work. Does anyone know how to bypass this?
(I think) The problem lies somewhere else in your code.
You have defined two rules, the first that outputs directory assembled, the second that outputs assembled/assembly.fasta. Since the output of the second rule is always at least the directory assembled, Snakemake complains. You can solve it by using the directory as input:
rule second:
input:
"assembled"
output:
...
shell:
cat {input}/assembly.fasta > {output}
Related
A few days ago I started using Snakemake for the first time. I am having an issue when I am trying to run the megahit rule in my pipeline.
It gives me the following error "Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). ......"
So initially it runs and then crashes with the above error. I implemented the solution with the directory() option in my pipeline but I think its not a good practice since, for various reasons, you can loose files without even knowing it.
Is there a way to run the rule without using the directory() ?
I would appreciate any help on the issue!
Thanking you in advance
sra = []
with open("run_ids") as f:
for line in f:
sra.append(line.strip())
rule all:
input:
expand("raw_reads/{sample}/{sample}.fastq", sample=sra),
expand("trimmo/{sample}/{sample}.trimmed.fastq", sample=sra),
expand("megahit/{sample}/final.contigs.fa", sample=sra)
rule download:
output:
"raw_reads/{sample}/{sample}.fastq"
params:
"--split-spot --skip-technical"
log:
"logs/fasterq-dump/{sample}.log"
benchmark:
"benchmarks/fastqdump/{sample}.fasterq-dump.benchmark.txt"
threads: 8
shell:
"""
fasterq-dump {params} --outdir /home/raw_reads/{wildcards.sample} {wildcards.sample} -e {threads}
"""
rule trim:
input:
"raw_reads/{sample}/{sample}.fastq"
output:
"trimmo/{sample}/{sample}.trimmed.fastq"
params:
"HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"
log:
"logs/trimmo/{sample}.log"
benchmark:
"benchmarks/trimmo/{sample}.trimmo.benchmark.txt"
threads: 6
shell:
"""
trimmomatic SE -phred33 -threads {threads} {input} trimmo/{wildcards.sample}/{wildcards.sample}.trimmed.fastq {params}
"""
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
params:
"-m 0.7 -t"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads: 10
shell:
"""
megahit -r {input} -o {output} -t {threads}
"""
IMHO it is a bad design of the megahit software that it takes a directory as a parameter and outputs into a file in this directory with a hardcoded name. Flagging the filename with directory() doesn't solve the issue, as in this case what you expect to be a file with the .fa extension megahit treats as a directory. The rest of the pipeline is broken in this case.
But this issue can be solved in Snakemake like that:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
# ...
shell:
"""
megahit -r {input} -o megahit/{wildcards.sample} -t {threads}
"""
A better design of the megahit rule would look as follows:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
out_dir = directory("megahit/{sample}/"),
fasta = "megahit/{sample}/final.contigs.fa"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads:
10
shell:
"megahit -r {input} -f -o {output.out_dir} -t {threads}"
This guarantees that the output directory is removed upon failure, while the -f argument to megahit tells it to ignore the fact that the output folder exists (it is created by Snakemake automatically because one of the outputs is a file inside it: final.contigs.fa).
BTW, the -m (--memory) parameter is best implemented as a resource. The only problem though is that snakemake's default resource, mem_mb is in megabytes. One workaround would be as follows:
resources:
mem_mb = mem_mb_limit_for_megahit # could be a fraction of a global constant
params:
mem_bytes = lambda w, resources: round(resources.mem_mb * 1e6)
shell:
"megahit ... -m {params.mem_bytes}"
This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""
With help following a previous question, this code creates targets (copies of the file named "practice_phased_reversed.vcf" in each of two directories.
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/practice_phased_reversed.vcf",f=dirs)
rule r1:
input:
"practice_phased_reversed.vcf"
output:
"{f}/{input}"
shell:
"cp {input} {output}"
However, I would like to pass the target file on the snakemake command line.
I tried this (below), with the command "snakemake practice_phased_reversed.vcf", but it gave an error : "MissingRuleException: No rule to produce practice_phased_reversed.vcf"
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/{{base}}_phased_reversed.vcf",f=dirs)
rule r1:
input:
"{base}_phased_reversed.vcf"
output:
"{f}/{input}"
shell:
"cp {input} {output}"
Thanks for any help
I think you should pass the target file name as configuration option on the command line and use that option to construct the file names in the Snakefile:
target = config['target']
dirs = ['k_1','k2_10']
rule all:
input:
expand("{f}/%s" % target, f=dirs),
rule r1:
input:
target,
output:
"{f}/%s" % target,
shell:
"cp {input} {output}"
To be executed as:
snakemake -C target=practice_phased_reversed.vcf
Your target file practice_phased_reversed.vcf doesn't satisfy output requirements of rule r1. It is missing wildcard value for {f}.
Instead this following example, snakemake data/practice_phased_reversed.vcf, where data matches wildcard f, will work as expected.
Code:
rule r1:
input:
"{base}_phased_reversed.vcf"
output:
"{f}/{base}_phased_reversed.vcf"
shell:
"cp {input} {output}"
My code is like this, I have set the rule all:
rule all:
input:
expand("data/sam/{sample}.sam", sample=SAMPLE_NAMES)
rule trimmomatic:
input:
"data/samples/{sample}.fastq"
output:
"data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"
rule hisat2:
input:
fa="data/genome.fa",
fastq="data/samples/{sample}.clean.fastq"
output:
"data/sam/{sample}.sam"
shell:
"hisat2-build {input.fa} ./index/geneindex | hisat2 -x - -q samples/{inout.fastq} -S {output}"
but it still shows that :
Nothing to be done.
I have try to find the way out. but useless.
help!
I don't see how to use a Snakemake rule to remove a Snakemake output file that has become useless.
In concrete terms, I have a rule bwa_mem_sam that creates a file named {sample}.sam.
I have this other rule, bwa_mem_bam that creates a file named {sample.bam}.
Has the two files contain the same information in different formats, I'd like to remove the first one cannot succeed doing this.
Any help would be very much appreciated.
Ben.
rule bwa_mem_map:
input:
sam="{sample}.sam",
bam="{sample}.bam"
shell:
"rm {input.sam}"
# Convert SAM to BAM.
rule bwa_mem_map_bam:
input:
rules.sam_to_bam.output
# Use bwa mem to map reads on a reference genome.
rule bwa_mem_map_sam:
input:
reference=reference_genome(),
index=reference_genome_index(),
fastq=lambda wildcards: config["units"][SAMPLE_TO_UNIT[wildcards.sample]],
output:
"mapping/{sample}.sam"
threads: 12
log:
"mapping/{sample}.log"
shell:
"{BWA} mem -t {threads} {input.reference} {input.fastq} > {output} 2> {log} "\
"|| (rc=$?; cat {log}; exit $rc;)"
rule sam_to_bam:
input:
"{prefix}.sam"
output:
"{prefix}.bam"
threads: 8
shell:
"{SAMTOOLS} view --threads {threads} -b {input} > {output}"
You don't need a rule to remove you sam files. Just mark the ouput sam file in "bwa_mem_map_sam" rule as temporary:
rule bwa_mem_map_sam:
input:
reference=reference_genome(),
index=reference_genome_index(),
fastq=lambda wildcards: config["units"][SAMPLE_TO_UNIT[wildcards.sample]],
output:
temp("mapping/{sample}.sam")
threads: 12
log:
"mapping/{sample}.log"
shell:
"{BWA} mem -t {threads} {input.reference} {input.fastq} > {output} 2> {log} "\
"|| (rc=$?; cat {log}; exit $rc;)"
as soon as a temp file is not needed anymore (ie: not used as input in any other rule), it will be removed by snakemake.
EDIT AFTER COMMENT:
If I understand correctly, your statement "if the user asks for a sam..." means the sam file is put in the target rule. If this is the case, then as long as the input of the target rule contains the sam file, the file won't be deleted (I guess). If the bam file is put in the target rule (and not the sam), then it will be deleted.
The other way is this:
rule bwa_mem_map:
input:
sam="{sample}.sam",
bam="{sample}.bam"
output:
touch("{sample}_samErased.txt")
shell:
"rm {input.sam}"
and ask for "{sample}_samErased.txt" in the target rule.
Based on the comments above, you want to ask the user if he wants a sam or bam output.
You could use this as a config argument:
snakemake --config output_format=sam
Then you use this kind Snakefile:
samples = ['A','B']
rule all:
input:
expand('{sample}.mapped.{output_format}', sample=samples, output_format=config['output_format'])
rule bwa:
input: '{sample}.fastq'
output: temp('{sample}.mapped.sam')
shell:
"""touch {output}"""
rule sam_to_bam:
input: '{sample}.mapped.sam'
output: '{sample}.mapped.bam'
shell:
"""touch {output}"""