snakemake: copying a file to multiple folders - snakemake

I'm new to snakemake. I have a rule to copy a file to multiple folders. The folders are made in python.
I must be misunderstanding something about working with multiple targets.
The following code, when run with "Snakemake practice_phased_reversed.vcf"
returns "No rule to produce practice_phased_reversed.vcf"
s=['k_1','k2_10']
fullfs = []
import os
cdir = os.getcwd()
for f in fs:
path = os.path.join(cdir,f)
fullfs.append(path)
try:
os.mkdir(path)
except:
pass
rule r1:
input:
"{basename}_phased_reversed.vcf"
output:
expand("{f}/{{basename}}_phased_reversed.vcf",f=fullfs)
shell:
"cp {input} {output}"

Thanks to Maarten-vd-Sande. Now this works using "snakemake" i.e. without passing a target file name.
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/practice_phased_reversed.vcf",f=dirs)
rule r1:
input:
"practice_phased_reversed.vcf"
output:
"{f}/practice_phased_reversed.vcf"
shell:
"cp {input} {output}"
But I am still missing something. I need to be able to call this with a target, i.e. "snakemake practice_phased_reversed.vcf" Thanks for any help.

Related

Using * to glob within an input file, or using multiple wildcards in input, then using only one wildcard for output?

Is there a way to write a rule so that I don't need to use the wildcards for all inputs/outputs or can I use a "*" to glob rather than using wildcards?? I want to symlink a file that is autocreated in subfolders, to the main directory.
This is the error I get when trying to run the Snakemake:
WildcardError in line 42 of snakemake_guppy_basecall/Snakefile:
Wildcards in input files cannot be determined from output files:
'failpass'
import glob
configfile: "config.yaml"
inputdirectory=config["directory"]
SAMPLES, = glob_wildcards(inputdirectory+"/{sample}.fast5", followlinks=True)
print(SAMPLES)
wildcard_constraints:
sample="\w+\d+_\w+_\w+\d+_.+_\d"
##### target rules #####
rule all:
input:
#expand('basecall/{sample}/sequencing_summary.txt', sample=SAMPLES),
"qc/multiqc.html"
rule make_indvidual_samplefiles:
input:
inputdirectory+"/{sample}.fast5",
output:
"lists/{sample}.txt",
shell:
"basename {input} > {output}"
rule guppy_basecall_persample:
input:
directory=directory(inputdirectory),
samplelist="lists/{sample}.txt",
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
params:
config["basealgo"]
shell:
"guppy_basecaller -i {input.directory} --input_file_list {input.samplelist} -s {output.directory} -c {params} --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"
rule guppy_linkfastq:
input:
#glob_wildcards("basecall/{sample}/*/*.fastq.gz"),
"basecall/{sample}/{failpass}/{runid}.fastq.gz",
output:
"basecall/{sample}.fastq.gz",
shell:
"ln -s {input} {output}"
rule fastqc_pretrim:
input:
#"basecall/{sample}/{failpass}/{runid}.fastq.gz",
"basecall/{sample}.fastq.gz"
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_pretrim/{sample}.log"
threads: 1
wrapper:
"v0.75.0/bio/fastqc"
rule multiqc:
input:
#expand("basecall/{sample}.fastq.gz", sample=SAMPLES)
expand("qc/fastqc_pretrim/{sample}_fastqc.zip", sample=SAMPLES)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.77.0/bio/multiqc"
I am trying to create a pipeline that does: Get Nanopore f5 sequence files -> run guppy basecaller GPU mode -> use resulting fastq files to run FASTQC -> run multiQC for everything

Snakemake cannot handle very long command line?

This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""

snakemake. How to pass target from command line when creating multiple targets

With help following a previous question, this code creates targets (copies of the file named "practice_phased_reversed.vcf" in each of two directories.
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/practice_phased_reversed.vcf",f=dirs)
rule r1:
input:
"practice_phased_reversed.vcf"
output:
"{f}/{input}"
shell:
"cp {input} {output}"
However, I would like to pass the target file on the snakemake command line.
I tried this (below), with the command "snakemake practice_phased_reversed.vcf", but it gave an error : "MissingRuleException: No rule to produce practice_phased_reversed.vcf"
dirs=['k_1','k2_10']
rule all:
input:
expand("{f}/{{base}}_phased_reversed.vcf",f=dirs)
rule r1:
input:
"{base}_phased_reversed.vcf"
output:
"{f}/{input}"
shell:
"cp {input} {output}"
Thanks for any help
I think you should pass the target file name as configuration option on the command line and use that option to construct the file names in the Snakefile:
target = config['target']
dirs = ['k_1','k2_10']
rule all:
input:
expand("{f}/%s" % target, f=dirs),
rule r1:
input:
target,
output:
"{f}/%s" % target,
shell:
"cp {input} {output}"
To be executed as:
snakemake -C target=practice_phased_reversed.vcf
Your target file practice_phased_reversed.vcf doesn't satisfy output requirements of rule r1. It is missing wildcard value for {f}.
Instead this following example, snakemake data/practice_phased_reversed.vcf, where data matches wildcard f, will work as expected.
Code:
rule r1:
input:
"{base}_phased_reversed.vcf"
output:
"{f}/{base}_phased_reversed.vcf"
shell:
"cp {input} {output}"

Snakemake: strip off path from input

I want to write a pipeline in snakemake that takes an input file from config.yaml, runs a command, and writes the output to the current directory under the original filename + new suffix.
Snakefile
configfile: "config.yaml"
rule target:
input:
config["reads"]+".fasta.gz",
rule raw_convert:
input:
config["reads"]
output:
config["reads"]+".fasta.gz" # old path specified here
shell:
"sed -n '1~4s/^#/>/p;2~4p' {input} | gzip > {output}"
config.yaml
reads: /path/to/dir/myreads.fq.gz
Using bash, I would write something like to get the file myreads.fq.gz.fasta.gz:
sed -n '1~4s/^#/>/p;2~4p' ${input} | gzip >$(basename ${input}).fasta.gz
In this solution, I pair read basenames to their full path in a dict, and then use it in rules. This would fail if basenames are not unique though.
import os
d = {}
for read in config["reads"]:
basename = os.path.basename(read)
d[basename] = read
rule all:
input:
expand('{read_basename}.fasta.gz', read_basename=list(d.keys()))
rule xxx:
input:
lambda wildcards: d[wildcards.read_basename]
output:
"{read_basename}.fasta.gz"
shell:
'soemthing'
You may want to replace .fq.gz with .fasta.gz instead of appending them. For readability purposes.
I finally came up with some code that seems to do the trick:
configfile: "config.yaml"
import os
basenamereads = os.path.basename(config["reads"])
rule target:
input: expand("{myoutput}.fasta.gz", myoutput=basenamereads)
rule xxx:
input:
config["reads"]
output:
os.path.basename(config["reads"])+".fasta.gz"
shell:
"cat {input} >{output}"

snakemake: Use different folders for input and output files

This is most likely a very basic issue, but I could not find it documented anywhere.
rule all:
input:
"fasta_file.fna"
output:
"headers.txt"
shell:
"grep "^>" {input} > {output}"
I want to run this for a set of files that are not necessarily in the same folder. Is there a way to provide as command (or config file) the input file name from another directory?
Okay never mind, this was probably no smart question indeed.
rule all:
input:
"input/{sample}.fna"
output:
"output/{sample}_headers.txt"
shell:
"grep "^>" {input} > {output}"
And with that I can just run snakemake for my target file, something like snakemake output/A1_headers.txt, or build a for loop over my input sequences.