Snakemake complains that "Only input files can be specified as functions" in the shell line.
def get_filename(wildcards):
sampleid = wildcards.sample.split['-'][1]
GeneFuse_vcf= f"{sampleid}.fusion.vcf"
return GeneFuse_vcf
rule GeneFuse:
input:
bam_path = f"{outputdir}/"+"{sample}/13_genefusion"
params:
svabaflow = config["svabaflow"],
output:
GeneFuse_vcf = get_filename
shell:
"{params.svabaflow} {input} {wildcards.sample}"
In the rule GenefUSE, my {sample} format is ctn-305A26000547
and i want to tell snakemake that my outputfile(GeneFuse_vcf) is named 305A26000547.fusion.vcf
Ofcourse,if the {sample} is ctn-367A23594285,the filename should be "367A23594285.fusion.vcf"
Any suggestion to fix it? Thanks.
Assuming you already have the list of SAMPLEIDS as you state in the comment, you can construct an rule all which calls rule GeneFuse like this:
rule all:
input:
expand("{sample}.fusion.vcf", sampleid=SAMPLEIDS),
default_target: True
rule GeneFuse:
input:
bam_path=f"{outputdir}/" + "{sample}/13_genefusion",
params:
svabaflow=config["svabaflow"],
output:
GeneFuse_vcf="{sample}.fusion.vcf",
shell:
"{params.svabaflow} {input} {wildcards.sample}"
rule all:
input:
expand("{sample}.fusion.vcf", sampleid=SAMPLEIDS),
default_target: True
dictionary = {"305A26000547": "ct1-305A26000547",
"367A23594285": "ct5-367A23594285",
"02A67458112": "ct9-302A67458112"}
def get_path(wildcards):
ss = dictionary[wildcards.sample]
bam_path= f"{outputdir}/{ss}/13_genefusion"
return bam_path
rule GeneFuse:
input:
get_path,
params:
svabaflow=config["svabaflow"],
output:
GeneFuse_vcf="{sample}.fusion.vcf",
shell:
"{params.svabaflow} {input} {wildcards.sample}"
Related
Is it possible to use Dynamic Branching/Plumbing in a snakefile?
I wish to perform the following:
A -> B -> D
or
A -> C -> D
Depending on whether a config variable is true.
for example:
*(rules.B if config["deblur"] == True else rules.B),
In this instance it runs both rules B and C.
I have tried
if config["deblur"] == True:
rules.B,
else:
rules.C,
But this gives me a syntax error.
In the next rule the input is as follows.
input:
qiime_feature_table_input = rules.qiime_deblur.output.qiime_deblur_table if config["deblur"] == "True" else rules.qiime_denoise.output.qiime_denoise_table
Thanks for your help!
Since the value of the configuration variable is known before runtime, there's no need for dynamic modification of the DAG in this case. Here's a simple snakefile that will run rules a -> b -> d if config_var is true and rules a -> c -> d if config_var is false:
config_var = True
rule all:
input:
"d/out.txt",
rule a:
output:
"a/a.txt",
shell:
"""
echo 'a' > '{output}'
"""
rule b:
input:
rules.a.output,
output:
"b/b.txt",
shell:
"""
echo 'b' > '{output}'
"""
rule c:
input:
rules.a.output,
output:
"c/c.txt",
shell:
"""
echo 'c' > '{output}'
"""
rule d:
input:
rules.b.output if config_var else rules.c.output,
output:
"d/out.txt",
shell:
"""
cat '{input}' > '{output}'
"""
Not sure if this applies to your case, but one option could be to have these two rules produce the same file (it could be a dummy file), but define only one rule at a time with a conditional. Here's a rough pseudocode:
config_var = True
rule all:
input: 'test.txt'
if config_var:
rule B:
output: 'test.txt'
else:
rule C:
output: 'test.txt'
As in the title, my Snakefile is giving me a SyntaxError for the expand function in the all rule. I am aware that this is typically caused by whitespace/indentation errors HOWEVER I have confirmed that there are no tabs in the file. I've gone through an deleted every whitespace as well as searched the file with grep. I appreciate any advice.
Error Message:
SyntaxError in line 14 of /PATH/to/Snakefile:
Unexpected keyword expand in rule definition (Snakefile, line 14)
Code:
from glob import glob
from numpy import unique
reads = glob('{}/*'.format(config['readDir']))
samples = []
for i in reads:
sampleName = i.replace('{}/'.format(config['readDir']), '')
sampleName = sampleName.replace('{}'.format(config['readSuffix1']), '')
sampleName = sampleName.replace('{}'.format(config['readSuffix2']), '')
samples.append(sampleName)
samples = unique(samples)
rule all:
expand('fastqc/{sample}_1_fastqc.html', sample=samples),
expand('gene_count/{sample}.count', sample=samples)
rule fastqc:
input:
r1 = config['readDir'] + '/{sample}' + config['readSuffix1'],
r2 = config['readDir'] + '/{sample}' + config['readSuffix2']
output:
o1 = 'fastqc/{sample}_1_fastqc.html',
o2 = 'fastqc/{sample}_2_fastqc.html'
params:
'fastqc'
shell:
'fastqc {input.r1} {input.r2} -o {params}'
rule trim:
input:
r1 = config['readDir'] + '/{sample}' + config['readSuffix1'],
r2 = config['readDir'] + '/{sample}' + config['readSuffix2']
output:
'trimmed_reads/{sample}_val_1.fq',
'trimmed_reads/{sample}_val_2.fq'
params:
outDir = 'trimmed_reads',
suffix = '{sample}',
minPhred = config['minPhred'],
minOverlap = config['minOverlap']
shell:
'trim_galore --paired --quality {params.minPhred} '
'--stringency {params.minOverlap} --basename {params.suffix} '
'--output_dir {params.outDir} {input.r1} {input.r2}'
rule align:
input:
r1 = 'trimmed_reads/{sample}_val_1.fq',
r2 = 'trimmed_reads/{sample}_val_2.fq'
output:
sam = temp('aligned_reads/{sample}.sam'),
bam = 'aligned_reads/{sample}.bam'
params:
ref = config['hisatRef']
threads:
config['threads']
log:
'logs/{sample}_hisat2.log'
shell:
'hisat2 --dta -p {threads} -x {params.ref} '
'-1 {input.r1} -2 {input.r2} -S {output.sam} 2> {log}; '
'samtools sort -# {threads} -o {output.bam} {output.sam}; '
rule sort_name:
input:
'aligned_reads/{sample}.bam'
output:
bam = temp('aligned_reads/{sample}_name_sorted.bam'),
index = temp('aligned_reads/{sample}_name_sorted.bam.bai')
threads:
config['threads']
shell:
'samtools sort -n -# {threads} -o {output.bam} {input}; '
rule count:
input:
bam = 'aligned_reads/{sample}.bam'
output:
'gene_count/{sample}.count'
params:
annotations = config['annotations'],
minMapq = config['minMapq'],
stranded = config['stranded']
shell:
'htseq-count -s {params.stranded} -a {params.minMapq} '
'--additional_attr=gene_name --additional_attr=gene_type '
'{input.bam} {params.annotations} > {output}'
This is an error from python as the rule all has two functions separated by a comma. In this case the second expand call is causing the error. You could replace the , with a + to resolve the error like given below.
expand('fastqc/{sample}_1_fastqc.html', sample=samples) + expand('gene_count/{sample}.count', sample=samples)
You could also combine both into a single expand function as follows
expand(['fastqc/{sample}_1_fastqc.html', 'gene_count/{sample}.count'], sample=samples)
Following code will solve this problem:
rule all:
input:
expand('fastqc/{sample}_1_fastqc.html', sample=samples),
expand('gene_count/{sample}.count', sample=samples)
I'm finding that the name of the output file per rule seems to need a static portion, e.g. "data/{wildcard}_data.csv" vs. "{wildcard}_data.csv"
For example, the script below returns the following error on dryrun:
Building DAG of jobs...
MissingInputException in line 12 of /home/rebecca/workflows/exploring_tools/affymetrix_preprocess/snakemake/Snakefile:
Missing input files for rule getDatFiles:
GSE4290
Script:
rule all:
input: expand("{geoid}_datout.scaled.expr.csv", geoid = config['geoid'], out_dir = config['out_dir'])
benchmark: "benchmark.csv"
rule getDatFiles:
input: "{geoid}"
output: temp("{geoid}_datFiles.RData")
shell:
"Rscript scripts/getDatFiles.R"
rule maskProbes:
input: "{geoid}_datFiles.RData"
output: temp("{geoid}_datFiles.masked.RData")
params:
probeFilterFxn = lambda x: config['probeFilterFxn'],
minProbeNumber = lambda x: config['minProbeNumber'],
probeSingle = lambda x: config['probeSingle']
script: "scripts/maskProbes.R"
rule runExpresso:
input: "{geoid}_datFiles.masked.RData"
output: temp("{geoid}_datout.RData")
params:
bgcorrect_method = lambda x: config['bgcorrect_method'],
normalize = lambda x: config['normalize'],
pmcorrect_method = lambda x: config['pmcorrect_method'],
summary_method = lambda x: config['summary_method']
script: "scripts/runExpresso.R"
rule scaleData:
input: "{geoid}_datout.RData"
output: temp("{geoid}_datout.scaled.RData")
params: sc = lambda x: config['sc']
script: "scripts/scaleData.R"
rule getExpr:
input: "{geoid}_datout.scaled.RData"
output: temp("{geoid}_datout.scaled.expr.csv")
script: "scripts/getExpr.R"
... While the following script runs without error (the difference being including "output/" ahead of the output file names:
rule all:
input: expand("output/{geoid}_datout.scaled.expr.csv", geoid = config['geoid'], out_dir = config['out_dir'])
benchmark: "output/benchmark.csv"
rule getDatFiles:
input: "output/{geoid}"
output: temp("output/{geoid}_datFiles.RData")
shell:
"Rscript scripts/getDatFiles.R"
rule maskProbes:
input: "output/{geoid}_datFiles.RData"
output: temp("output/{geoid}_datFiles.masked.RData")
params:
probeFilterFxn = lambda x: config['probeFilterFxn'],
minProbeNumber = lambda x: config['minProbeNumber'],
probeSingle = lambda x: config['probeSingle']
script: "scripts/maskProbes.R"
rule runExpresso:
input: "output/{geoid}_datFiles.masked.RData"
output: temp("output/{geoid}_datout.RData")
params:
bgcorrect_method = lambda x: config['bgcorrect_method'],
normalize = lambda x: config['normalize'],
pmcorrect_method = lambda x: config['pmcorrect_method'],
summary_method = lambda x: config['summary_method']
script: "scripts/runExpresso.R"
rule scaleData:
input: "output/{geoid}_datout.RData"
output: temp("output/{geoid}_datout.scaled.RData")
params: sc = lambda x: config['sc']
script: "scripts/scaleData.R"
rule getExpr:
input: "output/{geoid}_datout.scaled.RData"
output: temp("output/{geoid}_datout.scaled.expr.csv")
script: "scripts/getExpr.R"
I'm having a hard time understanding why this might be happening. Ultimately, I'd like to workflows that are as possible, and ideally, that entails making the output directory variable.
Any insight would be much appreciated.
You have:
rule getDatFiles:
input: "{geoid}"
which means there should be a file in the current directory named just {geoid}, e.g. ./GSE4290. I suspect what you want is:
rule getDatFiles:
input: "data/{geoid}_data.csv"
...
input: "output/{geoid}" works maybe because there is already a file named output/GSE4290 created elsewhere.
(I haven't looked the rest of the scripts)
Are you running them in the same directory?
I'm trying to combine these two rules together
rule fastqc:
input:
fastq = "{sample}.fastq.gz",
output:
zip1 = "{sample}_fastqc.zip",
html = "{sample}_fastqc.html",
threads:8
shell:
"fastqc -t {threads} {input.fastq}"
rule renamefastqc:
input:
zip1 = "{sample}_fastqc.zip",
html = "{sample}_fastqc.html",
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html",
shell:
"mv {input.zip} {output.zip} && "
"mv {input.html} {output.html} "
To look like this.
rule fastqc:
input:
fastq = "{sample}.fastq.gz"
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html"
threads:8
shell:
"fastqc -t {threads} {input.fastq} && "
"mv {outfile.zip} {output.zip1} && "
"mv {outfile.html} {output.html}"
FastQC cannot specify file outputs and will always take a file ending in fastq.gz and create two files ending in _fastqc.zip and _fastqc.html. Normally I just write a rule that takes in those outputs and produces the one with two underscores (renamefastqc rule). But this means everytime I run the pipeline, snakemake sees that the outputs for the fastqc rule are gone and it wants to rebuild them. Therefore I'm trying to combine both rules into one step.
You could use params to define files that are to be renamed.
rule all:
input:
"a123__fastqc.zip",
rule fastqc:
input:
fastq = "{sample}.fastq.gz",
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html",
threads:8
params:
zip1 = lambda wildcards, output: output.zip1.replace('__', '_'),
html = lambda wildcards, output: output.html.replace('__', '_')
shell:
"""
fastqc -t {threads} {input.fastq}
mv {params.zip1} {output.zip1} \\
&& mv {params.html} {output.html}
"""
I want to use a function on params.
Snakemake:
def mitico(x):
res =int(x)+1
return res
I I have a wildcard {sample} that are integer. And I want to use {sample}+1
How can do this inside the snakemake params?
In the function:
rule create_pt:
input:
read="CALL2/{sample}.vcf",
output:
out="OUT/{sample}.txt
conda:
"envs/mb.yml"
params:
db_ens = "/mnt/mpwor2k/",
fst = "/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa",
tumor_id="{sample}",
normal_id=lambda wildcards: mitico('{sample}')
shell:
I have this error
ValueError: invalid literal for int() with base 10: '{sample}'
Wildcards:
sample=432
{sample} in your lambda function is just a string and not wildcard. This is how to use wildcard in lambda
lambda wildcards: mitico(wildcards.sample)