Nextflow report an error : No such variable : from - nextflow

I'm trying to learn nextflow but it's not going very well.
I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process soapnuke works fine, but when passing the files from the channel (clean_fq1 \ clean_fq2) to the next process there is an ERROR: No such variable: from. As shown in the figure below. What should I do? Thanks for a help.
enter image description here
params.fq1 = "/data/mPCR/220213_I7_V350055104_L3_SZPVL22000812-81/*1.fq.gz"
params.fq2 = "/data/mPCR/220213_I7_V350055104_L3_SZPVL22000812-81/*2.fq.gz"
params.index = "/home/duxu/project/data/index.list"
params.primer = “/home/duxu/project/data/primer_*.fasta"
params.output='results'
fq1 = Channel.frompath(params.fq1)
fq2 = Channel.frompath(params.fq2)
index = Channel.frompath(params.index)
primer = Channel.frompath(params.primer)
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz
"""
}
I get stuck on this:
process barcode_splitter{
conda'barcode_splitter'
tag{"barcode_splitter ${fq1} ${fq2}"}
publishDir "${params.outdir}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
index from params.index
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
"""
}

The code below will produce the error you see:
index = Channel.fromPath( params.index )
process barcode_splitter {
...
input:
index from params.index
...
}
What you want is:
index = file( params.index )
process barcode_splitter {
...
input:
path index
...
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.
You may also want to consider refactoring to use the fromFilePairs factory method. Here's one way, untested of course:
params.reads = "/data/mPCR/220213_I7_V350055104_L3_SZPVL22000812-81/*_{1,2}.fq.gz"
params.index = "/home/duxu/project/data/index.list"
params.output = 'results'
reads_ch = Channel.fromFilePairs( params.reads )
index = file( params.index )
process soapnuke {
tag { sample }
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
conda 'soapnuke'
input:
tuple val(sample), path(reads) from reads_ch
output:
tuple val(sample), path('clean{1,2}.fastq.gz') into clean_reads_ch
script:
def (rawfq1, rawfq2) = reads
"""
SOAPnuke filter \\
-1 "${rawfq1}" \\
-2 "${rawfq2}" \\
-l 12 \\
-q 0.5 \\
-Q 2 \\
-o . \\
-C "clean1.fastq.gz" \\
-D "clean2.fastq.gz"
"""
}
process barcode_splitter {
tag { sample }
publishDir "${params.outdir}/barcode_splitter", mode: 'copy'
conda 'barcode_splitter'
input:
tuple val(sample), path(reads) from clean_reads_ch
path index
output:
tuple val(sample), path('*-read-{1,2}.fastq.gz') into trimmed_index
script:
def (splitfq1, splitfq2) = reads
"""
barcode_splitter \\
--bcfile \\
"${index}" \\
"${split1}" \\
"${split2}" \\
--idxread 1 2 \\
--mismatches 1 \\
--suffix ".fastq" \\
--gzipout
"""
}

Related

Snakemake: Only input files can be specified as functions

Snakemake complains that "Only input files can be specified as functions" in the shell line.
def get_filename(wildcards):
sampleid = wildcards.sample.split['-'][1]
GeneFuse_vcf= f"{sampleid}.fusion.vcf"
return GeneFuse_vcf
rule GeneFuse:
input:
bam_path = f"{outputdir}/"+"{sample}/13_genefusion"
params:
svabaflow = config["svabaflow"],
output:
GeneFuse_vcf = get_filename
shell:
"{params.svabaflow} {input} {wildcards.sample}"
In the rule GenefUSE, my {sample} format is ctn-305A26000547
and i want to tell snakemake that my outputfile(GeneFuse_vcf) is named 305A26000547.fusion.vcf
Ofcourse,if the {sample} is ctn-367A23594285,the filename should be "367A23594285.fusion.vcf"
Any suggestion to fix it? Thanks.
Assuming you already have the list of SAMPLEIDS as you state in the comment, you can construct an rule all which calls rule GeneFuse like this:
rule all:
input:
expand("{sample}.fusion.vcf", sampleid=SAMPLEIDS),
default_target: True
rule GeneFuse:
input:
bam_path=f"{outputdir}/" + "{sample}/13_genefusion",
params:
svabaflow=config["svabaflow"],
output:
GeneFuse_vcf="{sample}.fusion.vcf",
shell:
"{params.svabaflow} {input} {wildcards.sample}"
rule all:
input:
expand("{sample}.fusion.vcf", sampleid=SAMPLEIDS),
default_target: True
dictionary = {"305A26000547": "ct1-305A26000547",
"367A23594285": "ct5-367A23594285",
"02A67458112": "ct9-302A67458112"}
def get_path(wildcards):
ss = dictionary[wildcards.sample]
bam_path= f"{outputdir}/{ss}/13_genefusion"
return bam_path
rule GeneFuse:
input:
get_path,
params:
svabaflow=config["svabaflow"],
output:
GeneFuse_vcf="{sample}.fusion.vcf",
shell:
"{params.svabaflow} {input} {wildcards.sample}"

Nextflow DSL2 output from different processes mixed up as input in later processes

I have a DSL2 Nextflow pipeline that branches out to 2 FILTER processes. Then in the CONCAT process, I reuse the two previous process outputs as input. Also in the SUMMARIZE process, I reuse previous process ouputs as input.
I am finding that when I run the pipeline with 2 or more pairs of fastq samples, that the inputs are mixed up.
For example, at the CONCAT step, I end up concating the bwa_2_ch output of one pair of fastq samples with the filter_1_ch of another pair of fastq samples instead of samples with the same pair_id.
I believe am not writing the workflow { } channels and inputs entirely correctly the workflow runs through the steps properly without mixing samples. But I am not sure how to define the inputs so that there is no mix up.
//trimmomatic read trimming
process TRIM {
tag "trim ${pair_id}"
publishDir "${params.outdir}/$pair_id/trim_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id), path("trimmed_${pair_id}_...")
script:
"""
"""
}
//bwa alignment
process BWA_1 {
tag "align-1 ${pair_id}f"
publishDir "${params.outdir}/$pair_id/..."
input:
tuple val(pair_id), path(reads)
path index
output:
tuple val(pair_id), path("${pair_id}_...}")
script:
"""
"""
}
process FILTER_1 {
tag "filter ${pair_id}"
publishDir "${params.outdir}/$pair_id/filter_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id),
path("${pair_id}_...")
script:
"""
"""
}
process FILTER_2 {
tag "filter ${pair_id}"
publishDir "${params.outdir}/$pair_id/filter_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id),
path("${pair_id}_...")
script:
"""
"""
}
//bwa alignment
process BWA_2 {
tag "align-2 ${pair_id}"
publishDir "${params.outdir}/$pair_id/bwa_2_results"
input:
tuple val(pair_id), path(reads)
path index
output:
tuple val(pair_id), path("${pair_id}_...}")
script:
"""
"""
}
//concatenate pf and non_human reads
process CONCAT{
tag "concat ${pair_id}"
publishDir "${params.outdir}/$pair_id"
input:
tuple val(pair_id), path(program_reads)
tuple val(pair_id), path(pf_reads)
output:
tuple val(pair_id), path("${pair_id}_...")
script:
"""
"""
}
//summary
process SUMMARY{
tag "summary ${pair_id}"
publishDir "${params.outdir}/$pair_id"
input:
tuple val(pair_id), path(trim_reads)
tuple val(pair_id), path(non_human_reads)
output:
file("summary_${pair_id}.csv")
script:
"""
"""
}
workflow {
Channel
.fromFilePairs(params.reads, checkIfExists: true)
.set {read_pairs_ch}
// trim reads
trim_ch = TRIM(read_pairs_ch)
// map to pf genome
bwa_1_ch = BWA_1(trim_ch, params.pf_index)
// filter mapped reads
filter_1_ch = FILTER_1(bwa_1_ch)
filter_2_ch = FILTER_2(bwa_1_ch)
// map to pf and human genome
bwa_2_ch = BWA_2(filter_2_ch, params.index)
// concatenate non human reads
concat_ch = CONCAT(bwa_2_ch,filter_1_ch)
// summarize
summary_ch = SUMMARY(trim_ch,concat_ch)
}
Mix-ups like this usually occur when a process erroneously receives two or more queue channels. Most of the time, what you want is one queue channel and one or more value channels when you require multiple input channels. Here, I'm not sure exactly what pair_id would be bound to, but it likely won't be what you expect:
input:
tuple val(pair_id), path(program_reads)
tuple val(pair_id), path(pf_reads)
What you want to do is replace the above with:
input:
tuple val(pair_id), path(program_reads), path(pf_reads)
And then use the join operator to create the required inputs. For example:
workflow {
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.set { read_pairs_ch }
pf_index = file( params.pf_index )
bwa_index = file( params.bwa_index )
// trim reads
trim_ch = TRIM( read_pairs_ch )
// map to pf genome
bwa_1_ch = BWA_1( trim_ch, pf_index)
// filter mapped reads
filter_1_ch = FILTER_1(bwa_1_ch)
filter_2_ch = FILTER_2(bwa_1_ch)
// map to pf and human genome
bwa_2_ch = BWA_2(filter_2_ch, bwa_index)
// concatenate non human reads
concat_ch = bwa_2_ch \
| join( filter_1_ch ) \
| CONCAT
// summarize
summary_ch = trim_ch \
| join( concat_ch ) \
| SUMMARY
}

How to avoid "missing input files" error in Snakemake's "expand" function

I get a MissingInputException when I run the following snakemake code:
import re
import os
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
rule end:
input:
expand(os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas"), fileName=glob_vars.fileName)
rule rename:
'''
rename fasta file to avoid problems
'''
input:
expand("inputs/{{fileName}}.{ext}", ext=glob_vars.ext)
output:
os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas")
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = re.sub(r"\W", "_", line.strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))
My Inputs folder contains these files:
G.bullatarudis.fasta
goldfish_protein.faa
guppy_protein.faa
gyrodactylus_salaris.fasta
protopolystoma_xenopodis.fa
salmon_protein.faa
schistosoma_mansoni.fa
The error message is:
Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NCBI/BLAST/RHB/test.rule:
Missing input files for rule rename:
inputs/guppy_protein.fasta
inputs/guppy_protein.fa
I assumed that the error is caused by expand function, because only guppy_protein.faa file exists, but expand also generate guppy_protein.fasta and guppy_protein.fa files. Are there any solutions?
By default, expand will produce all combinations of the input lists, so this is expected behavior. You need your input to lookup the proper extension given a fileName. I haven't tested this:
glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))
# create a dict to lookup extensions given fileNames
glob_vars_dict = {fname: ex for fname, ex in zip(glob_vars.fileName, glob_vars.ext)}
def rename_input(wildcards):
ext = glob_vars_dict[wildcards.fileName]
return f"inputs/{wildcards.fileName}.{ext}"
rule rename:
input: rename_input
A few unsolicited style comments:
You don't have to prepend your glob_wildcards with the os.getcwd, glob_wildcards("inputs", "{fileName}.{ext}")) should work as snakemake uses paths relative to the working directory by default.
Try to stick with snake_case instead of camalCase for your variable names in python
In this case, fileName isn't a great descriptor of what you are capturing. Maybe species_name or species would be clearer
Thanks to Troy Comi, I modified my code and it worked:
import re
import os
import itertools
speciess,exts = glob_wildcards(os.path.join(os.getcwd(), "inputs_test","{species}.{ext}"))
rule end:
input:
expand("inputs_test/{species}_rename.fas", species=speciess)
def required_files(wildcards):
list_combination = itertools.product([wildcards.species], list(set(exts)))
exist_file = ""
for file in list_combination:
if os.path.exists(f"inputs_test/{'.'.join(file)}"):
exist_file = f"inputs_test/{'.'.join(file)}"
return exist_file
rule rename:
'''
rename fasta file to avoid problems
'''
input:
required_files
output:
"inputs_test/{species}_rename.fas"
run:
list_ = []
with open(str(input)) as f2:
line = f2.readline()
while line:
while not line.startswith('>') and line:
line = f2.readline()
fas_name = ">" + re.sub(r"\W", "_", line.replace(">", "").strip())
list_.append(fas_name)
fas_seq = ""
line = f2.readline()
while not line.startswith('>') and line:
fas_seq += re.sub(r"\s","",line)
line = f2.readline()
list_.append(fas_seq)
with open(str(output), "w") as f:
f.write("\n".join(list_))

Error in Snakemake 'Unexpected keyword expand in rule definition'

As in the title, my Snakefile is giving me a SyntaxError for the expand function in the all rule. I am aware that this is typically caused by whitespace/indentation errors HOWEVER I have confirmed that there are no tabs in the file. I've gone through an deleted every whitespace as well as searched the file with grep. I appreciate any advice.
Error Message:
SyntaxError in line 14 of /PATH/to/Snakefile:
Unexpected keyword expand in rule definition (Snakefile, line 14)
Code:
from glob import glob
from numpy import unique
reads = glob('{}/*'.format(config['readDir']))
samples = []
for i in reads:
sampleName = i.replace('{}/'.format(config['readDir']), '')
sampleName = sampleName.replace('{}'.format(config['readSuffix1']), '')
sampleName = sampleName.replace('{}'.format(config['readSuffix2']), '')
samples.append(sampleName)
samples = unique(samples)
rule all:
expand('fastqc/{sample}_1_fastqc.html', sample=samples),
expand('gene_count/{sample}.count', sample=samples)
rule fastqc:
input:
r1 = config['readDir'] + '/{sample}' + config['readSuffix1'],
r2 = config['readDir'] + '/{sample}' + config['readSuffix2']
output:
o1 = 'fastqc/{sample}_1_fastqc.html',
o2 = 'fastqc/{sample}_2_fastqc.html'
params:
'fastqc'
shell:
'fastqc {input.r1} {input.r2} -o {params}'
rule trim:
input:
r1 = config['readDir'] + '/{sample}' + config['readSuffix1'],
r2 = config['readDir'] + '/{sample}' + config['readSuffix2']
output:
'trimmed_reads/{sample}_val_1.fq',
'trimmed_reads/{sample}_val_2.fq'
params:
outDir = 'trimmed_reads',
suffix = '{sample}',
minPhred = config['minPhred'],
minOverlap = config['minOverlap']
shell:
'trim_galore --paired --quality {params.minPhred} '
'--stringency {params.minOverlap} --basename {params.suffix} '
'--output_dir {params.outDir} {input.r1} {input.r2}'
rule align:
input:
r1 = 'trimmed_reads/{sample}_val_1.fq',
r2 = 'trimmed_reads/{sample}_val_2.fq'
output:
sam = temp('aligned_reads/{sample}.sam'),
bam = 'aligned_reads/{sample}.bam'
params:
ref = config['hisatRef']
threads:
config['threads']
log:
'logs/{sample}_hisat2.log'
shell:
'hisat2 --dta -p {threads} -x {params.ref} '
'-1 {input.r1} -2 {input.r2} -S {output.sam} 2> {log}; '
'samtools sort -# {threads} -o {output.bam} {output.sam}; '
rule sort_name:
input:
'aligned_reads/{sample}.bam'
output:
bam = temp('aligned_reads/{sample}_name_sorted.bam'),
index = temp('aligned_reads/{sample}_name_sorted.bam.bai')
threads:
config['threads']
shell:
'samtools sort -n -# {threads} -o {output.bam} {input}; '
rule count:
input:
bam = 'aligned_reads/{sample}.bam'
output:
'gene_count/{sample}.count'
params:
annotations = config['annotations'],
minMapq = config['minMapq'],
stranded = config['stranded']
shell:
'htseq-count -s {params.stranded} -a {params.minMapq} '
'--additional_attr=gene_name --additional_attr=gene_type '
'{input.bam} {params.annotations} > {output}'
This is an error from python as the rule all has two functions separated by a comma. In this case the second expand call is causing the error. You could replace the , with a + to resolve the error like given below.
expand('fastqc/{sample}_1_fastqc.html', sample=samples) + expand('gene_count/{sample}.count', sample=samples)
You could also combine both into a single expand function as follows
expand(['fastqc/{sample}_1_fastqc.html', 'gene_count/{sample}.count'], sample=samples)
Following code will solve this problem:
rule all:
input:
expand('fastqc/{sample}_1_fastqc.html', sample=samples),
expand('gene_count/{sample}.count', sample=samples)

Snakemake: rename fastQC output in one rule

I'm trying to combine these two rules together
rule fastqc:
input:
fastq = "{sample}.fastq.gz",
output:
zip1 = "{sample}_fastqc.zip",
html = "{sample}_fastqc.html",
threads:8
shell:
"fastqc -t {threads} {input.fastq}"
rule renamefastqc:
input:
zip1 = "{sample}_fastqc.zip",
html = "{sample}_fastqc.html",
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html",
shell:
"mv {input.zip} {output.zip} && "
"mv {input.html} {output.html} "
To look like this.
rule fastqc:
input:
fastq = "{sample}.fastq.gz"
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html"
threads:8
shell:
"fastqc -t {threads} {input.fastq} && "
"mv {outfile.zip} {output.zip1} && "
"mv {outfile.html} {output.html}"
FastQC cannot specify file outputs and will always take a file ending in fastq.gz and create two files ending in _fastqc.zip and _fastqc.html. Normally I just write a rule that takes in those outputs and produces the one with two underscores (renamefastqc rule). But this means everytime I run the pipeline, snakemake sees that the outputs for the fastqc rule are gone and it wants to rebuild them. Therefore I'm trying to combine both rules into one step.
You could use params to define files that are to be renamed.
rule all:
input:
"a123__fastqc.zip",
rule fastqc:
input:
fastq = "{sample}.fastq.gz",
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html",
threads:8
params:
zip1 = lambda wildcards, output: output.zip1.replace('__', '_'),
html = lambda wildcards, output: output.html.replace('__', '_')
shell:
"""
fastqc -t {threads} {input.fastq}
mv {params.zip1} {output.zip1} \\
&& mv {params.html} {output.html}
"""