Nextflow DSL2 output from different processes mixed up as input in later processes - nextflow

I have a DSL2 Nextflow pipeline that branches out to 2 FILTER processes. Then in the CONCAT process, I reuse the two previous process outputs as input. Also in the SUMMARIZE process, I reuse previous process ouputs as input.
I am finding that when I run the pipeline with 2 or more pairs of fastq samples, that the inputs are mixed up.
For example, at the CONCAT step, I end up concating the bwa_2_ch output of one pair of fastq samples with the filter_1_ch of another pair of fastq samples instead of samples with the same pair_id.
I believe am not writing the workflow { } channels and inputs entirely correctly the workflow runs through the steps properly without mixing samples. But I am not sure how to define the inputs so that there is no mix up.
//trimmomatic read trimming
process TRIM {
tag "trim ${pair_id}"
publishDir "${params.outdir}/$pair_id/trim_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id), path("trimmed_${pair_id}_...")
script:
"""
"""
}
//bwa alignment
process BWA_1 {
tag "align-1 ${pair_id}f"
publishDir "${params.outdir}/$pair_id/..."
input:
tuple val(pair_id), path(reads)
path index
output:
tuple val(pair_id), path("${pair_id}_...}")
script:
"""
"""
}
process FILTER_1 {
tag "filter ${pair_id}"
publishDir "${params.outdir}/$pair_id/filter_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id),
path("${pair_id}_...")
script:
"""
"""
}
process FILTER_2 {
tag "filter ${pair_id}"
publishDir "${params.outdir}/$pair_id/filter_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id),
path("${pair_id}_...")
script:
"""
"""
}
//bwa alignment
process BWA_2 {
tag "align-2 ${pair_id}"
publishDir "${params.outdir}/$pair_id/bwa_2_results"
input:
tuple val(pair_id), path(reads)
path index
output:
tuple val(pair_id), path("${pair_id}_...}")
script:
"""
"""
}
//concatenate pf and non_human reads
process CONCAT{
tag "concat ${pair_id}"
publishDir "${params.outdir}/$pair_id"
input:
tuple val(pair_id), path(program_reads)
tuple val(pair_id), path(pf_reads)
output:
tuple val(pair_id), path("${pair_id}_...")
script:
"""
"""
}
//summary
process SUMMARY{
tag "summary ${pair_id}"
publishDir "${params.outdir}/$pair_id"
input:
tuple val(pair_id), path(trim_reads)
tuple val(pair_id), path(non_human_reads)
output:
file("summary_${pair_id}.csv")
script:
"""
"""
}
workflow {
Channel
.fromFilePairs(params.reads, checkIfExists: true)
.set {read_pairs_ch}
// trim reads
trim_ch = TRIM(read_pairs_ch)
// map to pf genome
bwa_1_ch = BWA_1(trim_ch, params.pf_index)
// filter mapped reads
filter_1_ch = FILTER_1(bwa_1_ch)
filter_2_ch = FILTER_2(bwa_1_ch)
// map to pf and human genome
bwa_2_ch = BWA_2(filter_2_ch, params.index)
// concatenate non human reads
concat_ch = CONCAT(bwa_2_ch,filter_1_ch)
// summarize
summary_ch = SUMMARY(trim_ch,concat_ch)
}

Mix-ups like this usually occur when a process erroneously receives two or more queue channels. Most of the time, what you want is one queue channel and one or more value channels when you require multiple input channels. Here, I'm not sure exactly what pair_id would be bound to, but it likely won't be what you expect:
input:
tuple val(pair_id), path(program_reads)
tuple val(pair_id), path(pf_reads)
What you want to do is replace the above with:
input:
tuple val(pair_id), path(program_reads), path(pf_reads)
And then use the join operator to create the required inputs. For example:
workflow {
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.set { read_pairs_ch }
pf_index = file( params.pf_index )
bwa_index = file( params.bwa_index )
// trim reads
trim_ch = TRIM( read_pairs_ch )
// map to pf genome
bwa_1_ch = BWA_1( trim_ch, pf_index)
// filter mapped reads
filter_1_ch = FILTER_1(bwa_1_ch)
filter_2_ch = FILTER_2(bwa_1_ch)
// map to pf and human genome
bwa_2_ch = BWA_2(filter_2_ch, bwa_index)
// concatenate non human reads
concat_ch = bwa_2_ch \
| join( filter_1_ch ) \
| CONCAT
// summarize
summary_ch = trim_ch \
| join( concat_ch ) \
| SUMMARY
}

Related

Nextflow report an error : No such variable : from

I'm trying to learn nextflow but it's not going very well.
I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process soapnuke works fine, but when passing the files from the channel (clean_fq1 \ clean_fq2) to the next process there is an ERROR: No such variable: from. As shown in the figure below. What should I do? Thanks for a help.
enter image description here
params.fq1 = "/data/mPCR/220213_I7_V350055104_L3_SZPVL22000812-81/*1.fq.gz"
params.fq2 = "/data/mPCR/220213_I7_V350055104_L3_SZPVL22000812-81/*2.fq.gz"
params.index = "/home/duxu/project/data/index.list"
params.primer = “/home/duxu/project/data/primer_*.fasta"
params.output='results'
fq1 = Channel.frompath(params.fq1)
fq2 = Channel.frompath(params.fq2)
index = Channel.frompath(params.index)
primer = Channel.frompath(params.primer)
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz
"""
}
I get stuck on this:
process barcode_splitter{
conda'barcode_splitter'
tag{"barcode_splitter ${fq1} ${fq2}"}
publishDir "${params.outdir}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
index from params.index
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
"""
}
The code below will produce the error you see:
index = Channel.fromPath( params.index )
process barcode_splitter {
...
input:
index from params.index
...
}
What you want is:
index = file( params.index )
process barcode_splitter {
...
input:
path index
...
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.
You may also want to consider refactoring to use the fromFilePairs factory method. Here's one way, untested of course:
params.reads = "/data/mPCR/220213_I7_V350055104_L3_SZPVL22000812-81/*_{1,2}.fq.gz"
params.index = "/home/duxu/project/data/index.list"
params.output = 'results'
reads_ch = Channel.fromFilePairs( params.reads )
index = file( params.index )
process soapnuke {
tag { sample }
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
conda 'soapnuke'
input:
tuple val(sample), path(reads) from reads_ch
output:
tuple val(sample), path('clean{1,2}.fastq.gz') into clean_reads_ch
script:
def (rawfq1, rawfq2) = reads
"""
SOAPnuke filter \\
-1 "${rawfq1}" \\
-2 "${rawfq2}" \\
-l 12 \\
-q 0.5 \\
-Q 2 \\
-o . \\
-C "clean1.fastq.gz" \\
-D "clean2.fastq.gz"
"""
}
process barcode_splitter {
tag { sample }
publishDir "${params.outdir}/barcode_splitter", mode: 'copy'
conda 'barcode_splitter'
input:
tuple val(sample), path(reads) from clean_reads_ch
path index
output:
tuple val(sample), path('*-read-{1,2}.fastq.gz') into trimmed_index
script:
def (splitfq1, splitfq2) = reads
"""
barcode_splitter \\
--bcfile \\
"${index}" \\
"${split1}" \\
"${split2}" \\
--idxread 1 2 \\
--mismatches 1 \\
--suffix ".fastq" \\
--gzipout
"""
}

Dynamic Branching/Plumbling

Is it possible to use Dynamic Branching/Plumbing in a snakefile?
I wish to perform the following:
A -> B -> D
or
A -> C -> D
Depending on whether a config variable is true.
for example:
*(rules.B if config["deblur"] == True else rules.B),
In this instance it runs both rules B and C.
I have tried
if config["deblur"] == True:
rules.B,
else:
rules.C,
But this gives me a syntax error.
In the next rule the input is as follows.
input:
qiime_feature_table_input = rules.qiime_deblur.output.qiime_deblur_table if config["deblur"] == "True" else rules.qiime_denoise.output.qiime_denoise_table
Thanks for your help!
Since the value of the configuration variable is known before runtime, there's no need for dynamic modification of the DAG in this case. Here's a simple snakefile that will run rules a -> b -> d if config_var is true and rules a -> c -> d if config_var is false:
config_var = True
rule all:
input:
"d/out.txt",
rule a:
output:
"a/a.txt",
shell:
"""
echo 'a' > '{output}'
"""
rule b:
input:
rules.a.output,
output:
"b/b.txt",
shell:
"""
echo 'b' > '{output}'
"""
rule c:
input:
rules.a.output,
output:
"c/c.txt",
shell:
"""
echo 'c' > '{output}'
"""
rule d:
input:
rules.b.output if config_var else rules.c.output,
output:
"d/out.txt",
shell:
"""
cat '{input}' > '{output}'
"""
Not sure if this applies to your case, but one option could be to have these two rules produce the same file (it could be a dummy file), but define only one rule at a time with a conditional. Here's a rough pseudocode:
config_var = True
rule all:
input: 'test.txt'
if config_var:
rule B:
output: 'test.txt'
else:
rule C:
output: 'test.txt'

combine two lists to PCollection

I'm using Apache Beam. When writing to tfRecord I need to include the ID of the item along with its text and embedding.
The tutorial works with just one list of text but I also have a list of the IDs to match the list of text so I was wondering how I could pass the ID to the following function:
def to_tf_example(entries):
examples = []
text_list, embedding_list = entries
for i in range(len(text_list)):
text = text_list[i]
embedding = embedding_list[i]
features = {
# need to pass in ID here like so:
'id': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[ids.encode('utf-8')])),
'text': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
'embedding': tf.train.Feature(
float_list=tf.train.FloatList(value=embedding.tolist()))
}
example = tf.train.Example(
features=tf.train.Features(
feature=features)).SerializeToString(deterministic=True)
examples.append(example)
return examples
My first thought was just to include the ids in the text column of my database and then extract them via slicing or regex or something but was wondering if there was a better way, I assume converting to a PCollection but don't know where to start. Here is the pipeline:
with beam.Pipeline(args.runner, options=options) as pipeline:
query_data = pipeline | 'Read data from BigQuery' >>
beam.io.Read(beam.io.BigQuerySource(project='my-project', query=get_data(args.limit), use_standard_sql=True))
# list of texts
text = query_data | 'get list of text' >> beam.Map(lambda x: x['text'])
# list of ids
ids = query_data | 'get list of ids' >> beam.Map(lambda x: x['id'])
( text
| 'Batch elements' >> util.BatchElements(
min_batch_size=args.batch_size, max_batch_size=args.batch_size)
| 'Generate embeddings' >> beam.Map(
generate_embeddings, args.module_url, args.random_projection_matrix)
| 'Encode to tf example' >> beam.FlatMap(to_tf_example)
| 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
file_path_prefix='{0}'.format(args.output_dir),
file_name_suffix='.tfrecords')
)
query_data | 'Convert to entity and write to datastore' >> beam.Map(
lambda input_features: create_entity(
input_features, args.kind))
I altered generate_embeddings to return List[int], List[string], List[List[float]] and then used the following function to pass the list of ids and text in:
def generate_embeddings_for_batch(batch, module_url, random_projection_matrix):
embeddings = generate_embeddings([x['id'] for x in batch], [x['text'] for x in batch], module_url, random_projection_matrix)
return embeddings
Here I'll assume generate_embeddings has the signature List[str], ... -> (List[str], List[List[float]])
What you want to do is avoid splitting your texts and ids into separate PCollections. So you might want to write something like
def generate_embeddings_for_batch(
batch,
module_url,
random_projection_matrix) -> Tuple[int, str, List[float]]:
embeddings = generate_embeddings(
[x['text'] for x in batch], module_url, random_projection_matrix)
text_to_embedding = dict(embeddings)
for id, text in batch:
yield x['id'], x['text'], text_to_embedding[x['text']]
From there you should be able to write to_tf_example.
It would probably make sense to look at using TFX.

Snakemake: output file name seems to require a static path portion

I'm finding that the name of the output file per rule seems to need a static portion, e.g. "data/{wildcard}_data.csv" vs. "{wildcard}_data.csv"
For example, the script below returns the following error on dryrun:
Building DAG of jobs...
MissingInputException in line 12 of /home/rebecca/workflows/exploring_tools/affymetrix_preprocess/snakemake/Snakefile:
Missing input files for rule getDatFiles:
GSE4290
Script:
rule all:
input: expand("{geoid}_datout.scaled.expr.csv", geoid = config['geoid'], out_dir = config['out_dir'])
benchmark: "benchmark.csv"
rule getDatFiles:
input: "{geoid}"
output: temp("{geoid}_datFiles.RData")
shell:
"Rscript scripts/getDatFiles.R"
rule maskProbes:
input: "{geoid}_datFiles.RData"
output: temp("{geoid}_datFiles.masked.RData")
params:
probeFilterFxn = lambda x: config['probeFilterFxn'],
minProbeNumber = lambda x: config['minProbeNumber'],
probeSingle = lambda x: config['probeSingle']
script: "scripts/maskProbes.R"
rule runExpresso:
input: "{geoid}_datFiles.masked.RData"
output: temp("{geoid}_datout.RData")
params:
bgcorrect_method = lambda x: config['bgcorrect_method'],
normalize = lambda x: config['normalize'],
pmcorrect_method = lambda x: config['pmcorrect_method'],
summary_method = lambda x: config['summary_method']
script: "scripts/runExpresso.R"
rule scaleData:
input: "{geoid}_datout.RData"
output: temp("{geoid}_datout.scaled.RData")
params: sc = lambda x: config['sc']
script: "scripts/scaleData.R"
rule getExpr:
input: "{geoid}_datout.scaled.RData"
output: temp("{geoid}_datout.scaled.expr.csv")
script: "scripts/getExpr.R"
... While the following script runs without error (the difference being including "output/" ahead of the output file names:
rule all:
input: expand("output/{geoid}_datout.scaled.expr.csv", geoid = config['geoid'], out_dir = config['out_dir'])
benchmark: "output/benchmark.csv"
rule getDatFiles:
input: "output/{geoid}"
output: temp("output/{geoid}_datFiles.RData")
shell:
"Rscript scripts/getDatFiles.R"
rule maskProbes:
input: "output/{geoid}_datFiles.RData"
output: temp("output/{geoid}_datFiles.masked.RData")
params:
probeFilterFxn = lambda x: config['probeFilterFxn'],
minProbeNumber = lambda x: config['minProbeNumber'],
probeSingle = lambda x: config['probeSingle']
script: "scripts/maskProbes.R"
rule runExpresso:
input: "output/{geoid}_datFiles.masked.RData"
output: temp("output/{geoid}_datout.RData")
params:
bgcorrect_method = lambda x: config['bgcorrect_method'],
normalize = lambda x: config['normalize'],
pmcorrect_method = lambda x: config['pmcorrect_method'],
summary_method = lambda x: config['summary_method']
script: "scripts/runExpresso.R"
rule scaleData:
input: "output/{geoid}_datout.RData"
output: temp("output/{geoid}_datout.scaled.RData")
params: sc = lambda x: config['sc']
script: "scripts/scaleData.R"
rule getExpr:
input: "output/{geoid}_datout.scaled.RData"
output: temp("output/{geoid}_datout.scaled.expr.csv")
script: "scripts/getExpr.R"
I'm having a hard time understanding why this might be happening. Ultimately, I'd like to workflows that are as possible, and ideally, that entails making the output directory variable.
Any insight would be much appreciated.
You have:
rule getDatFiles:
input: "{geoid}"
which means there should be a file in the current directory named just {geoid}, e.g. ./GSE4290. I suspect what you want is:
rule getDatFiles:
input: "data/{geoid}_data.csv"
...
input: "output/{geoid}" works maybe because there is already a file named output/GSE4290 created elsewhere.
(I haven't looked the rest of the scripts)
Are you running them in the same directory?

Snakemake: rename fastQC output in one rule

I'm trying to combine these two rules together
rule fastqc:
input:
fastq = "{sample}.fastq.gz",
output:
zip1 = "{sample}_fastqc.zip",
html = "{sample}_fastqc.html",
threads:8
shell:
"fastqc -t {threads} {input.fastq}"
rule renamefastqc:
input:
zip1 = "{sample}_fastqc.zip",
html = "{sample}_fastqc.html",
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html",
shell:
"mv {input.zip} {output.zip} && "
"mv {input.html} {output.html} "
To look like this.
rule fastqc:
input:
fastq = "{sample}.fastq.gz"
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html"
threads:8
shell:
"fastqc -t {threads} {input.fastq} && "
"mv {outfile.zip} {output.zip1} && "
"mv {outfile.html} {output.html}"
FastQC cannot specify file outputs and will always take a file ending in fastq.gz and create two files ending in _fastqc.zip and _fastqc.html. Normally I just write a rule that takes in those outputs and produces the one with two underscores (renamefastqc rule). But this means everytime I run the pipeline, snakemake sees that the outputs for the fastqc rule are gone and it wants to rebuild them. Therefore I'm trying to combine both rules into one step.
You could use params to define files that are to be renamed.
rule all:
input:
"a123__fastqc.zip",
rule fastqc:
input:
fastq = "{sample}.fastq.gz",
output:
zip1 = "{sample}__fastqc.zip",
html = "{sample}__fastqc.html",
threads:8
params:
zip1 = lambda wildcards, output: output.zip1.replace('__', '_'),
html = lambda wildcards, output: output.html.replace('__', '_')
shell:
"""
fastqc -t {threads} {input.fastq}
mv {params.zip1} {output.zip1} \\
&& mv {params.html} {output.html}
"""