How do I get Snakemake to apply all samples to a single rule, before proceeding to the next rule? - snakemake

On a machine with j cores, given a RuleB which depends on a RuleA, I expect to Snakemake to execute my workflow as follows:
RuleA Sample1 using j threads
RuleA Sample2 using j threads
...
RuleA SampleN using j threads
RuleB Sample1 using 1 thread
RuleB Sample2 using 1 thread
...
RuleB SampleN using 1 thread
With RuleB being executed on j samples simultaneously.
Instead the workflow is executed as follows:
RuleA Sample1 using j threads
RuleB Sample1 using 1 thread
RuleA Sample2 using j threads
RuleB Sample2 using 1 thread
...
with ruleB being executed on 1 sample at a time.
Executed in that order, ruleB can't be parallelised, and the workflow runs much slower than it could.
More specifically, I want to align reads to a genome using STAR, and quantify them using RNASeQC. The tool RNASEQC is single threaded, while STAR can work with multiple threads on a single sample.
This results in Snakemake aligning reads in sample1, and then quantifying them using rnaseqc, after which it proceeds to do the same in in sample2. I'd like it to reads in all samples first, and proceed to quantify them (this way, it would be able to run several instances of the single-threaded rnaseqc tool).
The relevant excerpt from the Snakemake file:
sample_basename = ["RNA-seq_L{}_S{}".format(x, y) for x,y in zip(range(1,41), range(1,41))]
sample_lane = [seq + "_L00{}".format(x) for x in [1, 2] for seq in sample_basename]
rule all:
input:
expand("rnaseqc/{s_l}/{s_l}.gene_tpm.gct", s_l=sample_lane)
rule run_star:
input:
index_dir=rules.star_index.output.index_dir,
fq1 = "data/fastq/{sample}_R1_001.fastq.gz",
fq2 = "data/fastq/{sample}_R2_001.fastq.gz",
output:
"star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
"star/{sample}/{sample}Aligned.toTranscriptome.out.bam",
"star/{sample}/{sample}ReadsPerGene.out.tab",
"star/{sample}/{sample}Log.final.out"
log:
"logs/star/{sample}.log"
params:
extra="--quantMode GeneCounts TranscriptomeSAM --chimSegmentMin 20 --outSAMtype BAM SortedByCoordinate",
sample_name = "{sample}"
threads: 18
script:
"scripts/star_align.py"
rule rnaseqc:
input:
bam="star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
gtf="data/gencode.v19.annotation.patched.collapsed.gtf"
output:
"rnaseqc/{sample}/{sample}.exon_reads.gct",
"rnaseqc/{sample}/{sample}.gene_fragments.gct",
"rnaseqc/{sample}/{sample}.gene_reads.gct",
"rnaseqc/{sample}/{sample}.gene_tpm.gct",
"rnaseqc/{sample}/{sample}.metrics.tsv"
params:
extra="-s {sample} --legacy",
output_dir="rnaseqc/{sample}"
log:
"logs/rnaseqc/{sample}"
shell:
"rnaseqc.v2.3.4.linux {params.extra} {input.gtf} {input.bam} {params.output_dir} 2> {log}"
Weirdly enough, doing a dry run with snakemake -np -j does the correct thing:
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L182_S16_L002_R1_001.fastq.gz, data/fastq/RNA-seq_L182_S16_L002_R2_001.fastq.gz
output: star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.sortedByCoord.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.toTranscriptome.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002ReadsPerGene.out.tab, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Log.final.out
log: logs/star/RNA-seq_L182_S16_L002.log
jobid: 1026
wildcards: sample=RNA-seq_L182_S16_L002
threads: 18
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L173_S7_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L173_S7_L001_R2_001.fastq.gz
output: star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001ReadsPerGene.out.tab, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Log.final.out
log: logs/star/RNA-seq_L173_S7_L001.log
jobid: 737
wildcards: sample=RNA-seq_L173_S7_L001
threads: 18
...
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.exon_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_fragments.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_tpm.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L221_S15_L001
jobid: 215
wildcards: sample=RNA-seq_L221_S15_L001
rnaseqc.v2.3.4.linux -s RNA-seq_L221_S15_L001 --legacy data/gencode.v19.annotation.patched.collapsed.gtf star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam rnaseqc/RNA-seq_L221_S15_L001 2> logs/rnaseqc/RNA-seq_L221_S15_L001
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.exon_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_fragments.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_tpm.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L284_S38_L001
jobid: 278
wildcards: sample=RNA-seq_L284_S38_L001
but executing snakemake -j without the -np flag does not.
[Mon Oct 21 13:13:49 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L249_S3_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L249_S3_L001_R2_001.fastq.gz
output: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001ReadsPerGene.out.tab, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Log.final.out
log: logs/star/RNA-seq_L249_S3_L001.log
jobid: 813
wildcards: sample=RNA-seq_L249_S3_L001
threads: 18
Aligning RNA-seq_L249_S3_L001
[Mon Oct 21 13:21:33 2019]
Finished job 813.
2 of 478 steps (0.42%) done
[Mon Oct 21 13:21:33 2019]
rule rnaseqc:
input: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.exon_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_fragments.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_tpm.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L249_S3_L001
jobid: 243
wildcards: sample=RNA-seq_L249_S3_L001
I'm using the latest version of Snakemake available through Conda:
5.5.2

Maybe what you are looking for is to give higher priority to the rule running STAR compared to the rule running rnaseqc. If so, look at the priorities directive, like:
rule star:
priority: 50
...
rule rnaseqc:
priority: 0
...
(Not tested) this should run first all the star jobs, one at a time because they need 18 cores each, then all the rnaseqc jobs in parallel.

Related

Visualizing processes created by fork() using directed graph

01 int main(){
02 int a=10;
03 fork();
04 fork();
05 printf("%d ", a);
07 return 0;
08 }
I am looking for an elegant way to visualize the spawned processes with the help of a graph. I came up with an idea, but this does not take into account the order of execution of each of the processes. My idea goes like this:
Process states are represented as ProcessName(InstPointer). InstPointer points to the line number to be executed for a process. A parent process always points to its child process.
Program starts, P is the parent process:
P(1)
Line 3 executes, child process A is spawned:
P(4)
|
|
v
A(4)
Line 4 executes, child processes B, C are spawned:
P(5)
| \
| \
v \
A(5) \
| |
| |
v v
B(5) C(5)
It somewhat helps me visualize the spawned processes, but I am still unsure whether this has any serious pitfalls. If anyone has any better ideas, please feel free to share!

Input function for technical / biological replicates in snakemake

I´m currently trying to write a Snakemake workflow that can check automatically via a sample.tsv file if a given sample is a biological or technical replicate. And then use in this case at some point of my workflow a rule to merge technical/biological replicates.
My tsv file looks like this:
|sample | unit_bio | unit_tech | fq1 | fq2 |
|----------|----------|-----------|-----|-----|
| bCalAnn1 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_1_R2.fastq.gz |
| bCalAnn1 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn1_1_2_R2.fastq.gz |
| bCalAnn2 | 1 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_1_R2.fastq.gz |
| bCalAnn2 | 1 | 2 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_1_2_R2.fastq.gz |
| bCalAnn2 | 2 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_2_1_R2.fastq.gz |
| bCalAnn2 | 3 | 1 | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R1.fastq.gz | /home/assembly_downstream/data/arima_HiC/bCalAnn2_3_1_R2.fastq.gz |
My Pipeline looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], dtype=str)
rule all:
input:
expand(config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam", zip,
sample=samples["sample"], unit_bio=samples["unit_bio"], unit_tech=samples["unit_tech"])
..
some rules
..
rule add_read_groups:
input:
config["arima_mapping"] + "paired/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
params:
platform = "ILLUMINA",
sampleName = "{sample}",
library = "{sample}",
platform_unit ="None"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/paired_read_groups/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard AddOrReplaceReadGroups I={input} O={output} SM={params.sampleName} LB={params.library} PU={params.platform_unit} PL={params.platform} 2> {log}"
rule merge_tech_repl:
input:
config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam"
params:
val_string = "SILENT"
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/merged_tech_repl/{sample}_{unit_bio}_{unit_tech}.log"
threads:
2 #verwendet nur maximal 2
shell:
"picard MergeSamFiles -I {input} -O {output} --ASSUME_SORTED true --USE_THREADING true --VALIDATION_STRINGENCY {params.val_string} 2> {log}"
rule mark_duplicates:
input:
config["arima_mapping"] + "merge_tech_repl/{sample}_{unit_bio}_{unit_tech}.bam" if config["tech_repl"] else config["arima_mapping"] + "paired_read_groups/{sample}_{unit_bio}_{unit_tech}.bam"
output:
bam = config["arima_mapping"] + "final/{sample}_{unit_bio}_{unit_tech}.bam",
metric = config["arima_mapping"] + "final/metric_{sample}_{unit_bio}_{unit_tech}.txt"
#params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mark_duplicates/{sample}_{unit_bio}_{unit_tech}.log"
shell:
"picard MarkDuplicates I={input} O={output.bam} M={output.metric} 2> {log}"
At the moment I have set a boolean in a config file that tells the mark_duplicates rule whether to take its input from the add_read_group or the merge_technical_replicates rule. This is of course not optimal since it could be that some samples may have duplicates (of any numbers) while others don´t. Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
EDIT
For clarification since I think I explained my goal confusingly.
My first attempt looks like this, I want "i" to be flexible, in case the duplicate number changes. I don't think that my input function returns all duplicates together that match each other but gives them one by one which is not what I want. I´m also unsure on how I would have to handle samples that do not have duplicates since they would have to skip this rule somehow.
input_function(wildcards):
return expand({sample}_{unit_bio}_{i}.bam", sample = wildcards.sample,
unit_bio = wildcards.unit_bio,
i = samples["sample"].str.count(wildcards.sample))
rule tech_duplicate_check:
input:
input_function #(that returns a list of 2-n duplicates, where n could be different for each sample)
output:
{sample}_{unit_bio}.bam
shell:
MergeTechDupl_tool {input} # input is a list
Therefore I want to create a syntax that checks the tsv table if a given sample name and unit_bio number are identical while the unit_tech number is different (and later analog to this for biological replicates), thus merging these specific samples while nonduplicate samples skip the merging rule.
rule gather_techdups_of_a_biodup:
output: "{sample}/{unit_bio}"
input: gather_techdups_of_a_biodup_input_fn
shell: "true" # Fill this in
rule gather_biodips_of_a_techdup:
output: "{sample}/{unit_tech}"
input: gather_biodips_of_a_techdup_input_fn
shell: "true" # Fill this in
After some attempts my main problem I struggle with is the table checking. As far as I know snakemake takes templates as input and checks for all samples that match this. But I would need to check the table for every sample that shares (e.g. for technical replicate) the sample name and the unit_bio number take all these samples and give them as input for the first rule run. Then I would have to take the next sample which was not already part of a previous run to prevent merging the same samples multiple times.
The logic you describe here can be implemented in the gather_techdups_of_a_biodup_input_fn and gather_biodips_of_a_techdup_input_fn functions above. For example, read your sample TSV file with pandas, filter for wildcards.sample and wildcards.unit_bio (or wildcards.unit_tech), then extract columns fq1 and fq2 from the filtered data frame.

snakemake rule error. How can I select rule order?

I run the snakemake for RNA-seq analysis.
I made snakefile for running, and some error occurred in terminal.
I set rule salmon quant reads at last order but it is running at first.
So snakemake showed the error in rule salmon quant reads.
salmon quant reads must run after salmon index finished.
Error in rule salmon_quant_reads:
jobid: 173
output: salmon/WT_Veh_11/quant.sf, salmon/WT_Veh_11/lib_format_counts.json
log: logs/salmon/WT_Veh_11.log (check log file(s) for error message)
conda-env: /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/conda/ff908de630224c1a4118f5dc69c8a761
RuleException:
CalledProcessError in line 111 of /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/Snakefile_2:
Command 'source /home/baelab2/miniconda3/bin/activate '/home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/conda/ff908de630224c1a4118f5dc69c8a761'; set -euo pipefail; /home/baelab2/miniconda3/envs/snakemake/bin/python3.10 /home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/.snakemake/scripts/tmpr6r8ryk9.wrapper.py' returned non-zero exit status 1.
File "/home/baelab2/LEEJUNEYOUNG/7.Colesevelam/RNA-seq/Snakefile_2", line 111, in __rule_salmon_quant_reads
File "/home/baelab2/miniconda3/envs/snakemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
How can I fix it?
Here is the my snakefile info.
SAMPLES = ["KO_Col_5", "KO_Col_6", "KO_Col_7", "KO_Col_8", "KO_Col_9", "KO_Col_10", "KO_Col_11", "KO_Col_15", "KO_Veh_3", "KO_Veh_4", "KO_Veh_5", "KO_Veh_9", "KO_Veh_11", "KO_Veh_13", "KO_Veh_14", "WT_Col_1", "WT_Col_2", "WT_Col_3", "WT_Col_6", "WT_Col_8", "WT_Col_10", "WT_Col_12", "WT_Veh_1", "WT_Veh_2", "WT_Veh_4", "WT_Veh_7", "WT_Veh_8", "WT_Veh_11", "WT_Veh_14"]
rule all:
input:
expand("raw/{sample}_1.fastq.gz", sample=SAMPLES),
expand("raw/{sample}_2.fastq.gz", sample=SAMPLES),
expand("qc/fastqc/{sample}_1.before.trim_fastqc.zip", sample=SAMPLES),
expand("qc/fastqc/{sample}_2.before.trim_fastqc.zip", sample=SAMPLES),
expand("trimmed/{sample}_1.fastq.gz", sample=SAMPLES),
expand("trimmed/{sample}_2.fastq.gz", sample=SAMPLES),
expand("qc/fastqc/{sample}_1.after.trim_fastqc.zip", sample=SAMPLES),
expand("qc/fastqc/{sample}_2.after.trim_fastqc.zip", sample=SAMPLES),
expand("salmon/{sample}/quant.sf", sample=SAMPLES),
expand("salmon/{sample}/lib_format_counts.json", sample=SAMPLES)
rule fastqc_before_trim_1:
input:
"raw/{sample}.fastq.gz",
output:
html="qc/fastqc/{sample}.before.trim.html",
zip="qc/fastqc/{sample}.before.trim_fastqc.zip",
log:
"logs/fastqc/{sample}.before.log"
threads: 10
priority: 1
wrapper:
"v1.7.0/bio/fastqc"
rule cutadapt:
input:
r1 = "raw/{sample}_1.fastq.gz",
r2 = "raw/{sample}_2.fastq.gz"
output:
fastq1="trimmed/{sample}_1.fastq.gz",
fastq2="trimmed/{sample}_2.fastq.gz",
qc="trimmed/{sample}.qc.txt"
params:
adapters = "-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
extra = "--minimum-length 1 -q 20"
log:
"logs/cutadapt/{sample}.log"
threads: 10
priority: 2
wrapper:
"v1.7.0/bio/cutadapt/pe"
rule fastqc_after_trim_2:
input:
"trimmed/{sample}.fastq.gz"
output:
html="qc/fastqc/{sample}.after.trim.html",
zip="qc/fastqc/{sample}.after.trim_fastqc.zip"
log:
"logs/fastqc/{sample}.after.log"
threads: 10
priority: 3
wrapper:
"v1.7.0/bio/fastqc"
rule salmon_index:
input:
sequences="raw/Mus_musculus.GRCm39.cdna.all.fasta"
output:
multiext(
"salmon/transcriptome_index/",
"complete_ref_lens.bin",
"ctable.bin",
"ctg_offsets.bin",
"duplicate_clusters.tsv",
"info.json",
"mphf.bin",
"pos.bin",
"pre_indexing.log",
"rank.bin",
"refAccumLengths.bin",
"ref_indexing.log",
"reflengths.bin",
"refseq.bin",
"seq.bin",
"versionInfo.json",
),
log:
"logs/salmon/transcriptome_index.log",
threads: 10
priority: 10
params:
# optional parameters
extra="",
wrapper:
"v1.7.0/bio/salmon/index"
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "trimmed/{sample}_1.fastq.gz",
r2 = "trimmed/{sample}_2.fastq.gz",
index = "salmon/transcriptome_index"
output:
quant = "salmon/{sample}/quant.sf",
lib = "salmon/{sample}/lib_format_counts.json"
log:
"logs/salmon/{sample}.log"
params:
# optional parameters
libtype ="A",
extra="--validateMappings"
threads: 10
priority: 20
wrapper:
"v1.7.0/bio/salmon/quant"
The only link between salmon_quant_reads and salmon_index is directory salmon/transcriptome_index. However, the directory creation is not a sufficient signal that all work in salmon_index has been completed. So, a quick way to fix this is to include an explicit file:
rule salmon_quant_reads:
input:
# If you have multiple fastq files for a single sample (e.g. technical replicates)
# use a list for r1 and r2.
r1 = "trimmed/{sample}_1.fastq.gz",
r2 = "trimmed/{sample}_2.fastq.gz",
index = "salmon/transcriptome_index",
_temp_dependency = "salmon/transcriptome_index/info.json",

Nexflow: structured inputs with files

I have an array of structure data similar to:
- name: foobar
sex: male
fastqs:
- r1: /path/to/foobar_R1.fastq.gz
r2: /path/to/foobar_R2.fastq.gz
- r1: /path/to/more/foobar_R1.fastq.gz
r2: /path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: /path/to/bazquux_R1.fastq.gz
r2: /path/to/bazquux_R2.fastq.gz
Note that fastqs come in pairs, and the number of pairs per "sample" may be variable.
I want to write a process in nextflow that processes one sample at a time.
In order for the nextflow executor to properly marshal the files, they must somehow be typed as path (or file). Thus typed, the executor will copy the files to the compute node for processing. Simply typing the files paths as var will treat the paths as strings and no files will be copied.
A trivial example of a path input from the docs:
process foo {
input:
path x from '/some/data/file.txt'
"""
your_command --in $x
"""
}
How should I go about declaring the process input so that the files are properly marshaled to the compute node? So far I haven't found any examples in the docs for how to handle structured inputs.
Your structured data looks a lot like YAML. If you can include a top-level object so that your file looks something like this:
samples:
- name: foobar
sex: male
fastqs:
- r1: ./path/to/foobar_R1.fastq.gz
r2: ./path/to/foobar_R2.fastq.gz
- r1: ./path/to/more/foobar_R1.fastq.gz
r2: ./path/to/more/foobar_R2.fastq.gz
- name: bazquux
sex: female
fastqs:
- r1: ./path/to/bazquux_R1.fastq.gz
r2: ./path/to/bazquux_R2.fastq.gz
Then, we can use Nextflow's -params-file option to load the params when we run our workflow. We can access the top-level object from the params, which gives us a list that we can use to create a Channel using the fromList factory method. The following example uses the new DSL 2:
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(fastqs)
"""
echo "${sample_name},${sex}:"
ls -g *.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| flatMap { rec ->
rec.fastqs.collect { rg ->
readgroup = tuple( file(rg.r1), file(rg.r2) )
tuple( rec.name, rec.sex, readgroup )
}
}
| test_proc
}
Results:
$ mkdir -p ./path/to/more
$ touch ./path/to/foobar_R{1,2}.fastq.gz
$ touch ./path/to/more/foobar_R{1,2}.fastq.gz
$ touch ./path/to/bazquux_R{1,2}.fastq.gz
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [desperate_colden] DSL2 - revision: 391a9a3b3a
executor > local (3)
[ed/61c5c3] process > test_proc (foobar) [100%] 3 of 3 ✔
foobar,male:
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 35 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/foobar_R2.fastq.gz
bazquux,female:
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R1.fastq.gz -> ../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 36 Oct 14 13:56 bazquux_R2.fastq.gz -> ../../../path/to/bazquux_R2.fastq.gz
foobar,male:
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R1.fastq.gz -> ../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 40 Oct 14 13:56 foobar_R2.fastq.gz -> ../../../path/to/more/foobar_R2.fastq.gz
As requested, here's a solution that runs per sample. The problem we have is that we cannot simply feed in a list of lists using the path qualifier (since an ArrayList is not a valid path value). We could flatten() the list of file pairs, but this makes it difficult to access each of the file pairs if we need them. You may not necessarily need the file pair relationship but assuming you do, I think the right solution is to feed the R1 and R2 files in separately (i.e. using a path qualifier for R1 and another path qualifier for R2). The following example introspects the instance type to (re-)create the list of readgroups. We can use the stageAs option to localize the files into progressively indexed subdirectories, since some files in the YAML have identical names.
process test_proc {
tag { sample_name }
debug true
stageInMode 'rellink'
input:
tuple val(sample_name), val(sex), path(r1, stageAs:'*/*'), path(r2, stageAs:'*/*')
script:
if( [r1, r2].every { it instanceof List } )
readgroups = [r1, r2].transpose()
else if( [r1, r2].every { it instanceof Path } )
readgroups = [[r1, r2], ]
else
error "Invalid readgroup configuration"
read_pairs = readgroups.collect { r1, r2 -> "${r1},${r2}" }
"""
echo "${sample_name},${sex}:"
echo ${read_pairs.join(' ')}
ls -g */*.fastq.gz
"""
}
workflow {
Channel.fromList( params.samples )
| map { rec ->
def r1 = rec.fastqs.r1.collect { file(it) }
def r2 = rec.fastqs.r2.collect { file(it) }
tuple( rec.name, rec.sex, r1, r2 )
}
| test_proc
}
Results:
$ nextflow run main.nf -params-file file.yaml
N E X T F L O W ~ version 22.04.0
Launching `main.nf` [berserk_sanger] DSL2 - revision: 2f317a8cee
executor > local (2)
[93/6345c9] process > test_proc (bazquux) [100%] 2 of 2 ✔
foobar,male:
1/foobar_R1.fastq.gz,1/foobar_R2.fastq.gz 2/foobar_R1.fastq.gz,2/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R1.fastq.gz -> ../../../../path/to/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 38 Oct 19 13:43 1/foobar_R2.fastq.gz -> ../../../../path/to/foobar_R2.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R1.fastq.gz -> ../../../../path/to/more/foobar_R1.fastq.gz
lrwxrwxrwx 1 users 43 Oct 19 13:43 2/foobar_R2.fastq.gz -> ../../../../path/to/more/foobar_R2.fastq.gz
bazquux,female:
1/bazquux_R1.fastq.gz,1/bazquux_R2.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R1.fastq.gz -> ../../../../path/to/bazquux_R1.fastq.gz
lrwxrwxrwx 1 users 39 Oct 19 13:43 1/bazquux_R2.fastq.gz -> ../../../../path/to/bazquux_R2.fastq.gz

Aerospike connection errors

We run Aerospike server 3.5.15-1 on Ubuntu 14.04 and periodically getting server connection errors from PHP clients ([-1]Unable to connect to server). PHP client version 3.4.1. We run PHP 5.3 clients from a separate server node. Connections created from php-fpm.
There are no any corresponding errors in the server logs and server didn't have to be restarted. So, the problem seem to be on the client side.
This application creates up to 400 simultaneous connections to Aerospike. We use r3.xlarge EC2 instance and server has plenty of available resources.
We followed Aerospike tuning documentation and tried updating proto-fd and recommended OS patameters on the server, but it didn't help
proto-fd-max 100000
proto-fd-idle-ms 15000
That's how we initialize and use Aerospike:
$opts = array(Aerospike::OPT_CONNECT_TIMEOUT => 1250,Aerospike::OPT_WRITE_TIMEOUT => 5000);
$this->db = new Aerospike($config, false, $opts);
//set key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$aero_value = array("value" => $value);
$status = $this->db->put($aero_key, $aero_value, $ttl, $options);
//get key
$aero_key = $this->db->initKey($this->keyspace, $this->table, $key);
$status = $this->db->get($aero_key, $result);
Aerospike server stats before the disconnect:
Aug 27 2015 19:32:50 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (237, 16073516, 16073279) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:00 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (334, 16076516, 16076182) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:10 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 1 ::: dq 0 : fds - proto (288, 16079478, 16079190) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:20 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (131, 16082477, 16082346) : hb (0, 0, 0) : fab (16, 16, 0)
Aug 27 2015 19:33:30 GMT: INFO (info): (thr_info.c::4828) trans_in_progress: wr 0 prox 0 wait 0 ::: q 0 ::: bq 0 ::: iq 0 ::: dq 0 : fds - proto (348, 16084665, 16084317) : hb (0, 0, 0)
From the log segment, we can see that there are around 300 client connections open on the node at any one time, well under the 100000 limit in proto-fd-max.
If you are using multicast for heartbeats (and I think you are), the heartbeats of 0 are fine.
I expect that you have already looked at this, but are you able to check network connectivity between the client and server at the time of the failure? I know that under normal conditions, the client and the server happily coexist, but at the time of the failure, do you see any basic connectivity problems?
Do you happen to have other applications installed on the client machine? Do they have any similar failures, possibly at the time of the Aerospike client problems?
Do you have the client installed on more than one server? Do you maybe only see the connectivity errors on one of the servers?
I know you have already been looking at this, so I apologize if I am covering topics that you have already reviewed.
Thank you for your time,
-DM