Nextflow input how to declare tuple in tuple - input

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?

In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.

The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }

The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Related

Nextflow: add unique ID, hash, or row number to tuple

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

How to merge multiple grouped output files in nextflow

I have a process that outputs multiple files as tuple. Like this:
[chr1,[[chr1.chunk1.bgen],[chr1.chunk1.stat],[chr1.chunk2.bgen],[chr1.chunk2.stat],[chr1.chunk3.bgen],[chr1.chunk3.stat]]]
How could I get chr1.merged.bgen and chr1.merged.stat . I want to use cat to merge all these chunks.
I tried:
input:
tuple val (chrom), file('*.bgen'),file('*.stat') from my_output
"""
cat "${chrom}.${*.bgen}" > "${chrom}.merged.bgen"
cat "${chrom}.${*.stat}" > "${chrom}.merged.stat"
"""
But got " Input tuple does not match input set cardinality decalred
Also for:
input:
tuple val (chrom), path(bgen),path(stat) from my_output
"""
cat "${bgen}" > "${chrom}.merged.bgen"
cat "${stat}" > "${chrom}.merged.stat"
"""
Same error.
I also tried to use my_output.collect() and my_output.toList() But getting same error.
Any help?
In your example process, you defined 3 input variables, but you only provide 2. That's what the error message is trying to tell you - in fact I think this is only a warning.
The way your process' input is defined, nextflow would expect it to be in this form:
[chr1,[chr1.chunk1.bgen, chr1.chunk2.bgen, chr1.chunk3.bgen],[chr1.chunk1.stat, chr1.chunk2.stat, chr1.chunk3.stat]]
So one option would be to reformat your channel before you feed it into that process. Like this for example:
my_output
.map { it -> [it[0],
it.flatten().findAll{ it.getExtension=="bgen"},
it.flatten().findAll{ it.getExtension=="stat"}
]
}

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Aggregate undetermined number of files for all wildcards in one rule

I have a set of files which will be individually processed to produce multiple files. Exactly how many files is unknown before runtime. (If it matters, this is demultiplexing DNA sequencing results.) I then have a script which takes all of these files at once.
Right now I have something like this:
checkpoint demultiplex:
input: "{sample}.fastq"
output: directory("{sample}")
shell:
# in reality the number of output files is not known
"mkdir -p {output} &&"
"touch {output}/{wildcards.sample}-1.fastq &&"
"touch {output}/{wildcards.sample}-2.fastq &&"
"touch {output}/{wildcards.sample}-3.fastq"
def find_outputs(wildcards) :
outdir = checkpoints.demultiplex.get(**wildcards)
return glob.glob("{sample}/{sample}-*.fastq".format_map(wildcards))
rule analysis:
input: find_outputs
outputs: "results.txt"
script: "scripts/do_analysis.R"
This obviously doesn't work, because the values of {sample} (Assume they should be A, B, C, D) are never defined.
As I was writing the question, I came up with this answer, which seems to work. However, if you have something cleaner, I would be happy to accept it!
For checkpoints.<rule>.get() to work its magic, it has to be in the body of a function which is given as a reference, not called. Also, this function needs to take one argument, wildcards.
So we make a function that returns closures having the behavior we need. The value of wildcards (which will be empty in this case) is ignored, allowing us to specify the values manually.
def find_outputs(sample):
def f(wildcards):
checkpoints.demultiplex.get(sample = sample)
return glob.glob("{sample}/{sample}-*.fastq".format(sample = sample))
return f
rule analysis:
input:
find_outputs("A"),
find_outputs("B"),
find_outputs("C"),
find_outputs("D")
output: "results.txt"
script: "script/do_analysis.R"

IDL batch processing: fully automatic input selection

I need to process MODIS ocean level 2 data and I obtained an external plugin for ENVI https://github.com/dawhite/EPOC/releases. Now, I want to batch process hundreds of images for which I modified the code like the following code. The code is running fine, but I have to select the input file every time. Can anyone please help me to make the program fully automatic? I really appreciate and thanks a lot for your help!
Pro OCL2convert
dir = 'C:\MODIS\'
CD, dir
; batch processing of level 2 ocean chlorophyll data
files=file_search('*.L2_LAC_OC.x.hdf', count=numfiles)
; this command will search for all files in the directory which end with
; the specified one
counter=0
; this is a counter that tells IDL which file is being read-starts at 0
While (counter LT numfiles) Do begin
; this command tells IDL to start a loop and to only finish when the counter
; is equal to the number of files with the name specified
name=files(counter)
openr, 1, name
proj = envi_proj_create(/utm, zone=40, datum='WGS-84')
ps = [1000.0d,1000.0d]
no_bowtie = 0 ;same as not setting the keyword
no_msg = 1 ;same as setting the keyword
;OUTPUT CHOICES
;0 -> standard product only
;1 -> georeferenced product only
;2 -> standard and georeferenced products
output_choice = 2
;RETURNED VALUES
;r_fid -> ENVI FID for the standard product, if requested
;georef_fid -> ENVI FID for the georeferenced product, if requested
convert_oc_l2_data, fname=fname, output_path=output_path, $
proj=proj, ps=ps, output_choice=output_choice, r_fid=r_fid, $
georef_fid=georef_fid, no_bowtie=no_bowtie, no_msg=no_msg
print,'done!'
close, 1
counter=counter+1
Endwhile
End
Not knowing what convert_oc_l2_data does (it appears to be a program you created, there is no public documentation for it), I would say that the problem might be that the out_path variable is not defined in the rest of your program.