Nextflow: add unique ID, hash, or row number to tuple - nextflow

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.

Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

Related

Nextflow input how to declare tuple in tuple

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Aggregate undetermined number of files for all wildcards in one rule

I have a set of files which will be individually processed to produce multiple files. Exactly how many files is unknown before runtime. (If it matters, this is demultiplexing DNA sequencing results.) I then have a script which takes all of these files at once.
Right now I have something like this:
checkpoint demultiplex:
input: "{sample}.fastq"
output: directory("{sample}")
shell:
# in reality the number of output files is not known
"mkdir -p {output} &&"
"touch {output}/{wildcards.sample}-1.fastq &&"
"touch {output}/{wildcards.sample}-2.fastq &&"
"touch {output}/{wildcards.sample}-3.fastq"
def find_outputs(wildcards) :
outdir = checkpoints.demultiplex.get(**wildcards)
return glob.glob("{sample}/{sample}-*.fastq".format_map(wildcards))
rule analysis:
input: find_outputs
outputs: "results.txt"
script: "scripts/do_analysis.R"
This obviously doesn't work, because the values of {sample} (Assume they should be A, B, C, D) are never defined.
As I was writing the question, I came up with this answer, which seems to work. However, if you have something cleaner, I would be happy to accept it!
For checkpoints.<rule>.get() to work its magic, it has to be in the body of a function which is given as a reference, not called. Also, this function needs to take one argument, wildcards.
So we make a function that returns closures having the behavior we need. The value of wildcards (which will be empty in this case) is ignored, allowing us to specify the values manually.
def find_outputs(sample):
def f(wildcards):
checkpoints.demultiplex.get(sample = sample)
return glob.glob("{sample}/{sample}-*.fastq".format(sample = sample))
return f
rule analysis:
input:
find_outputs("A"),
find_outputs("B"),
find_outputs("C"),
find_outputs("D")
output: "results.txt"
script: "script/do_analysis.R"

Regex speed in Perl 6

I've been previously working only with bash regular expressions, grep, sed, awk etc. After trying Perl 6 regexes I've got an impression that they work slower than I would expect, but probably the reason is that I handle them incorrectly.
I've made a simple test to compare similar operations in Perl 6 and in bash. Here is the Perl 6 code:
my #array = "aaaaa" .. "fffff";
say +#array; # 7776 = 6 ** 5
my #search = <abcde cdeff fabcd>;
my token search {
#search
}
my #new_array = #array.grep({/ <search> /});
say #new_array;
Then I printed #array into a file named array (with 7776 lines), made a file named search with 3 lines (abcde, cdeff, fabcd) and made a simple grep search.
$ grep -f search array
After both programs produced the same result, as expected, I measured the time they were working.
$ time perl6 search.p6
real 0m6,683s
user 0m6,724s
sys 0m0,044s
$ time grep -f search array
real 0m0,009s
user 0m0,008s
sys 0m0,000s
So, what am I doing wrong in my Perl 6 code?
UPD: If I pass the search tokens to grep, looping through the #search array, the program works much faster:
my #array = "aaaaa" .. "fffff";
say +#array;
my #search = <abcde cdeff fabcd>;
for #search -> $token {
say ~#array.grep({/$token/});
}
$ time perl6 search.p6
real 0m1,378s
user 0m1,400s
sys 0m0,052s
And if I define each search pattern manually, it works even faster:
my #array = "aaaaa" .. "fffff";
say +#array; # 7776 = 6 ** 5
say ~#array.grep({/abcde/});
say ~#array.grep({/cdeff/});
say ~#array.grep({/fabcd/});
$ time perl6 search.p6
real 0m0,587s
user 0m0,632s
sys 0m0,036s
The grep command is much simpler than Perl 6's regular expressions, and it has had many more years to get optimized. It is also one of the areas that hasn't seen as much optimizing in Rakudo; partly because it is seen as being a difficult thing to work on.
For a more performant example, you could pre-compile the regex:
my $search = "/#search.join('|')/".EVAL;
# $search = /abcde|cdeff|fabcd/;
say ~#array.grep($search);
That change causes it to run in about half a second.
If there is any chance of malicious data in #search, and you have to do this it may be safer to use:
"/#search».Str».perl.join('|')/".EVAL
The compiler can't quite generate that optimized code for /#search/ as #search could change after the regex gets compiled. What could happen is that the first time the regex is used it gets re-compiled into the better form, and then cache it as long as #search doesn't get modified.
(I think Perl 5 does something similar)
One important fact you have to keep in mind is that a regex in Perl 6 is just a method that is written in a domain specific sub-language.

Command in expect for grep keyword

Expect script query:
In one of my expect script I have to pick keyword from output of send command and store in a file, could some one help me.
send "me\n"
output :
EM/X Nmis Ssh Session/2; Userid =
Impact = ; Scope = ; CustomerId = 0
Here I want to pick keyword : Nmis Ssh Session/2
and my target is to create new command in expect script is :
send "set Nmis Ssh Session/2 \n"
so this value : Nmis Ssh Session/2 should store in a variable. Could some one help me.
I'm not quite sure exactly what information is produced by which side, but maybe something like this will do:
expect -re {EM/X ([^;]+);}
set theVariable $expect_out(1,string)
The key is that we use the -re option to pass a regular expression to the expect command. That makes the text that matches what is in the parentheses (a sequence of non-semicolon characters) be stored in the variable expect_out(1,string) (there are many other things stored in the expect_out array; see the documentation). Copying it from there to a named variable for the purpose of storage and further manipulation is trivial.
I do not know if the RE is the right one; there's something of an art in choosing the right one, and it takes quite a lot of knowledge about what the possible output of the other side could be.