check if nextflow channel is empty - channel

I am trying to figure out how to check if a channel is empty or not.
For instance, I have two processes. The first process runs only if a combination of parameters/flags are set and if so, checks also if its input file from another process (input via a channel) is not empty, then it creates a new input file for a second process (to eventually replace the default one). As a simplified example:
.....
.....
// create the channel here to force nextflow to wait for the first process
_chNewInputForProcessTwo = Channel.create()
process processOne {
when:
params.conditionOne && parameters.conditionTwo
input:
file inputFile from _channelUpstreamProcess
output:
file("my.output.file") into _chNewInputForProcessTwo
script:
"""
# check if we need to produce new input for second process (i.e., input file not empty)
if [ -s ${inputFIle} ]
then
<super_command_to_generate_new_fancy_input_for_second_process> > "my.output.file"
else
echo "No need to create new input"
fi
"""
}
// and here I would like to check if new input was generated or leave the "default" one
_chInputProcessTwo = Channel.from(_chNewInputForProcessTwo).ifEmpty(Channel.value(params.defaultInputProcessTwo))
process secondProcess {
input:
file inputFile from _chInputProcessTwo
......
......
etc.
When I try running with this approach it fails because the channel _chNewInputForProcessTwo contains DataflowQueue(queue=[]) therefore, not being actually empty.
I've tried several things looking at the documentation and the threads on google groups and on gitter. trying to set it to empty, but then it complains i am trying to use the channel twice. putting create().close(), etc.
Is there a clean/reasonable way to do this? I could do it using a value channel and have the first process output some string on the stdout to be picked up and checked by the second process, but that seems pretty dirty to me.
Any suggestions/feedback is appreciated. Thank you in advance!
Marius

Best to avoid trying to check if the channel is empty. If your channel could be empty and you need a default value in your channel, you can use the ifEmpty operator to supply one. Note that a single value is implicitly a value channel. I think all you need is:
myDefaultInputFile = file(params.defaultInputProcessTwo)
chInputProcessTwo = chNewInputForProcessTwo.ifEmpty(myDefaultInputFile)
Also, calling Channel.create() is usually unnecessary.

Related

Asynchronous reading of an stdout

I've wrote this simple script, it generates one output line per second (generator.sh):
for i in {0..5}; do echo $i; sleep 1; done
The raku program will launch this script and will print the lines as soon as they appear:
my $proc = Proc::Async.new("sh", "generator.sh");
$proc.stdout.tap({ .print });
my $promise = $proc.start;
await $promise;
All works as expected: every second we see a new line. But let's rewrite generator in raku (generator.raku):
for 0..5 { .say; sleep 1 }
and change the first line of the program to this:
my $proc = Proc::Async.new("raku", "generator.raku");
Now something wrong: first we see first line of output ("0"), then a long pause, and finally we see all the remaining lines of the output.
I tried to grab output of the generators via script command:
script -c 'sh generator.sh' script-sh
script -c 'raku generator.raku' script-raku
And to analyze them in a hexadecimal editor, and it looks like they are the same: after each digit, bytes 0d and 0a follow.
Why is such a difference in working with seemingly identical generators? I need to understand this because I am going to launch an external program and process its output online.
Why is such a difference in working with seemingly identical generators?
First, with regard to the title, the issue is not about the reading side, but rather the writing side.
Raku's I/O implementation looks at whether STDOUT is attached to a TTY. If it is a TTY, any output is immediately written to the output handle. However, if it's not a TTY, then it will apply buffering, which results in a significant performance improvement but at the cost of the output being chunked by the buffer size.
If you change generator.raku to disable output buffering:
$*OUT.out-buffer = False; for 0..5 { .say; sleep 1 }
Then the output will be seen immediately.
I need to understand this because I am going to launch an external program and process its output online.
It'll only be an issue if the external program you launch also has such a buffering policy.
In addition to answer of #Jonathan Worthington. Although buffering is an issue of writing side, it is possible to cope with this on the reading side. stdbuf, unbuffer, script can be used on linux (see this discussion). On windows only winpty helps me, which I found here.
So, if there are winpty.exe, winpty-agent.exe, winpty.dll, msys-2.0.dll files in working directory, this code can be used to run program without buffering:
my $proc = Proc::Async.new(<winpty.exe -Xallow-non-tty -Xplain raku generator.raku>);

Nextflow: add unique ID, hash, or row number to tuple

ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}

My workflow ignores the path decided via Inputfunction

I have two possible path for Trinity, genome free (GF) and genome guided (GG). For deciding which way to go I use the variable GUIDED from a config and depending on it i give the path to files created by either the GG or GF part.
The problem is that no matter what the Input function returns, snakemake always tries to run the GG part. (except for the exception ofc)
def GenomeDependentInput()->str:
guided = config["GUIDED"]
if guided == "GF":
print(rules.aggregate_GF.output.fasta) #this print is run by snakemake and gives the correct path ...Results/trinityGF/{species}_Trinity_GF.fasta
return rules.aggregate_GF.output.fasta
elif guided == "GG":
print(rules.aggregateTrinity.output.fasta) # this is not (good)
return rules.aggregateTrinity.output.fasta
else:
raise ValueError("Please fill in the GUIDED variable in the config")
rule Transdecoder:
input:
fasta = GenomeDependentInput()
output:
pep = path.join(TRANS_DIR, "{species}", path.basename(GenomeDependentInput()) + ".transdecoder.pep")
envmodules:
config["PERL"],
config["PYTHON3"]
script:
"scripts/TransDecoder.py"
So I found that down the line another rule uses only the GG part and I had to use the GenomeDependentInput function there too.
Furthermore the Transdecoder rule wasn't even active as the Transdecoder output wasn't yet used as input.
Maybe this will help somebody else so I'll just leave this here.

How do I store the value returned by either run or shell?

Let's say I have this script:
# prog.p6
my $info = run "uname";
When I run prog.p6, I get:
$ perl6 prog.p6
Linux
Is there a way to store a stringified version of the returned value and prevent it from being output to the terminal?
There's already a similar question but it doesn't provide a specific answer.
You need to enable the stdout pipe, which otherwise defaults to $*OUT, by setting :out. So:
my $proc = run("uname", :out);
my $stdout = $proc.out;
say $stdout.slurp;
$stdout.close;
which can be shortened to:
my $proc = run("uname", :out);
say $proc.out.slurp(:close);
If you want to capture output on stderr separately than stdout you can do:
my $proc = run("uname", :out, :err);
say "[stdout] " ~ $proc.out.slurp(:close);
say "[stderr] " ~ $proc.err.slurp(:close);
or if you want to capture stdout and stderr to one pipe, then:
my $proc = run("uname", :merge);
say "[stdout and stderr] " ~ $proc.out.slurp(:close);
Finally, if you don't want to capture the output and don't want it output to the terminal:
my $proc = run("uname", :!out, :!err);
exit( $proc.exitcode );
The solution covered in this answer is concise.
This sometimes outweighs its disadvantages:
Doesn't store the result code. If you need that, use ugexe's solution instead.
Doesn't store output to stderr. If you need that, use ugexe's solution instead.
Potential vulnerability. This is explained below. Consider ugexe's solution instead.
Documentation of the features explained below starts with the quote adverb :exec.
Safest unsafe variant: q
The safest variant uses a single q:
say qx[ echo 42 ] # 42
If there's an error then the construct returns an empty string and any error message will appear on stderr.
This safest variant is analogous to a single quoted string like 'foo' passed to the shell. Single quoted strings don't interpolate so there's no vulnerability to a code injection attack.
That said, you're passing a single string to the shell which may not be the shell you're expecting so it may not parse the string as you're expecting.
Least safe unsafe variant: qq
The following line produces the same result as the q line but uses the least safe variant:
say qqx[ echo 42 ]
This double q variant is analogous to a double quoted string ("foo"). This form of string quoting does interpolate which means it is subject to a code injection attack if you include a variable in the string passed to the shell.
By default run just passes the STDOUT and STDERR to the parent process's STDOUT and STDERR.
You have to tell it to do something else.
The simplest is to just give it :out to tell it to keep STDOUT. (Short for :out(True))
my $proc = run 'uname', :out;
my $result = $proc.out.slurp(:close);
my $proc = run 'uname', :out;
for $proc.out.lines(:close) {
.say;
}
You can also effectively tell it to just send STDOUT to /dev/null with :!out. (Short for :out(False))
There are more things you can do with :out
{
my $file will leave {.close} = open :w, 'test.out';
run 'uname', :out($file); # write directly to a file
}
print slurp 'test.out'; # Linux
my $proc = run 'uname', :out;
react {
whenever $proc.out.Supply {
.print
LAST {
$proc.out.close;
done; # in case there are other whenevers
}
}
}
If you are going to do that last one, it is probably better to use Proc::Async.

Command in expect for grep keyword

Expect script query:
In one of my expect script I have to pick keyword from output of send command and store in a file, could some one help me.
send "me\n"
output :
EM/X Nmis Ssh Session/2; Userid =
Impact = ; Scope = ; CustomerId = 0
Here I want to pick keyword : Nmis Ssh Session/2
and my target is to create new command in expect script is :
send "set Nmis Ssh Session/2 \n"
so this value : Nmis Ssh Session/2 should store in a variable. Could some one help me.
I'm not quite sure exactly what information is produced by which side, but maybe something like this will do:
expect -re {EM/X ([^;]+);}
set theVariable $expect_out(1,string)
The key is that we use the -re option to pass a regular expression to the expect command. That makes the text that matches what is in the parentheses (a sequence of non-semicolon characters) be stored in the variable expect_out(1,string) (there are many other things stored in the expect_out array; see the documentation). Copying it from there to a named variable for the purpose of storage and further manipulation is trivial.
I do not know if the RE is the right one; there's something of an art in choosing the right one, and it takes quite a lot of knowledge about what the possible output of the other side could be.