In this program:
use v6;
my $j = +any "33", "42", "2.1";
gather for $j -> $e {
say $e;
} # prints 33422.1
for $j -> $e {
say $e; # prints any(33, 42, 2.1)
}
How does gather in front of forchange the behavior of the Junction, allowing to create a loop over it? The documentation does not seem to reflect that behavior. Is that spec?
Fixed by jnthn in code and test commits.
Issue filed.
Golfed:
do put .^name for any 1 ; # Int
put .^name for any 1 ; # Mu
Any of ten of the thirteen statement prefixes listed in the doc can be used instead of do or gather with the same result. (supply unsurprisingly produces no output and hyper and race are red herrings because they try and fail to apply methods to the junction values.)
Any type of junction produces the same results.
Any number of elements of the junction produces the same result for the for loop without a statement prefix, namely a single Mu. With a statement prefix the for loop repeats the primary statement (the put ...) the appropriate number of times.
I've searched both rt and gh issues and failed to find a related bug report.
Related
ch_files = Channel.fromPath("myfiles/*.csv")
ch_parameters = Channel.from(['A','B, 'C', 'D'])
ch_samplesize = Channel.from([4, 16, 128])
process makeGrid {
input:
path input_file from ch_files
each parameter from ch_parameters
each samplesize from ch_samplesize
output:
tuple path(input_file), parameter, samplesize, path("config_file.ini") into settings_grid
"""
echo "parameter=$parameter;sampleSize=$samplesize" > config_file.ini
"""
}
gives me a number_of_files * 4 * 3 grid of settings files, so I can run some script for each combination of parameters and input files.
How do I add some ID to each line of this grid? A row ID would be OK, but I would even prefer some unique 6-digit alphanumeric code without a "meaning" because the order in the table doesn't matter. I could extract out the last part of the working folder which is seemingly unique per process; but I don't think it is ideal to rely on sed and $PWD for this, and I didn't see it provided as a runtime metadata variable provider. (plus it's a bit long but OK). In a former setup I had a job ID from the LSF cluster system for this purpose, but I want this to be portable.
Every combination is not guaranteed to be unique (e.g. having parameter 'A' twice in the input channel should be valid).
To be clear, I would like this output
file1.csv A 4 pathto/config.ini 1ac5r
file1.csv A 16 pathto/config.ini 7zfge
file1.csv A 128 pathto/config.ini ztgg4
file2.csv A 4 pathto/config.ini 123js
etc.
Given the input declaration, which uses the each qualifier as an input repeater, it will be difficult to append some unique id to the grid without some refactoring to use either the combine or cross operators. If the inputs are just files or simple values (like in your example code), refactoring doesn't make much sense.
To get a unique code, the simple options are:
Like you mentioned, there's no way, unfortunately, to access the unique task hash without some hack to parse $PWD. Although, it might be possible to use BASH parameter substitution to avoid sed/awk/cut (assuming BASH is your shell of course...) you could try using: "${PWD##*/}"
You might instead prefer using ${task.index}, which is a unique index within the same task. Although the task index is not guaranteed to be unique across executions, it should be sufficient in most cases. It can also be formatted for example:
process example {
...
script:
def idx = String.format("%06d", task.index)
"""
echo "${idx}"
"""
}
Alternatively, create your own UUID. You might be able to take the first N characters but this will of course decrease the likelihood of the IDs being unique (not that there was any guarantee of that anyway). This might not really matter though for a small finite set of inputs:
process example {
...
script:
def uuid = UUID.randomUUID().toString()
"""
echo "${uuid}"
echo "${uuid.take(6)}"
echo "${uuid.takeBefore('-')}"
"""
}
Something is going on here that I don't quite understand.
> my #arr = <ac bc abc>
> #arr.grep: (( * ~~ /a/ ) && ( * ~~ /b/ ))
(bc abc)
But
> #arr.grep(* ~~ /a/).grep(* ~~ /b/)
(abc)
What's the reason?
You've come up with perfectly cromulent solutions.
Another would be:
my #arr = <ac bc abc>
#arr.grep: { $_ ~~ /a/ && $_ ~~ /b/ }
(abc)
The rest of this answer just explains the problem. The problem in this question is a more complicated version of the problem covered at WhateverStar && WhateverStar.
The logical ops don't execute their arguments if they're code.
So { $_ ~~ /a/ } && { $_ ~~ /b/ } returns { $_ ~~ /b/ }.
Or * ~~ /a/ && * ~~ /b/ returns * ~~ /b/.
At the same time, grep does execute its matcher if it's code or if it's a regex, so these are all the same:
foo.grep: { $_ ~~ /.../ }
foo.grep: * ~~ /.../;
foo.grep: /.../;
The magic of Junctions
Your Junction solution seems natural. I'd love it if someone could explain what I'm missing in the following. While I expected it to work, it's bending my head to figure out how it actually works but I think it's something like (but not quite):
foo & bar becomes a Junction of foo and bar.
An attempt is made to invoke grep with a Junction as an argument.
Because Junction is outside the normal Any value hierarchy, most routines don't have a matching signature. grep doesn't.
When you invoke a routine and there is no corresponding signature then the initial dispatch fails and we hit a dispatch fallback handler. This may be the method one.
If the dispatch fallback handler sees there are Junction arguments then it extracts the individual values in the Junction and fires up logically parallel "threads" (currently never actual threads, but the compiler is allowed to make them real threads if it thinks that is a good idea) corresponding to those values. Thus it invokes grep twice per element of the invocant and outputs the results back into a new Junction of the same type as the incoming Junction.
In boolean context, a Junction collapses to a single result.
(If a Junction is in an overall boolean context then it can (parallel) short-circuits based on results. If it's an any and any result comes in True then it can cancel the rest -- or not do them in the first place if it's actually doing things sequentially, as is currently always the case. If it's an all and any result comes in False it can cancel the rest. Etc.)
Problem
I have several files, each one column, and I want to compare each of them to one another to find what elements are contained across all files. Alternatively - if it is easier - I could make a column matrix.
Question
How can I find the common elements across multiple columns.
Request
I am not an expert at awk (obviously). So a verbose explanation of the code would be much appreciated.
Other
# joepvd made some code that was somewhat similar... https://unix.stackexchange.com/questions/216511/comparing-the-first-column-of-two-files-and-printing-the-entire-row-of-the-secon/216515#216515?newreg=f4fd3a8743aa4210863f2ef527d0838b
to find what elements are contained across all files
awk is your friend as you guessed. Use the procedure below
#Store the files in an array. Assuming all files in one place
filelist=( $(find . -maxdepth 1 -type f) ) #array of files
awk -v count="${#filelist[#]}" '{value[$1]++}END{for(i in value){
if(value[i]==count){printf "Value %d is found in all files\n",i}}}' "${filelist[#]}"
Note
We used -v count="${#filelist[#]}" to pass the total file count to awk Note # in the beginning of an array gives element count.
value[$1]++ increments the count of a value as seen in the file. Also it creates value[$1] if not already exist with the initial value zero.
This method fails, if a value appear in a file more than once.
And END block with awk is executed only at last, ie after every records from all the files have been processed.
If you can have the same value multiple times in a single file, we'll need to take care to only count it once for each file.
A couple of variations with GNU awk (which is needed for ARGIND to be available. It could be emulated by checking FILENAME but that's even uglier.)
gawk '{ A[$0] = or(A[$0], lshift(1, ARGIND-1)) }
END { for (x in A) if (A[x] == lshift(1, ARGIND) - 1) print x }'
file1 file2 file3
The array A is keyed by the values (lines), and holds a bitmap of the files in which a line has been found. For each line read, we set bit number ARGIND-1 (since ARGIND starts with one).
At the end of input, run through all saved lines, and print them if the bitmap is all ones (up to the number of files seen).
gawk 'ARGIND > LASTIND {
LASTIND = ARGIND; for (x in CURR) { ALL[x] += 1; delete CURR[x] }
}
{ CURR[$0] = 1 }
END { for (x in CURR) ALL[x] += 1;
for (x in ALL) if (ALL[x] == ARGIND) print x
}' file1 file2 file3
Here, when a line is encountered, the corresponding element in arrayCURR, is set (middle part). When the file number changes (ARGIND > LASTIND), values in array ALL are increased for all values set in CURR, and the latter is cleared. At the END of input, the values in ALL are updated for the last file, and the total count is checked against the total number of files, printing the ones that appear in all files.
The bitmap approach is likely slightly faster with large inputs, since it doesn't involve creating and walking through a temporary array, but the number of files it can handle is limited by the number of bits the bit operations can handle (which seems to be about 50 on 64-bit Linux).
In both cases, the resulting printout will be in essentially a random order, since associative arrays do not preserve ordering.
I'm going to assume that it's the problem that matters, not the implementation language so here's an alternative using perl:
#! /usr/bin/perl
use strict;
my %elements=();
my $filecount=#ARGV;
while(<>) {
$elements{$_}->{$ARGV}++;
};
print grep {!/^$/} map {
"$_" if (keys %{ $elements{$_} } == $filecount)
} (keys %elements);
The while loop builds a hash-of-hashes (aka "HoH". See man perldsc and man perllol for details. Also see below for an example), with the top level key being each line from each input file, and the second-level key being the names of the file(s) that value appeared in.
The grep ... map {...} returns each top-level key where the number of files it appears in is equal to the number of input files
Here's what the data structure looks like, using the example you gave to ilkkachu:
{
'A' => { 'file1' => 1 },
'B' => { 'file2' => 1 },
'C' => { 'file1' => 1, 'file2' => 1, 'file3' => 1 },
'E' => { 'file2' => 1 },
'F' => { 'file1' => 1 },
'K' => { 'file3' => 1 },
'L' => { 'file3' => 1 }
}
Note that if there happen to be any duplicates in a single file, that fact is stored in this structure and can be checked.
The grep before the map isn't strictly required in this particular example, but is useful if you want to store the result in an array for further processing rather than print it immediately.
With the grep, it returns an array of only the matching elements, or in this case just the single value C. Without it, it returns an array of empty strings plus the matching elements. e.g. ("", "", "", "", "C", "", ""). Actually, they return the elements with a newline (\n) at the end because I didn't use chomp in the while loop as I knew i'd be printing them directly. In most programs, i'd use chomp to strip newlines and/or carriage-returns.
I'm doing a optimization model of a relatively big model. I will use 15 timesteps in this model, but now when I'm testing it I am only using 4. However, even with 11 time steps less than desired the model still prints 22 000 rows of variables, where perhaps merely a hundred differs from 0.
Does anyone see a way past this? I.e. a way using NEOS server to only print the variable name and corresponding value if it is higher than 0.
What I've tested is:
solve;
option omit_zero_rows 0; (also tried 1;)
display _varname, _var;
Using both omit_zero_rows 0; or omit_zero_rows 1; still prints every result, and not those higher than 0.
I've also tried:
solve;
if _var > 0 then {
display _varname, _var;
}
but it gave me syntax error. Both (or really, the three) variants were tested in the .run file I use for NEOS server.
I'm posting a solution to this issue, as I believe that this is an issue more people will stumble upon. Basically, in order to print only non-zero values using NEOS Server write your command file (.run file) as:
solve;
display {j in 1.._nvars: _var[j] > 0} (_varname[j], _var[j]);
I have a tuple in my pig script:
((,v1,fb,fql))
I know I can choose elements from the left as $0 (blank), $1 ("v1"), etc. But can I choose elements from the right? The tuples will be different lengths, but I would always like to get the last element.
You can't. You can however write a python UDF to extract it:
# Make sure to include the appropriate ouputSchema
def getFromBack(TUPLE, pos):
# gets elements from the back of TUPLE
# You can also do TUPLE[-pos], but this way it starts at 0 which is closer
# to Pig
return TUPLE[::-1][pos]
# This works the same way as above, but may be faster
# return TUPLE[-(pos + 1)]
And used like:
register 'myudf.py' using jython as pythonUDFs ;
B = FOREACH A GENERATE pythonUDFs.getFromBack(T, 0) ;