Awk get unique elements from array - awk

file.txt:
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
I am attempting to create a hyperlink by transforming the content of a file. The hyperlink will have the following style:
somelink&gene=<gene>[&gene=<gene>]&mutation=<gene:key>[&mutation=<gene:key>]
where INTS11:P446P corresponds to gene:key for example
The problem is that I am looping on the each row to create an array that contains the genes as values and thus multiple duplicated entries can be found for the same gene.
My attempt is the following
Split on & and store in a
For each element in a, split on : and add a[i] to array b
The problem is that I don't know how to get unique values from my array. I found this question but it talks about files and not arrays like in my case.
The code:
awk '#include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt
will output:
somelink&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&mutation=INTS11:P446P&mutation=INTS11:P449P&mutation=INTS11:P518P&mutation=INTS11:P547P&mutation=INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&gene=PLCH2&mutation=PLCH2:A1007int&mutation=PLCH1:D987int &mutation=PLCH2:P977L
I wish to obtain something similar like (notice how many &gene= is there):
somelink&gene=INTS11&mutation=INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&mutation=PLCH2:A1007int&mutation=PLCH1:D987int&mutation=PLCH2:P977L
EDIT:
my problem was partly solved thanks to Pierre Francois's answer which was the SUBSEP. My other issue is that I want to get only unique elements from my arrays genes and keys.
Thank you.

Supposing you want to remove the spaces between the fields concatenated with the join function of awk, the 4th argument you have to provide to the join function is the magic number SUBSEP and not an empty string "" as you did. Try:
awk '#include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt

Related

normalize column data with average value of that column with awk

I have 3 columns in a data file look like below and continues up to 250 rows:
0.9967 0.7765 0.5798
0.9955 0.7742 0.5767
0.9942 0.7769 0.5734
I want to normalise each column based on the average value of that column.
I am using the code below (e.g. for column 1) but it does not print my desired output.
The results should be very close to 1
awk 'NR==FNR{sum+= $1; next}{avg=(NR/sum)}FNR>1{print($1/avg)}' f.dat f.dat
expected output for first column.
1.003
1.001
0.9988
You need separate placeholders for storing the sum and the count of columns. Recommend using an array for storing it for each column.
awk '
NR==FNR {
for (col=1; col<=NF; col++) {
avg[col] += $col
len[col] += 1
}
next
}
{
for (col=1; col<=NF; col++) {
colAvg = avg[col]/len[col]
printf "%.3f%s", $col/colAvg, (col<NF ? FS : ORS)
}
}
' file file
Or if you want to update the entire table with the new normalized values, drop the FNR==1 from the above snippet. If you want to increase the precision of the averaged value, change %.2f to how many digits you want as preferable

Trying to delete first string where pattern is found and leave second string intact

I have a file which contains multiple rows of data and some are duplicates with date field at end of record. I want to be able to scan the file and keep the most current record. Here's what the data looks like:
00xbdf0c9fd6;joe#easy.us.com;20141231 <- remove this one
00vbdf0c9fd6;joe#easy.us.com;20150403 <- keep this one (newer date)
00dndf0ca080;betty#easy.us.com;20141231 <-keep
00dbkf0ca292;jerry#easy.us.com;20141231 <-keep
0dbds0ca2f6;john#easy.us.com;20141231 <- remove
0dbds0ca2f6;john#easy.us.com;20150403 <- keep (newer date)
I tried various flavors and combinations of sed, awk, grep but I could not get it to work.
Why not sort the file based on addresses and descending time stamp? Then all you need to do is keep the first one:
<infile sort -t\; -k2,2 -k3r | awk -F\; '!h[$2]++'
Output:
00dndf0ca080;betty#easy.us.com;20141231
00dbkf0ca292;jerry#easy.us.com;20141231
00vbdf0c9fd6;joe#easy.us.com;20150403
0dbds0ca2f6;john#easy.us.com;20150403
Try this:
{
split($0,parts,/;/)
if (link[parts[2]] < parts[3]) {
link[parts[2]] = parts[3]
}
}
END {
for (l in link) {
print l,link[l]
}
}
produces:
sue#easy.us.com 20141231
jerry#easy.us.com 20141231
joe#easy.us.com 20150403
betty#easy.us.com 20141231
john#easy.us.com 20150403

How can I do a SQL like group by in AWK? Can I calculate aggregates for different columns?

I would like to run splits on csv files in unix and run aggregates on some columns. I want to group by on several columns if possible on each of the split up files using awk.
Does anyone know some unix magic that can do this?
here is a sample file:
customer_id,location,house_hold_type,employed,income
123,Florida,Head,true,100000
124,NJ,NoHead,false,0
125,Florida,NoHead,true,120000
126,Florida,Head,true,72000
127,NJ,Head,false,0
I want to get counts grouping on location, house_hold_type as well as AVG(income) for the same group by conditions.
How can I split a file and run awk with this?
this is the output I expect the format of the output could be different but
this is the overall data structure I am expecting. Will humbly accept other ways of presenting
the information:
location:[counts:['Florida':3, 'NJ':2], income_avgs:['Florida':97333, 'NJ':0]]
house_hold_type:[counts:['Head':3, 'NoHead':2], income_avgs:['Head':57333, 'NoHead':60000]]
Thank you in advance.
awk deals best with columns of data, so the input format is fine. The output format could be managed, but it will be much simpler to output it in columns as well:
#set the input and output field separators to comma
BEGIN {
FS = ",";
OFS = FS;
}
#skip the header row
NR == 1 {
next;
}
#for all remaining rows, store counters and sums for each group
{
count[$2,$3]++;
sum[$2,$3] += $5;
}
#after all data, display the aggregates
END {
print "location", "house_hold_type", "count", "avg_income";
#for every key we encountered
for(i in count) {
#split the key back into "location" and "house_hold_type"
split(i,a,SUBSEP);
print a[1], a[2], count[i], sum[i] / count[i];
}
}
Sample input:
customer_id,location,house_hold_type,employed,income
123,Florida,Head,true,100000
124,NJ,NoHead,false,0
125,Florida,NoHead,true,120000
126,Florida,Head,true,72000
127,NJ,Head,false,0
and output:
location,house_hold_type,count,avg_income
Florida,Head,2,86000
Florida,NoHead,1,120000
NJ,NoHead,1,0
NJ,Head,1,0

How to load 2D array from a text(csv) file into Octave?

Consider the following text(csv) file:
1, Some text
2, More text
3, Text with comma, more text
How to load the data into a 2D array in Octave? The number can go into the first column, and all text to the right of the first comma (including other commas) goes into the second text column.
If necessary, I can replace the first comma with a different delimiter character.
AFAIK you cannot put stings of different size into an array. You need to create a so called cell array.
A possible way to read the data from your question stored in a file Test.txt into a cell array is
t1 = textread("Test.txt", "%s", "delimiter", "\n");
for i = 1:length(t1)
j = findstr(t1{i}, ",")(1);
T{i,1} = t1{i}(1:j - 1);
T{i,2} = strtrim(t1{i}(j + 1:end));
end
Now
T{3,1} gives you 3 and
T{3,2} gives you Text with comma, more text.
After many long hours of searching and debugging, here's how I got it to work on Octave 3.2.4. Using | as the delimiter (instead of comma).
The data file now looks like:
1|Some text
2|More text
3|Text with comma, more text
Here's how to call it: data = load_data('data/data_file.csv', NUMBER_OF_LINES);
Limitation: You need to know how many lines you want to get. If you want to get all, then you will need to write a function to count the number of lines in the file in order to initialize the cell_array. It's all very clunky and primitive. So much for "high level languages like Octave".
Note: After the unpleasant exercise of getting this to work, it seems that Octave is not very useful unless you enjoy wasting your time writing code to do the simplest things. Better choices seems to be R, Python, or C#/Java with a Machine Learning or Matrix library.
function all_messages = load_data(filename, NUMBER_OF_LINES)
fid = fopen(filename, "r");
all_messages = cell (NUMBER_OF_LINES, 2 );
counter = 1;
line = fgetl(fid);
while line != -1
separator_index = index(line, '|');
all_messages {counter, 1} = substr(line, 1, separator_index - 1); % Up to the separator
all_messages {counter, 2} = substr(line, separator_index + 1, length(line) - separator_index); % After the separator
counter++;
line = fgetl(fid);
endwhile
fprintf("Processed %i lines.\n", counter -1);
fclose(fid);
end

How do I use Perl to parse the output of the sqlplus command?

I have an SQL file which will give me an output like below:
10|1
10|2
10|3
11|2
11|4
.
.
.
I am using this in a Perl script like below:
my #tmp_cycledef = `sqlplus -s $connstr \#DLCycleState.sql`;
after this above statement, since #tmp_cycledef has all the output of the SQL query,
I want to show the output as:
10 1,2,3
11 2,4
How could I do this using Perl?
EDIT:
I am using the following code:
foreach my $row (#tmp_cycledef)
{
chomp $row;
my ($cycle_code,$cycle_month)= split /\s*\|\s*/, $row;
print "$cycle_code, $cycle_month\n";
$hash{$cycle_code}{$cycle_month}=1
}
foreach my $num ( sort keys %hash )
{
my $h = $hash{$num};
print join(',',sort keys %$h),"\n";
}
the fist print statement prints:
2, 1
2, 10
2, 11
2, 12
3, 1
3, 10
3, 11
but the out is always
1,10,11,12
1,10,11,12
1,10,11,12
1,10,11,12
1,10,11,12
1,10,11,12
1,10,11,12
Well, this one is actually how you might do it in perl:
# two must-have pragmas for perl development
use strict;
use warnings;
Perl allows for variables to be created as they are used, $feldman = some_function() means that you now have the variable $feldman in your local namespace. But the bad part about this is that you can type $fldman and take a long time finding out why what you thought was $feldman has no value. Turning on strictures means that your code fails to compile if it encounters an undeclared variable. You declare a variable with a my or our statement (or in older Perl code a use vars statement.
Turning on warnings just warns you when you're not getting values you expect. Often warnings tends to be too touchy, but they are generally a good thing to develop code with.
my %hash; # the base object for the data
Here, I've declared a hash variable that I creatively called %hash. The sigil (pronounced "sijil") "%" tells that it is a map of name-value pairs. This my statement declared the variable and makes it legal for the compiler. The compiler will warn me about any use of %hsh.
The next item is a foreach loop (which can be abbreviated "for"). The loop will process the list of lines in #tmp_cycledef assigning each one in turn to $row. ( my $row).
We chomp the line first, removing the end-of-line character for that platform.
We split the line on the '|' character, creating a list of strings that had been separated by a pipe.
And then we store it in a two-layered hash. Since we want to group them by at least the first number. We could do this by array, and create an array at the location in the hash like so: push #{$hash{$key}}, $val, but I typically want to collapse duplicates (not that there were any duplicates in your sample.)
Here:
foreach my $row ( #tmp_cycledef ) {
chomp $row; # removes the end-of-line character when present.
my ( $key, $val ) = split /\|/, $row;
# One of the best ways to merge lists is a presence-of idea
# with the hash holding whether the value is present
$hash{$key}{$val} = 1;
}
Once we have the data in the structure, we need to iterate both level of hash keys. You wanted to separate the "top level" numbers by lines, but you wanted the second numbers concatenated on the same line. So we print a line for each of the first numbers and join the list of strings stored for each number on the same line, delimited by commas. We also sort the list: { $a <=> $b } just takes to keys and numerically compares them. So you get a numeric order.
# If they were alpha keys our sort routine, we would just likely say sort keys %hash
foreach my $num ( sort { $a <=> $b } keys %hash ) {
my $h = $hash{$num};
print "$num ", join( ',', sort { $a <=> $b } keys %$h ), "\n";
}
As I said in the comments, sort, by default, sorts in character order so you can just say sort keys %hash.
To help you out, you really need to read some of these:
strictures
warnings
perldata
perlfunc -- especially my, foreach, chomp, split, keys, sort and join
And the data structure tutorial
Use a hash of arrays to collect all the values for a single key together, then print them out:
init hash
for each line:
parse into key|value
append value to hash[key]
for each key in hash: # you can sort it, if needed
print out key, list of values
If your input is sorted (as it is in the provided sample), you don't actually need to bother with the hash of arrays/hashes. The code is a bit longer, but doesn't require you to understand references and should run faster for large datasets:
#!/usr/bin/perl
use strict;
use warnings;
my #tmp_cycledef = <DATA>;
my $last_key;
my #values;
for (#tmp_cycledef) {
chomp;
my ($key, $val) = split '\|';
# Seed $last_key with the first key value on the first pass
$last_key = $key unless defined $last_key;
# The key has changed, so it's time to print out the values associated
# with the previous key, then reset everything for the new one
if ($key != $last_key) {
print "$last_key " . join(',', #values) . "\n";
$last_key = $key;
#values = ();
}
# Add the current value to the list of values for this key
push #values, $val;
}
# Don't forget to print out the final key when you're done!
print "$last_key " . join(',', #values) . "\n";
__DATA__
10|1
10|2
10|3
11|2
11|4