normalize column data with average value of that column with awk - awk

I have 3 columns in a data file look like below and continues up to 250 rows:
0.9967 0.7765 0.5798
0.9955 0.7742 0.5767
0.9942 0.7769 0.5734
I want to normalise each column based on the average value of that column.
I am using the code below (e.g. for column 1) but it does not print my desired output.
The results should be very close to 1
awk 'NR==FNR{sum+= $1; next}{avg=(NR/sum)}FNR>1{print($1/avg)}' f.dat f.dat
expected output for first column.
1.003
1.001
0.9988

You need separate placeholders for storing the sum and the count of columns. Recommend using an array for storing it for each column.
awk '
NR==FNR {
for (col=1; col<=NF; col++) {
avg[col] += $col
len[col] += 1
}
next
}
{
for (col=1; col<=NF; col++) {
colAvg = avg[col]/len[col]
printf "%.3f%s", $col/colAvg, (col<NF ? FS : ORS)
}
}
' file file
Or if you want to update the entire table with the new normalized values, drop the FNR==1 from the above snippet. If you want to increase the precision of the averaged value, change %.2f to how many digits you want as preferable

Related

Awk get unique elements from array

file.txt:
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
I am attempting to create a hyperlink by transforming the content of a file. The hyperlink will have the following style:
somelink&gene=<gene>[&gene=<gene>]&mutation=<gene:key>[&mutation=<gene:key>]
where INTS11:P446P corresponds to gene:key for example
The problem is that I am looping on the each row to create an array that contains the genes as values and thus multiple duplicated entries can be found for the same gene.
My attempt is the following
Split on & and store in a
For each element in a, split on : and add a[i] to array b
The problem is that I don't know how to get unique values from my array. I found this question but it talks about files and not arrays like in my case.
The code:
awk '#include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt
will output:
somelink&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&mutation=INTS11:P446P&mutation=INTS11:P449P&mutation=INTS11:P518P&mutation=INTS11:P547P&mutation=INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&gene=PLCH2&mutation=PLCH2:A1007int&mutation=PLCH1:D987int &mutation=PLCH2:P977L
I wish to obtain something similar like (notice how many &gene= is there):
somelink&gene=INTS11&mutation=INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&mutation=PLCH2:A1007int&mutation=PLCH1:D987int&mutation=PLCH2:P977L
EDIT:
my problem was partly solved thanks to Pierre Francois's answer which was the SUBSEP. My other issue is that I want to get only unique elements from my arrays genes and keys.
Thank you.
Supposing you want to remove the spaces between the fields concatenated with the join function of awk, the 4th argument you have to provide to the join function is the magic number SUBSEP and not an empty string "" as you did. Try:
awk '#include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt

Reading fields in previous lines for moving average

Main Question
What is the correct syntax for recursively calling AWK inside of another AWK program, and then saving the output to a (numeric) variable?
I want to call AWK using 2/3 variables:
N -> Can be read from Bash or from container AWK script.
Linenum -> Read from container AWK program
J -> Field that I would like to read
This is my attempt.
Container AWk program:
BEGIN {}
{
...
# Loop in j
...
k=NR
# Call to other instance of AWK
var=(awk -f -v n="$n_steps" linenum=k input-file 'linenum-n {printf "%5.4E", $j}'
...
}
END{}
Background for more general questions:
I have a file for which I would like to calculate a moving average of n (for example 2280) steps.
Ideally, for the first n rows the average is of the values 1 to k,
where k <= n.
For rows k > n the average would be of the last n values.
I will eventually execute the code in many large files, with several columns, and thousands to millions of rows, so I'm interested in streamlining the code as much as possible.
Code Excerpt and Description
The code I'm trying to develop looks something like this:
NR>1
{
# Loop over fields
for (j in columns)
{
# Rows before full moving average is done
if ( $1 <= n )
{
cumsum[j]=cumsum[j]+$j #Cumulative sum
$j=cumsum[j]/$1 # Average
}
#moving average
if ( $1 > n )
{
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}') # Obtain value that will get ubstracted from moving average
cumsum[j]=cumsum[j]+$j-last[j] # Cumulative sum adds last step and deleted unwanted value
$j=cumsum[j]/n # Moving average
}
}
}
My input file contains several columns. The first column contains the row number, and the other columns contain values.
For the cumulative sum of the moving average: If I am in row k, I want to add it to the cumulative sum, but also start subtracting the first value that I don't need (k-n).
I don't want to have to create an array of cumulative sums for the last steps, because I feel it could impact performance. I prefer to directly select the values that I want to substract.
For that I need to call AWK once again (but on a different line). I attempt to do it in this line:
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}'
I am sure that this code cannot be correct.
Discussion Questions
What is the best way to obtain information about a field in a previous line to the one that AWK is working on? Can it be then saved into a variable?
Is this recursive use of AWK allowed or even recommended?
If not, what could be the most efficient way to update the cumulative sum values so that I get an efficient enough code?
Sample input and Output
Here is a sample of the input (second column) and the desired output (third column). I'm using 3 as the number of averaging steps (n)
N VAL AVG_VAL
1 1 1
2 2 1.5
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
14 15 14
If you want to do a running average of a single column, you can do it this way:
BEGIN{n=2280; c=7}
{ s += $c - a[NR%n]; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
Here we store the last n values in an array a and keep track of the cumulative sum s. Every time we update the sum we correct by first removing the last value from it.
If you want to do this for a couple of columns, you have to be a bit handy with keeping track of your arrays
BEGIN{n=2280; c[0]=7; c[1]=8; c[2]=9}
{ for(i in c) { s[i] += $c[i] - a[n*i + NR%n]; a[n*i + NR%n] = $c[i] } }
{ printf $0
for(i=0;i<length(c);++i) printf OFS (s[i]/(NR < n : NR ? n))
printf ORS
}
However, you mentioned that you have to add millions of entries. That is where it becomes a bit more tricky. Summing a lot of values will introduce numeric errors as you loose precision bit by bit (when you add floats). So in this case, I would suggest implementing the Kahan summation.
For a single column you get:
BEGIN{n=2280; c=7}
{ y = $c - a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
or a bit more expanded as:
BEGIN{n=2280; c=7}
{ y = $c - k; t = s + y; k = (t - s) - y; s = t; }
{ y = -a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; }
{ a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
For a multi-column problem, it is now straightforward to adjust the above script. All you need to know is that y and t are temporary values and k is the compensation term which needs to be stored in memory.

Trying to delete first string where pattern is found and leave second string intact

I have a file which contains multiple rows of data and some are duplicates with date field at end of record. I want to be able to scan the file and keep the most current record. Here's what the data looks like:
00xbdf0c9fd6;joe#easy.us.com;20141231 <- remove this one
00vbdf0c9fd6;joe#easy.us.com;20150403 <- keep this one (newer date)
00dndf0ca080;betty#easy.us.com;20141231 <-keep
00dbkf0ca292;jerry#easy.us.com;20141231 <-keep
0dbds0ca2f6;john#easy.us.com;20141231 <- remove
0dbds0ca2f6;john#easy.us.com;20150403 <- keep (newer date)
I tried various flavors and combinations of sed, awk, grep but I could not get it to work.
Why not sort the file based on addresses and descending time stamp? Then all you need to do is keep the first one:
<infile sort -t\; -k2,2 -k3r | awk -F\; '!h[$2]++'
Output:
00dndf0ca080;betty#easy.us.com;20141231
00dbkf0ca292;jerry#easy.us.com;20141231
00vbdf0c9fd6;joe#easy.us.com;20150403
0dbds0ca2f6;john#easy.us.com;20150403
Try this:
{
split($0,parts,/;/)
if (link[parts[2]] < parts[3]) {
link[parts[2]] = parts[3]
}
}
END {
for (l in link) {
print l,link[l]
}
}
produces:
sue#easy.us.com 20141231
jerry#easy.us.com 20141231
joe#easy.us.com 20150403
betty#easy.us.com 20141231
john#easy.us.com 20150403

awk - Rounding to 2 decimal places in subtotals

Short version:
Is there a way to tell awk to round to 2 decimal places during the consolidation, not during the printing?
Long version:
I have an incoming file in the format below. I should get the net balances per currency and if the net is NOT zero, print the result in two columns: net balances less than zero go to neg_bal column and positive balances go to pos_bal column. For some reason, the USD column is still being printed despite netting to zero
JPY||170
JPY||40
USD|-42.61|
USD|-166.27
USD||42.61|
GBP|-20|
EUR||18.7
USD||174.6|
USD|-8.33||
EUR|-30.6|
GBP||100
JPY|-210|
Here is the code am using:
#!/bin/awk -f
BEGIN {
FS="|";
}
{
bal[$1]+=$2+$3
ccy[$1]=$1
}
END {
for (i in ccy)
{
if (bal[i] >0 )
{
pos_bal = bal[i]
neg_bal = 0
}
else
{
neg_bal = bal[i]
pos_bal = 0
}
if (bal[i] != 0 )
{
printf "%s|%.2f|%.2f\n",ccy[i],neg_bal,pos_bal
}
}
}
Result (notice JPY is not displayed since it nets to zero):
awk]$ ./scr1 file1
EUR|-11.90|0.00
USD|0.00|0.00
GBP|0.00|80.00
If I increase the decimal places to say, 20, I see that the USD net amount is not really zero. (Why is this, btw? Even excel gives a net of -1.59872E-14)
awk]$ ./scr1 file1
EUR|-11.90000000000000213163|0.00000000000000000000
USD|0.00000000000000000000|0.00000000000001243450
GBP|0.00000000000000000000|80.00000000000000000000
Is there a way to tell awk to round to 2 decimal places during the
consolidation, not during the printing?
Yes: multiply by 100 and convert to int. Then divide by 100 when you're ready to print.
(In other words, count pennies instead of dollars.)

How can I do a SQL like group by in AWK? Can I calculate aggregates for different columns?

I would like to run splits on csv files in unix and run aggregates on some columns. I want to group by on several columns if possible on each of the split up files using awk.
Does anyone know some unix magic that can do this?
here is a sample file:
customer_id,location,house_hold_type,employed,income
123,Florida,Head,true,100000
124,NJ,NoHead,false,0
125,Florida,NoHead,true,120000
126,Florida,Head,true,72000
127,NJ,Head,false,0
I want to get counts grouping on location, house_hold_type as well as AVG(income) for the same group by conditions.
How can I split a file and run awk with this?
this is the output I expect the format of the output could be different but
this is the overall data structure I am expecting. Will humbly accept other ways of presenting
the information:
location:[counts:['Florida':3, 'NJ':2], income_avgs:['Florida':97333, 'NJ':0]]
house_hold_type:[counts:['Head':3, 'NoHead':2], income_avgs:['Head':57333, 'NoHead':60000]]
Thank you in advance.
awk deals best with columns of data, so the input format is fine. The output format could be managed, but it will be much simpler to output it in columns as well:
#set the input and output field separators to comma
BEGIN {
FS = ",";
OFS = FS;
}
#skip the header row
NR == 1 {
next;
}
#for all remaining rows, store counters and sums for each group
{
count[$2,$3]++;
sum[$2,$3] += $5;
}
#after all data, display the aggregates
END {
print "location", "house_hold_type", "count", "avg_income";
#for every key we encountered
for(i in count) {
#split the key back into "location" and "house_hold_type"
split(i,a,SUBSEP);
print a[1], a[2], count[i], sum[i] / count[i];
}
}
Sample input:
customer_id,location,house_hold_type,employed,income
123,Florida,Head,true,100000
124,NJ,NoHead,false,0
125,Florida,NoHead,true,120000
126,Florida,Head,true,72000
127,NJ,Head,false,0
and output:
location,house_hold_type,count,avg_income
Florida,Head,2,86000
Florida,NoHead,1,120000
NJ,NoHead,1,0
NJ,Head,1,0