averaging multiple columns in awk excluding null value [duplicate] - awk

This question already has answers here:
Calculate mean of each column ignoring missing data with awk
(2 answers)
Closed 6 years ago.
I need to average all the columns in this file from column 3 to the last, excluding row 1:
table
jd h 3 5 8 10 11 14 15
79 1 52.0 51.0 58.0 45.0 59.0 20.0 27
79 2 52.0 51.0 58.0 45.0 59.0 20.0 -999.0
79 3 52.0 51.0 58.0 45.0 59.0 20.0 -999.0
79 4 -999.0 51.0 58.0 45.0 59.0 20.0 -999.0
Data transcribed by Chet.
This script works fine:
cat myfile.txt | awk ' NR>1{for (i=3;i<=NF;i++){a[i]+=$i;}} END {for (i=3;i<=NF;i++) {print a[i]/(NR-1)}}' > myoutput.txt
the problem is that in the columns I have null values (marked as "-999.0"), which I want to exclude from the average.
Any suggestion will be helpful.

awk 'NR > 1 { for (i = 3; i <= NF; i++) if ($i != -999.0) { sum[i] += $i; num[i]++; } }
END { for (i = 3; i <= NF; i++) print i, sum[i], num[i], sum[i]/num[i] }' \
myfile.txt > myoutput.txt
This counts only the valid field values, and counts the number of such rows for each column separately. The printing at the end identifies the field, the raw data (sum, number) and the average.

Related

How to calculate anomaly using awk

A have a file:
file.txt
1 32
2 34
3 32
4 43
5 25
6 34
7 65
8 34
9 23
10 44
I would like to find anomaly on send column:
my below script printing anomalies considering row-2 to row-10 values. It is not considering row-1 values.
awk 'FNR==NR{
f=1;
if($1 >= 1 && $1 <= 10){
count++;
SUM+=$2;
};
next
}
FNR==1 && f==1{
AVG=SUM/count;
next
}
($1 >= 1 && $1 <= 10){
print $1, $2-AVG
}
' file.txt file.txt
My desire output:
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
I got a solution of it:
awk '{f=$1>=1 && $1<=10}f && NR==FNR{sum+=$2; c++; next}f{ print $1, $2-(sum/c) }' file.txt file.txt
I am still wondering why the first script is not giving correct answer.
Since this is just 2 columns file, this can be done in a single pass awk also:
awk '{map[$1] = $2; s += $2}
END {mean = s/NR; for (i in map) print i, map[i] - mean}' file
1 -4.6
2 -2.6
3 -4.6
4 6.4
5 -11.6
6 -2.6
7 28.4
8 -2.6
9 -13.6
10 7.4
The first script in the OP is not giving the correct value, because you skip the first line in the second pass of your file. This is seen in the statement (FNR==1 && f==1) { AVG=sum/count; next }. Due to the next statement, you skip the computation of the deviation from the mean value for the first record.
This is an efficient computation of the deviation from the mean in a double pass:
awk '(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
If file contains values bigger than 10 or smaller than 1 in the first, column, but you only want to see this for values in the range of [0,10], then you can do:
awk '($1<1 || $1>10) {next}
(NR==FNR){s+=$2;c++;next}
(FNR==1){s/=c}
{print $1,$2-s}' file file
There are still other optimizations that can be done, but these only become beneficial when working with extremely large files (many millions of lines).

how to grep a N (7) column row before a grepped number from a long text file

I have a text file with a spacial format.
After the top "N" rows, the file will have a 7 column row ans then there will be "X" rows (X is the value from column number 6 in this 7 column row). Then there will be another row with 7 column and it will have further "Y" sub-rows (Y is the value from column number 6 in this 7 column row). and it occurance of rows will go upto some fixed numbers, say 40.
En example is here
(I am skipping top few rows).
2.857142857143E-01 2.857142857143E-01-2.857142857143E-01 1 1533 9 1.0
1 -3.52823873905418
2 -3.52823873905417
3 -1.77620635653680
4 -1.77620635653680
5 -1.77620570068355
6 -1.77620570068354
7 -1.77620570066112
8 -1.77620570066112
9 -1.60388273192418
1.428571428571E-01 1.428571428571E-01-1.428571428571E-01 2 1506 14 8.0
1 -3.52823678441811
2 -3.52823678441810
3 -1.77620282216865
4 -1.77620282216865
5 -1.77619365786042
6 -1.77619365786042
7 -1.77619324280126
8 -1.77619324280125
9 -1.60387130881086
10 -1.60387130881086
11 -1.60387074066972
12 -1.60387074066972
13 -1.51340357895078
14 -1.51340357895078
1.000000000000E+00 4.285714285714E-01-1.428571428571E-01 20 1524 51 24.0
1 -3.52823580096110
2 -3.52823580096109
3 -1.77624472106293
4 -1.77624472106293
5 -1.77623455229910
6 -1.77623455229909
7 -1.77620473017160
8 -1.77620473017159
9 -1.60387169115834
10 -1.60387169115834
11 -1.60386634866654
12 -1.60386634866654
13 -1.51340851656332
14 -1.51340851656332
15 -1.51340086887553
16 -1.51340086887553
17 -1.51321967923767
18 -1.51321967923766
19 -1.40212716813452
20 -1.40212716813451
21 -1.40187887062753
22 -1.40187887062753
23 -0.749391485667459
24 -0.749391485667455
25 -0.740712218931955
26 -0.740712218931954
27 -0.714030906779278
28 -0.714030906779278
29 -0.689087278411268
30 -0.689087278411265
31 -0.687054399753234
32 -0.687054399753233
33 -0.677686868127079
34 -0.677686868127075
35 -0.405343895324740
36 -0.405343895324739
37 -0.404786479693490
38 -0.404786479693488
39 -0.269454266134757
40 -0.269454266134755
41 -0.267490250650300
42 -0.267490250650296
43 -0.262198373307171
44 -0.262198373307170
45 -0.260912148881762
46 -0.260912148881761
47 -9.015623907768122E-002
48 -9.015623907767983E-002
49 0.150591609452852
50 0.150591609452856
51 0.201194203960446
I want to grep a particular number from my text file and to do so, I use
awk -v c=2 -v t=$GREP 'NR==1{d=$c-t;d=d<0?-d:d;v=$c;next}{m=$c-t;m=m<0?-m:m}m<d{d=m;v=$c}END{print v}' case.dat
Here $GREP is 0.2011942 which prints the last row (it will change according to different file)
51 0.201194203960446
I want to print the header row also with this number, i.e., my script should print,
51 0.201194203960446
1.000000000000E+00 4.285714285714E-01-1.428571428571E-01 20 1524 51 24.0.
How can I print this header row of the grepped numbers?
I have idea, but I could not implement it in script format.
Simply, grep the number using my script and print the first row before this number that have 7 columns.
This may be what you're looking for
awk -v t="$GREP" '
BEGIN { sub("\\.", "\\.", t) }
NF > 2 { header=$0; next }
NF == 2 && $2 ~ t { printf("%s %s\n%s\n", $1, $2, header) }
' file
You can replace the NF > 2 with NF == 7 if you want the strictly seven-column header to be printed (that header contains 6 columns in your sample data, not 7).
Update after the comment "Can you please modify my script so that it should grep upto 13 decimal number":
awk -v t="$GREP" '
BEGIN { if (match(t, "\\.")) {
t = substr(t, 1, RSTART + 13)
sub("\\.", "\\.", t)
}
}
NF > 2 { header=$0; next }
NF == 2 && $2 ~ t { printf("%s %s\n%s\n", $1, $2, header) }
' file

extract specific lines based on another file

I have a folder containing text files. I need to extract specific lines from the files of this folder based on another file input.txt. I tried the following code . But it doesn't work.
awk '
NR==FNR{
if(NF>1)f=$3
else A[f,$1]
next
}
(FILENAME,$3) in A {
print > ( todir "/" FILENAME )
}
' todir=/home/alan/Desktop/output FS=\* /home/alan/Desktop/input.txt FS=" " *
file1
PHE .0 10 .0 0
ASP 59.8 11 59.8 0
LEU 66.8 15 66.8 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
THR 33.2 50 33.2 0
SER 10.2 51 .9 9.3
input.txt
*file1
10
16
*file2
43
44
49
Desired output
file1
PHE 0 10 0 0
ARG 21.0 16 21.0 0
file2
SER 57.9 43 57.9 0
PHE 2.4 44 2.4 0
GLN 86.1 49 83.2 2.9
On line 3,
$3 needs to be changed to $2.
Since asterisk is the field separator in input.txt, the (empty, non-existent) string before it is counted as $1 and the file name that comes after it as $2.
awk '
NR==FNR{
if(NF>1)f=$2
else A[f,$1]
next
}

awk: divide odd columns by following even column

I want to divide all the odd columns in a file by the next even column, e.g. column1/column2, column3/column4,......, columnN/columnN+1
test1.txt
1 4 1 2 1 3
1 2 4 2 3 9
desired output
0.25 0.5 0.333
0.5 2 0.333
I tried this:
awk 'BEGIN{OFS="\t"} { for (i=2; i<NF+2; i+=2) printf $(i-1)/i OFS; printf "\n"}'
but it doesn't work.
I would like to add that my actual files have a very large and variable (but always even) number of columns and I would like something that would work on all of them.
awk '{for(i=1;i<NF;i+=2)printf "%f%s",$i/$(i+1),OFS;print "";}' input.txt
Output:
0.250000 0.500000 0.333333
0.500000 2.000000 0.333333
You can adjust printing format to your needs see here for more info.

Selecting elements of two column whose difference is less than some given value using awk

While doing post processing for a numerical analysis, I have the following problem of selection of data :
time_1 result_1 time_2 result_2
1 10 1.1 10.1
2 20 1.6 15.1
3 30 2.1 20.1
4 40 2.6 25.1
5 50 3.1 30.1
6 60 3.6 35.1
7 70 4.1 40.1
8 80 4.6 45.1
9 90 5.1 50.1
10 100 5.6 55.1
6.1 60.1
6.6 65.1
7.1 70.1
7.6 75.1
8.1 80.1
8.6 85.1
9.1 90.1
9.6 95.1
10.1 100.1
This file has 4 columns, the first column (time_1) represents the calculated instants of a program 1, the second column (result_1) is the results calculated for each instant.
The third column (time_2) represents represents the calculated instants of another program, the fourth column (result_2) is the results calculated for each instant of this program 2.
Now I wish to select only the instants of the third column (time_2) that is very near the instants of the first column (time_1), the difference admitted is less than or equal to 0.1. For example :
for the instant 1 of the time_1 column, I wish to select the instant 1.1 of the time_2 column, because (1.1 - 1) = 0.1, I do not want to select the others instants of the time_2 column because (1.6 - 1) > 0.1, or (2.1 - 1) > 0.1
for the instant 2 of the time_1 column, I wish to select the instant 2.1 of the time_2 column, because (2.1 - 2) = 0.1, I do not want to select the others instants of the time_2 column because (2.6 - 1) > 0.1, or (3.1 - 1) > 0.1
At the end, I would like to obtain the following data :
time_1 result_1 time_2 result_2
1 10 1.1 10.1
2 20 2.1 20.1
3 30 3.1 30.1
4 40 4.1 40.1
5 50 5.1 50.1
6 60 6.1 60.1
7 70 7.1 70.1
8 80 8.1 80.1
9 90 9.1 90.1
10 100 10.1 100.1
I wish to use awk but I have not been familiarized with this code. I do not know how to fix an element of the first column then compare this to all elements of the third column in order to select the right value of this third column. If I do very simply like this, I can print only the first line :
{if (($3>=$1) && (($3-$1) <= 0.1)) {print $2, $4}}
Thank you in advance for your help !
You can try the following perl script:
#! /usr/bin/perl
use strict;
use warnings;
use autodie;
use File::Slurp qw(read_file);
my #lines=read_file("file");
shift #lines; # skip first line
my #a;
for (#lines) {
my #fld=split;
if (#fld == 4) {
push (#a,{id=>$fld[0], val=>$fld[1]});
}
}
for (#lines) {
my #fld=split;
my $id; my $val;
if (#fld == 4) {
$id=$fld[2]; $val=$fld[3];
} elsif (#fld == 2) {
$id=$fld[0]; $val=$fld[1];
}
my $ind=checkId(\#a,$id);
if ($ind>=0) {
$a[$ind]->{sel}=[] if (! exists($a[$ind]->{sel}));
push(#{$a[$ind]->{sel}},{id=>$id,val=>$val});
}
}
for my $item (#a) {
if (exists $item->{sel}) {
my $s= $item->{sel};
for (#$s) {
print $item->{id}."\t".$item->{val}."\t";
print $_->{id}."\t".$_->{val}."\n";
}
}
}
sub checkId {
my ($a,$id) = #_;
my $dif=0.1+1e-10;
for (my $i=0; $i<=$#$a; $i++) {
return $i if (abs($a->[$i]->{id}-$id)<=$dif)
}
return -1;
}
One thing to be aware of: due to the vagaries of floating point numbers, comparing a value to 0.1 is unlikely to give you the results you're looking for:
awk 'BEGIN {x=1; y=x+0.1; printf "%.20f", y-x}'
0.10000000000000008882⏎
here, y=x+0.1, but y-x > 0.1
So, we will look at the diff as diff = 10*y - 10x:
Also, I'm going to process the file twice: once to grab all the time_1/result_1 values, the second time to extract the "matching" time_2/result_2 values.
awk '
NR==1 {print; next}
NR==FNR {if (NF==4) r1[$1]=$2; next}
FNR==1 {next}
{
if (NF == 4) {t2=$3; r2=$4} else {t2=$1; r2=$2}
for (t1 in r1) {
diff = 10*t1 - 10*t2;
if (-1 <= diff && diff <= 1) {
print t1, r1[t1], t2, r2
break
}
}
}
' ~/tmp/timings.txt ~/tmp/timings.txt | column -t
time_1 result_1 time_2 result_2
1 10 1.1 10.1
2 20 2.1 20.1
3 30 3.1 30.1
4 40 4.1 40.1
5 50 5.1 50.1
6 60 6.1 60.1
7 70 7.1 70.1
8 80 8.1 80.1
9 90 9.1 90.1
10 100 10.1 100.1