Calculate Average of Column Data with Headers - awk

I have data for example that looks like this:
Flats 2b
01/1991, 3.45
01/1992, 4.56
01/1993, 4.21
01/1994, 5.21
01/1995, 7.09
01/2013, 6.80
Eagle 2
01/1991, 4.22
01/1992, 6.32
01/1993, 5.21
01/1994, 8.09
01/1995, 7.92
01/2013, 6.33
I'm trying to calculate an average of column 2 so that my desired output looks like this preferably:
Flats 2b
Avg = 4.67
Eagle 2
Avg = 5.26
or even simpler that looks like this without the header:
Avg = 4.67
Avg = 5.26
and so on...since the input file is full of many headers with data like that shown above.
I have tried to do pattern matching options and using NR with something like this without success as an awk one-liner:
awk '/01/1991,/01/1993 {sum+=$2; cnt+=1} {print "Avg =" sum/cnt}' myfile.txt
I get averages but not my desired average for JUST the years 1991, 1992 and 1993 separately for each met tower.
Your help is much appreciated!

If you want to consider only the years 1991-1993
#! /usr/bin/awk -f
# new header, print average if exists, reset values
/[a-zA-Z]/ {
if (cnt > 0) {
print header;
printf("Avg = %.2f\n", sum/cnt);
}
header=$0; sum=0; cnt=0;
}
# calculate average
/^01\/199[123]/ { sum+=$2; cnt++; }
# print last average
END {
if (cnt > 0) {
print header;
printf("Avg = %.2f\n", sum/cnt);
}
}
This looks for awk script looks for a header, prints an average, if there is one and then resets all variables for the next average calculation. If it finds a data row it calculates the sum needed for the average later. If last line is read it prints the final average.
The script considers only the years 1991 until 1993 inclusive. If you want to include more years, you can either duplicate the calculation line or add multiple years with the or operator ||
# calculate average
/^01\/199[0-9]/ || /^01\/200[0-9]/ { sum+=$2; cnt++; }
This takes all the 1990s and 2000s into account.
If you don't want to print the headers, remove the appropriate lines print header.
You call this awk script as
awk -f script.awk myfile.txt

Related

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

Calculating the average length of a column

I have a task, where I have to count the average length of each word in a column with awk.
awk -F'\t' '{print length ($8) } END { print "Average = ",sum/NR}' file
In the output I get the total length of each line, but it does not count the average length, the output just says Average = 0 which can not be the case because the printed lines before have numbers.
For better understanding i will copy paste the last line of the output here:
4
4
3
4
4
2
5
7
6
5
Average = 0
How do i need to change my code to get the average letters of the whole column as output?
Ty very much for your time and help :)
In the output i get the total length of each line, but it does not count the average length, the output just says Average=0 which can not be the case because the printed lines before have numbers.
Because you're not adding lengths of columns to sum. Do it like this instead:
awk -F'\t' '{
print length($8)
sum += length($8)
}
END {
print "Average =", sum/NR
}' file
Initialise a sum variable in a BEGIN section and accumulate the length of a column at each iteration.
I don't have your original file so I did a similar exercise for the 1st column of my /etc/passwd file:
awk -F':' 'BEGIN{sum=0} {sum += length($1); print length($1)} END{print "Average = " sum/NR}' /etc/passwd

Search and Print by Two Conditions using AWK

I have this file:
- - - Results from analysis of weight - - -
Akaike Information Criterion 307019.66 (assuming 2 parameters).
Bayesian Information Criterion 307036.93
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
id 39892.82 490.360 0.7 0.6 1.0
damid 0.00 0.00000 0.0 0.0 1.0
Residual Variance 1546.46 320.979 0.0 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
id NRM_V 17633 0.18969 13.480 4.22 0 P
damid NRM_V 17633 0.07644 13.845 2.90 0 P
ide(damid) IDV_V 17633 0.00000 32.0979 1.00 0 S
Residual SCA_V 12459 1.0000 320.979 27.81 0 P
And I Would Like to print the Value of Sigma on id, note there are two id on the file, so I used the condition based on NRM_V too.
I tried this code:
tac myfile | awk '(/id/ && /NRM_V/){print $5}'
but the results printed were:
13.480
13.845
and I need just the first one
Could you please try following, I have added exit function of awk here which will help us to exit from code ASAP whenever first occurrence of condition comes, it will help us to save time too, since its no longer reading whole Input_file.
awk '(/id/ && /NRM_V/){print $5;exit}' Input_file
OR with columns:
awk '($1=="id" && $2=="NRM_V"){print $5;exit}' Input_file
In case you want to read file from last line towards first line and get its first value then try:
tac Input_file | awk '(/id/ && /NRM_V/){print $5;exit}'
OR with columns comparisons:
tac Input_file | awk '($1=="id" && $2=="NRM_V"){print $5;exit}'
The problem is that /id/ also matches damid. You could use the following to print the Sigma value only if the first field is id and the second field is NRM_V:
awk '$1=="id" && $2=="NRM_V"{ print $5 }' myfile

How to calculate quocient of sums of columns as one number?

I have data:
1 82 0.20971070
2 7200 13659.50038631
3 7443 15389.87972458
and I want to print quotient of sum of cilumn 3 and sum of column 2. How to do that?
I tried:
print((sum+=$3)/(sum+=$2))
and the result is 3 numbers - it computed according to rows. The desired result is 1,972807...
EDIT
Please one more question, I have a code:
/Curve No./ { in_f_format=1; next }
/^[[:space:]]*$/ { in_f_format=0; next }
{sum2+=$2; sum3+=$3} END{printf("%.6f\n",sum3/sum2)}
How to get a column of results for more files. I wrote
awk -f program.awk file??.txt
and I get only one result - for file01.txt
awk '{sum2+=$2; sum3+=$3} END{print sum3/sum2}' file
Output:
1.97281
or
awk '{sum2+=$2; sum3+=$3} END{printf("%.20f\n",sum3/sum2)}' file
Output:
1.97280745817249592022

Copy lines by rows in awk

I have an input file that contains, per row, a value and two weights.
I would like to generate two output files - where the value in the first column is repeated once per line, according to the weights. This is probably best explained with a short example. If the input file is:
file.in:
35 2 0
37 2 3
38 0 4
Then I would like to generate two output files:
file.out1:
35
35
37
37
file.out2:
37
37
37
38
38
38
38
I will then use these output files to calculate the average and median of first column according to the weights in the second and third column.
This is pretty easy in awk.
awk '{for(i=0;i<$2;i++) print $1;}' file.in > file.out1
generates the first file, and
awk '{for(i=0;i<$3;i++) print $1;}' file.in > file.out2
generates the second
It is not clear from your question whether you know how to compute the mean and median from these files - it seems you just wanted to create these output files. Let me know if the rest is giving your trouble, or whether the above scripts are not clear (I think they are pretty self-explanatory).
If I understood well you need average and median.
Average:
awk '{a+=$1}END{print a/NR}' file.in
36.6667
Median:
cat file.in | awk '{print $1}' | sort | awk '{a[NR]=$1}END{ b=NR/2; b=b%1?int(b)+1:b; print a[b] }'
37
Explanation:
Putting in simple terms NR is a variable which keeps the number of lines, for average you want a sum of every line divided by the number of lines.
For median you want you input sorted and pick the median value, but it's not so simple for your input because I you divide number of lines which is 3 by 2 you will get 1.5 so you need a ceiling function which awk doesn't have so I am doing it with b=NR/2; b=b%1?int(b)+1:b;
I hope this helps.