Calculating the average length of a column - awk

I have a task, where I have to count the average length of each word in a column with awk.
awk -F'\t' '{print length ($8) } END { print "Average = ",sum/NR}' file
In the output I get the total length of each line, but it does not count the average length, the output just says Average = 0 which can not be the case because the printed lines before have numbers.
For better understanding i will copy paste the last line of the output here:
4
4
3
4
4
2
5
7
6
5
Average = 0
How do i need to change my code to get the average letters of the whole column as output?
Ty very much for your time and help :)

In the output i get the total length of each line, but it does not count the average length, the output just says Average=0 which can not be the case because the printed lines before have numbers.
Because you're not adding lengths of columns to sum. Do it like this instead:
awk -F'\t' '{
print length($8)
sum += length($8)
}
END {
print "Average =", sum/NR
}' file

Initialise a sum variable in a BEGIN section and accumulate the length of a column at each iteration.
I don't have your original file so I did a similar exercise for the 1st column of my /etc/passwd file:
awk -F':' 'BEGIN{sum=0} {sum += length($1); print length($1)} END{print "Average = " sum/NR}' /etc/passwd

Related

comparing subtracting and printing using awk

I have a file which looks like:
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
I want to subtract the consecutive numbers in the first column. If the result is less than or equal to some threshold say modulus 0.01, it should print the corresponding data range in the 3rd column.
For example, the difference in the 2nd and 1st consecutive data of the 1st column is
4.98065923-4.97911047=0.00154876.
Thus it should print
"631-632"
Then the difference in the 3rd and 2nd consecutive data of the 1st column is
5.70441060-4.98065923=0.723751
Thus it should not print anything. Again, the difference in the 4th and 3rd consecutive data of the 1st column is
5.70448527-5.70441060=7.467e-05
Thus it should print
"974-975"...
in this way. The final output should be like:\
631-632 \
974-975 \
2020-2021 \
3181-3180 \
Note: If the difference is less than 0.01 for 3 or more consecutive numbers of the 1st column, then it should also print the total range like "631-635" say.\
The best I could do until now is to use awk command and make the corresponding differences: \
awk 'NR > 1 { print $0 - prev } { prev = $0 }' < filename
Any help?
Regards
Sitangshu
I would GNU AWK for this task following way, let file.txt content be
4.97911047 1.00000000 631.000000 369.343907
4.98065923 0.999492004 632.000000 369.771568
5.70441060 0.480057937 974.000000 642.686561
5.70448527 0.479704863 975.000000 641.643578
8.23090986 0.310811710 2020.00000 331.182895
8.23096067 0.312290865 2021.00000 331.188128
14.8077297 0.357914635 3181.00000 449.390996
14.8077541 0.352977613 3180.00000 449.377675
then
awk 'NR>1&&($1-prev[1])<=0.01{printf "%d-%d\n",prev[3],$3}{split($0,prev)}' file.txt
gives output
631-632
974-975
2020-2021
3181-3180
Explanation: For each line I save under prev not whole line (string) but rather array, using split function to do so, so I can access for example 3rd field of previous line as prev[3]. For each line after 1st one and where difference between 1st field of current line and 1st field or previous line is lesser or equal 0.01 I do printf 3rd field of previous line and 3rd field of current line, using %d to get value formatted as integer.
(tested in gawk 4.2.1)

Bash one liner for calculating the average of a specific row of numbers in bash

I just began learning bash.
Trying to figure out how to convert a two-liner into a one liner using bash.
The First Line of code...
searches the first column of the input.txt for the word - KEYWORD.
captures every number in this KEYWORD row from column2 until the last column.
dumps all these numbers into the values.txt file placing each number on a new line.
The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value.
awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt
awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt
How do I create a one liner from this?
Something like
awk '
BEGIN { count = sum = 0 }
$1 == "KEYWORD" {
for (n = 2; n <= NF; n++) sum += $n
count += NF - 1
}
END { print sum/count }' input.txt
Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer.
awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt
This code searches the input file for the row containing "KEYWORD".
Then sums up all the field from the 3rd column to the last column.
Then prints out the average value of all those numbers. (i.e. the mean).

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

awk NR wrong with the total number of lines returned

when awk NR was used for getting the total number of lines of a file, wrong number was returned. Could you help to find out what happened?
File 'test.txt' contents :
> 2012 09 10 30.0 8 14
fdafadf
> 2013 08 11 05.0 9 1.5
fdafa
> 2011 01 12 02.0 7 1.2
daff
The average of the last column of records with '>' beginning was expected to get.
Code:
awk 'BEGIN{SUM=0}{/^> /{SUM=SUM+$6}END{print SUM/NR}' test.txt
With this code, the wrong mean of the last column was obtained instead of the right number 3. How can I get the right result with awk mission? Thanks
Could you please try following. This will take SUM of all line's last column and it will keep doing till Input_file is done with reading. It will alos count the number of occurrences of > lines because average means SUM divided by count(here count of lines), in END block of awk we could divide them and could get average as needed.
awk 'BEGIN{sum=0;count=0}/^>/{sum+=$NF;count++} END{print "avg="sum/count}' Input_file
If you want to take average of 6th column then use $6 in spite of $NF in above code too.
Explanation: Adding following only for explanation purposes.
awk ' ##Starting awk command/script here.
/^>/{ ##Checking condition if a line starts from > then do following.
sum+=$NF ##Creating a variable named sum wohse value is adding in its own value of $NF last field of current line.
count++ ##Creating a variable named count whose value is incrementing by 1 each time cursor comes here.
}
END{ ##END block of awk code here.
print "avg="sum/count ##Printing string avg= then dividing sum/count it will print the result of it.
}
' Input_file ##Mentioning Input_file name here.

Grouping and finding max using awk

The data needs to be grouped, each group having 6 values, and then needs to find the max in each group.
data:
0.0759313
0.0761037
0.0740772
0.0736791
0.0719802
0.0715406
0.0828038
0.0826728
0.0802384
0.0798476
0.0785342
0.0777939
0.0738756
0.0733486
0.0709046
0.0707067
0
0
Used this awk statements, but am not getting any result.
awk '{for(x=i+1;(x<=(i+5))&&(x<=NF);x++){a[++y]=$x;if(x==(i+5)){c=asort(a);b[z++]=a[c];i=i+6;y=0}}}END{for(j in b) print b[j]}'
I would go for something like this:
awk 'NR % 6 == 1 || $0 > max { max = $0 } NR % 6 == 0 { print max }' file
Always set max to the first value in each group of six, or if the value is greater than the current maximum. At the end of each group, print the value.
You may also want to include some additional logic to deal with printing the maximum of the last few numbers, in case the number of lines is not exactly divisible by 6:
END { if (NR % 6 != 0) print max }