Grouping and finding max using awk - awk

The data needs to be grouped, each group having 6 values, and then needs to find the max in each group.
data:
0.0759313
0.0761037
0.0740772
0.0736791
0.0719802
0.0715406
0.0828038
0.0826728
0.0802384
0.0798476
0.0785342
0.0777939
0.0738756
0.0733486
0.0709046
0.0707067
0
0
Used this awk statements, but am not getting any result.
awk '{for(x=i+1;(x<=(i+5))&&(x<=NF);x++){a[++y]=$x;if(x==(i+5)){c=asort(a);b[z++]=a[c];i=i+6;y=0}}}END{for(j in b) print b[j]}'

I would go for something like this:
awk 'NR % 6 == 1 || $0 > max { max = $0 } NR % 6 == 0 { print max }' file
Always set max to the first value in each group of six, or if the value is greater than the current maximum. At the end of each group, print the value.
You may also want to include some additional logic to deal with printing the maximum of the last few numbers, in case the number of lines is not exactly divisible by 6:
END { if (NR % 6 != 0) print max }

Related

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

Calculating the average length of a column

I have a task, where I have to count the average length of each word in a column with awk.
awk -F'\t' '{print length ($8) } END { print "Average = ",sum/NR}' file
In the output I get the total length of each line, but it does not count the average length, the output just says Average = 0 which can not be the case because the printed lines before have numbers.
For better understanding i will copy paste the last line of the output here:
4
4
3
4
4
2
5
7
6
5
Average = 0
How do i need to change my code to get the average letters of the whole column as output?
Ty very much for your time and help :)
In the output i get the total length of each line, but it does not count the average length, the output just says Average=0 which can not be the case because the printed lines before have numbers.
Because you're not adding lengths of columns to sum. Do it like this instead:
awk -F'\t' '{
print length($8)
sum += length($8)
}
END {
print "Average =", sum/NR
}' file
Initialise a sum variable in a BEGIN section and accumulate the length of a column at each iteration.
I don't have your original file so I did a similar exercise for the 1st column of my /etc/passwd file:
awk -F':' 'BEGIN{sum=0} {sum += length($1); print length($1)} END{print "Average = " sum/NR}' /etc/passwd

awk count unique occurrences and print other columns

I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.
GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2

Calculate sequential average and median from file using awk

This is my input file (there are thousands of rows):
$ cat file.txt
1 495.03
2 503.76
3 512.28
4 520.75
5 529.17
I'd like to use awk to calculate the median of the first column over X (let's say 1-100) number of rows and the average of the corresponding values of the second column. awk would then move the the next set of rows (101-201) and do the same, i.e. median of the first column and average of the second column and so on. Needless to say, I'm trying to learn awk and have tried several previous solutions but couldn't quite make it work.
From a previous post, I found that I can calculate the average this way:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}'
How does this work exactly (i.e. what does this {sum+=$1} expression mean?) and how can I adapt this for median? Btw, the first column will always be sorted.
Thanks in advance,
TP
If the records are sorted, the median will be just the average of 50th and 51st values.
$ awk '{r=NR%100; sum+=$2}
r==50 {m=$1}
r==51 {m=(m+$1)/2}
r==0 {print m, sum/100; sum=0}' file
this will work if number of records is a multiple of 100, otherwise you need to handle the last group which will have a different size.
There are other definitions for "median" for even number of records but that's something you should specify.
Explanation define r to be the remainder by mod 100, in essence the relative position in each block of 100 records. For the median we take the average of 50th and 51st records, sum aggregates the second field value for each 100 block. When the remainder is 0, we complete each block, print median and average (sum/100) values; reset sum for the next block.
note: This contains a bit more information wrt running means and medians for unsorted data. This should be seen as an addendum to the original question.
If you want to compute the running average over the last n terms (assume n = 100) then you have to take care of how you handle the first m records with m < n. A way to handle this is to place the values in an array where the index is the modulo of n. This way you always have the last n terms in your array :
running average of $i:
awk '{a[NR%100] = $i; s=0; for(j in a) { s+=a[j] }; print "avg:" s/length(a) }'
You can, however, remove the for-loop by keeping track of s:
awk '{s+=$i; if (NR%100 in a) s-=a[NR%100]; a[NR%100]=$i; print "avg:" s/length(a) }'
running median of $i:
A way to compute the median can be done with gawk in which we assume that the array is sorted for array-traversal by value
awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_asc" }
{ a[NR%100] = $i }
{ k=0; m=0;
for(j in a) { k++
if (k >= length(a)/2 ) m+=a[j]
if (k <= length(a)/2+1) {m+=a[j]; break }
}
print "med:", m/2
}'
or if you want it a bit lighter on the if-conditions
awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_asc" }
{ a[NR%100] = $i }
{ k=0; m=0;
for(j in a) { k++
if (k < length(a)/2 ) continue
if (k > length(a)/2+1) break
m+=a[j]
}
print "med:", (length(a)%2==0 ? m/2 : m)
}'
If you don't want to use the pre-sorted concept, then the computation of the median becomes much more difficult. A possible way would be making use of selection algorithm as explained here.

Average of a given number of rows

I want to get the average of a certain number of rows, in this case this number is dictated by the second column
-1 1 22.776109913596883 0.19607208141710716
-1 1 4.2985901827923954 1.0388892840309705
-1 1 4.642271812306717 0.96197712195674756
-1 2 2.8032298255711794 1.5930763994471333
-1 2 2.9358628368936479 1.5211062387604053
-1 2 4.9987168801017106 0.8933811184867273
1 4 2.6211673161014915 1.7037291934441456
1 4 4.483831056393683 0.99596956735821618
1 4 9.7189442154485732 0.4594901646050486
The expected output would be
-1 1 0.732313
-1 2 1.33585
1 4 1.05306
I have done
awk '{sum+=$4} (NR%3)==0 {print $2,$3,sum/3;sum=0;}' test
which works, but I would like to (somehow) generalize (NR%3)==0 in a way that awk realizes that the value of the second column has changed and therefore means that it's a new average what it needs to calculate. For instance, the first three rows have 1 as value in the second column, so once 1 changes to 2 then means that it's a new average what it needs to be calculated.
Does this make sense?
Try something like:
awk '{sum[$2] += $4; count[$2] += 1; }
END { for (k in sum) { print k " " sum[k]/count[k]; } }'
Not tested but that is the idea...
With this method, the whold computation is printed at the end; it may be not what you want if the input is some infinite stream, but according to your example I think it should be fine.
If you want to keep the first column also, you can perfectly do it with the same system.
you can also use try this;
awk '{array[$1" "$2]+=$4} END { for (i in array) {print i" " array[i]/length(array)}}' test | sort -n
Test;
$ awk '{array[$1" "$2]+=$4} END { for (i in array) {print i" " array[i]/length(array)}}' test | sort -n
-1 1 0.732313
-1 2 1.33585
1 4 1.05306