Calculate sequential average and median from file using awk - awk

This is my input file (there are thousands of rows):
$ cat file.txt
1 495.03
2 503.76
3 512.28
4 520.75
5 529.17
I'd like to use awk to calculate the median of the first column over X (let's say 1-100) number of rows and the average of the corresponding values of the second column. awk would then move the the next set of rows (101-201) and do the same, i.e. median of the first column and average of the second column and so on. Needless to say, I'm trying to learn awk and have tried several previous solutions but couldn't quite make it work.
From a previous post, I found that I can calculate the average this way:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}'
How does this work exactly (i.e. what does this {sum+=$1} expression mean?) and how can I adapt this for median? Btw, the first column will always be sorted.
Thanks in advance,
TP

If the records are sorted, the median will be just the average of 50th and 51st values.
$ awk '{r=NR%100; sum+=$2}
r==50 {m=$1}
r==51 {m=(m+$1)/2}
r==0 {print m, sum/100; sum=0}' file
this will work if number of records is a multiple of 100, otherwise you need to handle the last group which will have a different size.
There are other definitions for "median" for even number of records but that's something you should specify.
Explanation define r to be the remainder by mod 100, in essence the relative position in each block of 100 records. For the median we take the average of 50th and 51st records, sum aggregates the second field value for each 100 block. When the remainder is 0, we complete each block, print median and average (sum/100) values; reset sum for the next block.

note: This contains a bit more information wrt running means and medians for unsorted data. This should be seen as an addendum to the original question.
If you want to compute the running average over the last n terms (assume n = 100) then you have to take care of how you handle the first m records with m < n. A way to handle this is to place the values in an array where the index is the modulo of n. This way you always have the last n terms in your array :
running average of $i:
awk '{a[NR%100] = $i; s=0; for(j in a) { s+=a[j] }; print "avg:" s/length(a) }'
You can, however, remove the for-loop by keeping track of s:
awk '{s+=$i; if (NR%100 in a) s-=a[NR%100]; a[NR%100]=$i; print "avg:" s/length(a) }'
running median of $i:
A way to compute the median can be done with gawk in which we assume that the array is sorted for array-traversal by value
awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_asc" }
{ a[NR%100] = $i }
{ k=0; m=0;
for(j in a) { k++
if (k >= length(a)/2 ) m+=a[j]
if (k <= length(a)/2+1) {m+=a[j]; break }
}
print "med:", m/2
}'
or if you want it a bit lighter on the if-conditions
awk 'BEGIN{ PROCINFO["sorted_in"]="#val_num_asc" }
{ a[NR%100] = $i }
{ k=0; m=0;
for(j in a) { k++
if (k < length(a)/2 ) continue
if (k > length(a)/2+1) break
m+=a[j]
}
print "med:", (length(a)%2==0 ? m/2 : m)
}'
If you don't want to use the pre-sorted concept, then the computation of the median becomes much more difficult. A possible way would be making use of selection algorithm as explained here.

Related

Bash one liner for calculating the average of a specific row of numbers in bash

I just began learning bash.
Trying to figure out how to convert a two-liner into a one liner using bash.
The First Line of code...
searches the first column of the input.txt for the word - KEYWORD.
captures every number in this KEYWORD row from column2 until the last column.
dumps all these numbers into the values.txt file placing each number on a new line.
The second line of code calculates average value for all the numbers in the first column of values txt the prints out the this value.
awk '{if($1=="KEYWORD") for(i=2;i<=NF;i++) print $i}' input.txt > values.txt
awk 'BEGIN{sum=0; counter=0}{sum+=$1; counter+=1}END{print sum/counter}' values.txt
How do I create a one liner from this?
Something like
awk '
BEGIN { count = sum = 0 }
$1 == "KEYWORD" {
for (n = 2; n <= NF; n++) sum += $n
count += NF - 1
}
END { print sum/count }' input.txt
Just keep track of the sum and total count of numbers in the first (and only) pass through the file and then average them at the end instead of printing a value for each matching line.
After reviewing this problem with several people and learning some new bash/awk shortcuts, the code below appears to be the shortest answer.
awk '/KEYWORD/{for(n=3;n<=NF;n++)sum+=$n;print sum/(NF-2)}' input.txt
This code searches the input file for the row containing "KEYWORD".
Then sums up all the field from the 3rd column to the last column.
Then prints out the average value of all those numbers. (i.e. the mean).

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

In a CSV file, subtotal 2 columns based on a third one, using AWK in KSH

Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
Martín
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34

Print every line from a large file where the previous N lines meet specific criteria

I'd like to print every line from a large file where the previous 10 lines have a specific value in in a specific column (in the example below, column 9 has a value < 1). I don't want to store the whole file in memory. I am trying to use awk for this purpose as follows:
awk 'BEGIN{FS=","}
{
for (i=FNR,i<FNR+10, i++) saved[++s] = $0 ; next
for (i=1,i<s, i++)
if ($9<1)
print saved[s]; delete saved; s=0
}' file.csv
The goal of this command is to save the 10 previous lines, then that check that column 9 in each of those lines meets my criteria, then print the current line. Any help with this, or suggestion on a more efficient way to do this, is much appreciated!
No need to store anything in memory or do any explicit looping on values. To print the current line if the last 10 lines (inclusive) had a $9 value < 1 is just:
awk -F, '(c=($9<1?c+1:0))>9' file
Untested of course since you didn't provide any sample input or expected output so check the math but that is the right approach and if the math is wrong then the tweak to fix it is just to change >9 to >10 or whatever you need.
Here is a solution for GNU Awk:
chk_prev_lines.awk
BEGIN { FS=","
CMP_LINE_NR=10
CMP_VAL = 1 }
FNR > CMP_LINE_NR {
ok = 1
# check the stored values
for( i = 0; i< CMP_LINE_NR; i++ ) {
if ( !(prev_Field9[ i ] < CMP_VAL) ) {
ok = 0
break # early return
}
}
if( ok ) print
}
{ # store $9 for the comparison
prev_Field9[ FNR % CMP_LINE_NR] = $9
}
Use it like this: awk -f chk_prev_lines.awk your_file.
Explanation
CMP_LINE_NR determines how many values from previous lines are stored
CMP_VAL determines the values used for the comparison
The condition FNR > CMP_LINE_NR takes care, that the first line, whose previous lines are checked, is the one with CMP_LINE_NR +1. It is the first with that much previous lines.
The last Action stores the value $9. This Action is executed for all lines.

Is it possible to append an item to an array in awk without specifying an index?

I realize that awk has associative arrays, but I wonder if there is an awk equivalent to this:
http://php.net/manual/en/function.array-push.php
The obvious workaround is to just say:
array[$new_element] = $new_element
However, this seems less readable and more hackish than it needs to be.
I don't think an array length is immediately available in awk (at least not in the versions I fiddle around with). But you could simply maintain the length and then do something like this:
array[arraylen++] = $0;
And then access the elements it via the same integer values:
for ( i = 0; i < arraylen; i++ )
print array[i];
In gawk you can find the length of an array with length(var) so it's not very hard to cook up your own function.
function push(A,B) { A[length(A)+1] = B }
Notice this discussion, though -- all the places I can access right now have gawk 3.1.5 so I cannot properly test my function, duh. But here is an approximation.
vnix$ gawk '# BEGIN: make sure arr is an array
> BEGIN { delete arr[0] }
> { print "=" length(arr); arr[length(arr)+1] = $1;
> print length(arr), arr[length(arr)] }
> END { print "---";
> for (i=1; i<=length(arr); ++i) print i, arr[i] }' <<HERE
> fnord foo
> ick bar
> baz quux
> HERE
=0
1 fnord
=1
2 ick
=2
3 baz
---
1 fnord
2 ick
3 baz
As others have said, awk provides no functionality like this out of the box. Your "hackish" workaround may work for some datasets, but not others. Consider that you might add the same array value twice, and want it represented twice within the array.
$ echo 3 | awk 'BEGIN{ a[1]=5; a[2]=12; a[3]=2 }
> { a[$1] = $1 }
> END {print length(a) " - " a[3]}'
3 - 3
The best solution may be informed by the data are in the array, but here are some thoughts.
First off, if you are certain that your index will always be numeric, will always start at 1, and that you will never delete array elements, then triplee's suggestion of A[length(A)+1]="value" may work for you. But if you do delete an element, then your next write may overwrite your last element.
If your index does not matter, and you're not worried about wasting space with long keys, you could use a random number that's long enough to reduce the likelihood of collisions. A quick & dirty option might be:
srand()
a[rand() rand() rand()]="value"
Remember to use srand() for better randomization, and don't trust rand() to produce actual random numbers. This is a less than perfect solution in a number of ways, but it has the advantage of being a single line of code.
If your keys are numeric but possibly sparse, as in the example that would break tripleee's solution, you can add a small search to your push function:
function push (a, v, n) {
n=length(a)+1
while (n in a) n++
a[n]=v
}
The while loop insures that you'll assign an unused index. This function is also compatible with arrays that use non-numeric indices -- it assigns keys that are numeric, but it doesn't care what's already there.
Note that awk does not guarantee the order of elements within an array, so the idea that you will "push an item onto the end of the array" is wrong. You'll add this element to the array, but there's no guarantee it's appear last when you step through with a for loop.
$ cat a
#!/usr/bin/awk -f
function push (a, v, n) {
n=length(a)+1
while (n in a) n++
a[n]=v
}
{
push(a, $0)
}
END {
print "length=" length(a)
for(i in a) print i " - " a[i]
}
$ printf '3\nfour\ncinq\n' | ./a
length=3
2 - four
3 - cinq
1 - 3