I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.
Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
MartÃn
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
I'd like to print every line from a large file where the previous 10 lines have a specific value in in a specific column (in the example below, column 9 has a value < 1). I don't want to store the whole file in memory. I am trying to use awk for this purpose as follows:
awk 'BEGIN{FS=","}
{
for (i=FNR,i<FNR+10, i++) saved[++s] = $0 ; next
for (i=1,i<s, i++)
if ($9<1)
print saved[s]; delete saved; s=0
}' file.csv
The goal of this command is to save the 10 previous lines, then that check that column 9 in each of those lines meets my criteria, then print the current line. Any help with this, or suggestion on a more efficient way to do this, is much appreciated!
No need to store anything in memory or do any explicit looping on values. To print the current line if the last 10 lines (inclusive) had a $9 value < 1 is just:
awk -F, '(c=($9<1?c+1:0))>9' file
Untested of course since you didn't provide any sample input or expected output so check the math but that is the right approach and if the math is wrong then the tweak to fix it is just to change >9 to >10 or whatever you need.
Here is a solution for GNU Awk:
chk_prev_lines.awk
BEGIN { FS=","
CMP_LINE_NR=10
CMP_VAL = 1 }
FNR > CMP_LINE_NR {
ok = 1
# check the stored values
for( i = 0; i< CMP_LINE_NR; i++ ) {
if ( !(prev_Field9[ i ] < CMP_VAL) ) {
ok = 0
break # early return
}
}
if( ok ) print
}
{ # store $9 for the comparison
prev_Field9[ FNR % CMP_LINE_NR] = $9
}
Use it like this: awk -f chk_prev_lines.awk your_file.
Explanation
CMP_LINE_NR determines how many values from previous lines are stored
CMP_VAL determines the values used for the comparison
The condition FNR > CMP_LINE_NR takes care, that the first line, whose previous lines are checked, is the one with CMP_LINE_NR +1. It is the first with that much previous lines.
The last Action stores the value $9. This Action is executed for all lines.
I realize that awk has associative arrays, but I wonder if there is an awk equivalent to this:
http://php.net/manual/en/function.array-push.php
The obvious workaround is to just say:
array[$new_element] = $new_element
However, this seems less readable and more hackish than it needs to be.
I don't think an array length is immediately available in awk (at least not in the versions I fiddle around with). But you could simply maintain the length and then do something like this:
array[arraylen++] = $0;
And then access the elements it via the same integer values:
for ( i = 0; i < arraylen; i++ )
print array[i];
In gawk you can find the length of an array with length(var) so it's not very hard to cook up your own function.
function push(A,B) { A[length(A)+1] = B }
Notice this discussion, though -- all the places I can access right now have gawk 3.1.5 so I cannot properly test my function, duh. But here is an approximation.
vnix$ gawk '# BEGIN: make sure arr is an array
> BEGIN { delete arr[0] }
> { print "=" length(arr); arr[length(arr)+1] = $1;
> print length(arr), arr[length(arr)] }
> END { print "---";
> for (i=1; i<=length(arr); ++i) print i, arr[i] }' <<HERE
> fnord foo
> ick bar
> baz quux
> HERE
=0
1 fnord
=1
2 ick
=2
3 baz
---
1 fnord
2 ick
3 baz
As others have said, awk provides no functionality like this out of the box. Your "hackish" workaround may work for some datasets, but not others. Consider that you might add the same array value twice, and want it represented twice within the array.
$ echo 3 | awk 'BEGIN{ a[1]=5; a[2]=12; a[3]=2 }
> { a[$1] = $1 }
> END {print length(a) " - " a[3]}'
3 - 3
The best solution may be informed by the data are in the array, but here are some thoughts.
First off, if you are certain that your index will always be numeric, will always start at 1, and that you will never delete array elements, then triplee's suggestion of A[length(A)+1]="value" may work for you. But if you do delete an element, then your next write may overwrite your last element.
If your index does not matter, and you're not worried about wasting space with long keys, you could use a random number that's long enough to reduce the likelihood of collisions. A quick & dirty option might be:
srand()
a[rand() rand() rand()]="value"
Remember to use srand() for better randomization, and don't trust rand() to produce actual random numbers. This is a less than perfect solution in a number of ways, but it has the advantage of being a single line of code.
If your keys are numeric but possibly sparse, as in the example that would break tripleee's solution, you can add a small search to your push function:
function push (a, v, n) {
n=length(a)+1
while (n in a) n++
a[n]=v
}
The while loop insures that you'll assign an unused index. This function is also compatible with arrays that use non-numeric indices -- it assigns keys that are numeric, but it doesn't care what's already there.
Note that awk does not guarantee the order of elements within an array, so the idea that you will "push an item onto the end of the array" is wrong. You'll add this element to the array, but there's no guarantee it's appear last when you step through with a for loop.
$ cat a
#!/usr/bin/awk -f
function push (a, v, n) {
n=length(a)+1
while (n in a) n++
a[n]=v
}
{
push(a, $0)
}
END {
print "length=" length(a)
for(i in a) print i " - " a[i]
}
$ printf '3\nfour\ncinq\n' | ./a
length=3
2 - four
3 - cinq
1 - 3