multiple statements after if statement - awk

I have an awk line to calculate an average that works fine but when I put it into an if statement, I get a syntax error referring to the part with "END". I want to calculate the average only if certain conditions are fulfilled.
Line for calculating average that works:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' input.txt
Line for calculating average after if statement which doesn't work:
awk '{if ( $1 > 5 ) { {sum += $2; n++} END { if (n > 0) print sum / n; }}}' input.txt
I would like to know where the error is, changing the type and number of brackets did not help.

try this
awk '$1>5 {sum+=$2; n++}
END {if(n) print sum/n}' file

Related

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

Print every nth column of a file

I have a rather big file with 255 coma separated columns and I need to print out every third column only.
I was trying something like this
awk '{ for (i=0;i<=NF;i+=3) print $i }' file
but that doesn't seem to be the solution, since it prints to only one long column. Anybody can help? Thanks
Here is one way to do this.
The script prog.awk:
BEGIN {FS = ","} # field separator
{for (i = 1; i <= NF; i += 3) printf ("%s%c", $i, i + 3 <= NF ? "," : "\n");}
Invocation:
awk -f prog.awk <input.csv >output.csv
Example input.csv:
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
Example output.csv:
1,4,7,10
11,14,17,20
It behaves like that because by default awk splits fields in spaces. You have to tell it to split them with commas, and it's done using the FS variable or the -F switch. Besides that, first field is number one. The zero is the whole line, so also change the initial value of the for loop:
awk -F',' '{ for (i=1;i<=NF;i+=3) print $i }' file

New to Awk. Struggling with negative number formatting

Goal: to output only data that is above 1 and below -1
or
output data that is between 1 and -1
I have the basics of awk and can print column 2 (where my data is)
notice I also specified a range of 0-1
awk '/[0-1]/ {print $2}' test.dat
I am also needing to have the line number so I added NR...
awk '/[0-1]/ {print $2 NR}' test.dat
To make sure I am clear, the point is to identify which lines of the data are outside of the acceptable range, so we can ignore them in our analysis. (ie anything bigger than 1 or lower than -1 is too much of a change).
Any help you can provide would be great. I have pasted some sample data below.
http://pastebin.com/7tpBAqua
Not sure if you want to evaluate the data in every column, or if there's a specific column you need to test. Testing a single column is simplest; testing multiple or all columns is a fairly simple repetitive extension of the pattern. Since you mention column 2 specifically, let's assume you want to print column 2 only when it is between -1 and 1:
awk -F, '($2 >= -1) && ($2 <= 1) { print $2 }'
To test for the field being greater than 1 or less than -1 instead:
awk -F, '($2 <= -1) || ($2 >= 1) { print $2 }'
Printing a different field, or the entire line instead ($0) should be fairly obvious. To examine each field, either simply repeat the entire ($2 >= -1) && ($2 <= 1) { print $2 } clause for each field you're interested in (which quickly gets verbose), or something like this (not tested):
awk -F, '{ for (i = 1; i <= NF; ++i) if (($i >= -1) && ($i <= 1)) print $i; }'
awk -F'[ ,]' 'NR>2{for (i=2;i<=NF;i++) if ($i<-1 || $i>1) { print NR; next } }' file

how to get rid of awk fatal division by zero error

when ever I am trying to calculate mean and standard deviation using awk i am getting "awk: fatal: division by zero attempted" error.
my command is
awk '{s+=$3} END{print $2"\t"s/(NR)}' >> mean;
awk '{sum+=$3;sumsq+=$3*$3} END {print $2"\t"sqrt(sumsq/NR - (sum/NR)^2)}' >>sd
does any one know how to solve this ?
Your trouble is that ... you are dividing by zero.
You have two commands:
awk '{s+=$3} END{print $2"\t"s/(NR)}' >> mean;
awk '{sum+=$3;sumsq+=$3*$3} END {print $2"\t"sqrt(sumsq/NR - (sum/NR)^2)}' >>sd
The first command reads from standard input to EOF. The second command is then run, tries to read standard input, but finds that it is empty, so it has zero records read, so NR is zero, and you are dividing by 0, and crashing.
You will need to deal with both the mean and the standard deviation in a single command.
awk '{s1 += $3; s2 += $3*$3}
END { if (NR > 0){
print $2 "\t" s1 / NR;
print $2 "\t" sqrt(s2 / NR - (s1/NR)^2);
}
}'
This avoids divide-by-zero errors.

Counting and matching process

I have a matching problem with awk :(
I will count first column elements in main.file and if its value is more than 2 I will print the first and the second column.
main.file
1725009 7211378
3353866 11601802
3353866 8719104
724973 3353866
3353866 7211378
For example number of "3353866" in the first column is 3, so output.file will be like that:
output.file
3353866 11601802
3353866 8719104
3353866 7211378
How can I do this in awk?
If you mean items with at least 3 occurrences, you can collect occurrences in one array and the collected values as a preformatted or delimited string in another.
awk '{o[$1]++;v[$1]=v[$1] "\n" $0}
END{for(k in o){if(o[k]<3)continue;
print(substr(v[k],1)}' main.file
Untested, not at my computer. The output order will be essentially random; you'll need another variable to keep track of line numbers if you require the order to be stable.
This would be somewhat less hackish in Perl or Python, where a hash/dict can contain a structured value, such as a list.
Another approach is to run through the file twice: it's a little bit slower, but the code is very neat:
awk '
NR==FNR {count[$1]++; next}
count[$1] > 2 {print}
' main.file main.file
awk '{store[$1"-"lines[$1]] = $0; lines[$1]++;}
END {for (l in store) {
split(l, pair, "-"); if (lines[pair[1]] > 2) { print store[l] } } }'
One approach is to track all the records seen, the corresponding key $1 for each record, and how often each key occurs. Once you've record those for all the lines, you can then iterate through all the records stored, only printing those for which the count of the key is greater than two.
awk '{
record[NR] = $0;
key[$0] = $1;
count[$1]++
}
END {
for (n=1; n <= length(record); n++) {
if (count[key[record[n]]] > 2) {
print record[n]
}
}
}'
Sort first, and then use awk to print only when you have 3 times or more the 1st field:
cat your_file | sort -n | awk 'prev == $1 {count++; p0=p1; p1=p2; p2=$2}
prev != $1 {prev=$1; count=1; p2=$2}
count == 3 {print $1 " " p0; print $1 " " p1; print $1 " " p2}
count > 3 {print $1 " " $2}'
This will avoid awk to use too much memory in case of big input file.
based on how the question looks and the Ray Toal edit, I'm guessing you mean based on count, so something like this works:
awk '!y[$1] {y[$1] = 1} x[$1] {if(y[$1]==1) {y[$1]==2; print $1, x[$1]}; print} {x[$1] = $2}'