AWK output problems - awk

I wrote a script for getting the MEAN and the STDEV from a data file.
Let's say the data file has this data:
1
2
3
4
5
The awk script looks like this
awk '{MEAN+=$1/5}END{print MEAN, STDEV=sqrt(($1-MEAN)**2/4)}' dat.dat>stat1.dat
but it gives me an incorrect value of STDEV=1. It must be 1.5811. Do you know what is incorrect in my script? how could I improve it?

you can do the same in one pass
$ seq 5 | awk '{sum+=$1; sqsum+=$1^2}
END{mean=sum/NR;
print mean, sqrt((sqsum-NR*mean^2)/(NR-1))}'
3 1.58114
note that this is the std definition for "sample population" (divide by N-1).

Could you please try following and let me know if this helps you(this should work on provided data and if you actual file has more fields too).
awk '{for(i=1;i<=NF;i++){sum+=$i};mean=sum?sum/NF:0;sum="";for(j=1;j<=NF;j++){$j=($j-mean)*($j-mean);sum+=$j};print "Mean=",mean", S.D=",sqrt(sum/NF)}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){ sum+=$i };
mean=sum?sum/NF:0;
sum="";
for(j=1;j<=NF;j++){ $j=($j-mean)*($j-mean);
sum+=$j};
print "Mean=",mean", S.D=",sqrt(sum/NF)
}
' Input_file
EDIT: Adding code similar to above only thing adding exception handling kind of where if any of the value is ZERO it should print 0 then.
awk '
{
for(i=1;i<=NF;i++){ sum+=$i };
mean=sum?sum/NF:0
sum="";
for(j=1;j<=NF;j++){ $j=($j-mean)*($j-mean);
sum+=$j};
val=sum?sqrt(sum/NF):0
print "Mean=",mean", S.D=",val
}
' Input_file

Even though the title and tag say awk, I wanted to add that calculating the mean and stdev for a column of data can be easily accomplished with datamash:
seq 1 5 | datamash mean 1 sstdev 1
3 1.5811388300842
It may be off-topic here (and I realize that programming simple tasks like that in awk can be a good learning opportunity), but I think datamash deserves some attention, specially for straightforward calculations such as this one. The documentation gives all the functions it can perform, and good examples as well for files with many columns. It is a fast and reliable alternative. Hope it helps!

Here is a two-pass streamable version:
parse.awk
# First-pass: sum the numbers
FNR == NR { sum += $1; next }
# After first pass: determine sample size (N) and mean
# Note: run only once because of the f flag
!f {
N = NR-1 # Number of samples
mean = sum/N # The mean of the samples
f = 1
}
# Second-pass: add the squares of the sample distance to mean
{ varsum += ($1 - mean)**2 }
END {
# Sample standard deviation
sstd = sqrt( varsum/(N-1) )
print "Sample std: " sstd
}
Run it like this for a file:
awk -f parse.awk file.dat{,}
Run it like this for streams:
awk -f parse.awk <(seq 5) <(seq 5)
Output in both cases:
Sample std: 1.58114

Related

Bash: Finding average of entries from multiple columns after reading a CSV text file

I am trying read a CSV text file and find average of weekly hours (columns 3 through 7) spent by all user-ids (column 2) ending with an even number (2,4,6,...).
The input sample is as below:
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer6,User7,4,7,8,2,5
Computer7,User6,8,8,8,0,0
Computer8,User9,5,2,0,6,8
Computer9,User8,2,5,7,3,6
Computer10,User10,8,9,9,9,10
I have written the following script:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' user-list.txt > superuser.txt
The output of this script is:
User4 4
User2 5.4
User6 4.8
User8 4.6
User10 9
However, I want to change the script to only print one average for all user-Ids ending with an even number.
The desired output for this would be as below (which is technically the average of all hours for the IDs ending with even numbers):
5.56
Any help would be appreciated.
TIA
Trying to fix OP's attempt here and adding logic to get average of averages at last of the file's reading. Written on mobile so couldn't test it should work in case I got the thought correct by OP's description.
awk -F, '
$2~/[24680]$/{
count++
for(i=3;i<=7;i++){
sum+=$i
}
tot+=sum/5
sum=0
}
END{
print "Average of averages is: " (count?tot/count:"NaN")
}
' user-list.txt > superuser.txt
You may try:
awk -F, '$2 ~ /[02468]$/ {
for(i=3; i<=7; i++) {
s += $i
++n
}
}
END {
if (n)
printf "%.2f\n", s/n
}' cust.csv
5.56
awk -F, 'NR == 1 { next } { match($2,/[[:digit:]]+/);num=substr($2,RSTART,RLENGTH);if(num%2==0) { av+=($3+$4+$5+$6+$7)/5 } } END { printf "%.2f\n",av/5}' user-list.txt
Ignore the first header like. Pick the number out of the userid with awk's match function. Set the num variable to this number. Check to see if the number is even with num%2. If it is average, set the variable av to av plus the average. At the end, print the average to 2 decimal places.
Print the daily average, for all even numbered user IDs:
#!/bin/sh
awk -F , '
(NR>1) &&
($2 ~ /[02468]$/) {
hours += ($3 + $4 + $5 + $6 + $7)
(users++)
}
END {
print (hours/users/5)
}' \
"$1"
Usage example:
$ script user-list
5.56
One way to get evenness or oddness of an integer is to use modulus (%), as in N % 2. For even values of N, this sum evaluates to zero, and for odd values, it evaluates to 1.
However in this case, a string operation would be required to extract the number any way, so may as well just use a single string match, to get odd or even.
Also, IMO, for 5 fields, which are not going to change (days of the week), it's more succinct to just add them directly, instead of a loop. (NR>1) skips the titles line too, in case there's a conflict.
Finally, you can of of course swap /[02468]$/ for /[13579]$/ to get the same data, for odd numbered users.

Duplicate Lines 2 times and transpose from row to column

I will like to duplicate each line 2 times and print values of column 5 and 6 separated.( transpose values of column 5 and 6 from column to row ) for each line
I mean value on column 5 (first line) value in column 6 ( second line)
Input File
08,1218864123180000,3201338573,VV,22,27
08,1218864264864000,3243738789,VV,15,23
08,1218864278580000,3244738513,VV,3,13
08,1218864310380000,3243938789,VV,15,23
08,1218864324180000,3244538513,VV,3,13
08,1218864334380000,3200538561,VV,22,27
Desired Output
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
I use this code to duplicate the lines 2 times, but i cant'n figer out the condition with values of column 5 and 6
awk '{print;print}' file
Thanks in advance
To repeatedly print the start of a line for each of the last N fields where N is 2 in this case:
$ awk -v n=2 '
BEGIN { FS=OFS="," }
{
base = $0
sub("("FS"[^"FS"]+){"n"}$","",base)
for (i=NF-n+1; i<=NF; i++) {
print base, $i
}
}
' file
08,1218864123180000,3201338573,VV,22
08,1218864123180000,3201338573,VV,27
08,1218864264864000,3243738789,VV,15
08,1218864264864000,3243738789,VV,23
08,1218864278580000,3244738513,VV,3
08,1218864278580000,3244738513,VV,13
08,1218864310380000,3243938789,VV,15
08,1218864310380000,3243938789,VV,23
08,1218864324180000,3244538513,VV,3
08,1218864324180000,3244538513,VV,13
08,1218864334380000,3200538561,VV,22
08,1218864334380000,3200538561,VV,27
In this simple case where the last field has to be removed and placed on the last line, you can do
awk -F , -v OFS=, '{ x = $6; NF = 5; print; $5 = x; print }'
Here -F , and -v OFS=, will set the input and output field separators to a comma, respectively, and the code does
{
x = $6 # remember sixth field
NF = 5 # Set field number to 5, so the last one won't be printed
print # print those first five fields
$5 = x # replace value of fifth field with remembered value of sixth
print # print modified line
}
This approach can be extended to handle fields in the middle with a function like the one in the accepted answer of this question.
EDIT: As Ed notes in the comments, writing to NF is not explicitly defined to trigger a rebuild of $0 (the whole-line record that print prints) in the POSIX standard. The above code works with GNU awk and mawk, but with BSD awk (as found on *BSD and probably Mac OS X) it fails to do anything.
So to be standards-compliant, we have to be a little more explicit and force awk to rebuild $0 from the modified field state. This can be done by assigning to any of the field variables $1...$NF, and it's common to use $1=$1 when this problem pops up in other contexts (for example: when only the field separator needs to be changed but not any of the data):
awk -F , -v OFS=, '{ x = $6; NF = 5; $1 = $1; print; $5 = x; print }'
I've tested this with GNU awk, mawk and BSD awk (which are all the awks I can lay my hands on), and I believe this to be covered by the awk bit in POSIX where it says "setting any other field causes the re-evaluation of $0" right at the top. Mind you, the spec could be more explicit on this point, and I'd be interested to test if more exotic awks behave the same way.
Could you please try following(considering that your Input_file always is same as shown and you need to print every time 1st four fields and then rest of the fields(one by one printing along with 1st four)).
awk 'BEGIN{FS=OFS=","}{for(i=5;i<=NF;i++){print $1,$2,$3,$4,$i}}' Input_file
This might work for you (GNU awk):
awk '{print gensub(/((.*,).*),/,"\\1\n\\2",1)}' file
Replace the last comma by a newline and the previous fields less the penultimate.

In a CSV file, subtotal 2 columns based on a third one, using AWK in KSH

Disclaimers:
1) English is my second language, so please forgive any grammatical horrors you may find. I am pretty confident you will be able to understand what I need despite these.
2) I have found several examples in this site that address questions/problems similar to mine, though I was unfortunately not able to figure out the modifications that would need to be introduced to fit my needs.
The "Problem":
I have an CSV file that looks like this:
c1,c2,c3,c4,c5,134.6,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER1,c11
c1,c2,c3,c4,c5,0.18,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER2,c11
c1,c2,c3,c4,c5,416.09,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,0,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,12.1,,c8,c9,SERVER3,c11
c1,c2,c3,c4,c5,480.64,,c8,c9,SERVER4,c11
c1,c2,c3,c4,c5,,83.65,c8,c9,SERVER5,c11
c1,c2,c3,c4,c5,,253.15,c8,c9,SERVER6,c11
c1,c2,c3,c4,c5,,18.84,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,8.12,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,22.45,c8,c9,SERVER7,c11
c1,c2,c3,c4,c5,,117.81,c8,c9,SERVER8,c11
c1,c2,c3,c4,c5,,96.34,c8,c9,SERVER9,c11
Complementary facts:
1) File has 11 columns.
2) The data in columns 1, 2, 3, 4, 5, 8, 9 and 11 is irrelevant in this case. In other words, I will only work with columns 6, 7 and 10.
3) Column 10 will be typically alphanumeric strings (server names), though it may contain also "-" and/or "_".
4) Columns 6 and 7 will have exclusively numbers, with up to two decimal places (A possible value is 0). Only one of the two will have data per line, never both.
What I need as an output:
- A single occurrence of every string in column 10 (as column 1), then the sum (subtotal) of it's values in column 6 (as column 2) and last, the sum (subtotal) of it's values in column 7 (as column 3).
- If the total for a field is "0" the field must be left empty, but still must exist (it's respective comma has to be printed).
- **Note** that the strings in column 10 will be already alphabetically sorted, so there is no need to do that part of the processing with AWK.
Output sample, using the sample above as an input:
SERVER1,134.6,,
SERVER2,0.18,,
SERVER3,428.19,,
SERVER4,480.64,,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,26.96
I've already found within these pages not one, but two AWK oneliners that PARTIALLY accomplish what it need:
awk -F "," 'NR==1{last=$10; sum=0;}{if (last != $10) {print last "," sum; last=$10; sum=0;} sum += $6;}END{print last "," sum;}' inputfile
awk -F, '{a[$10]+=$6;}END{for(i in a)print i","a[i];}' inputfile
My "problems" in both cases are the same:
- Subtotals of 0 are printed.
- I can only handle the sum of one column at a time. Whenever I try to add the second one, I get either a syntax error or it does simply not print the third column at all.
Thanks in advance for your support people!
Regards,
Martín
something like this?
$ awk 'BEGIN{FS=OFS=","}
{s6[$10]+=$6; s7[$10]+=$7}
END{for(k in s6) print k,(s6[k]?s6[k]:""),(s7[k]?s7[k]:"")}' file | sort
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34
note that your treatment of commas is not consistent, you're adding an extra one when the last field is zero (count the commas)
Your posted expected output doesn't seem to match your posted sample input so we're guessing but this might be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="," }
$10 != prev {
if (NR > 1) {
print prev, sum6, sum7
}
sum6 = sum7 = ""
prev = $10
}
$6 { sum6 += $6 }
$7 { sum7 += $7 }
END { print prev, sum6, sum7 }
$ awk -f tst.awk file
SERVER1,134.6,
SERVER2,0.18,
SERVER3,428.19,
SERVER4,480.64,
SERVER5,,83.65
SERVER6,,253.15
SERVER7,,49.41
SERVER8,,117.81
SERVER9,,96.34

subtracting values in one column based on another column

I have input file as follows
100A 2000
100B 150
100C 800
100A 1000
100B 100
100C 300
I want to subtract values in column 2 for each uniq value in column 1
so the out put should look like
100A 1000
100B 50
100C 500
I have tried
awk '{if(!a[$1])a[$1]=$2; else a[$1]=$2-a[$1]}END{ for(i in a)print i" " a[i]}' file
but the out put is :
100A 0
100B 0
100C 0
please advise
So many (slight) variations on the same theme.
awk '
!($1 in a) {a[$1]=$2; next}
{a[$1]-=$2}
END {for (i in a) printf "%s %d\n",i,a[i]}
' input.txt
Stack it up as a one-liner if you like.
Remember that awk structure consists of multiple condition { statement } pairs, so you can sometimes express your requirements more elegantly than using an if..else. (Not saying that this is the case here - this is a simple enough awk script that it probably doesn't matter, unless you're a purist. :] )
Also, beware of testing for values the way you've done in the condition in your if in the question. Note that a[$1] both tests whether the value at that array index is non-zero and causes the index to exist with a null value if it didn't previously exist. If you want to check for index existence, use $1 in a.
Update based on a comment on your question...
If you want to subtract the last from the first entry, ignoring the ones in between, then you need to keep a record of both your firsts and your lasts. Something like this might suffice.
awk '
!($1 in a){a[$1]=$2;next}
{b[$1]=$2}
END {for(i in b)if(i in a)print i,a[i]-b[i]}
' input.txt
Note that as Ed mentioned, this produces output in random order. If you want the output ordered, you'll need an additional array to track of the order. For example, this will use order that items are first seen:
awk '
!($1 in a) {
a[$1]=$2;
o[++n]=$1;
next
}
{
b[$1]=$2
}
END {
for (n=1;n<=length(o);n++)
print o[n],a[o[n]]-b[o[n]]
}
' i
Note that the length() function being used to determine the number of elements in an array is not universal amongst dialects of awk, but it does work in both gawk and one-true-awk (used in FreeBSD and others).
This awk one-liner does the job:
awk '{if($1 in a)a[$1]=a[$1]-$2;else a[$1]=$2}
END{for(x in a) print x, a[x]}' file
In awk. Using conditional operator for value placing/subtraction to keep it tight:
$ awk '{ a[$1]+=($1 in a?-$2:$2) } END{ for(i in a)print i, a[i] }' file
100A 1000
100B 50
100C 500
Explained:
{
a[$1]+=($1 in a?-$2:$2) # if $1 in a already, subtract from it
# otherwise add value to it
}
END {
for(i in a) # go thru all a
print i, a[i] # and print keys and values
}
Given the sample input you provided, all you need is:
$ awk '$1 in a{print $1, a[$1]-$2} {a[$1]=$2}' file
100A 1000
100B 50
100C 500
If that's not all you need then provide more truly representative sample input/output that includes the cases where that's not good enough.
You can use this awk:
awk 'a[$1]{a[$1]=a[$1]-$2; next} {a[$1]=$2} END{for(v in a){print v, a[v]}}' file

AWK subtract numeric record two from record one

I cannot solve a very simple problem. My data file looks like:
Crap Crap 0.123456789D+09 Crap Crap
Crap Crap 0.123456798D+09 Crap Crap
I need to use AWK to subtract the number in the third column; the SECOND row minus the FIRST row.
I tried:
cat crap.txt | awk '{ A[NR-1] = $3 } END { print A[1] - A[0] }'
to no success. Maybe the format of the number is wrong? (Can AWK read scientific notation with D instead of E?)
Help!
EDIT:
Just so the community knows, AWK does not understand scientific notation which uses D instead of E (like many Fortran outputs produce). It is necessary to replace the D for E and then carry on any mathematical operation.
You could change the notation on the fly...
cat crap.txt | awk '{ sub(/D/,"E",$3); A[NR-1] = $3; } END { print A[1] - A[0] }'
awk cannot read D-notation (what does it mean by the way?). Here is an example:
printf "1.2D+1\n1.2E+1\n12\n" | awk '{ print $1/3; }'
0.4
4
4
The first one is wrong.