I am calculating average and standard deviation for 3rd column of a file. Now, without modifying this file I would like to also calculate these values just taking into account those rows having a value higher than 0.
This is the command I am using:
awk '{sum+=$3; sumsq+=$3*$3} END {print "MEAN:",sum/NR; print "SD:",sqrt(sumsq/NR - (sum/NR)**2)}' myFile > mean.txt
Do you know how can I adapt it to get also the mean and sd but just taking into account values higher than 0, as if those rows didnĀ“t exist.
This is the head of my file (and in the whole file no number is lower than 0):
A g1 10
B g6 5
C h7 3
D l8 0
F gg 1
T o7 0
O m7 33
My desired output (imaging that this is my whole file) is:
MEAN: 7.428 SD: 10.939
MEAN1: 10.4 SD1: 11.68
Thanks!
You can do it quite easily. In your rules before END you simply need to keep a counter of the number of rows where the values is zero. skipped below. Then in END compute an updated nr = NR - skipped and use that for your second print, e.g.
awk '
$3==0 { skipped++; next }
{ sum+=$3; sumsq+=$3*$3 }
END { nr = NR - skipped
print "MEAN:",sum/NR " SD:",sqrt(sumsq/NR - (sum/NR)**2)
print "MEAN:",sum/nr " SD:",sqrt(sumsq/nr - (sum/nr)**2)
}
' myFile
Example Use/Output
You can simply copy/middle-mouse paste in an xterm where myFile is in the current directory, e.g.:
$ awk '
> $3==0 { skipped++; next }
> { sum+=$3; sumsq+=$3*$3 }
> END { nr = NR - skipped
> print "MEAN:",sum/NR " SD:",sqrt(sumsq/NR - (sum/NR)**2)
> print "MEAN:",sum/nr " SD:",sqrt(sumsq/nr - (sum/nr)**2)
> }
> ' myFile
MEAN: 7.42857 SD: 10.9395
MEAN: 10.4 SD: 11.6893
Let me know if that does what you need and if you have any further questions.
I would use GNU AWK following way, for simplicity sake I would deal with mean only, lets file.txt content be
A g1 10
B g6 5
C h7 3
D l8 0
F gg 1
T o7 0
O m7 33
then
awk '{sum+=$3}$3>0{sum1+=$3;cnt1+=1}END{print "MEAN:",sum/NR,"MEAN1:",sum1/cnt1}' file.txt
output
MEAN: 7.42857 MEAN1: 10.4
Explanation: sum computing for whole data remained unchanged. I added action to be applied in case 3rd column ($3) is greater (>) than zero (0), which does use separate variable sum1 and variable cnt1 as counter. Finally I printed computed means.
(tested in gawk 4.2.1)
I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.
GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
I would like an Awk command where I can search a large file for columns which contain numbers both below 3 and above 5. It also needs to skip the first column.
e.g. for the following file
1 2 6 2
2 1 7 3
3 2 5 4
4 2 8 7
5 2 6 8
6 1 9 9
In this case, only column 4 is a match, as it is the only column with values above 5 and below 3 (except for column 1, which we skip).
Currently, I have this code:
awk '{for (i=2; i<=NF; i++) {if ($i < 3 && $i > 5) {print i}}}'
But this only reads one row at a time (so never makes a match). I want to search all of the rows, but I am unable to work out how this is done.
Ideally the output would simply be the column number. So for this example, simply '4'.
Many thanks.
Could you please try following and let me know if this helps you.
awk '{for(i=1;i<=NF;i++){if($i<3){col[i]++};if($i>5){col1[i]++}}} END{for(j in col){if(col[j]>=1 && col1[j]>=1){print j}}}' Input_file
If you want to start searching from second column then change i=1 to i=2 in above code.
EDIT: Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){
if($i<3) { col[i]++ };
if($i>5) { col1[i]++}}
}
END{
for(j in col){
if(col[j]>=1 && col1[j]>=1){ print j }}
}' Input_file
/begpat/, /endpat/ { action } in awk effectively corresponds to a closed line interval (both ends included) -- in mathematical notation [begpat, endpath].
What are standard awk ways of matching open line intervals ((begpat, endpat)) and half-closed line intervals ((begpat, endpat], [begpat,endpat))?
Example:
printf '%s\n' 0 1 2 3 4 5 | awk '/1/,/4/ { print $0 }' #prints 1 through 4 ([1,4])
What is a standard awk way of making it print intervals (1,4) (= [2,3]),
(1,4] (=[2,4]), and [1,4) (=[1,3]) without changing the endpoint patterns?
There is no "standard" way. You have to implement it, for example, set flag variables. After checking your stackoverflow profile, I think you know how to do it.
an example for (start,end) case:
/start/{doIt=1;next}/end/{doIt=0}doIt{doActionhere}.
[..), (..] would be pretty similar.
update
since you gave example, I append codes for your example: (all examples assume the start & end patterns are different.)
[1,4]
kent$ seq 0 5|awk '/1/,/4/'
1
2
3
4
(1,4)
kent$ seq 0 5|awk '/1/{f=1;next}/4/{f=0}f'
2
3
(1,4]
kent$ seq 0 5|awk '/1/{f=1;next}f;/4/{f=0}'
2
3
4
[1,4)
kent$ seq 0 5|awk '/1/{f=1}/4/{f=0}f'
1
2
3
I have text files that each have a single column of numbers:
2
3
4
I want to duplicate the second line n times, where n is the number in the first row, so the output looks like this:
3
3
I've done something similar in awk but can't seem to figure out this specific example.
$ awk 'NR==1{n=$1;} NR==2{for (i=1;i<=n;i++) print; exit;}' file
3
3
How it works
NR==1{n=$1;}
When we reach the first row, save the number in variable n.
NR==2{for (i=1;i<=n;i++) print; exit;}
When we reach the second row, print it n times and exit.
just for fun
read c d < <(head -2 file) | yes $d | head -n $c
extract first two rows, assign to c and d; repeat $d forever but get first $c rows.