I have a large file of data with multiple ID's followed by several columns of observations. I need to average over one of the columns of ID's. I think this can be done using awk, but I'm not sure of how to set it up.
Data:
ID1 ID2 Observation
1 15_24 -0.00002649
2 15_24 0.00001584
3 15_24 -0.00003168
1 16_2 0.00002649
2 16_2 -0.00002014
3 16_2 -0.00003058
1 12_25 0.00009636
2 12_25 -0.00007514
3 12_25 0.00003021
Need the observations averaged over ID2 like this:
1 15_24 -0.00001411
2 16_2 -0.00000808
3 12_25 0.00001714
Thank you.
Maybe so:
awk 'BEGIN{ FS=" " } { cnt[$2] += $3; lincnt[$2] +=1; } END{i=1; for (x in cnt){print i++, x, (cnt[x] /lincnt[x] ) } }' file
If ordering is relevant, this awk script could help:
#!/usr/bin/env awk
lastItem==$2{
observation+=$3
observationCounter+=1
next
}
observationCounter>0{
print ++i" "lastItem" - "observation/observationCounter
}
{
lastItem=$2
observation=$3
observationCounter=1
}
END{
print ++i" "lastItem" - "observation/observationCounter
}
Related
I am having a file with ~10,000 columns and ~20,000 rows in the following format
Id A_1 A_2 A_3 B_2 B_5 C_1
T1 0 1 1 6 1 0
T2 1 1 1 0 0 1
T3 2 0 3 1 1 5
T4 1 1 1 2 3 1
In the header row, 1st column is the id. From 2nd column onward are the sample name in the following format sampleName_batch#. Now, I want to add all the values for the id's based on the sampleName and have the sampleName and the sum up value in the output. My expected output is
Id A B C
T1 2 7 0
T2 3 0 1
T3 5 2 5
T4 3 5 1
I have came across this answer https://unix.stackexchange.com/questions/569615/combine-columns-in-one-file-by-matching-header but i dont know how to modify the whole header row.
Thanks
I am trying to edit solution mentioned in OP's post from cross site, its a bit tweak in solution and all lines are same as it is from it. I am NO where near in knowledge like "THE Ed Morton" in awk, so humbly taking his permission(I hope he is ok with it) trying to edit his great solution from cross site, could you please try following.
awk '
NR==1 {
for (inFldNr=2; inFldNr<=NF; inFldNr++) {
sub(/_.*/,"",$inFldNr)
fldName = $inFldNr
if ( !(fldName in fldName2outFldNr) ) {
outFldNr2name[++numOutFlds] = fldName
fldName2outFldNr[fldName] = numOutFlds
}
outFldNr = fldName2outFldNr[fldName]
out2inFldNrs[outFldNr,++numInFlds[outFldNr]] = inFldNr
}
printf "%s%s", $1, OFS
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
outFldName = outFldNr2name[outFldNr]
printf "%s%s", outFldName, (outFldNr<numOutFlds ? OFS : ORS)
}
next
}
{
printf "%s%s", $1, OFS
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
sum = 0
for (inFldIdx=1; inFldIdx<=numInFlds[outFldNr]; inFldIdx++) {
inFldNr = out2inFldNrs[outFldNr,inFldIdx]
sum += $inFldNr
}
printf "%s%s", sum, (outFldNr<numOutFlds ? OFS : ORS)
}
}
' Input_file
What is added in already existing Ed's code:
Add a sub function to substitute everything after _ in first line(header).
Removed \t as delimiter since OP's samples are NOT tab delimited.
I have the following piece of code:
awk '{h[$1]++}; END { for(k in h) print k, h[k]}' ${infile} >> ${outfile2}
Which does part of what I want: printing out the unique values and then also counting how many times these unique values have occurred. Now, I want to print out the 2nd and 3rd column as well from each unique value. For some reason the following does not seem to work:
awk '{h[$1]++}; END { for(k in h) print k, $2, $3, h[k]}' ${infile} >> ${outfile2}
awk '{h[$1]++}; END { for(k in h) print k, h[$2], h[$3], h[k]}' ${infile} >> ${outfile2}
The first prints out the last index's 2nd and 3rd column, whereas the second code prints out nothing except k and h[k].
${infile} would look like:
20600 33.8318 -111.9286 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
29400 33.9455 -113.5430 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
20600 33.8318 -111.9286 -1 0.00 0
30900 33.3979 -111.8140 -1 0.00 0
30600 33.4461 -111.7876 -1 0.00 0
The desired output would be:
20600, 33.8318, -111.9286, 3
30900, 33.3979, -111.8140, 2
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
You were close and you can do it all in awk, but if you are going to store the count based on field 1 and also have field 2 and field 3 available in END to output, you also need to store field 2 & 3 in arrays indexed by field 1 (or whatever field you are keeping count of). For example you could do:
awk -v OFS=', ' '
{ h[$1]++; i[$1]=$2; j[$1]=$3 }
END {
for (a in h)
print a, i[a], j[a], h[a]
}
' infile
Where h[$1] holds the count of the number of times field 1 is seen indexing the array with field 1. i[$1]=$2 captures field 2 indexed by field 1, and then j[$1]=$3 captures field 3 indexed by field 1.
Then within END all that is needed is to output field 1 (a the index of h), i[a] (field 2), j[a] (field 3), and finally h[a] the count of the number of times field 1 was seen.
Example Use/Output
Using your example data, you can just copy/middle-mouse-paste the code at the terminal with the correct filename, e.g.
$ awk -v OFS=', ' '
> { h[$1]++; i[$1]=$2; j[$1]=$3 }
> END {
> for (a in h)
> print a, i[a], j[a], h[a]
> }
> ' infile
20600, 33.8318, -111.9286, 3
29400, 33.9455, -113.5430, 1
30600, 33.4461, -111.7876, 2
30900, 33.3979, -111.8140, 2
Which provides the output desired. If you need to preserve the order of records in the order of the output you show, you can use string-concatenation to group fields 1, 2 & 3 as the index of the array and then output the array and index, e.g.
$ awk '{a[$1", "$2", "$3]++}END{for(i in a) print i ", " a[i]}' infile
20600, 33.8318, -111.9286, 3
30600, 33.4461, -111.7876, 2
29400, 33.9455, -113.5430, 1
30900, 33.3979, -111.8140, 2
Look things over and let me know if you have further questions.
GNU datamash is a very handy tool for working on groups of columnar data in files that makes this trivial to do.
Assuming your file uses tabs to separate columns like it appears to:
$ datamash -s --output-delimiter=, -g 1,2,3 count 3 < input.tsv
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
Though it's not much more complicated in awk, using a multi dimensional array:
$ awk 'BEGIN { OFS=SUBSEP="," }
{ group[$1,$2,$3]++ }
END { for (g in group) print g, group[g] }' input.tsv
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
20600,33.8318,-111.9286,3
30900,33.3979,-111.8140,2
If you want sorted output instead of random order for this one, if using GNU awk, add a PROCINFO["sorted_in"] = "#ind_str_asc" in the BEGIN block, or otherwise pipe the output through sort.
You can also get the same effect by pipelining a bunch of utilities (including awk and uniq):
$ sort -k1,3n input.tsv | cut -f1-3 | uniq -c | awk -v OFS=, '{ print $2, $3, $4, $1 }'
20600,33.8318,-111.9286,3
29400,33.9455,-113.5430,1
30600,33.4461,-111.7876,2
30900,33.3979,-111.8140,2
I would like an Awk command where I can search a large file for columns which contain numbers both below 3 and above 5. It also needs to skip the first column.
e.g. for the following file
1 2 6 2
2 1 7 3
3 2 5 4
4 2 8 7
5 2 6 8
6 1 9 9
In this case, only column 4 is a match, as it is the only column with values above 5 and below 3 (except for column 1, which we skip).
Currently, I have this code:
awk '{for (i=2; i<=NF; i++) {if ($i < 3 && $i > 5) {print i}}}'
But this only reads one row at a time (so never makes a match). I want to search all of the rows, but I am unable to work out how this is done.
Ideally the output would simply be the column number. So for this example, simply '4'.
Many thanks.
Could you please try following and let me know if this helps you.
awk '{for(i=1;i<=NF;i++){if($i<3){col[i]++};if($i>5){col1[i]++}}} END{for(j in col){if(col[j]>=1 && col1[j]>=1){print j}}}' Input_file
If you want to start searching from second column then change i=1 to i=2 in above code.
EDIT: Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){
if($i<3) { col[i]++ };
if($i>5) { col1[i]++}}
}
END{
for(j in col){
if(col[j]>=1 && col1[j]>=1){ print j }}
}' Input_file
I want to get the average of a certain number of rows, in this case this number is dictated by the second column
-1 1 22.776109913596883 0.19607208141710716
-1 1 4.2985901827923954 1.0388892840309705
-1 1 4.642271812306717 0.96197712195674756
-1 2 2.8032298255711794 1.5930763994471333
-1 2 2.9358628368936479 1.5211062387604053
-1 2 4.9987168801017106 0.8933811184867273
1 4 2.6211673161014915 1.7037291934441456
1 4 4.483831056393683 0.99596956735821618
1 4 9.7189442154485732 0.4594901646050486
The expected output would be
-1 1 0.732313
-1 2 1.33585
1 4 1.05306
I have done
awk '{sum+=$4} (NR%3)==0 {print $2,$3,sum/3;sum=0;}' test
which works, but I would like to (somehow) generalize (NR%3)==0 in a way that awk realizes that the value of the second column has changed and therefore means that it's a new average what it needs to calculate. For instance, the first three rows have 1 as value in the second column, so once 1 changes to 2 then means that it's a new average what it needs to be calculated.
Does this make sense?
Try something like:
awk '{sum[$2] += $4; count[$2] += 1; }
END { for (k in sum) { print k " " sum[k]/count[k]; } }'
Not tested but that is the idea...
With this method, the whold computation is printed at the end; it may be not what you want if the input is some infinite stream, but according to your example I think it should be fine.
If you want to keep the first column also, you can perfectly do it with the same system.
you can also use try this;
awk '{array[$1" "$2]+=$4} END { for (i in array) {print i" " array[i]/length(array)}}' test | sort -n
Test;
$ awk '{array[$1" "$2]+=$4} END { for (i in array) {print i" " array[i]/length(array)}}' test | sort -n
-1 1 0.732313
-1 2 1.33585
1 4 1.05306
Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!
BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.