Count number of occurrences of a number larger than x from every raw - awk

I have a file with multiple rows and 26 columns. I want to count the number of occurrences of values that are higher than 0 (I guess is also valid different from 0) in each row (excluding the first two columns). The file looks like this:
X Y Sample1 Sample2 Sample3 .... Sample24
a a1 0 7 0 0
b a2 2 8 0 0
c a3 0 3 15 3
d d3 0 0 0 0
I would like to have an output file like this:
X Y Result
a a1 1
b b1 2
c c1 3
d d1 0
awk or sed would be good.
I saw a similar question but in that case the columns were summed and the desired output was different.

awk 'NR==1{printf "X\tY\tResult%s",ORS} # Printing the header
NR>1{
count=0; # Initializing count for each row to zero
for(i=3;i<=NF;i++){ #iterating from field 3 to end, NF is #fields
if($i>0){ #$i expands to $3,$4 and so which are the fields
count++; # Incrementing if the condition is true.
}
};
printf "%s\t%s\t%s%s",$1,$2,count,ORS # For each row print o/p
}' file
should do that

another awk
$ awk '{if(NR==1) c="Result";
else for(i=3;i<=NF;i++) c+=($i>0);
print $1,$2,c; c=0}' file | column -t
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

$ awk '{print $1, $2, (NR>1 ? gsub(/ [1-9]/,"") : "Result")}' file
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

Related

AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value

I am trying to keep the rows with highest value in column 2 per column 1 including identical rows with max value like the desired output below.
Data is
a 55
a 66
a 130
b 88
b 99
b 99
c 110
c 130
c 130
Desired output is
a 130
b 99
b 99
c 130
c 130
I could find great answers from this site, but not exactly for the current question.
awk '{ max=(max>$2?max:$2); arr[$2]=(arr[$2]?arr[$2] ORS:"")$0 } END{ print arr[max] }' file
yields the output which includes the identical rows But max value is from all rows not per column 1.
a 130
c 130
c 130
awk '$2>max[$1] {max[$1]=$2 ; row[$1]=$0} END{for (i in row) print row[i]}' file
Output includes the max value per column 1 but NOT include identical rows with max values.
a 130
b 99
c 130
Would you please help me to trim the data in desired way. Even all codes above are obtained from your questions and answers in this site. Appreciate that!! Many thanks for helps in advance!!!
I've used this approach in the past:
awk 'NR==FNR{if($2 > max[$1]){max[$1]=$2}; next} max[$1] == $2' test.txt test.txt
a 130
b 99
b 99
c 130
c 130
This requires you to pass in the same file twice (i.e. awk '...' test.txt test.txt), so it's not ideal, but hopefully it provides the required output with your actual data.
Using any awk:
awk '
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
' file
a 130
b 99
b 99
c 130
c 130
Here is a ruby to do that:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 55", "a 66", "a 130"], "b"=>["b 88", "b 99", "b 99"], "c"=>["c 110", "c 130", "c 130"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[-1].to_f}}
# ["a 130", "b 99", "c 130"]
grps.values.flatten.each{|s| puts s if maxes.include?(s)}
' file
Prints:
a 130
b 99
b 99
c 130
c 130
Another way using awk. The second loop should be light, just repeating the duplicated max values.
% awk 'arr[$1] < $2{arr[$1] = $2; # get max value
co[$1]++; if(co[$1] == 1){x++; id[x] = $1}} # count unique ids
arr[$1] == $2{n[$1,arr[$1]]++} # count repeated max
END{for(i=1; i<=x; i++){
for(j=1; j<=n[id[i],arr[id[i]]]; j++){print id[i], arr[id[i]]}}}' file
a 130
b 99
b 99
c 130
c 130
or, if order doesn't matter
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++}
END{for(i in arr){
j=0; do{print i, arr[i]; j++} while(j < n[i,arr[i]])}}' file
c 130
c 130
b 99
b 99
a 130
-- EDIT --
Printing data in additional columns
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
c 130 data8
c 130 data9
b 99 data5
b 99 data6
a 130 data3
Data
% cat file
a 55 data1
a 66 data2
a 130 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9

How to replace other column-entries when searching for a specific column in a file?

Suppose my file looks like this:
A 1 0
B 1 0
C 1 0
How can I search for the line that has B in the first column, and if so, switch the entries in the second and third column? So my final result would look like:
A 1 0
B 0 1
C 1 0
try this -
vipin#kali:~$ awk '{if($1 == "B") {print $1,$3,$2} else print $1,$2,$3}' kk
A 1 0
B 0 1
C 1 0

average of specific rows in a file

I have 6 rows in files. I need to find average only of specific rows in a file and the others should be left as they are. The average should be calculated for A1 and A2, B1 and B2, other lines should stay as they are
Input:
A1 1 1 2
A2 5 6 1
A3 1 1 1
B1 10 12 12
B2 10 12 10
B3 100 200 300
Output:
A1A2 3 3.5 1.5
A3 1 1 1
B1B2 10 12 11
B3 100 200 300
EDIT: There are n columns in total
awk to the rescue!
$ awk '/[AB][12]/{a=substr($1,1,1);
k=a"1"a"2";
c1[k]+=$2; c2[k]+=$3; c3[k]+=$4; n[k]++; next}
1;
END{for(k in c1)
print k, c1[k]/n[k], c2[k]/n[k], c3[k]/n[k]}' file | sort | column -t
A1A2 3 3.5 1.5
A3 1 1 1
B1B2 10 12 11
B3 100 200 300
pattern match grouped rows, create a key, calculate sum of all fields and count of rows per key; print unmatched rows; when done print the averaged rows, since order is not preserved sort and pipe to column for easy formatting.
$ cat tst.awk
$1 ~ /^[AB]1$/ { for (i=2;i<=NF;i++) val[$1,i]=$i; next }
$1 ~ /^[AB]2$/ { p=$1; sub(2,1,p); $1=p $1; for (i=2;i<=NF;i++) $i=($i + val[p,i])/2 }
{ print }
$ awk -f tst.awk file | column -t
A1A2 3 3.5 1.5
A3 1 1 1
B1B2 10 12 11
B3 100 200 300

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

sum rows based on unique columns awk

I'm looking for a more elegant way to do this (for more than >100 columns):
awk '{a[$1]+=$4}{b[$1]+=$5}{c[$1]+=$6}{d[$1]+=$7}{e[$1]+=$8}{f[$1]+=$9}{g[$1]+=$10}END{for(i in a) print i,a[i],b[i],c[i],d[i],e[i],f[i],g[i]}'
Here is the input:
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
and output:
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Thanks :)
I break the one-liner down into lines, to make it easier to read.
awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}
END{for(x in n){
printf "%s ", x
for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:FS)
}
}' file
this awk command should work with your 100 columns file.
test with your file:
kent$ cat f
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
kent$ awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(x in n){printf "%s ", x;for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:OFS)}}' f
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Using arrays of arrays in gnu awk version 4
awk '{for (i=2;i<=NF;i++) a[$1][i]+=$i}
END{for (i in a)
{ printf i FS;
for (j in a[i]) printf a[i][j] FS
printf RS}
}' file
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you care about order of output try this
$ cat file
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
Awk Code :
$ cat tester
awk 'FNR==NR{
U[$1] # Array U with index being field1
for(i=2;i<=NF;i++) # loop through columns thats is column2 to NF
A[$1,i]+=$i # Array A holds sum of columns
next # stop processing the current record and go on to the next record
}
($1 in U){ # Here we read same file once again,if field1 is found in array U, then following statements
for(i=1;i<=NF;i++)
s = s ? s OFS A[$1,i] : A[$1,i] # I am writing sum to variable s since I want to use only one print statement, here you can use printf also
print $1,s # print column1 and variable s
delete U[$1] # We have done, so delete array element
s = "" # reset variable s
}' OFS='\t' file{,} # output field separator is tab you can set comma also
Resulting
$ bash tester
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
As requested in comment here is one liner, in above post for better reading purpose I had commented and it became several lines.
$ awk 'FNR==NR{U[$1];for(i=2;i<=NF;i++)A[$1,i]+=$i;next}($1 in U){for(i=1;i<=NF;i++)s = s ? s OFS A[$1,i] : A[$1,i];print $1,s;delete U[$1];s = ""}' OFS='\t' file{,}
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7