AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value - awk

I am trying to keep the rows with highest value in column 2 per column 1 including identical rows with max value like the desired output below.
Data is
a 55
a 66
a 130
b 88
b 99
b 99
c 110
c 130
c 130
Desired output is
a 130
b 99
b 99
c 130
c 130
I could find great answers from this site, but not exactly for the current question.
awk '{ max=(max>$2?max:$2); arr[$2]=(arr[$2]?arr[$2] ORS:"")$0 } END{ print arr[max] }' file
yields the output which includes the identical rows But max value is from all rows not per column 1.
a 130
c 130
c 130
awk '$2>max[$1] {max[$1]=$2 ; row[$1]=$0} END{for (i in row) print row[i]}' file
Output includes the max value per column 1 but NOT include identical rows with max values.
a 130
b 99
c 130
Would you please help me to trim the data in desired way. Even all codes above are obtained from your questions and answers in this site. Appreciate that!! Many thanks for helps in advance!!!

I've used this approach in the past:
awk 'NR==FNR{if($2 > max[$1]){max[$1]=$2}; next} max[$1] == $2' test.txt test.txt
a 130
b 99
b 99
c 130
c 130
This requires you to pass in the same file twice (i.e. awk '...' test.txt test.txt), so it's not ideal, but hopefully it provides the required output with your actual data.

Using any awk:
awk '
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
' file
a 130
b 99
b 99
c 130
c 130

Here is a ruby to do that:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 55", "a 66", "a 130"], "b"=>["b 88", "b 99", "b 99"], "c"=>["c 110", "c 130", "c 130"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[-1].to_f}}
# ["a 130", "b 99", "c 130"]
grps.values.flatten.each{|s| puts s if maxes.include?(s)}
' file
Prints:
a 130
b 99
b 99
c 130
c 130

Another way using awk. The second loop should be light, just repeating the duplicated max values.
% awk 'arr[$1] < $2{arr[$1] = $2; # get max value
co[$1]++; if(co[$1] == 1){x++; id[x] = $1}} # count unique ids
arr[$1] == $2{n[$1,arr[$1]]++} # count repeated max
END{for(i=1; i<=x; i++){
for(j=1; j<=n[id[i],arr[id[i]]]; j++){print id[i], arr[id[i]]}}}' file
a 130
b 99
b 99
c 130
c 130
or, if order doesn't matter
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++}
END{for(i in arr){
j=0; do{print i, arr[i]; j++} while(j < n[i,arr[i]])}}' file
c 130
c 130
b 99
b 99
a 130
-- EDIT --
Printing data in additional columns
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
c 130 data8
c 130 data9
b 99 data5
b 99 data6
a 130 data3
Data
% cat file
a 55 data1
a 66 data2
a 130 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9

Related

Column manipulating using Bash & Awk

Let's assume have an example1.txt file consisting of few rows.
item item item
A B C
100 20 2
100 22 3
100 23 4
101 26 2
102 28 2
103 29 3
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
104 36 3
There are few commands I would like to perform to filter out the txt files and add a few more columns.
At first, I want to apply a condition when item C is equal to 2. Using awk command I can do that in the following way.
Therefore The return text file would be:
awk '$3 == 2 { print $1 "\t" $2 "\t" $3} ' example1.txt > example2.txt
item item item
A B C
100 20 2
101 26 2
102 28 2
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
Now I want to count two things:
I want to count the total unique number in column 1.
For example, in the above case example2.txt, it would be:
(100,101,102,103,104) = 5
And I would like to add the repeating column A number and add that to a new column.
I would like to have like this:
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
~
Above Item D column (4th), 1st row is 1, because it did not have any repetitive. but in 4th row, it's 2 because 103 is repetitive twice. Therefore I have added 2 in the 4th and 5th columns. Similarly, the last three columns in Item 4 is 3, because item A is repetitive three times in these three columns.
You may try this awk:
awk -v OFS='\t' 'NR <= 2 {
print $0, (NR == 1 ? "item" : "D")
}
FNR == NR && $3 == 2 {
++freq[$1]
next
}
$3 == 2 {
print $0, freq[$1]
}' file{,}
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Could you please try following. In case you want to save output into same Input_file then append > temp && mv temp Input_file to following code.
awk '
FNR==NR{
if($3==2){
a[$1,$3]++
}
next
}
FNR==1{
$(NF+1)="item"
print
next
}
FNR==2{
$(NF+1)="D"
print
next
}
$3!=2{
next
}
FNR>2{
$(NF+1)=a[$1,$3]
}
1
' Input_file Input_file | column -t
Output will be as follows.
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program fro here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st time Input_file is being read.
if($3==2){ ##Checking condition if 3rd field is 2 then do following.
a[$1,$3]++ ##Creating an array a whose index is $1,$3 and keep adding its index with 1 here.
}
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is first line.
$(NF+1)="item" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
FNR==2{ ##Checking condition if this is second line.
$(NF+1)="D" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
$3!=2{ ##Checking condition if 3rd field is NOT equal to 2 then do following.
next ##next will skip further statements from here.
}
FNR>2{ ##Checking condition if line is greater than 2 then do following.
$(NF+1)=a[$1,$3] ##Creating new field with value of array a with index of $1,$3 here.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names 2 times here.
Similar to the others, but using awk with a single-pass and storing the information in arrays regarding the records seen and the count for D with the arrays ord and Dcnt used to map the information for each, e.g.
awk '
FNR == 1 { h1=$0"\titem" } # header 1 with extra "\titem"
FNR == 2 { h2=$0"\tD" } # header 2 with exter "\tD"
FNR > 2 && $3 == 2 { # remaining rows with $3 == 2
D[$1]++ # for D colum times A seen
seen[$1,$2] = $0 # save records seen
ord[++n] = $1 SUBSEP $2 # save order all records appear
Dcnt[n] = $1 # save order mapped to $1 for D
}
END {
printf "%s\n%s\n", h1, h2 # output headers
for (i=1; i<=n; i++) # loop outputing info with D column added
print seen[ord[i]]"\t"D[Dcnt[i]]
}
' example.txt
(note: SUBSEP is a built-in variable that corresponds to the substring separator used when using the comma to concatenate fields for an array index, e.g. seen[$1,$2] to allow comparison outside of an array. It is by default "\034")
Example Output
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Always more than one way to skin-the-cat with awk.
Assuming the file is not a big file;
awk 'NR==FNR && $3 == 2{a[$1]++;next}$3==2{$4=a[$1];print;}' file.txt file.txt
You parse through the file twice. In the first iteration, you calculate the 4th column and have it in an array. In the second parsing, we set the count as 4th column,and get the whole line printed.

awk: look for duplicated fields in multiple columns, print new column under condition

I would like your help with awk.
I am trying to look for lines where column $1and $2are duplicated in the file and where at least one of the duplicate has the value refin column $3. If so, print a "1"else print "2" in new column.
An example of input file would be:
a 123 exp_a
a 123 ref
b 146 exp_a
c 156 ref
d 205 exp_a
d 205 exp_b
And the output file would be:
a 123 exp_a 1
a 123 ref 1
b 146 exp_a 2
c 156 ref 2
d 205 exp_a 2
d 205 exp_b 2
Here, a 123 is duplicated with one line having ref at $3so it gets a 1. In contrast, the others are either not duplicated at $1and $2or duplicated but with no ref at $3, so they get a 2.
After some fiddling around, I manage to put a 1at lines where $1and $2are duplicated but it does not take the ref at $3 into account and I cannot tell awk to print a 2 otherwise... SPOILERS: my code is probably very ugly.
awk 'BEGIN {FS=OFS="\t"} {i=$1FS$2} {a[i]=!a[i]?$3:a[i]FS"1\n" i"\t"$3FS"1"} END {for (l in a) {print l,a[l]}}' infile > outfile
The output I get is:
d 205 exp_a 1
d 205 exp_b 1
a 123 exp_a 1
a 123 ref 1
b 146 exp_a
c 156 ref
$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR {
cnt2[$1,$2]++
cnt3[$1,$2,$3]++
next
}
{ print $0, (cnt2[$1,$2]>1 && cnt3[$1,$2,"ref"]>0 ? 1 : 2) }
$ awk -f tst.awk file file
a 123 exp_a 1
a 123 ref 1
b 146 exp_a 2
c 156 ref 2
d 205 exp_a 2
d 205 exp_b 2
Could you please try following.
awk 'FNR==NR{a[$1,$2]++;b[$1,$2]=$3;next} {$NF=(b[$1,$2]=="ref" && a[$1,$2]>1?$NF OFS "1":$NF OFS "2")} 1' OFS="\t" Input_file Input_file
Adding a non-one liner form of solution too here.
awk '
FNR==NR{
a[$1,$2]++
b[$1,$2]=$3
next
}
{
$NF=(b[$1,$2]=="ref" && a[$1,$2]>1?$NF OFS "1":$NF OFS "2")
}
1
' OFS="\t" Input_file Input_file
This one works in one go of the data but expects the file to be ordered by $1 $2, the "key". Records within each "key" group are outputed in random order (for(i in a)):
awk '
BEGIN { FS=OFS="\t" }
{
if((p!=$1 OFS $2) && NR>1) { # when the $1 $2 changes from previous
for(i=1;i<=a[0];i++) { # iterate and output buffered records
print p,a[i],2-(a[-1]&&a[0]>1) # more than one record in buffer and ...
} # ... ref for $4=1
delete a # empty buffer after output
}
if($3=="ref") # if there is a match in $3
a[-1]++ # increase counter
a[++a[0]]=$3 # buffer records to a, a[0] counter
p=$1 OFS $2 # p is for previous "key"
}
END {
for(i=1;i<=a[0];i++) # duplicate code from above if
print p,a[i],2-(a[-1]&&a[0]>1)
}' file
Outputs:
a 123 exp_a 1
a 123 ref 1
b 146 exp_a 2
c 156 ref 2
d 205 exp_a 2
d 205 exp_b 2
Record counter a[0] and ref counter a[-1] are in a[] to reset them with a single delete a.

Count number of occurrences of a number larger than x from every raw

I have a file with multiple rows and 26 columns. I want to count the number of occurrences of values that are higher than 0 (I guess is also valid different from 0) in each row (excluding the first two columns). The file looks like this:
X Y Sample1 Sample2 Sample3 .... Sample24
a a1 0 7 0 0
b a2 2 8 0 0
c a3 0 3 15 3
d d3 0 0 0 0
I would like to have an output file like this:
X Y Result
a a1 1
b b1 2
c c1 3
d d1 0
awk or sed would be good.
I saw a similar question but in that case the columns were summed and the desired output was different.
awk 'NR==1{printf "X\tY\tResult%s",ORS} # Printing the header
NR>1{
count=0; # Initializing count for each row to zero
for(i=3;i<=NF;i++){ #iterating from field 3 to end, NF is #fields
if($i>0){ #$i expands to $3,$4 and so which are the fields
count++; # Incrementing if the condition is true.
}
};
printf "%s\t%s\t%s%s",$1,$2,count,ORS # For each row print o/p
}' file
should do that
another awk
$ awk '{if(NR==1) c="Result";
else for(i=3;i<=NF;i++) c+=($i>0);
print $1,$2,c; c=0}' file | column -t
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0
$ awk '{print $1, $2, (NR>1 ? gsub(/ [1-9]/,"") : "Result")}' file
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

To sum adjacent lines from the same column in AWK

I have a file:
A 1 20
B 2 21
C 3 22
D 4 23
I have to find the sum of values from 0-3rd line then the sum of line 1 to 3 and finally the sum of line 2 to 3. The last value has to be simply 0. In another words, I want to get an output file with two columns where the values are the sum of adjacent lines something like this:
10 86
9 66
7 45
0 0
The last row has to have two zeros as values. How to do it in AWK?
This might be what you want:
$ tac file | awk 'NR==1{ print 0, 0; a=$2; b=$3; next} { print a+=$2, b+=$3 }' | tac
10 86
9 66
7 45
0 0
Avoid two tacs by accumulating the sums in two arrays:
$ awk '{
for (i = 1; i <= NR; ++i) { sum2[i] += $2; sum3[i] += $3 }
}
END {
sum2[NR] = sum3[NR] = 0
for (i = 1; i <= NR; ++i) print sum2[i], sum3[i]
}' file
10 86
9 66
7 45
0 0
The value of each row is added into all the previous rows. Once all rows have been processed, the last values are zeroed out and everything is printed.

use awk for printing selected rows

I have a text file and i want to print only selected rows of it. Below is the dummy format of the text file:
Name Sub Marks percentage
A AB 50 50
Name Sub Marks percentage
b AB 50 50
Name Sub Marks percentage
c AB 50 50
Name Sub Marks percentage
d AB 50 50
I need the output as:(Don't need heading before every record and need only 3 columns omitting "MARKS")
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
Please Suggest me a form of awk command using which I can achieve this, and thanks for supporting.
You can use:
awk '(NR == 1) || ((NR % 2) == 0) {print $1" "$2" "$4}' inputFile
This will print columns one, two and four but only if the record number is one or even. The results are:
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
If you want it nicely formatted, you can use printf instead:
awk '(NR == 1) || ((NR % 2) == 0) {printf "%-10s %-10s %10s\n", $1, $2, $4}' inputFile
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
awk solution:
awk 'NR==1 || !(NR%2){ print $1,$2,$4 }' OFS='\t' file
NR==1 || !(NR%2) - considering only the 1st and each even line
OFS='\t' - output field separator
The output:
Name Sub percentage
A AB 50
b AB 50
c AB 50
d AB 50
In case that input file has a slight different format above solution will fail. For example:
Name Sub Marks percentage
A AB 50 50
Name Sub Marks percentage
b AB 50 50
c AB 50 50
Name Sub Marks percentage
d AB 50 50
In such a case, something like this will work in all cases:
$ awk '$0!=h;NR==1{h=$0}' file1
Name Sub Marks percentage
A AB 50 50
b AB 50 50
c AB 50 50
d AB 50 50