awk: look for duplicated fields in multiple columns, print new column under condition - awk

I would like your help with awk.
I am trying to look for lines where column $1and $2are duplicated in the file and where at least one of the duplicate has the value refin column $3. If so, print a "1"else print "2" in new column.
An example of input file would be:
a 123 exp_a
a 123 ref
b 146 exp_a
c 156 ref
d 205 exp_a
d 205 exp_b
And the output file would be:
a 123 exp_a 1
a 123 ref 1
b 146 exp_a 2
c 156 ref 2
d 205 exp_a 2
d 205 exp_b 2
Here, a 123 is duplicated with one line having ref at $3so it gets a 1. In contrast, the others are either not duplicated at $1and $2or duplicated but with no ref at $3, so they get a 2.
After some fiddling around, I manage to put a 1at lines where $1and $2are duplicated but it does not take the ref at $3 into account and I cannot tell awk to print a 2 otherwise... SPOILERS: my code is probably very ugly.
awk 'BEGIN {FS=OFS="\t"} {i=$1FS$2} {a[i]=!a[i]?$3:a[i]FS"1\n" i"\t"$3FS"1"} END {for (l in a) {print l,a[l]}}' infile > outfile
The output I get is:
d 205 exp_a 1
d 205 exp_b 1
a 123 exp_a 1
a 123 ref 1
b 146 exp_a
c 156 ref

$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR {
cnt2[$1,$2]++
cnt3[$1,$2,$3]++
next
}
{ print $0, (cnt2[$1,$2]>1 && cnt3[$1,$2,"ref"]>0 ? 1 : 2) }
$ awk -f tst.awk file file
a 123 exp_a 1
a 123 ref 1
b 146 exp_a 2
c 156 ref 2
d 205 exp_a 2
d 205 exp_b 2

Could you please try following.
awk 'FNR==NR{a[$1,$2]++;b[$1,$2]=$3;next} {$NF=(b[$1,$2]=="ref" && a[$1,$2]>1?$NF OFS "1":$NF OFS "2")} 1' OFS="\t" Input_file Input_file
Adding a non-one liner form of solution too here.
awk '
FNR==NR{
a[$1,$2]++
b[$1,$2]=$3
next
}
{
$NF=(b[$1,$2]=="ref" && a[$1,$2]>1?$NF OFS "1":$NF OFS "2")
}
1
' OFS="\t" Input_file Input_file

This one works in one go of the data but expects the file to be ordered by $1 $2, the "key". Records within each "key" group are outputed in random order (for(i in a)):
awk '
BEGIN { FS=OFS="\t" }
{
if((p!=$1 OFS $2) && NR>1) { # when the $1 $2 changes from previous
for(i=1;i<=a[0];i++) { # iterate and output buffered records
print p,a[i],2-(a[-1]&&a[0]>1) # more than one record in buffer and ...
} # ... ref for $4=1
delete a # empty buffer after output
}
if($3=="ref") # if there is a match in $3
a[-1]++ # increase counter
a[++a[0]]=$3 # buffer records to a, a[0] counter
p=$1 OFS $2 # p is for previous "key"
}
END {
for(i=1;i<=a[0];i++) # duplicate code from above if
print p,a[i],2-(a[-1]&&a[0]>1)
}' file
Outputs:
a 123 exp_a 1
a 123 ref 1
b 146 exp_a 2
c 156 ref 2
d 205 exp_a 2
d 205 exp_b 2
Record counter a[0] and ref counter a[-1] are in a[] to reset them with a single delete a.

Related

How this AWK program works?

Input data:
2018 Fiat 125
2018-01-17: Opel 2 Volvo 3
2017-01-21: Fiat 5 Fiat 6
2017-02-22: Opel 7 Fiat 8
2018-01-31: Fiat 9 Opel 17
Code:
$1 !~ /17/ {t[$2]+= $3;}
END {for (i in t)
print i": "t[i];}
The result is:
Fiat: 134
I understand that the condition !~ /17/ is fulfilled only for the first and last line because there is no 17. But what does the program do next?
There is an instruction:
{t[$2]+= $3;}
So (as $2 is Fiat and $3 is 125): t[Fiat]= t[Fiat] + 125 ?
I assume that 134 is the sum of 125 and 9.
What is "i" ?
$1 !~ /17/ {t[$2]+= $3;}
When the first column ($1) does NOT contains 17, because of the !~, then add the value to an array t[$2]+=$3. For the first line this is t['Fiat']+=125. Then += adds the value to the previous value of t['Fiat']
END {for (i in t)
print i": "t[i];}
When done (in the END), print all values of this array.
This can be seen/debugged when changing the script to:
$1 !~ /17/ {
t[$2]+= $3;
print "for",$2," the count is updated with ",$3," to: ",t[$2]," in the line: ",$0
}
END {for (i in t)
print i": "t[i];}
The array does contain (can contain, because in this example it's just 1 value) multiple value. The for (i in t) { .... } loops over the contents of the array.
This can be tested using: awk '{ t[$2]+=$3; }END{ for (i in t) { print i":"t[i]; }}' input.txt, which will output:
Fiat:139
Opel:9
which with this data, outputs:
for Fiat the count is updated with 125 to: 125 in the line: 2018 Fiat 125
for Fiat the count is updated with 9 to: 134 in the line: 2018-01-31: Fiat 9 Opel 17
Fiat: 134

AWK command of add column to count of grouped column

I have a data set tab separated like this: (file.txt)
A B
1 111
1 111
1 112
1 113
1 113
1 113
1 113
2 113
2 113
2 113
I want to add a new C column to show count of grouped A and B
Desired output:
A B C
1 111 2
1 111 2
1 112 1
1 113 4
1 113 4
1 113 4
1 113 4
2 113 3
2 113 3
2 113 3
I have tried this:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{
if (FNR>1) a[$2]+=$3
next
}
{ $(NF+1)=(FNR==1 ? "C" : a[$2]) }
1
' file.txt file.txt > file2.txt
Could you please try following, With shown samples.
awk '
FNR==NR{
count[$1,$2]++
next
}
FNR==1{
print $0,"C"
next
}
{
print $0,count[$1,$2]
}
' Input_file Input_file
Add BEGIN{FS=OFS="\t"} in above code in case your data is tab delimited.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when first time Input_file being read.
count[$1,$2]++ ##Creating count with index of 1st and 2nd field and increasing its count.
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is 1st line then do following.
print $0,"C" ##Printing current line with C heading here.
next ##next will skip further statements from here.
}
{
print $0,count[$1,$2] ##Printing current line along with count with index of 1st and 2nd field.
}
' Input_file Input_file ##Mentioning Input_file(s) here.
Problem in OP's attempt: OP was adding $3 in values(though logic looked ok) but there is NO 3rd field present in Input_file so that's why it was not working. Also OP was using index as 2nd field but as per OP's comments it should be 1st and 2nd fields.
You might consider using GNU Datamash, e.g.:
datamash -HW groupby 1,2 count 1 < file.txt | column -t
Output:
GroupBy(A) GroupBy(B) count(A)
1 111 2
1 112 1
1 113 4
2 113 3

Column manipulating using Bash & Awk

Let's assume have an example1.txt file consisting of few rows.
item item item
A B C
100 20 2
100 22 3
100 23 4
101 26 2
102 28 2
103 29 3
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
104 36 3
There are few commands I would like to perform to filter out the txt files and add a few more columns.
At first, I want to apply a condition when item C is equal to 2. Using awk command I can do that in the following way.
Therefore The return text file would be:
awk '$3 == 2 { print $1 "\t" $2 "\t" $3} ' example1.txt > example2.txt
item item item
A B C
100 20 2
101 26 2
102 28 2
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
Now I want to count two things:
I want to count the total unique number in column 1.
For example, in the above case example2.txt, it would be:
(100,101,102,103,104) = 5
And I would like to add the repeating column A number and add that to a new column.
I would like to have like this:
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
~
Above Item D column (4th), 1st row is 1, because it did not have any repetitive. but in 4th row, it's 2 because 103 is repetitive twice. Therefore I have added 2 in the 4th and 5th columns. Similarly, the last three columns in Item 4 is 3, because item A is repetitive three times in these three columns.
You may try this awk:
awk -v OFS='\t' 'NR <= 2 {
print $0, (NR == 1 ? "item" : "D")
}
FNR == NR && $3 == 2 {
++freq[$1]
next
}
$3 == 2 {
print $0, freq[$1]
}' file{,}
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Could you please try following. In case you want to save output into same Input_file then append > temp && mv temp Input_file to following code.
awk '
FNR==NR{
if($3==2){
a[$1,$3]++
}
next
}
FNR==1{
$(NF+1)="item"
print
next
}
FNR==2{
$(NF+1)="D"
print
next
}
$3!=2{
next
}
FNR>2{
$(NF+1)=a[$1,$3]
}
1
' Input_file Input_file | column -t
Output will be as follows.
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program fro here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st time Input_file is being read.
if($3==2){ ##Checking condition if 3rd field is 2 then do following.
a[$1,$3]++ ##Creating an array a whose index is $1,$3 and keep adding its index with 1 here.
}
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is first line.
$(NF+1)="item" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
FNR==2{ ##Checking condition if this is second line.
$(NF+1)="D" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
$3!=2{ ##Checking condition if 3rd field is NOT equal to 2 then do following.
next ##next will skip further statements from here.
}
FNR>2{ ##Checking condition if line is greater than 2 then do following.
$(NF+1)=a[$1,$3] ##Creating new field with value of array a with index of $1,$3 here.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names 2 times here.
Similar to the others, but using awk with a single-pass and storing the information in arrays regarding the records seen and the count for D with the arrays ord and Dcnt used to map the information for each, e.g.
awk '
FNR == 1 { h1=$0"\titem" } # header 1 with extra "\titem"
FNR == 2 { h2=$0"\tD" } # header 2 with exter "\tD"
FNR > 2 && $3 == 2 { # remaining rows with $3 == 2
D[$1]++ # for D colum times A seen
seen[$1,$2] = $0 # save records seen
ord[++n] = $1 SUBSEP $2 # save order all records appear
Dcnt[n] = $1 # save order mapped to $1 for D
}
END {
printf "%s\n%s\n", h1, h2 # output headers
for (i=1; i<=n; i++) # loop outputing info with D column added
print seen[ord[i]]"\t"D[Dcnt[i]]
}
' example.txt
(note: SUBSEP is a built-in variable that corresponds to the substring separator used when using the comma to concatenate fields for an array index, e.g. seen[$1,$2] to allow comparison outside of an array. It is by default "\034")
Example Output
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Always more than one way to skin-the-cat with awk.
Assuming the file is not a big file;
awk 'NR==FNR && $3 == 2{a[$1]++;next}$3==2{$4=a[$1];print;}' file.txt file.txt
You parse through the file twice. In the first iteration, you calculate the 4th column and have it in an array. In the second parsing, we set the count as 4th column,and get the whole line printed.

If two columns from different files equal, replace third column with awk

I am looking for a way to replace a column in a file, if two ID columns match.
I have file A.txt
c a b ID
1 0.01 5 1
2 0.1 6 2
3 2 3
and file B.txt
ID a b
1 10 15
2 20 16
3 30 12
4 40 14
The output im looking for is
file A.txt
ID a b
1 0.01 5
2 0.1 6
3 30 2
I can find with awk which ID columns from both files match
awk 'NR==FNR{a[$1];next}$1 in a' B.txt A.txt
But how to add replacement. Thank you for any suggestions.
awk solution:
awk 'NR==FNR{ if(NR>1) a[$1]=$2; next }
FNR>1 && $1 in a && NF<3{ f=$2; $2=a[$1]; $3=f }1' B.txt A.txt | column -t
if(NR>1) a[$1]=$2; - capturing column values from file B.txt except the header line (N>1)
FNR>1 && $1 in a && NF<3 - if IDs match and some line from A.txt has less than 3 fields
The output:
ID a b
1 0.01 5
2 0.1 6
3 30 2
Adapted to your new data format
awk '
# Load file b reference
FNR==NR && NR > 1 {ColB[$1]=$2; next}
# treat file A
{
# set missing field if know in file B (and not 1st line)
if ( NF < 4 && ( $NF in ColB) && FNR > 1) $0 = $NF FS ColB[$NF] FS $2
# print result (in any case)
print
}
#order of file is mandatory' B.txt A.txt
Self documented.
Assume this is only the second field that is missing like in your sample

Compare two files and append the values, leave the mismatches as such in the output file

I'm trying to match two files,file1.txt(50,000 lines), file2.txt(55,000 lines). I want to campare file2 to file 1 extract the values of column 2 and 3 and leave the mismatches as such. Output file must contain all the ids from file2 i.e., it should have 55000 lines. Note: All the ids in file 1 are not present in file2. i.e the actual matches could be less than 50,000.
file1.txt
ab1 12 345
ab2 9 456
gh67 6 987
file2.txt
ab2 0 0
ab1 0 345
nh7 0 0
gh67 6 987
Output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987
This is what i tried but it only print the matches (so instead of 55,000 lines i have 49,000 lines in my output file)
awk "NR==FNR {f[$1]=$0;next}$1 in f{print f[$1],$0}" file1.txt file2.txt >output.txt
This awk script will work
NR == FNR {
a[$1] = $0
next
}
$1 in a {
split(a[$1], b)
print $1, (b[2] == $2 ? $2 : b[2]), (b[3] == $3 ? $3 : b[3])
}
!($1 in a)
If you save this as a.awk and run
awk -f a.awk foo.txt foo1.txt
This will output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987