How to add multiple columns to a file from another file - awk

I have two files like shown below which are tab-delimited:
file A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
....
file B
chr1 123 aa c d e ff
chr1 345 aa e f g gg
chr1 123 aa c d e hh
chr1 567 aa z c a ii
chr1 345 bb x q r kk
chr1 789 df f g s ff
chr1 345 sh d t g ll
...
I want to add a new column to file A from file B based on 2 key columns "chr1", "123" i.e.(first two columns are key columns). If the key columns matches in both files, the data in column 7 in file B should be added to column 3 in file A.
For example (chr1 123) key is found twice in file B, therefore 3rd column in file A has ff and hh separated by comma. If the key is not found it should put NA and output should look like as shown below:
output:
chr1 123 ff,hh aa b c d
chr1 234 NA a b c d
chr1 345 gg,kk,ll aa b c d
chr1 456 NA a b c d
I achieved this using the awk solution
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$1FS$2]=a[$1FS$2]?a[$1FS$2]","$7:$7;next}{$3=(($1FS$2 in a)?a[$1FS$2]:"NA")FS $3}1' fileB fileA
Now, i would like to add another column 6 along with column 7. Could anyone suggest how to do this?
The output looks like:
chr1 123 ff,hh e,e aa b c d
chr1 234 NA NA a b c d
chr1 345 gg,kk,ll g,r,g aa b c d
chr1 456 NA NA a b c d
Thanks

My suggestion is to use another array to track the next variable you want to add, but to keep the code a little more readable, I've made an executable awk script to generalize it a bit:
#!/usr/bin/awk -f
BEGIN { FS="\t"; OFS="\t" }
{ key = $1 FS $2 }
FNR==NR {
updateArray( a, $7 )
updateArray( b, $6 )
next
}
{ $3 = concat( a, concat( b, $3 ) ) }
1
function updateArray( arr, fld ) {
arr[key] = arr[key]!="" ? arr[key] "," fld : fld
}
function concat( arr, suffix ) {
return( (arr[key]=="" ? "NA" : arr[key]) OFS suffix )
}
Here's the breakdown:
Set the FS and OFS values
Make a global key for every line read
Store data from the first file in arrays a and b where they are passed by reference to the function updateArray and the field value is passed by value
Update $3 using the local concat function
Print the updated line out with 1
As another option, you could make the value stored in a single a[key] equal to all the file B fields you want represented in $3 and have them separated by OFS. That would require parsing and reassembling the value in a[key] every time it changed as file B is parsed, but would make creating the $3 a simple three part concatenation.

Related

AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value

I am trying to keep the rows with highest value in column 2 per column 1 including identical rows with max value like the desired output below.
Data is
a 55
a 66
a 130
b 88
b 99
b 99
c 110
c 130
c 130
Desired output is
a 130
b 99
b 99
c 130
c 130
I could find great answers from this site, but not exactly for the current question.
awk '{ max=(max>$2?max:$2); arr[$2]=(arr[$2]?arr[$2] ORS:"")$0 } END{ print arr[max] }' file
yields the output which includes the identical rows But max value is from all rows not per column 1.
a 130
c 130
c 130
awk '$2>max[$1] {max[$1]=$2 ; row[$1]=$0} END{for (i in row) print row[i]}' file
Output includes the max value per column 1 but NOT include identical rows with max values.
a 130
b 99
c 130
Would you please help me to trim the data in desired way. Even all codes above are obtained from your questions and answers in this site. Appreciate that!! Many thanks for helps in advance!!!
I've used this approach in the past:
awk 'NR==FNR{if($2 > max[$1]){max[$1]=$2}; next} max[$1] == $2' test.txt test.txt
a 130
b 99
b 99
c 130
c 130
This requires you to pass in the same file twice (i.e. awk '...' test.txt test.txt), so it's not ideal, but hopefully it provides the required output with your actual data.
Using any awk:
awk '
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
' file
a 130
b 99
b 99
c 130
c 130
Here is a ruby to do that:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 55", "a 66", "a 130"], "b"=>["b 88", "b 99", "b 99"], "c"=>["c 110", "c 130", "c 130"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[-1].to_f}}
# ["a 130", "b 99", "c 130"]
grps.values.flatten.each{|s| puts s if maxes.include?(s)}
' file
Prints:
a 130
b 99
b 99
c 130
c 130
Another way using awk. The second loop should be light, just repeating the duplicated max values.
% awk 'arr[$1] < $2{arr[$1] = $2; # get max value
co[$1]++; if(co[$1] == 1){x++; id[x] = $1}} # count unique ids
arr[$1] == $2{n[$1,arr[$1]]++} # count repeated max
END{for(i=1; i<=x; i++){
for(j=1; j<=n[id[i],arr[id[i]]]; j++){print id[i], arr[id[i]]}}}' file
a 130
b 99
b 99
c 130
c 130
or, if order doesn't matter
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++}
END{for(i in arr){
j=0; do{print i, arr[i]; j++} while(j < n[i,arr[i]])}}' file
c 130
c 130
b 99
b 99
a 130
-- EDIT --
Printing data in additional columns
% awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
c 130 data8
c 130 data9
b 99 data5
b 99 data6
a 130 data3
Data
% cat file
a 55 data1
a 66 data2
a 130 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9

AWK command to extract distinct values from a column

From a tab delimited file. I'm trying to extract all rows based on a unique value from column 4 and then save it as a CSV. However, I would like to extract all the distinct values in column 4 and save them as CSV in one go.
I was able to extract one value using this command:
awk -F $'\t' '$4 == "\"C333\"" {print}' dataFile > C333.csv
Let's consider this test file:
$ cat in.csv
a b c d
aa bb cc d
1 2 3 4
12 23 34 4
A B C d
Now, let's write each row to a tab-separated output file that is named after the fourth column:
$ awk -F'\t' '{f=$4".csv"; print>>f; close(f)}' OFS='\t' in.csv
$ cat d.csv
a b c d
aa bb cc d
A B C d
$ cat 4.csv
1 2 3 4
12 23 34 4

Compare 4 columns in two files; and output the line for unique combination (from first file) and line for duplicate combination (from second file)

I have two tab separated values file, say
File1.txt
chr1 894573 rs13303010 GG
chr2 18674 rs10195681 **CC**
chr3 104972 rs990284 AA <--- Unique Line
chr4 111487 rs17802159 AA
chr5 200868 rs4956994 **GG**
chr5 303686 rs6896163 AA <--- Unique Line
chrX 331033 rs4606239 TT
chrY 2893277 i4000106 **GG**
chrY 2897433 rs9786543 GG
chrM 57 i3002191 **TT**
File2.txt
chr1 894573 rs13303010 GG
chr2 18674 rs10195681 AT
chr4 111487 rs17802159 AA
chr5 200868 rs4956994 CC
chrX 331033 rs4606239 TT
chrY 2893277 i4000106 GA
chrY 2897433 rs9786543 GG
chrM 57 i3002191 TA
Desired Output:
Output.txt
chr1 894573 rs13303010 GG
chr2 18674 rs10195681 AT
chr3 104972 rs990284 AA <--Unique Line from File1.txt
chr4 111487 rs17802159 AA
chr5 200868 rs4956994 CC
chr5 303686 rs6896163 AA <--Unique Line from File1.txt
chrX 331033 rs4606239 TT
chrY 2893277 i4000106 GA
chrY 2897433 rs9786543 GG
chrM 57 i3002191 TA
File1.txt has total 10 entries while File2.txt has 8 entries.
I want to compare the both the file using Column 1 and Column 2.
If both the file's first two column values are same, it should print the corresponding line to Output.txt from File2.txt.
When File1.txt has unique combination (Column1:column2, which is not present in File2.txt) it should print the corresponding line from File1.txt to the Output.txt.
I tried various awk and perl combination available at website, but couldn't get correct answer.
Any suggestion will be helpful.
Thanks,
Amit
next time, show your awk code tryso we can help on error or missing object
awk 'NR==FNR || (NR>=FNR&&($1","$2 in k)){k[$1,$2]=$0}END{for(K in k)print k[K]}' file1 file2

Comparing fields of two files in awk

I want to compare two fields of two files, such as follows:
Compare the 2nd filed of file one with the 1st field of file two, print the match (even if the match is repeated) and all the columns of file one and two.
File 1:
G4 b45 3 4
G4 b45 1 3
G3 b23 2 2
G3 b22 2 6
G3 b22 2 4
File 2:
b45 a b c
b64 d e f
b23 g h i
b22 j k l
b20 m n o
Output:
G4 b45 a b c 3 4
G4 b45 a b c 1 3
G3 b23 g h i 2 2
G3 b22 j k l 2 6
G3 b22 j k l 2 4
I have tried this with the following awk command using associative arrays:
awk 'FNR==NR {array1[$2] = $1 ; arrayrest[$2] = substr($0, index($0, $2)); next}($1 in array1) {print array1[$1] "\t" $0 "\t" arrayrest[$1]}' file1 file2
But there are two problems:
It does not print the lines if the match is repeated while I want them to be printed.
It repeats the first field of file two in the output.
How could I make this awk command work nicely? Thanks in advance.
Not quite the exact output formatting you want but the right output contents.
awk 'FNR==NR{seen[$1]=$0; next} ($2 in seen) {$2=seen[$2]}7' file2 file1
Add | column -t to get more consistent column spacing.
This should be simple and clear to u:
awk 'NR==FNR {n[$2]=$0} {if ($1 in n) print n[$1],$2,$3,$4}' file1 file2
small awk
awk '{x[$1]=$0}$2=x[$2]' f2 f1
If $1 and $2 can contain the same value
awk '{x[$1]=$0}FNR!=NR&&$2=x[$2]' f2 f1
output
G4 b45 a b c 3 4
G4 b45 a b c 1 3
G3 b23 g h i 2 2
G3 b22 j k l 2 6
G3 b22 j k l 2 4

Using an array in AWK when working with two files

I have two files I merged them based key using below code
file1
-------------------------------
1 a t p bbb
2 b c f aaa
3 d y u bbb
2 b c f aaa
2 u g t ccc
2 b j h ccc
file2
--------------------------------
1 11 bbb
2 22 ccc
3 33 aaa
4 44 aaa
I merged these two file based key using below code
awk 'NR==FNR{a[$3]=$0;next;}{for(x in a){if(x==$5) print $1,$2,$3,$4,a[x]};
My question is how I can save $2 of file2 in variable or array and print after a[x] again.
My desired result is :
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
As you see the first 7 columns is the result of my merge code. I need add the last column (field 2 of a[x]) to my result.
Important:
My next question is if I have .awk file, how I can use some bash script code like (| column -t) or send result to file (awk... > result.txt)? I always use these codes in command prompt. Can I use them inside my code in .awk file?
Simply add all of file2 to an array, and use split to hold the bits you want:
awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[3] == $NF) print $1,$2,$3,$4, i, one[2] } }' file2 file1
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Regarding your last question; you can also add 'pipes' and 'writes' inside of your awk. Here's an example of a pipe to column -t:
Contents of script.awk:
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
Run like: awk -f script.awk file2 file1
EDIT:
Add the following to your shell script:
results=$(awk '
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
' $1 $2)
echo "$results"
Run like:
./script.sh file2.txt file1.txt
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Your current script is:
awk 'NR==FNR { a[$3]=$0; next }
{ for (x in a) { if (x==$5) print $1,$2,$3,$4,a[x] } }'
(Actually, the original is missing the second close brace for the second pattern/action pair.)
It seems that you process file2 before you process file1.
You shouldn't need the loop in the second code. And you can make life easier for yourself by using the splitting in the first phase to keep the values you need:
awk 'NR==FNR { c1[$3] = $1; c2[$3] = $2; next }
{ print $1, $2, $3, $4, c1[$5], c2[$5], $5, c2[$5] }'
You can upgrade that to check whether c1[$5] and c2[$5] are defined, presumably skipping the row if they are not.
Given your input files, the output is:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Give or take column spacing, that's what was requested. Column spacing can be fixed by using printf instead of print, or setting OFS to tab, or ...
The c1 and c2 notations for column 1 and 2 is OK for two columns. If you need more, then you should probably use the 2D array notation:
awk 'NR==FNR { for (i = 1; i <= NF; i++) col[i,$3] = $i; next }
{ print $1, $2, $3, $4, col[1,$5], col[2,$5], $5, col[2,$5] }'
This produces the same output as before.
To achieve what you ask, save the second field after the whole line in the processing of your first file, with a[$3]=$0 OFS $2. For your second question, awk has a variable to separate fields in output, it's OFS, assign a tabulator to it and play with it. Your script would be like:
awk '
BEGIN { OFS = "\t"; }
NR==FNR{
a[$3]=$0 OFS $2;
next;
}
{
for(x in a){
if(x==$5) print $1,$2,$3,$4,a[x]
}
}
' file2 file1
That yields:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22