Print lines in both file when 2 different columns match - awk

I have 2 tab delim files
file 1
B T 4 tab -
1 C 5 - cab
5 A 2 - ttt
D T 18 1111 -
file 2
K A 3 0.1
T B 4 0.3
P 1 5 0.5
P 5 2 0.11
I need to merge the two based on file 1 col1 and 3 and file2 col2 and 3, and print lines in both files. I'm expecting the following output:
B T 4 tab - T B 4 0.3
1 C 5 - cab P 1 5 0.5
5 A 2 - ttt P 5 2 0.11
I tried adapting from a similar question I had in the past:
awk 'NR==FNR {a[$1,$3] = $2"\t"$4"\t"$5; next} $2,$3 in a {print a[$1,$3],$0}' file1 file2
but no success, the output I get looks like this, which is similar to file2:
K A 3 0.1
T B 4 0.3
P 1 5 0.5
P 5 2 0.11

There are two small problems in your code:
awk 'NR==FNR{a[$1,$3]=$0; next} ($2,$3) in a {print a[$2,$3], $0}' file1 file2
# parentheses -^ ----^
# $2,$3 ----^

Related

calculate current line column less previous line column using awk

my input
a 9
b 2
c 5
d 3
e 7
desired output (current line column 2 - previous line column 2)
a 9
b 2 -7
c 5 3
d 3 -2
e 7 4
explanation
a 9
b 2 -7 ( 2-9 = -7 )
c 5 3 ( 5-2 = 3 )
d 3 -2 ( 3-5 = -2 )
e 7 4 ( 7-3 = 4 )
I tried this without success
awk '{ print $1, $2,$2 - $(NR-1) }' input
I want a awk code to generate an additional column contains de calculation of current line less previous line in column 2
You can try this awk
$ awk 'NR==1{ print $0 } NR>1{ print $0,$2 - pre } { pre=$2 }' file
a 9
b 2 -7
c 5 3
d 3 -2
e 7 4

Retrieving lines comparing columns from two files

I'm trying to compare 2 tables and retrieve the matches based on two columns:
file 1
0.736 5 100 T
0.723 1 15 T
0.792 6 100 T
0.634 3 100 T
0.754 7 100 T
0.708 2 100 T
0.722 9 100 T
0.542 1 6 T
File 2
0.736 5
0.634 3
0.542 1
output
0.736 5 100 T
0.634 3 100 T
0.542 1 6 T
When I try this code it tells me that awk is not found, which doesnt make sense because I use awk regularly.. Could you help me out spotting the error here please?
awk 'FNR==NR{a[$1,$2]=$0;next}{if(b=a[$1,$2]){print b}}' file1 file2> output
you could use grep
grep -f file2 file1
or awk
awk 'NR==FNR{A[$1];next}$1 in A' file2 file1
Hope this helps :)

Split a column in lines after every 5 entries - awk

I have a file that looks like the following
1
1
1
1
1
1
12
2
2
2
2
2
2
2
3
4
What I want to do is convert this column in multiple rows. Each new line/row should start after 5 entries, so that the output will like like this
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
I tried to achieve that by using
awk '{printf "%s" (NR%5==0?"RS:FS),$1}' file
but I get the following error
awk: line 1: runaway string constant "RS:FS),$1} ...
Any idea on how to achieve the desired output?
Maybe this awk one liner can help.
awk '{if (NR%5==0){a=a $0" ";print a; a=""} else a=a $0" "}END{print}' file
Output:
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
Longer awk:
{
if (NR%5==0)
{
a=a $0" ";
print a;
a="";
}
else
{
a=a $0" ";
}
}
END
{
print
}
Slightly different approach, still using awk:
$ awk '{if (NR%5) {ORS=""} else {ORS="\n"}{print " "$0}}' input.txt
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
Using perl:
$ perl -p -e 's/\n/ / if $.%5' input.txt
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4
No need to complicate...
$ pr -5ats' ' <file
1 1 1 1 1
1 12 2 2 2
2 2 2 2 3
4

AWK: Comparing two different columns in two files

I have these two files
File1:
9 8 6 8 5 2
2 1 7 0 6 1
3 2 3 4 4 6
File2: (which has over 4 million lines)
MN 1 0
JK 2 0
AL 3 90
CA 4 83
MK 5 54
HI 6 490
I want to compare field 6 of file1, and compare field 2 of file 2. If they match, then put field 3 of file2 at the end of file1
I've looked at other solutions but I can't get it to work correctly.
Desired output:
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
My attempt:
awk 'NR==FNR{a[$2]=$2;next}a[$6]{print $0,a[$6]}' file2 file1
program just hangs after that.
To print all lines in file1 with match if available:
$ awk 'FNR==NR{a[$2]=$3;next;} {print $0,a[$6];}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
To print only the lines that have a match:
$ awk 'NR==FNR{a[$2]=$3;next} $6 in a {print $0,a[$6]}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
Note that I replaced a[$2]=$2 with a[$2]=$3 and changed the test a[$6] (which is false if the value is zero) to $6 in a.
Your own attempt basically has two bugs as seen in #John1024's answer:
You use field 2 as both key and value in a, where you should be storing field 3 as the value (since you want to keep it for later), i.e., it should be a[$2] = $3.
The test a[$6] is false when the value in a is zero, even if it exists. The correct test is $6 in a.
Hence:
awk 'NR==FNR { a[$2]=$3; next } $6 in a {print $0, a[$6] }' file2 file1
However, there might be better approaches, but it is not clear from your specifications. For instance, you say that file2 has over 4 million lines, but it is unknown if there are also that many unique values for field 2. If yes, then a will also have that many entries in memory. And, you don't specify how long file1 is, or if its order must be preserved for output, or if every line (even without matches in file2) should be output.
If it is the case that file1 has many fewer lines than file2 has unique values for field 2, and only matching lines need to be output, and order does not need to be preserved, then you might wish to read file1 first…

sum rows based on unique columns awk

I'm looking for a more elegant way to do this (for more than >100 columns):
awk '{a[$1]+=$4}{b[$1]+=$5}{c[$1]+=$6}{d[$1]+=$7}{e[$1]+=$8}{f[$1]+=$9}{g[$1]+=$10}END{for(i in a) print i,a[i],b[i],c[i],d[i],e[i],f[i],g[i]}'
Here is the input:
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
and output:
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Thanks :)
I break the one-liner down into lines, to make it easier to read.
awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}
END{for(x in n){
printf "%s ", x
for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:FS)
}
}' file
this awk command should work with your 100 columns file.
test with your file:
kent$ cat f
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
kent$ awk '{n[$1];for(i=2;i<=NF;i++)a[$1,i]+=$i}END{for(x in n){printf "%s ", x;for(y=2;y<=NF;y++)printf "%s%s", a[x,y],(y==NF?ORS:OFS)}}' f
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
Using arrays of arrays in gnu awk version 4
awk '{for (i=2;i<=NF;i++) a[$1][i]+=$i}
END{for (i in a)
{ printf i FS;
for (j in a[i]) printf a[i][j] FS
printf RS}
}' file
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you care about order of output try this
$ cat file
a1 1 1 2 2
a2 2 5 3 7
a2 2 3 3 8
a3 1 4 6 1
a3 1 7 9 4
a3 1 2 4 2
Awk Code :
$ cat tester
awk 'FNR==NR{
U[$1] # Array U with index being field1
for(i=2;i<=NF;i++) # loop through columns thats is column2 to NF
A[$1,i]+=$i # Array A holds sum of columns
next # stop processing the current record and go on to the next record
}
($1 in U){ # Here we read same file once again,if field1 is found in array U, then following statements
for(i=1;i<=NF;i++)
s = s ? s OFS A[$1,i] : A[$1,i] # I am writing sum to variable s since I want to use only one print statement, here you can use printf also
print $1,s # print column1 and variable s
delete U[$1] # We have done, so delete array element
s = "" # reset variable s
}' OFS='\t' file{,} # output field separator is tab you can set comma also
Resulting
$ bash tester
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk
--edit--
As requested in comment here is one liner, in above post for better reading purpose I had commented and it became several lines.
$ awk 'FNR==NR{U[$1];for(i=2;i<=NF;i++)A[$1,i]+=$i;next}($1 in U){for(i=1;i<=NF;i++)s = s ? s OFS A[$1,i] : A[$1,i];print $1,s;delete U[$1];s = ""}' OFS='\t' file{,}
a1 1 1 2 2
a2 4 8 6 15
a3 3 13 19 7