AWK - Match First 3 Fields, print $6 & $7 from both files on same line - awk

My two input files have changed and I need to match the first three fields of two files. When a match is made, I want to print $1 (of which was matched) and $6 and $7 of file1 and $6 and $7 of file two. The original code was an AWK one liner to match just the first field.
File1
BSTN-SANJ BSTN SANJ 0 0 50 105910
MRFD-SANJ MRFD SANJ 0 0 40 69105
NYRK-SANJ NYRK SANJ 0 0 30 905010
SANJ-HMRD SANJ HMRD 0 0 25 69010
SANJ-NRFK SANJ NRFK 0 0 38 61506
File2
BSTN-SANJ BSTN SANJ 0 0 45 601251
MRFD-SANJ MRFD SANJ 0 0 39 919591
NYRK-SANJ NYRK SANJ 0 0 25 690155
Output
BSTN-SANJ 50 105910 45 601251
MRFD-SANJ 40 69105 39 919591
NYRK-SANJ 30 905010 25 690155

This will do
awk -v OFS='\t' '
{key = $1 OFS $2 OFS $3}
NR == FNR {f2[key] = $6 OFS $7; next}
key in f2 {print $1, $6, $7, f2[key]}
' file2 file1

Related

How to insert column from a file to another file at multiple places

I would like to insert columns no. 1 and 2 from file no. 2 into file no. 1 after every second column and till the last column.
File1.txt (tab-separated, column range from 1-2400 and cell range from 1-4500)
ID IMPACT ID IMPACT ID IMPACT
51 0.288 128 0.4557 156 0.85
625 0.858 15 -0.589 51 0.96
8 0.845 7 0.5891
File2.txt (consist of only two-tab separated column with 19000 raws)
ID IMPACT
18 -1
165 -1
41 -1
11 -1
Output file
ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT
51 0.288 18 -1 128 0.4557 18 -1 156 0.85 18 -1
625 0.858 165 -1 15 -0.589 165 -1 51 0.96 165 -1
8 0.845 41 -1 7 0.5891 41 -1 41 -1
11 -1 11 -1 11 -1
I tried the below commands but it's not working
paste <(cut -f 1,2 File1.txt) <(cut -f 1,2 File2.txt) <(cut -f 3,4 File1.txt) <(cut -f 1,2 File2.txt)......... > File3
Prob: It starts sifting the File2.txt column value into different columns after the highest cell of File1.txt
paste File1.txt File2.txt > File3.txt
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 "\t" $3 "\t" $4....}' File3.txt > File4.txt
This do the job, however it mixup the value of File1.txt from one column to another column.
I tried everything but failed to succeed.
Any help would be appreciated, however, bash or pandas would be better. Thanks in advance.
$ awk '
BEGIN {
FS=OFS="\t" # tab-separated data
}
NR==FNR { # hash fields of file2
a[FNR]=$1 # index with record numbers FNR
b[FNR]=$2
next
}
{ # print file1 records with file2 fields
print $1,$2,a[FNR],b[FNR],$3,$4,a[FNR],b[FNR],$5,$6,a[FNR],b[FNR]
}
END { # in the end
for(i=(FNR+1);(i in a);i++) # deal with extra records of file2
print "","",a[i],b[i],"","",a[i],b[i],"","",a[i],b[i]
}' file2 file1
Output:
ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT ID IMPACT
51 0.288 18 -1 128 0.4557 18 -1 156 0.85 18 -1
625 0.858 165 -1 15 -0.589 165 -1 51 0.96 165 -1
8 0.845 41 -1 7 0.5891 41 -1 41 -1
11 -1 11 -1 11 -1

Awk to update file based on match and condition in another

The below awk will produce the tab-delimeted file1 with the difference between $3-$2 calulated for each line and printed in $6. Before the awk is executed only 5 fields exist.
What I am having trouble with updated each $2 value in file2 with the $7 value of file1 if the $1 value of file2 matches the $5 of file1 and $6 in file1 is not intron. If the value of $5 is intron then then the value of $7 in file1 is zero. So for example line 1 in file1 is intron so that is equvilant to zero or skipped (those lines are not needed in the calculation).
It is possible that a $1 value in file2 may not exist in file1 and in this case the value of $2 in file2 is zero. Line3 infile2 is an example and is set to zero because it does not exist in file1. Thank you:).
Awk w/ output
awk '
FNR==NR{ # process same line
b[$4]=$3-$2;
next # process next line
}
{
a[$5]+=($3-$2)
}
{
split($1, b, " "); print b[0], a[b[0]]
}' OFS="\t" file1 file2
Output
-2135
-2135
-2222
-2351
-2351
-2414
File1 tab-delimited
chr5 86667863 86667879 RASA1 intron 16
chr5 86669977 86669995 RASA1 splicing 18
chr5 86670703 86670805 RASA1 exon 102
chr5 86679453 86679547 RASA1 intron 94
chr5 86679571 86679673 RASA1 exon 102
chr19 15088950 15088961 NOTCH2 intron 50
chr19 15288950 15288961 NOTCH3 intron 11
chr19 15308240 15308275 NOTCH3 exon 35
File2 space delimited
RASA1 2135
NOTCH2 0
GIMAP8 87
NOTCH3 129
FOXF2 0
PRB3 63
Desired out after file2 is updated
RASA1 222 `(102+102+18)`
NOTCH2 0
GIMAP8 0
NOTCH3 35 `(35)`
FOXF2 0
PRB3 0
Maybe adding a | after the first awk with:
awk 'FNR==NR { a[$1]=$7; next } { if(a[$5]){$1=a[$5] }; print }'
To update file2
Could you please try following. It will provide you sequence of output in same order of Input_file's order.
awk '
FNR==NR{
if(!b[$1]++){
c[++count]=$1
}
a[$1]
next
}
($4 in a) && $5!="intron"{
a[$4]+=$NF
}
END{
for(i=1;i<=count;i++){
print c[i],a[c[i]]?a[c[i]]:0
}
}' Input_file2 Input_file1
Since your Input_file1 is NOT TAB delimited as per your claim, so in case it is then edit Input_file2 Input_file1 -----> Input_file2 FS="\t" Input_file1. To get output as TAB delimited either append above code's output to | column -t command or set OFS="\t" near to FS="\t" too.
Output will be as follows.
RASA1 222
NOTCH2 0
GIMAP8 0
NOTCH3 35
FOXF2 0
PRB3 0
if I understood it correctly, this should do what you expect
$ awk 'FNR==NR && $5!="intron" {a[$4]+=$3-$2; next}
{$2=($1 in a)?a[$1]:0}1' file1 file2 > file2.updated

How to use awk and grep combination

I have a file with 10 columns and lots of lines. I want to add a fix correction to the 10th column where its line contain 'G01' pattern.
For example, in the file below
AS G17 2014 3 31 0 2 0.000000 1 -0.809159910000E-04
AS G12 2014 3 31 0 2 0.000000 1 0.195515363000E-03
AS G15 2014 3 31 0 2 0.000000 1 -0.171167837000E-03
AS G29 2014 3 31 0 2 0.000000 1 0.521982134000E-03
AS G07 2014 3 31 0 2 0.000000 1 0.329889640000E-03
AS G05 2014 3 31 0 2 0.000000 1 -0.381588767000E-03
AS G25 2014 3 31 0 2 0.000000 1 0.203352860000E-04
AS G01 2014 3 31 0 2 0.000000 1 0.650180300000E-05
AS G24 2014 3 31 0 2 0.000000 1 -0.258444780000E-04
AS G27 2014 3 31 0 2 0.000000 1 -0.203691700000E-04
the 10th column of the line with G01 should be corrected.
I've used 'awk' with 'while' loop to do that, but it takes a very long time for massive files.
It will be appreciated if anybody can help for a more effective way.
You can use the following :
awk '$2 == "G01" {$10="value"}1' file.txt
To preserve whitespaces you can use the solution from this post :
awk '$2 == "G01" {
data=1
n=split($0,a," ",b)
a[10]="value"
line=b[0]
for (i=1;i<=n; i++){
line=(line a[i] b[i])
}
print line
}{
if (data!=1){
print;
}
else {
data=0;
}
}' file.txt
the 10th column of the line with G01 should be corrected
Syntax is as follows, which will search for regex given inside /../ in current record/line/row regardless of which field the regex was found
Either
$ awk '/regex/{ $10 = "somevalue"; print }' infile
OR
1 at the end does default operation print $0, that is print current record/line/row
$ awk '/regex/{ $10 = "somevalue" }1' infile
OR
$0 means current record/line/row
$ awk '$0 ~ /regex/{ $10 = "somevalue"}1' infile
So in current context, it will be any of the following
$ awk '/G01/{$10 = "somevalue" ; print }' infile
$ awk '/G01/{$10 = "somevalue" }1' infile
$ awk '$0 ~ /G01/{$10 = "somevalue"; print }' infile
$ awk '$0 ~ /G01/{$10 = "somevalue" }1' infile
If you would like to strict your search to specific field/column in record/line/row then
$10 means 10th field/column
$ awk '$2 == "G01" {$10 = "somevalue"; print }' infile
$ awk '$2 == "G01" {$10 = "somevalue" }1' infile
In case if you would like to pass say some word from shell variable to awk or just a word then
$ awk -v search="G01" -v replace="foo" '$2 == search {$10 = replace }1' infile
and then same from shell
$ search_value="G01"
$ new_value="foo"
$ awk -v search="$search_value" -v replace="$new_value" '$2 == search {$10 = replace }1' infile
From man
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of
the program begins. Such variable values are available to the
BEGIN block of an AWK program.
For additional syntax instructions:
"sed & awk" by Dale Dougherty and Arnold Robbins
(O'Reilly)
"UNIX Text Processing," by Dale Dougherty and Tim O'Reilly (Hayden
Books)
"GAWK: Effective awk Programming," by Arnold D. Robbins
(O'Reilly)
http://www.gnu.org/software/gawk/manual/

What are NR and FNR and what does "NR==FNR" imply?

I am learning file comparison using awk.
I found syntax like below,
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
I couldn't understand what is the significance of NR==FNR in this?
If I try with FNR==NR then also I get the same output?
What exactly does it do?
In Awk:
FNR refers to the record number (typically the line number) in the current file.
NR refers to the total record number.
The operator == is a comparison operator, which returns true when the two surrounding operands are equal.
This means that the condition NR==FNR is normally only true for the first file, as FNR resets back to 1 for the first line of each file but NR keeps on increasing.
This pattern is typically used to perform actions on only the first file. It works assuming that the first file is not empty, otherwise the two variables would continue to be equal while Awk was processing the second file.
The next inside the block means any further commands are skipped, so they are only run on files other than the first.
The condition FNR==NR compares the same two operands as NR==FNR, so it behaves in the same way.
Look for keys (first word of line) in file2 that are also in file1.
Step 1: fill array a with the first words of file 1:
awk '{a[$1];}' file1
Step 2: Fill array a and ignore file 2 in the same command. For this check the total number of records until now with the number of the current input file.
awk 'NR==FNR{a[$1]}' file1 file2
Step 3: Ignore actions that might come after } when parsing file 1
awk 'NR==FNR{a[$1];next}' file1 file2
Step 4: print key of file2 when found in the array a
awk 'NR==FNR{a[$1];next} $1 in a{print $1}' file1 file2
Look up NR and FNR in the awk manual and then ask yourself what is the condition under which NR==FNR in the following example:
$ cat file1
a
b
c
$ cat file2
d
e
$ awk '{print FILENAME, NR, FNR, $0}' file1 file2
file1 1 1 a
file1 2 2 b
file1 3 3 c
file2 4 1 d
file2 5 2 e
There are awk built-in variables.
NR - It gives the total number of records processed.
FNR - It gives the total number of records for each input file.
Assuming you have Files a.txt and b.txt with
cat a.txt
a
b
c
d
1
3
5
cat b.txt
a
1
2
6
7
Keep in mind
NR and FNR are awk built-in variables.
NR - Gives the total number of records processed. (in this case both in a.txt and b.txt)
FNR - Gives the total number of records for each input file (records in either a.txt or b.txt)
awk 'NR==FNR{a[$0];}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
a.txt 1 1 a
a.txt 2 2 b
a.txt 3 3 c
a.txt 4 4 d
a.txt 5 5 1
a.txt 6 6 3
a.txt 7 7 5
b.txt 8 1 a
b.txt 9 2 1
lets Add "next" to skip the first matched with NR==FNR
in b.txt and in a.txt
awk 'NR==FNR{a[$0];next}{if($0 in a)print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 8 1 a
b.txt 9 2 1
in b.txt but not in a.txt
awk 'NR==FNR{a[$0];next}{if(!($0 in a))print FILENAME " " NR " " FNR " " $0}' a.txt b.txt
b.txt 10 3 2
b.txt 11 4 6
b.txt 12 5 7
awk 'NR==FNR{a[$0];next}!($0 in a)' a.txt b.txt
2
6
7
Here is the pseudo code for your interest.
NR = 1
for (i=1; i<=files.length; ++i) {
line = read line from files[i]
FNR = 1
while (not EOF) {
columns = getColumns(line)
if (NR is equals to FNR) { // processing first file
add columns[1] to a
} else { // processing remaining files
if (columns[1] exists in a) {
print columns[1]
}
}
NR = NR + 1
FNR = FNR + 1
line = read line from files[i]
}
}

Compare two files and append the values, leave the mismatches as such in the output file

I'm trying to match two files,file1.txt(50,000 lines), file2.txt(55,000 lines). I want to campare file2 to file 1 extract the values of column 2 and 3 and leave the mismatches as such. Output file must contain all the ids from file2 i.e., it should have 55000 lines. Note: All the ids in file 1 are not present in file2. i.e the actual matches could be less than 50,000.
file1.txt
ab1 12 345
ab2 9 456
gh67 6 987
file2.txt
ab2 0 0
ab1 0 345
nh7 0 0
gh67 6 987
Output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987
This is what i tried but it only print the matches (so instead of 55,000 lines i have 49,000 lines in my output file)
awk "NR==FNR {f[$1]=$0;next}$1 in f{print f[$1],$0}" file1.txt file2.txt >output.txt
This awk script will work
NR == FNR {
a[$1] = $0
next
}
$1 in a {
split(a[$1], b)
print $1, (b[2] == $2 ? $2 : b[2]), (b[3] == $3 ? $3 : b[3])
}
!($1 in a)
If you save this as a.awk and run
awk -f a.awk foo.txt foo1.txt
This will output
ab2 9 456
ab1 12 345
nh7 0 0
gh67 6 987